Bovicide 5: Parse-time Error Reporting
This post is one of a series prompted by an exchange of papers on the Internet. That exchange discussed what it would take to produce an acceptable replacement for yacc and the entire family of LALR-based parsers.
My first post concentrated on speed and power -- the ability to parse all context-free grammars in O(n**3) time, and the ability to parse just about every grammar in practical use, including all yacc-able grammars, in linear (O(n)) time. This is the second of two posts about error reporting.
Error reporting is often ignored, but it is extremely important. In a previous post, I suggested that its poor error reporting properties are the real reason why production parsers are abandoning the once-dominant LALR in favor of hand-written recursive descent. LALR's feedback on grammar errors could pass for low-grade encryption.
Theory heavily influences how programmers look at their craft. Error detection is a good example. Great theoreticians are people who know what to ignore. As Michangelo is alleged to have said, the statue is already in the marble, you just have to chip away the extra stuff. Error detection is certainly grotty detail. So the first examinations of parsing pretty much ignored it. As did the second, third, fourth, ...
Things did eventually get better. These days parsing texts often look carefully at an algorithm's parse-time error detection properties. Classification of the parsers by error detection currently focuses on their ability to determine where the error is. It makes sense to focus on this because it's simpler than determining why there is an error, and in any case, you aren't likely to find the why if you don't know the where.
A error is said to occur at the first token which is not part of a correct prefix. (I will call that token the error location.) A correct prefix is a string of tokens which is a prefix of some input that parses successfully. Note that this definition of error location may not always match your intuition of where the error is. Intuitively, the error is the first token which cannot be part of one of the inputs that you intended -- it is the point where you "went wrong". But neither the theoreticians or I have any idea of how to determine what a programmer really intended, short of a Vulcan Mind Meld. So correct prefixes are going to have to be good enough.
Here's the theoretician's breakdown, using my names.
- Clueless Parsers. After it's all over, these parsers realize the something went wrong, but they have no idea where.
- Clueish Parsers. These parsers don't always know it immediately when things go wrong, but they clue in soon after.
- Clueful Parsers. These parsers know exactly when things go wrong.
The Fifth Requirement for Replacing yacc: Good Reporting of Parse-time Errors
Clueless parsers stay confined to the pages of textbooks, as you might expect. You might also expect this of clueish parsers, but in fact LALR is clueish, which means clueish parsing was the industry standard for production parsing for decades. Generations of programmers got used to mysterious error messages from even the best compilers, and as a result standards for parse-time error reporting remain low.
I don't expect them to stay that way. I hope in the future clueful parsing will be seen as a minimum in production quality parsing. Regular expressions are clueful.
Recursive descent is hard to characterize. The underlying algorithm is usually less than full LL(1), which makes recursive descent in theory clueish. But hand-written recursive descent is attractive because it can be extended as needed with hacks, so in practice a hand-written recursive descent grammar might be clueful when it counts.
The Ruby Slippers Property
Marpa has the Ruby Slippers Property -- Marpa not only knows exactly where the parse went wrong, it knows why, and can report that to the user in convenient form. If a lexer passes Marpa a bad token, Marpa can easily tell the lexer which token or tokens it will accept.
The Ruby Slippers are useful for much more than error reporting. For example, writing BNF to parse very liberal HTML is a difficult task using conventional methods. Just dealing with the cases of missing start or end tags makes the grammar very complex, and very hard to maintain and modify.
With Marpa, you can skip all that work. You write the BNF for strict HTML, on the assumption that all the start and end tags will be there. Then, while running, if the parser rejects a token, the lexer can ask Marpa what it wanted instead. If it was an start or end tag, the lexer can invent one and pass it on to keep the parse going. And that's all you need to do to handle the issue of missing HTML start and end tags.
Marpa knows not only what tokens it is looking for, but what rules it is working on and how far it has progressed into them. So far, the only application I've put this additional information to is debugging. I call the lists of rules in progress, "progress reports".
I created progress reports when I started to work on complex grammars in Marpa, hoping to make a traditionally difficult task easier. Right off the bat, these "progress reports" were a big improvement. Instead of struggling with the internals of the parser generator, as I then had to do with Marpa, and as users of yacc still must, I now had a parser which provided a window directly into my grammar.
Note 2: Why my terms? Because the theoretician's terms are cumbersome and misleading. Theoreticians defined "the immediate error detection property" and "the correct prefix property". Clueless parsers lack both properties. Clueish parsers have the correct prefix property, but lack the immediate error detection property. Clueful parsers have both properties.
The "immediate error detection property" is what you'd think it is, which is why all parsers with that property are clueful. What's not so clear is why the "immediate detection property" is different from the "correct prefix property". Parsers with the "correct prefix property" are those which reject as soon as an incorrect prefix has been processed. That is not the same as "the immediate error detection property" because "processing" often involves reading input well beyond the correct prefix, as well as destroying useful evidence about where the last correct prefix was.
Note 3: Marpa's speed improvements derive from one algorithm published by John Aycock and Nigel Horspool, and another published by Joop Leo. Marpa is the first parser to combine the algorithms into one, and Marpa is the first practical implementation of Leo's algorithm.