What is the Marpa algorithm?
I have referred to "the Marpa algorithm" many times. What is that? The implementation involves many details, but the Marpa algorithm itself is basically four ideas. Of these only the most recent is mine. The other three come from papers spanning over 40 years.
Idea 1: Parse by determining which rules can be applied where
The first idea is to track the progress of the a parse by determining, for each token, which rules can be applied and where. Sounds pretty obvious. Not-so-obvious is how to do this efficiently.
In fact, most parsing these days uses some sort of shortcut. Regexes and LALR (yacc, etc.) require the grammar to take a restricted form, so that they can convert the rules into a state machine. Recursive descent, rather than list the possibilities, dives into them one by one. It, too, only works well with grammars of a certain kind.
In 1970, Jay Earley described an parsing algorithm that went from left to right, and worked by determining which rules applied where. Earley's was reasonably fast, but the severe limits of 1970's hardware pushed less powerful parsing algorithms to the forefront, where they remain. Jay Earley soon left the computer field to become a psychotherapist. His ideas remain the basis of much of general BNF parsing. Marpa is an Earley's parser.
Idea 2: Right-recursion is left-recursion in the mirror
Earley's original algorithm could handle anything you could write in BNF: ambiguous, infinitely ambiguous, left-recursive, right-recursive, you name it. And Earley's was very good at left-recursion -- blazingly fast. With right-recursion, however, Earley's original algorithm went quadratic. For many applications, quadratic time is unacceptable.
Right-recursion is just left-recursion seen backwards. In 1991, Joop Leo figured out how to take efficient advantage of this. Leo modified Earley's to identify potential right-recursions and parse them "in the mirror", as if they were left-recursions.
With this improvement, Earley's algorithm was now linear for the LR-regular grammars, a vast class which includes every other class of grammar in practical use, then and now. Startlingly, Leo's result went 20 years with few implementations. Marpa is its first implementation in a general-purpose utility.
Idea 3: Don't dance around the issue of nullables
A pesky problem with Earley's algorithm remained: nullable symbols.
A symbol
is a nullable symbol
if it represents something that might be omitted,
Examples are the three expression's
in C language for
statements:
any or all of these may be omitted.
To be considered practical,
a parsing algorithm must work well with grammars
that contain nullables.
Earley's original 1970 algorithm actually had a bug in its handling of nullables. There was an easy fix, but it made the algorithm slower. Since efficiency was already seen as the reason to prefer other parsers, this was a big deal.
In 2002, Aycock & Horspool stopped the dancing around the nullable issue. They rewrote Earley's parse engine, targeting nullables. The result was an improvement in other respects. Marpa's parse engine is not that of Aycock & Horspool. But Marpa's strategy for handling nullables comes directly from their work. And the design of Marpa's parse engine is heavily influenced by that described in Aycock & Horspool 2002.
Idea 4: Use the Ruby Slippers
For Marpa to get the benefits of both Leo and Aycock&Horspool, I needed to combine their quite different parse engines. In the process, I realized that behind Earley's original algorithm lay a promise, never-fulfilled. If your parser knows which rules are applicable where, then it should, in principle, allow you to use this information to guide the parsing process.
What was needed was a parse engine which was careful to do all the other processing BEFORE it looked at the input. The Marpa parse engine does this --- at each parse location, examining the input is the last thing Marpa does. By examining the input last, Marpa makes the information from the other processing available for determining what that input should be.
To see how this can be useful, take the example of liberal HTML --- HTML in which tags might be missing. With Marpa, it is possible to parse liberal HTML with an unrealistically simple grammar. An HTML parser can use a reference grammar which assumes that start and end tags are always present, even when not required by the standard. When the input does not conform to this grammar's unrealistic expectations, and it very often will not, the application can ask the parser what it DOES expect according to that grammar. If it's a missing end tag, the application can invent it and pass it on to the parser. The parse then continues, quite happily.
I call this technique, where the grammar demands a perfect world and the application changes the input to match the grammar's expectations, Ruby Slippers parsing. I've implemented an HTML parser which uses the Ruby Slippers. I use Marpa::HTML myself a good deal. It is fast and flexible and IMHO the best way to parse liberal HTML.
Summary
Four ideas are essential to the Marpa algorithm:- Marpa is an Earley's parser, in that it works by creating lists of possibilities at each token.
- Marpa handles right-recursion using Joop Leo's method. This makes it O(n) for every class of grammar in practical use today.
- Marpa handles nullables using the ideas of Aycock&Horspool 2002. As a side effect, this opens the way to further improvements in the Earley parse engine.
- The Marpa algorithm allows the application to be fully aware, at all times, of what is going on in the parse. Obviously, this benefits error-handling. More importantly, it opens the road to powerful new parsing strategies.
Notes
- "Earley 1970": Earley, Jay, "An efficient context-free parsing algorithm", Communications of the Association for Computing Machinery, 13:2:94-102, 1970.
- "Leo 1991": Leo, Joop M.I.M., "A General Context-Free Parsing Algorithm Running in Linear Time on Every LR(k) Grammar Without Using Lookahead", Theoretical Computer Science, Vol. 82, No. 1, 1991, pp 165-176.
- "Aycock&Horspool 2002": Aycock, John and Horspool, R. Nigel, "Practical Earley Parsing", The Computer Journal, Vol. 45, No. 6, 2002, pp. 620-630. Available online.
This is awesome, .. especially Marpa::HTML. I stumbled across this post because I frequent blogs.perl.org but coincidentally I was looking into how I would approach a PP HTMLTidy solution.
I thought about the different ways to parse the source, quick-and-dirty, lexer, etc. Then by pure happenstance I seen this post which gave me a nice insight into Marpa (always wondered what it was) and gave me my PP HTMLTidy'er.
THanks a MiL
Perhaps you could publish your "Ruby Slippers" algorithm too?
About 4., CoffeeScript does the same thing, as a trick to help simplify the parser.
@Jakub: I have late drafts of papers on the algorithm, with the usual apparatus of correctness proofs, etc. I'd like to put them at arXiv.org, but I don't currently have an academic post and I'd need to find an "endorser" in computer science. (These days my academic contacts are mainly in philosophy.)
I published in a refereed journal many years ago, but that process is slow & dicey & no longer the only game in town.
I wrote a "ruby slippers" algo for HTML/aXML input on PerlNights recently to prevent layout breakup when users want to display markup. My solution is reasonably efficient but I wouldn't mind having a look at how others have accomplished the same effect.