Announcing Marpa::XS 0.8.0
I have just released Marpa::XS 0.008000. With this release the core Marpa algorithm has been converted to C, vastly speeding it up. Marpa::XS is still alpha, but the additional development needed at this point is a matter of packaging (See Note 1).
It is my hope that Marpa will become the standard parsing algorithm for problems too big for regular expressions.
- Marpa parses all classes of grammar that are currently in practical use in linear time. (See Note 2).
- Marpa is a general BNF parser -- that means if you feed it anything written in BNF, it "just parses" it. This includes grammars which are left-recursive, right-recursive and ambiguous -- even infinitely ambiguous.
- Marpa never goes exponential -- worst case, even for highly ambiguous grammars, is O(n**3), which is considered optimal (See Note 3).
Limitations and Comparisons
The foremost limitation of Marpa is, of course, that it is alpha. Development is well advanced, but the interface remains subject to change.
There are several parsing tasks where current technology will be superior even to a stable, production Marpa. For problems which work well as regexes (compact and with no backtracking), Marpa will never replace a good regular expression engine. On the other hand, Marpa may well be the answer to a lot of problems which are forced to fit into the regular expression paradigm for lack of anything better. Hand-written recursive descent could compete with Marpa, but not having to write your own parser is a huge advantage.
I've made no secret that I consider yacc and LALR theoretical milestones and industry traditions which are best honored in the breach. I've pointed out LALR's horrendous error-handling properties elsewhere (See Note 4). LALR does offer some apparent speed advantages, but in the context of practical applications these are more apparent than real.
Notes
Note 1: Inside the code are many relics of the development process, and these need to be cleaned up. The user documentation is finished, but the internals documentation needs a lot of work.
Some additional C code may be written to help in the evaluation phase but, while this code could produce some significant additional efficiencies, I don't consider it part of the core Marpa development project for two reasons. First, evaluation is to a large degree a matter of calling higher-level and application code.
Second, the core Marpa algorithm was new and extremely difficult mathematics -- built on top of a large body of prior work to be sure, but new nonetheless. With the evaluation code, my original thinking involves making making choices of what to use in existing mathematics -- evaluation consists of problems like tree traversal, where there is lots of prior work to consult.
Note 2: To be specific, Marpa parses any LR-regular grammar in linear time -- O(n). LR-regular is a vast class of grammars that includes LALR, LR(k) for all k, and LL(k) for all k. This means that Marpa parses, in linear time, every grammar parsed by yacc, recursive descent and regular expressions.
Note 3: The phrase "considered optimal" elides some irrelevant bits of theory. It would be a bit of a surprise if it is possible to do general BNF parsing in O(n), but nobody has proved that it can't be done. The Valiant algorithm parses general BNF and is O(n**2.376...) or better. But Valiant's algorithm is only faster for huge problems, and for those it needs a machine with many terabytes of main memory to deliver on its speed claims. So it won't be competing with Marpa any time soon.
Note 4: My series on yacc ran in four blog posts: Killing Yacc, Why the Bovicidal Rage?, Bovicide 5: Parse-time Error Reporting, and Bovicide 6: The Final Requirement
Hi Jeffrey
You know I'm a big fan of Marpa, having used it to such great effect in Graph::Easy::Marpa.
Demo here.
For anyone interested in lexing and parsing, about which I knew virtually nothing, getting your head around Marpa is arguably the best way to start.
My apology is in advance if this is just another stupid question, but I can't help thinking that what you call ambiguity (as per [1]) is really a BNF alternation in disguise, all the more that Marpa::Grammar seems to provide no support for BNF alternations directly, unless I'm missing something.
I'd be very grateful for an explanation and thanks a lot for all of your work.
[1]
— http://search.cpan.org/dist/Marpa-PP/pod/Marpa_PP.pod#EXAMPLE_2:_AN_AMBIGUOUS_PARSE
Ambiguity and alternation are two different things. Ambiguity (in a grammar) means that from a single input it can produce more than one parse, which raises the question of which one to use. Alternation is when one symbol appears on the LHS of more than one rule. You can have ambiguity without alternation (sequences of nullable symbols will do it). And it is very common for a useful unambiguous grammar to have alternation.
When I say that Marpa can handle ambiguity, I mean it can handle situations where more than one parse is possible. Whether to have explicit alternation or not is purely a interface/notation issue, whereas ambiguity is about the capabilities of the algorithm.
About "disquising" alternation: Explicit alternation (sometimes using the vertical bar: '|') is syntactic sugar. BNF can be defined without it. Alternation is convenient when defining a theoretical grammar, one where you don't have to specify the semantic and other properties for each RHS. I respect other points of view in the matter, but when each RHS must have properties attached to it, my personal feeling is that explicit alternation looks cluttered.
For the purposes of the above definition, [1] below is an alternation with E being the symbol appearing on the LHS of more than one rule, is it not?
However, it also serves to showcase ambiguity in Marpa's documentation and, as such, "can produce more than one parse".
Hence, [1] somehow serves as both alternation and ambiguity.
But then, another question: suppose a lexem cannot be classified unambigiously at lexing time but can be so classified at parse time based on the preceeding/succedding lexems, as e.g. in "they are jogging" where 'jogging' can be a gerund or a present participle and that ambiguity is resolved by the preceding 'are'.
Can Marpa, given rules like
and token stream like
parse "they are jogging" as 'Sentence' or am I just asking too much and knocking to wrong door?
[1]
@rns: With respect to "jogging" being ambiguous, that is, either a gerund or a present participle: Marpa allows ambiguous tokens. That's documented under "Advanced Input Models": http://search.cpan.org/~jkegl/Marpa-PP-0.008000/pod/Advanced/Models.pod.
Ambiguous tokens are one requirement of natural language parsing that Marpa provides. But natural language parsing presents many other problems, so while I like to think that Marpa might be a useful tool in the field, I am reluctant to advance any claims.
Marpa would have the advantage of being a packaged algorithm in hand-tweaked C code, with good time complexity. It would be faster than chart parsing, but less general. Would Marpa be general enough? Could the best of both worlds be had with a little hackery? I don't know.
@Jeffrey Kegler: thanks for pointing out to the doc on ambiguous tokens.
As for natural language parsing: couldn't agree more, use of Earley in comp-ling notwithstanding. :)
Thanks for a wonderful module and a great speed-up.
@ron, @acme: Thanks for the kind words. Marpa is the product of 4 years of full-time work, and it is my hope that it represents a step forward in the technology of parsing.