Announcing Marpa::XS 0.8.0

I have just released Marpa::XS 0.008000. With this release the core Marpa algorithm has been converted to C, vastly speeding it up. Marpa::XS is still alpha, but the additional development needed at this point is a matter of packaging (See Note 1).

It is my hope that Marpa will become the standard parsing algorithm for problems too big for regular expressions.

  • Marpa parses all classes of grammar that are currently in practical use in linear time. (See Note 2).
  • Marpa is a general BNF parser -- that means if you feed it anything written in BNF, it "just parses" it. This includes grammars which are left-recursive, right-recursive and ambiguous -- even infinitely ambiguous.
  • Marpa never goes exponential -- worst case, even for highly ambiguous grammars, is O(n**3), which is considered optimal (See Note 3).

Limitations and Comparisons

The foremost limitation of Marpa is, of course, that it is alpha. Development is well advanced, but the interface remains subject to change.

There are several parsing tasks where current technology will be superior even to a stable, production Marpa. For problems which work well as regexes (compact and with no backtracking), Marpa will never replace a good regular expression engine. On the other hand, Marpa may well be the answer to a lot of problems which are forced to fit into the regular expression paradigm for lack of anything better. Hand-written recursive descent could compete with Marpa, but not having to write your own parser is a huge advantage.

I've made no secret that I consider yacc and LALR theoretical milestones and industry traditions which are best honored in the breach. I've pointed out LALR's horrendous error-handling properties elsewhere (See Note 4). LALR does offer some apparent speed advantages, but in the context of practical applications these are more apparent than real.


Note 1: Inside the code are many relics of the development process, and these need to be cleaned up. The user documentation is finished, but the internals documentation needs a lot of work.

Some additional C code may be written to help in the evaluation phase but, while this code could produce some significant additional efficiencies, I don't consider it part of the core Marpa development project for two reasons. First, evaluation is to a large degree a matter of calling higher-level and application code.

Second, the core Marpa algorithm was new and extremely difficult mathematics -- built on top of a large body of prior work to be sure, but new nonetheless. With the evaluation code, my original thinking involves making making choices of what to use in existing mathematics -- evaluation consists of problems like tree traversal, where there is lots of prior work to consult.

Note 2: To be specific, Marpa parses any LR-regular grammar in linear time -- O(n). LR-regular is a vast class of grammars that includes LALR, LR(k) for all k, and LL(k) for all k. This means that Marpa parses, in linear time, every grammar parsed by yacc, recursive descent and regular expressions.

Note 3: The phrase "considered optimal" elides some irrelevant bits of theory. It would be a bit of a surprise if it is possible to do general BNF parsing in O(n), but nobody has proved that it can't be done. The Valiant algorithm parses general BNF and is O(n**2.376...) or better. But Valiant's algorithm is only faster for huge problems, and for those it needs a machine with many terabytes of main memory to deliver on its speed claims. So it won't be competing with Marpa any time soon.

Note 4: My series on yacc ran in four blog posts: Killing Yacc, Why the Bovicidal Rage?, Bovicide 5: Parse-time Error Reporting, and Bovicide 6: The Final Requirement


Hi Jeffrey

You know I'm a big fan of Marpa, having used it to such great effect in Graph::Easy::Marpa.

Demo here.

For anyone interested in lexing and parsing, about which I knew virtually nothing, getting your head around Marpa is arguably the best way to start.

grammars which are left-recursive, right-recursive and ambiguous – even infinitely ambiguous

My apology is in advance if this is just another stupid question, but I can't help thinking that what you call ambiguity (as per [1]) is really a BNF alternation in disguise, all the more that Marpa::Grammar seems to provide no support for BNF alternations directly, unless I'm missing something.

I'd be very grateful for an explanation and thanks a lot for all of your work.

rules => [
    [ 'E', [qw/E Add E/],      'do_add' ],
    [ 'E', [qw/E Multiply E/], 'do_multiply' ],
    [ 'E', [qw/Number/], ],
Alternation is when one symbol appears on the LHS of more than one rule.

For the purposes of the above definition, [1] below is an alternation with E being the symbol appearing on the LHS of more than one rule, is it not?

However, it also serves to showcase ambiguity in Marpa's documentation and, as such, "can produce more than one parse".

Hence, [1] somehow serves as both alternation and ambiguity.

But then, another question: suppose a lexem cannot be classified unambigiously at lexing time but can be so classified at parse time based on the preceeding/succedding lexems, as e.g. in "they are jogging" where 'jogging' can be a gerund or a present participle and that ambiguity is resolved by the preceding 'are'.

Can Marpa, given rules like

rules => [
[ 'Sentence', [qw/Pronoun Verb PresentParticiple/], 'do_sentence' ],

and token stream like

$recce->read( 'they', 'Pronoun');
$recce->read( 'are', 'Verb');
$recce->read( 'jogging', 'Gerund', 'PresentParticiple'); # ambigious lexem

parse "they are jogging" as 'Sentence' or am I just asking too much and knocking to wrong door?

rules => [
[ 'E', [qw/E Add E/], 'do_add' ],
[ 'E', [qw/E Multiply E/], 'do_multiply' ],
[ 'E', [qw/Number/], ],

@Jeffrey Kegler: thanks for pointing out to the doc on ambiguous tokens.

As for natural language parsing: couldn't agree more, use of Earley in comp-ling notwithstanding. :)

Thanks for a wonderful module and a great speed-up.

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.