Developing parsers incrementally with Marpa

By Jeffrey Kegler on January 13, 2012 8:58 AM

Marpa::XS is a general context-free parser. What does that mean? For a grammar writer, it means that he doesn't need to worry that the next rule he adds to the grammar is the one that makes it hit the invisible wall that most other parser generators set up. If you can write it in BNF, Marpa will parse it. Which makes Marpa::XS good at incremental development.

As you develop your Marpa grammar, you can track the tradeoffs you are making between features and efficiency. Chances are, if your grammar is unambiguous or lightly ambiguous, there are no tradeoffs -- you're getting everything you want in linear time. Marpa is linear for every class of grammar currently in practical use.

Marpa is also linear with many ambiguous grammars and, in the worst case, Marpa's time complexity is what is accepted as optimal in practice. Whatever the time complexity that you're seeing with Marpa, it's probably as good or better than you're going to get from another parser generator.

One way to start the semantics

As you build your grammar out, of course, you'll need a semantics. I often start my Marpa semantics with a single routine, something like:

sub My_Actions::do_what_I_mean {

    # The first argument is the per-parse variable.
    # At this stage, just throw it away
    shift;

    # Throw away any undef's
    my @children = grep { defined } @_;

    # Return what's left
    return scalar @children > 1 ? \@children : shift @children;
}

When defining the grammar, I set do_what_I_mean as the default action:

    default_action => 'My_Actions::do_what_I_mean',

This one function is enough to get me started working out the semantics of my parse tree. At every node of the parse tree, it throws away all the undefined child values. If no child value remains, an undef is returned. If only one child value remains, it is the result. If more than one child value remains, a reference to an array of them is returned. The result, when handed to Data::Dumper, is a reasonable representation of your parse tree.

The "per-parse variable" is also thrown away. By default, this is an empty hash which can be used as a kind of global scratchpad in the parse. This will be useful, for example, if your language has a symbol table.

The do_what_I_mean function produces a first cut at a parse tree, which I can hack away at. For example, if I'm writing a calculator, I can change the actions rule by rule to perform the calculations.

Your do_what_I_mean function might vary, depending on your application. You might find it is better to keep all the undefineds, for example. Also, once your grammar is complete, you should check to see if you can do without a default_action. If you can, eliminating it is an efficiency gain. This efficiency gain will be even bigger in Marpa::R2, Marpa's next release.

2 comments

Tagged as:

Marpa Earley parsing parser

2 Comments

Jakub Narebski | January 13, 2012 1:08 PM

I have read it somewhere that C++ has undecidable grammar. Does it mean that it cannot be parsed using Marpa?

Jeffrey Kegler replied to comment from Jakub Narebski | January 13, 2012 5:57 PM

@Jakub: C++ is not one of the languages Wikipedia lists as having an undecidable syntax. (The list is at the end of this section.) Marpa really isn't an issue here either way -- Marpa cannot parse 100% of an undecidable language, but neither can anything else. Incidentally, with undecidable languages, the important thing is what portion of the programs you'd actually write are undecidable. The trick (yet another thing to be credited to Larry Wall) is to get the power that comes with undecidability, without ever having to actually pay the price.

About Jeffrey Kegler

I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.

More info »

Ocean of Awareness