A Marpa tutorial: iterative parser development

[ This is cross-posted from the new home of the Ocean of Awareness blog. ]

Developing a parser iteratively

This post describes a manageable way to write a complex parser, a little bit at a time, testing as you go. This tutorial will "iterate" a parser through one development step. As the first iteration step, we will use the example parser from the previous tutorial in this series, which parsed a Perl subset.

You may recall that the topic of that previous tutorial was pattern search. Pattern search and iterative parser development are essentially the same thing, and the same approach can be used for both. Each development stage of our Perl parser will do a pattern search for the Perl subset it parses. We can use the accuracy of this pattern search to check our progress. The subset we are attempting to parse is our "search target". When our "searches" succeed in finding all instances of the target, we have successfully written a parser for that subset, and can move on to the next step of the iteration.

What we need to do

This tutorial is the latest of a series, each of which describes one self-contained example of a Marpa-based parser. In this tutorial we use the example from the previous tutorial as the first iteration step in the iterative development of a Perl parser. For the iteration step in this example, we will add two features.

  • The previous iteration step was more of a recognizer than a parser. In particular, its grammar was too simplified to support a semantics, even for the Perl subset it recognized. We will fix that.

  • Having amplified the grammar, we will add a semantics, simple, but quite powerful enough to use in checking our progress in developing the parser.

The grammar

Here is our grammar from the previous post:

    
start ::= prefix target
prefix ::= any_token*
target ::= expression
expression ::=
       number | scalar | scalar postfix_op
    || op_lparen expression op_rparen assoc => group
    || unop expression
    || expression binop expression`
    
    

The format is documented here. These eight lines were enough to descibe arithmetic expressions sufficiently well for a recognizer, as well as to provide the "scaffolding" for the unanchored search. Nice compression, but now that we are talking about supporting a Perl semantics, we will need more.

Adding the appropriate grammar is a matter of turning to the appropriate section of the perlop man page and copying it. I needed to change the format and name the operators, but the process was pretty much rote, as you can see:

    
my $perl_grammar = Marpa::R2::Grammar->new(
    {   start          => 'start',
        actions        => 'main',
        default_action => 'do_what_I_mean',
        rules          => [ <<'END_OF_RULES' ]
start ::= prefix target action => do_arg1
prefix ::= any_token* action => do_undef
target ::= expression action => do_target
expression ::=
     number
   | scalar
   | op_lparen expression op_rparen assoc => group
  || op_predecrement expression
   | op_preincrement expression
   | expression op_postincrement
   | expression op_postdecrement
  || expression op_starstar expression assoc => right
  || op_uminus expression
   | op_uplus expression
   | op_bang expression
   | op_tilde expression
  || expression op_star expression
   | expression op_slash expression
   | expression op_percent expression
   | expression kw_x expression
  || expression op_plus expression
   | expression op_minus expression
  || expression op_ltlt expression
   | expression op_gtgt expression
  || expression op_ampersand expression
  || expression op_vbar expression
   | expression op_caret expression
  || expression op_equal expression assoc => right
  || expression op_comma expression
END_OF_RULES
    }
);
    
    

The lexer

The lexer is table-driven. I've used this same approach to lexing in every post in this tutorial series. Those interested in an explanation of how the lexer works can find one in the first tutorial. Having broken out the operators, I had to rewrite the lexing table, but that was even more rote than rewriting the grammar. I won't repeat the lexer table here -- it can be found in the Github gist.

Adding the semantics

Our semantics will create a syntax tree. Here is that logic. (Note that the first argument to these semantic closures is a per-parse "object", which we don't use here.)

    
sub do_undef       { undef; }
sub do_arg1        { $_[2]; }
sub do_what_I_mean { shift; return $_[0] if scalar @_ == 1; return \@_ }

sub do_target {
    my $origin = ( Marpa::R2::Context::location() )[0];
    return if $origin != $ORIGIN;
    return $_[1];
} ## end sub do_target
    
    

There is some special logic in the do_target() method, involving the "origin", or starting location of the target. Perl arithmetic expressions, when they are the target of an unanchored search, are ambiguous. For example, in the string "abc 1 + 2 + 3 xyz", there are two targets ending at the same position: "2 + 3" and "1 + 2 + 3". We are interested only in longest of these, whose start location is indicated by the $ORIGIN variable.

The next logic will be familiar from our pattern search tutorial. It repeatedly looks for non-overlapping occurrences of target, starting from the end and going back to the beginning of the input.

    
my $end_of_search;
my @results = ();
RESULTS: while (1) {
    my ( $origin, $end ) =
        $self->last_completed_range( 'target', $end_of_search );
    last RESULTS if not defined $origin;
    push @results, [ $origin, $end ];
    $end_of_search = $origin;
} ## end RESULTS: while (1)
    
    

This final code sample is the logic that unites pattern search with incremental parsing. It is a loop through @results that prints the original text and, depending on a flag, its syntax tree.

Near the top of the loop, the "$recce->set( { end => $end } )" call sets the end of parse location to the current result. At the bottom of the loop, we call "$recce->reset_evaluation()". This is necessary to allow us to evaluate the input stream again, but with a new $end location.

    
RESULT: for my $result ( reverse @results ) {
    my ( $origin, $end ) = @{$result};

    ... Print out the original text ...

    $recce->set( { end => $end } );
    my $value;
    VALUE: while ( not defined $value ) {
        local $main::ORIGIN = $origin;
        my $value_ref = $recce->value();
        last VALUE if not defined $value_ref;
        $value = ${$value_ref};
    } ## end VALUE: while ( not defined $value )
    if ( not defined $value ) {
        say 'No parse'
            or die "say() failed: $ERRNO";
        next RESULT;
    }
    say Data::Dumper::Dumper($value)
        or die "say() failed: $ERRNO"
        if not $quiet_flag;
    $recce->reset_evaluation();
} ## end RESULT: for my $result ( reverse @results )
    
    

The VALUE sub-loop is where the $ORIGIN variable was set. In the semantics, do_target() checks this. In the case of an ambiguous parse, do_target() turns any target which does not cover the full span from $origin to $end into a Perl undef, which will eventually become the value of its parse. The logic in the VALUE loop ignores parses whose value is a Perl undef, so that only the longest target for each $end location is printed.

Code and comments

The example in this post is available as a Github gist. It was run with Marpa::R2 2.024000, as of this writing the latest full release. Its main test, which is included in the gist, used displays from the perlop man page.

Comments on this post can be sent to the Marpa Google Group: marpa-parser@googlegroups.com

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.