A Marpa tutorial: pattern searches

[ This is cross-posted from the new home of the Ocean of Awareness blog. ]

Pattern searches

We use regular expressions for pattern searching these days. But what if your search target is not a regular expression? In this post I will show how to use Marpa to search text files for arbitrary context-free expressions.

This tutorial builds on earlier tutorials. It is possible to simply dive into it, but it may be easier to start with two of my earlier posts, here and here.

The grammar

I will use arithmetic expressions as the example of a search target. Even the arithmetic subset of Perl expressions is quite complex, but in this case we can get the job done with eight lines of grammar and a lexer driven by a table of just over a dozen lines. Here is the grammar:

    
start ::= prefix target
prefix ::= any_token*
target ::= expression
expression ::=
       number | scalar | scalar postfix_op
    || op_lparen expression op_rparen assoc => group
    || unop expression
    || expression binop expression`
    
    

This grammar uses Marpa::R2's BNF interface. It takes considerable advantage of the fact that we are not parsing these expressions, but recognizing them. Because of this, we don't have to specify whether expressions left- or right-associate. We can also ignore what operators mean and group them according to syntax only -- binary, prefix unary and postfix unary. Similarly, we can ignore the precedence within these large groups. This leaves us with numbers, scalars, parentheses, and binary, prefix unary and postfix unary operators. (To keep this example simple, we restrict the primaries to numeric constants and Perl scalars.)

What we are searching for is defined by the target symbol. For target you could substitute the start symbol of any context-free grammar, and the structure of this example will still work. To turn a parser for target into a pattern searcher, we add a new start symbol (unimaginatively named "start") and two rules that allow the target to have a prefix.

Ambiguous parsing

To do an anchorless pattern search, this example will use ambiguous parsing. This grammar always has at least one parse going, representing the prefix for the zero or more targets that our parser expects to find in the future. The prefix will never end, because any token (as indicated by a token named, literally, any_token) extends it.

If we are in the process of recognizing a target, we will have one or more other parses going. I say "one or more" because the search method described in this post allows target to be ambiguous. But arithmetic expressions, the target pattern used in this example, are not ambiguous. So our example will have at most two parses active at any point: one for the prefix and another for the target.

Ambiguous parsing has a serious potential downside -- it is not necessarily linear and therefore not necessarily efficient. But Marpa can parse many classes of ambiguous grammar in linear time. Grammars like the one in this post -- a prefix and an unambiguous search target -- fall into one of the linearly parseable classes. Keeping the prefix going requires a tiny constant overhead per token.

The lexer table

The lexer is driven by a table of pairs: token name and regex.


my @lexer_table = (
    [ number     => qr/(?:\d+(?:\.\d*)?|\.\d+)/xms ],
    [ scalar     => qr/ [\$] \w+ \b/xms ],
    [ postfix_op => qr/ [-][-] | [+][+] /xms ],
    [ unop       => qr/ [-][-] | [+][+] /xms ],
    [   binop => qr/
          [*][*] | [>][>] | [<][<]
        | [*] | [\/] | [%] | [x] \b
        | [+] | [-] | [&] | [|] | [=] | [,]
    /xms
    ],
    [   unop => qr/ [-] | [+] | [!] | [~] /xms
    ],
    [ op_lparen => qr/[(]/xms ],
    [ op_rparen => qr/[)]/xms ],
);

Order is significant here. In particular two-character operators are checked for first. This guarantees that two consecutive minus signs will be seen as an decrement operator, and not as a double negation.

Ambiguous lexing

The very careful reader may have noticed that any_token is not in the lexing table. The main loop is written so that every token is read as an any_token. If no token from the lexing table is accepted, the next character in the input stream is read as an any_token. If a token from the lexing table is accepted, then it gets read twice, once as an any_token, and once as the token type taken from the lexing table entry.

Ambiguous lexing is a familiar technique to the Natural Language Processing community. Engish, in particular, is a language that abounds in lexemes that can play multiple roles. The word "sort", for example, can easily be an noun, a verb or an adjective.

The Ruby Slippers

The main loop will also be a simple case of the use of the Ruby Slippers. For those unfamiliar, the "Ruby Slippers" parsing technique handles difficult lexing and parsing problems by asking the parser, at the problem point, what it is looking for, and providing it. This seems a fairly obvious approach, but the Ruby Slippers are new with Marpa -- traditional parsers could not easily determine where they were in a parse.

One way to use the Ruby Slippers is to ask the parser in advance what it is looking for. The code that follows uses another method. Instead of determining in advance what tokens to read, it simply feeds tokens to the parser.

Token rejection is a "soft" error -- it costs little to try, and little to retry. The following code can efficiently determine which entry in the lexing table is appropriate, simply by trying each of them in order. If the alternative() method returns a Perl undef, indicating that a token was rejected, then the main loop will try later entries in the lexing table.

When a token is accepted, the main loop can safely assume that it is on the right track. Marpa is 100% accurate about which tokens can and cannot result in a successful parse.

The main loop

The main loop iterates through input looking for tokens. Whitespace is skipped. Comments are not skipped. Finding arithmetic expressions in strings and/or comments can be useful. We will assume that is the case here.


my $length = length $string;
pos $string = $positions[-1];
TOKEN: while ( pos $string < $length ) {
    next TOKEN if $string =~ m/\G\s+/gcxms;    # skip whitespace
    my $position = pos $string;
    FIND_ALTERNATIVE: {
        TOKEN_TYPE: for my $t (@lexer_table) {
            my ( $token_name, $regex ) = @{$t};
            next TOKEN_TYPE if not $string =~ m/\G($regex)/gcxms;
            if ( not defined $recce->alternative($token_name) ) {
                pos $string = $position;       # reset position for matching
                next TOKEN_TYPE;
            }
            $recce->alternative('any_token');
            last FIND_ALTERNATIVE;
        } ## end TOKEN_TYPE: for my $t (@lexer_table)
        ## Nothing in the lexer table matched
        ## Just read the currrent character as an 'any_token'
        pos $string = $position + 1;
        $recce->alternative('any_token');
    } ## end FIND_ALTERNATIVE:
    $recce->earleme_complete();
    my $latest_earley_set_ID = $recce->latest_earley_set();
    $positions[$latest_earley_set_ID] = pos $string;
} ## end TOKEN: while ( pos $string < $length )

The earleme_complete() method tells Marpa that all the alternatives at one location have been entered, and that the parse should now move on to the next location. (Marpa's idea of location is called an "earleme", in honor of the great parsing theorist, Jay Earley.)

How to parse without really trying

At this point, I want to draw the reader's attention to the code that deals with special cases for the minus sign. Specifically, to the fact that there is no such code. The more familiar you are with PPI and/or perly.y, the more remarkable this will seem.

To take one example, PPI correctly realizes that the minus sign in "1+2-3" is a binary operator. However PPI fails on "(1+2)-3" -- it thinks the minus sign is part of the number "-3". Why don't the authors of PPI just look at the Perl interpreter and copy the logic there? Take a glance at perly.y and toke.c and you will know the answer to that question.

What is PPI's problem here? The problem is that, without knowing where you are in the expression, you cannot tell whether a minus sign is a unary operator or a binary operator. And the parse engines for PPI and for Perl itself, while quite different in many respects, share a property common to traditional parsers -- in determining context they offer the lexer, respectively, little and no help.

In the code in this example, Marpa's alternative() method is, by accepting and rejecting tokens, guiding the lexer to the right choice. Because of Perl's grammar, a minus sign at a given position cannot be both a unary operator and a binary operator. And Marpa is 100% accurate in its knowledge of which tokens are possible. So Marpa's alternative() method always knows whether a minus sign can be a unary or binary operator and accepts or rejects the token accordingly.

This is the Ruby Slippers in action -- a very simple solution to what for the Perl interpreter and PPI is a very complicated problem. When I developed the Ruby Slippers technique, my most serious problem was convincing myself that something so simple could really work.

Finding the targets

Once the parse is complete, it remains to find and print the "targets" found by the search. In a previous post, I showed how, given a symbol name, to find the last occurrence of the symbol in a Marpa parse. That routine needed to be modified to allow repeated searches, but the change was straightforward. The code is in the gist, and the ideas behind it were explained in the previous post, so I won't repeat them here.

Code and comments

The example in this post is available as a Github gist. It was run with Marpa::R2 2.024000, as of this writing the latest full release. My main test, which is included in the gist, used displays from the perlop man page.

Comments on this post can be sent to the Marpa Google Group: marpa-parser@googlegroups.com

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.