Precedence parsing made simpler

By Jeffrey Kegler on August 22, 2012 9:58 AM

This post describes a new approach to precedence parsing, one that makes it simpler and more flexible. Many programmers find precedence is an intuitive way to look at problems. The traditional rules of arithmetic are a familiar example:

E ::= ( E )
E ::= n
E ::= E * E
E ::= E + E

Here, as in the rest of this post, the rules are ranked from tightest ("highest") precedence to loosest ("lowest"). The order in the above display indicates that multiplication takes precedence over addition, and parentheses take precedence over everything else.

The old way and the new way

The traditional way to deal with precedence centers on symbols. The symbols are divided sharply in two: those that define structure and those that carry information. The structural symbols (often called "operators") are assigned associativities and precedence. To help guide the parse, symbols may be further classified as infix, prefix, circumfix, etc.

Many rules, even those which intuitively seem part of the precedence order, do not fit into this symbol-oriented view of precedence. Implied operators are ruled out, as is any rule with two non-terminals in a row. Rules with an arity of 3 or above, when not also impossible, are a challenge.

The approach of this post is 100% rule-based. There is no attempt to identify operators or structural symbols, and no attempt to assign properties to them. This rule-based approach allows the convenient expression and efficient implementation of implied operators, of rules of arity 3 or higher, and of rules with any pattern of terminals or non-terminals.

Simpler

Before getting into new features, it is probably best to show the new approach as applied to a grammar that can be parsed with the traditional methods. My notation is mostly standard or transparent, but here are details:

    ::=       separates the lhs of a rule from its rhs alternatives
    |         separates alteratives at the same precedence level
    ||        separates alteratives at different precedence levels
    :group    indicates 'grouping' associativity
    :left     indicates left associativity (the default)
    :right    indicates right associativity

Here is the grammar:

e ::=
     NUM
   | VAR
   | :group '(' e ')'
  || '-' e
  || :right e '^' e
  || e '*' e
   | e '/' e
  || e '+' e
   | e '-' e
  || VAR '=' e

The above fully states the precedence and associativity for the grammar's rules. (As a reminder, the precedence follows the order of the rules, from tightest to loosest.) This is significantly simpler than what is required to set up a traditional precedence parser. On the other hand, intuitively, it looks like all the required information is there. And, in fact, this is the source from which Marpa::Demo::OP1 creates the grammar for a calculator. The code is a Github gist.

In real life, users of a calculator grammar, like the above, will be interested in a numeric result. However, in this post we are not interested in double-checking Perl's ability to do basic arithmetic, so instead we capture the syntactic structure that the calculator creates. Here are sample outputs, with square brackets added to show the parse.

Input: "4 * 3 + 42 / 1"
  Parse: [[4*3]+[42/1]]
Input: "4 * 3 / (a = b = 5) + 42 - 1"
  Parse: [[[[4*3]/[([a=[b=5]])]]+42]-1]
Input: "4 * 3 /  5 - - - 3 + 42 - 1"
  Parse: [[[[[4*3]/5]-[-[-3]]]+42]-1]
Input: "- a - b"
  Parse: [[-a]-b]
Input: "1 * 2 + 3 * 4 ^ 2 ^ 2 ^ 2 * 42 + 1"
  Parse: [[[1*2]+[[3*[4^[2^[2^2]]]]*42]]+1]

More flexible

In the next grammar, I'll introduce an implied operator. An implied operator is prominent among the features that traditional precedence parsers simply could not handle. In the grammar that follows, a missing operator will indicate multiplication, just as in algebra.

Traditional precedence parsers also were stymied by rules with an arity of 3 or more. For Marpa::Demo::OP1, these are no problem at all. I'll introduce two ternary operations, and a quaternary operation. (New in the notation below is the "=> xyz", which specifies a non-default semantics, in this case "xyz()".)

e ::=
     NUM
   | VAR
   | :group '(' e ')'
  || '-' e
  || :right e '^' e
  || e '*' e
   | e e                                          => implied_multiply
   | e '/' e
  || e '+' e
   | e '-' e
  || VAR '=' e
  || :right e '?' e ':' e                         => spaced
   | :right e '??' e ':' e ':' e                  => spaced
  || 'payment' 'on' e 'over' e 'years' 'at' e '%' => spaced

The code for this second example is also a Github gist. And here is the output. (To make it easy to spot them, implied multiplications are shown with an "x" instead of a "*".)

Input: "4 3 42 + 1"
  Parse: [[[4 x 3] x 42]+1]
Input: "e = m c^2"
  Parse: [e=[m x [c^2]]]
Input: "4 * 3 5 (6 7) 8 9 10"
  Parse: [[[[[[4*3] x 5] x [([6 x 7])]] x 8] x 9] x 10]
Input: "1 ? 42 : 2 ?? 3 : 4 : 5 ? 6 : 7"
  Parse: [1 ? 42 : [2 ?? 3 : 4 : [5 ? 6 : 7]]]
Input: "payment on 1000 + 1000 over months/12 years at 5 + 1 %"
  Parse: [payment on [1000+1000] over [months/12] years at [5+1] %]

How rule-based precedence works

Rule-based precedence parsing uses Marpa, a new and efficient general BNF parsing algorithm. With Marpa, the rest is straightforward. The grammars in the examples above are rewritten, using the included precedence and associativity information, into an "order-explicit grammar". The BNF of an order-explicit grammar enforces the precedence and associativity of the original, source grammar. Many textbooks on parsing describe how to write order-explicit BNF by hand.

Creating an algorithm to produce an order-explicit BNF grammar required some careful thought, but no flashes of brilliance. Previously, the major obstacle to this approach would have been the parse engine. Traditional parsers did not "just parse" arbitrary BNF -- far from it. Without (and often even with) programmer intervention, there would be little reason to hope that an LALR or LL parse engine would parse an arbitrary order-explicit BNF grammar. Marpa, on the other hand, does "just parse" arbitrary BNF, and a successful parse is guaranteed.

Any grammar which could have been parsed by yacc (LALR) or an operator-precedence parser will be parsed by Marpa in linear time. LALR and operator precedence are subsets of LR(1), while Marpa is linear for LR-regular, and for LR(k) for all k. This means that Marpa will stay linear for vast classes of grammars that the traditional techniques had no hope of ever parsing.

Acknowledgements

This post is the outcome of a line of thinking started by an exchange with Alberto Simões, begun when he graciously shared with me a pre-release copy of an article on lexical analysis, which he authored jointly. And, in creating the DSL used for the examples, I benefited immensely from studying the approaches used by Peter Stuifzand.

2 comments

Tagged as:

Earley, Marpa, parser, parsing

2 Comments

tlrrd | August 23, 2012 4:30 AM

Did marpa really use 8GB of memory in that paper, or did I missread table 3?

Jeffrey Kegler replied to comment from tlrrd | August 23, 2012 8:14 AM

No, you read it right. Marpa did not do well in the test in Alberto's table 3.

What happened was that Alberto wanted Marpa to do precedence parsing directly, without rewriting. The deadline was near, so I suggested using Marpa's ability to do ambiguous parsing. This was not a good suggestion on my part and (as Alberto says on p. 45) the result was not completely fair to Marpa. Alberto reran with the grammar rewritten for Marpa, and got the numbers reported on page 46 of the text, just below the tables: 9.218295 seconds, and 279 MB.

This was my motivation for the method described in the post. With it, Marpa handles precedence directly -- much better than yacc does. yacc will handle the first example in the post, but not the second.

In fact, yacc does not really do precedence parsing directly -- it uses precedence to resolve conflicts, which is a quite different thing. In practice, that means that yacc sometimes gives you a grammar with the precedence you specify, and sometimes silently gives you a grammar that parses some other language. It was nasty behaviors like this and the requirement for programmer-tweaking to parse most practical grammars, that made yacc's greatest practitioners (Thompson, the GNU compiler group, Larry Wall) turn their back on yacc and LALR parsing in general.

To be clear, if your grammar is a yacc-able grammar, and if you can put up with getting it working and maintaining it, yacc will run faster than Marpa. Marpa has to compete in this race but, given yacc's other issues, this is not a race Marpa needs to win.

About Jeffrey Kegler

I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.

More info »

Ocean of Awareness