BNF to AST

[ This is cross-posted from the new home of the Ocean of Awareness blog. ]

The latest version of Marpa takes parsing "whipitupitude" one step further. You can now go straight from a BNF description of your language, and an input string, to an abstract syntax tree (AST).

To illustrate, I'll use an example from the Gang of Four's (Go4's) chapter on the Interpreter pattern. (It's pages 243-255 of the Design Patterns book.) The Go4 knew of no easy general way to go from BNF to AST, so they dealt with that part of the interpreter problem by punting -- they did not even try to parse the input string. Instead they constructed the BNF they'd just presented and constructed an AST directly in their code.

The reason the Go4 didn't know of an easy, generally-applicable way to parse their example was that there was none. Now there is. In this post, Marpa will take us quickly and easily from BNF to AST. (Full code for this post can be found in a Github gist.)

The Go4's example was a simple boolean expression language, whose primary input was

true and x or y and not x

Here, in full, is the BNF for an slight elaboration of the Go4 example. It is written in the DSL for Marpa's Scanless interface (SLIF DSL), and includes specifications for building the AST.

:default ::= action => ::array

:start ::= <boolean expression>
<boolean expression> ::=
       <variable> bless => variable
     | '1' bless => constant
     | '0' bless => constant
     | ('(') <boolean expression> (')') action => ::first bless => ::undef
    || ('not') <boolean expression> bless => not
    || <boolean expression> ('and') <boolean expression> bless => and
    || <boolean expression> ('or') <boolean expression> bless => or

<variable> ~ [[:alpha:]] <zero or more word characters>
<zero or more word characters> ~ [\w]*

:discard ~ whitespace
whitespace ~ [\s]+

This syntax should be fairly transparent. In previous posts I've given a tutorial, and a a mini-tutorial. And of course, the interface is documented.

For those skimming, here are a few quick comments on less-obvious features. To guide Marpa in building the AST, the BNF statements have action and bless adverbs. The bless adverbs indicate a Perl class into which the node should be blessed. This is convenient for using an object-oriented approach with the AST. The action adverb tells Marpa how to build the nodes. "action => ::array" means the result of the rule should be an array containing its child nodes. "action => ::first" means the result of the rule should just be its first child. Many of the child symbols, especially literal strings of a structural nature, are in parentheses. This makes them invisible to the semantics.

A :default pseudo-rule specifies the defaults -- in this case the "action => ::array" adverb setting. The :start pseudo-rule specified the start symbol. The :discard pseudo-rule indicates that whitespace is to be discarded.

The Go4 did not deal with precedence. In their example, the input string is fully parenthesized, even though its priorities are the standard ones. I've eliminated the parentheses, because the standard precedence is implemented in SLIF grammar. The double vertical bar ("||") is a "loosen" operator -- an alternative after "loosen" operator will be at a looser precedence than the one before. Alternatives separated by a single bar are at the same precedence.

Creating the AST

Creating the AST is simple. First, we use Marpa to turn the above DSL for boolean expressions into a parser. (We'd saved the SLIF DSL source in the string $rules.)

my $grammar = Marpa::R2::Scanless::G->new(
    {   bless_package => 'Boolean_Expression',
        source        => \$rules,
    }   
);  

Next we define a closure that uses $grammar to turn BNF into AST's.

sub bnf_to_ast {
    my ($bnf) = @_;
    my $recce = Marpa::R2::Scanless::R->new( { grammar => $grammar } );
    $recce->read( \$bnf );
    my $value_ref = $recce->value();
    if ( not defined $value_ref ) {
        die "No parse for $bnf";
    }
    return ${$value_ref};
} ## end sub bnf_to_ast

Where $bnf is our input string, we run it as follows:

my $ast1 = bnf_to_ast($bnf);

The AST

If we use Data::Dumper to examine the AST,

say Data::Dumper::Dumper($ast1) if $verbose_flag;

we see this:

$VAR1 = bless( [
                 bless( [
                          bless( [
                                   'true'
                                 ], 'Boolean_Expression::variable' ),
                          bless( [
                                   'x'
                                 ], 'Boolean_Expression::variable' )
                        ], 'Boolean_Expression::and' ),
                 bless( [
                          bless( [
                                   'y'
                                 ], 'Boolean_Expression::variable' ),
                          bless( [
                                   bless( [
                                            'x'
                                          ], 'Boolean_Expression::variable' )
                                 ], 'Boolean_Expression::not' )
                        ], 'Boolean_Expression::and' )
               ], 'Boolean_Expression::or' );

Processing the AST

In their example, the Go4 processed their AST in several ways: straight evaluation, copying, and substitution of the occurrences of a variable in one boolean expression by another boolean expression. It is obvious that the AST above is the computational equivalent of the Go4's AST, but for the sake of completeness I carry out the same operations in the Github gist.

AST creation via Marpa's SLIF is self-hosting -- the SLIF DSL is parsed into an AST, and a parser created by interpreting the AST. The Marpa SLIF DSL source file in this post, that describes boolean expressions, was itself turned into an AST on its way to becoming a parser that turns boolean expressions into AST's.

Comments

Comments on this post can be sent to the Marpa Google Group: marpa-parser@googlegroups.com

About Jeffrey Kegler

user-pic I blog about Marpa, my parsing algorithm, and other things of interest to techies.