A self-parsing and self-lexing grammar

[ This is cross-posted from the new home of the Ocean of Awareness blog. ]

In a previous post, I showed a self-parsing grammar, written in Marpa's new BNF interface. That grammar was in a tradition going back to the 70's. Following the tradition, I cheated a bit. That grammar required, but did not include, a lexer to make a prepass over its input.

This post contains a self-parsing and self-lexing grammar, the one for Marpa's forthcoming Scanless interface. This grammar is about as self-contained as a grammar can get, short of being encoded into a Universal Turing machine.

Many readers will prefer to be introduced to the Scanless interface by way of a simpler example, but based on the response to the previous post I know there are others who share my fascination with self-description and self-exemplification. And there is something to be said for reading an example that is a final authority on itself.

This is certainly a practical example. The grammar that follows is used to parse itself and all other grammars written for the Marpa's Scanless interface. It is also used to parse the strings written for Marpa's BNF interface, the Scanless interface's predecessor.

Starting out

The grammar in this blog post is abridged a bit, and rearranged for ease of explanation. The original is here.

# Copyright 2012 Jeffrey Kegler

The file starts with legalese, which I've cut. (The grammar is under GNU's LGPL 3.0.) Note the hash comment -- since this is a self-describing self-lexer, the grammar will eventually tell us how it deals with hash comments.

:start ::= rules
rules ::= rule+
rule ::= <start rule>
  | <empty rule>
  | <priority rule>
  | <quantified rule>
  | <discard rule>

The rules symbol is the start symbol. Our grammar consists of a series of one or more rules, each one of which falls into one of five types.

By my count, three of the rules in our self-describing grammar are themselves self-describing. The first and last rule above are two of them. The definition of rules is itself a series of one or more rule's. And the definition of rule is a <priority rule>, which is one of the 5 types of rule that are allowed.

The first two rules exemplify two of the other rule types: the first is a <start rule> and the second is a <quantified rule>. A start rule is defined as consisting of the :start pseudo-symbol, followed by a ::= operator, followed by a symbol:

<start rule> ::= (':start' <op declare bnf>) symbol
<op declare bnf> ~ '::='

The parentheses can be ignored. Their purpose is to make life easier on the semantics -- they surround symbols that are hidden from the semantics.

"What I say" versus "what I mean"

When we define a grammar rule, sometimes we are asking Marpa to do exactly what we say, character for character. For example, in the "<op declare bnf> ~ '::='" rule above, we are saying that the symbol <op declare bnf> is exactly the string '::=', nothing more and nothing less. "Do exactly what I say" rules are called lexical rules, and for them we use the match operator ("~").

In other cases, we are asking Marpa to "do what I mean". That is, we are saying that "the rule essentially consists of these symbols, but I also want you to do what is reasonable with whitespace and comments." An example of a "do what I mean" rule was "rules ::= rule+". Within, between, before and after the rule symbols, there will be comments and whitespace. The whitespace will sometimes be optional, and will sometimes be required.

Spelling all the comment and whitespace handling out would be very tedious and error-prone. We want there to be rules of a kind that "do what we mean". Marpa's "do what I mean" rules are called "structural" rules, and are identified by the BNF operator ("::=").

Some structural rules have lexical content within them. An example was "<start rule> ::= (':start' <op declare bnf>) symbol". That is basically a structural rule, where Marpa should "do what I mean" with whitespace and comments. But it also contains a string, ':start'. When a string or a character class occurs inside a structural rule, Marpa knows how to treat them properly. Marpa knows, for example, that whitespace is not OK between the "a" and the "r" of ":start".

The "what I mean" versus "what I say" distinction corresponds very closely to the distinction in Perl 6 grammars between the "rule" and "token". It also corresponds to the traditional division of labor in compilers, between the lexer and the parser proper.

Rules

Above we saw an example of a quantified rule: "rules ::= rule+". Here is the definition:

<quantified rule>
  ::= lhs <op declare>
    <single symbol> quantifier <adverb list>
lhs ::= <symbol name>
<op declare>
  ::= <op declare bnf> | <op declare match>
<op declare match> ~ '~'
quantifier ::= '*' | '+'

A quantified rule contains a left hand side (LHS) symbol name, one of the two declaration operators, a single symbol, a plus or minus "quantifier", and an adverb list. The adverb list can be empty, as it was in our example. (In fact the adverb list has been empty in every rule so far.)

Next come the two rule types that we've yet to see:

<discard rule>
  ::= (':discard' <op declare match>) <single symbol>
<empty rule> ::= lhs <op declare> <adverb list>

We'll explain what a "discard rule" is when we encounter one. An empty rule indicates that its LHS symbol is nullable. We won't encounter an empty rule in this grammar.

<priority rule> ::= lhs <op declare> priorities
priorities ::= alternatives+
    separator => <op loosen> proper => 1
<op loosen> ~ '||'
alternatives ::= alternative+
    separator => <op equal priority> proper => 1
alternative ::= rhs <adverb list>
<op equal priority> ~ '|'

Most rules, including most of the rules we've already seen, are priority rules. Priority rules are so-called because in their most complicated form they can express a precedence scheme. The typical rule in a grammar is a priority rule with only one priority -- a "simple" priority rule. All priority rules in this grammar will be of the simple kind.

Within priorities, there can be alternatives, and we have seen examples of these. The self-defining rule that defined a rule stated that it was one of a set of 5 possible rule types (priority rule being among those types). In that self-defining rule, the different types of rule were alternatives within a single priority.

And, as long as we are on the subject of self-defining rules, there are three of them in this grammar, of which we have previously identified two. The definition of <priority rule> is the third and last self-defining rule -- it is itself a priority rule.

Symbols

We've used symbol, <symbol name>, and <single symbol> a few times. It's time to see how they are defined:

<single symbol> ::=
    symbol
  | <character class>
symbol ::= <symbol name>
<symbol name> ::= <bare name>
<symbol name> ::= <bracketed name>
<bare name> ~ [\w]+
<bracketed name> ~ '<' <bracketed name string> '>'
<bracketed name string> ~ [\s\w]+

At this point, symbol and <symbol name> are essentially the same thing: someday there may be another way to specify symbols other than by name. <single symbol> means any expression guaranteed to produce a single symbol. symbol is obviously one; a character class is the other.

The rules above contain our first mention of character classes and, by coincidence, our first use of character classes. Character classes are enclosed in square brackets, and look exactly like Perl character classes. In fact, they are implemented as Perl character classes, memoized for efficiency.

Now that we know what a symbol can be, let's look at how right hand sides are built up:

rhs ::= <rhs primary>+
<rhs primary> ::= <single symbol>
<rhs primary> ::= <single quoted string>
<rhs primary> ::= <parenthesized rhs primary list>
<parenthesized rhs primary list>
  ::= ('(') <rhs primary list> (')')
<rhs primary list> ::= <rhs primary>+

A right hand side (RHS) is a sequence of one or more RHS "primaries". A RHS primary can be a single symbol, a string in single quotes, or a sublist of one or more RHS primaries in parentheses.

Adverbs

<adverb list> ::= <adverb item>*
<adverb item> ::=
      action
    | <left association>
    | <right association>
    | <group association>
    | <separator specification>
    | <proper specification>

Adverb lists are lists of zero or more adverbs, which can be of one of six kinds. Of these six, four do not occur in this grammar:

<left association> ::= ('assoc' '=>' 'left')
<right association> ::= ('assoc' '=>' 'right')
<group association> ::= ('assoc' '=>' 'group')
action ::= ('action' '=>') <action name>
<action name> ::= <bare name>

Three of the unused adverbs have to do with the associativity (right/left/group) of priorities. Since all our "prioritized" rules are trivial (have only one priority), this grammar does not use them. (Their use is described in the documentation of Marpa's BNF interface.) We also will not see action adverbs in this grammar, for reasons explained below.

Here are the two adverbs that we do see:

<separator specification>
  ::= ('separator' '=>') <single symbol>
<proper specification> ::= ('proper' '=>') boolean
boolean ~ [01]

These adverbs are used for quantified rules. One specifies a "separator" that can go between items of the series. The other specifies whether separation is "proper" or not. (When proper is 0, a separator is allowed after the last item of a series. When that is the case, the separator does not really always separate two items and in that sense the separator is not "proper".)

Discarded tokens

:discard ~ whitespace
whitespace ~ [\s]+

The two rules say that sequences of whitespace are recognized as tokens, then discarded. Perl-style comments are handled in the same way:

# allow comments
:discard ~ <hash comment>
<hash comment> ~ <terminated hash comment>
  | <unterminated final hash comment>
<terminated hash comment>
  ~ '#' <hash comment body> <vertical space char>
<unterminated final hash comment>
  ~ '#' <hash comment body>
<hash comment body> ~ <hash comment char>*
<vertical space char> ~ [\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]
<hash comment char> ~ [^\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]

"Unterminated final hash comments" deal with the special case of hash comments at the end of a file, when that file is not properly terminated with a newline. The <unterminated final hash comment> symbol is an example of how a long angle bracketed symbol name can make things clearer. Without the long name, it might not be evident what that rule and symbol were trying to accomplish.

Long symbol names have a cleaner look than comments, but they are not a panacea. They have the special advantage that the description goes wherever the symbol name goes. But when the description is too long, that advantage becomes a disadvantage.

Strings

In the next snippet, defining single-quoted strings, the description is clearly too long for the symbol name, so much of it does go into a comment.

# In single quotes strings and character classes
# no escaping or internal newlines, and disallow empty string
<single quoted string>
  ~ ['] <string without single quote or vertical space> [']
<string without single quote or vertical space>
  ~ [^'\x{0A}\x{0B}\x{0C}\x{0D}\x{0085}\x{2028}\x{2029}]+

Note two Unicode vertical whitespace codepoints, U+2028 and U+2029, are included. The implementation only supports 7-bit ASCII, so for the moment these accomplish nothing. But when Unicode support is added, the grammar won't need to be changed.

Character classes

Finally, there are character classes:

<character class> ~ '[' <cc string> ']'
<cc string> ~ <cc character>+
<cc character> ~ <escaped cc character>
  | <safe cc character>
<escaped cc character> ~ '\' <horizontal character>

# hex 5d is right square bracket
<safe cc character>
  ~ [^\x{5d}\x{0A}\x{0B}\x{0C}\x{0D}\x{0085}\x{2028}\x{2029}]

# a horizontal character is any character that is not vertical space
<horizontal character> ~ [^\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]

These are Perl character classes, and are passed unaltered to Perl for interpretation. Marpa needs to recognize that they start and end with square brackets. And it also must recognize enough of their internals to deal with escaped square brackets, which makes them the most complicated lexeme of the grammar.

Longest tokens matching

When there is a choice of lexicals, Marpa follows a longest tokens matching strategy. The effect is usually that it does what you mean. (In part this is because longest token match is the usual default for regular expressions and lexical analyzers, so that programmers are trained by their tools to really mean a longest match whenever they specify a match.)

Most of what longest-token-match does is obvious. For example, in the rule "weeknights ~ 'Mon' | 'Tue' | 'Wed' | 'Thu'", it recognizes that weeknights is one symbol and not two symbols like "wee" and "knights". Longest tokens match means that if you have a grammar where you specify both ++ and + as operators, Marpa will always prefer the longer operator: ++.

Because this is Marpa, there is a slight difference from the traditional longest token matching. Note that in Marpa's matching strategy, "tokens" is plural. If more than one possibility has the same length, Marpa will try them all. This plays a role in our meta-grammar. For example, separator is a keyword. But it is also a valid symbol name. Marpa allows it to be both, and figures out which is meant at the structural level, based on context.

Semantics

In practice, a grammar is usually tied tightly to a single semantics. This is an exception. The Scanless interface's meta-grammar is also the grammar for Marpa's BNF interface, and the BNF interface has a different semantics.

For most grammars in Marpa's BNF or Scanless interface, the semantics would be specified using action adverbs. Examples of the normal method of specifying semantics are in the documentation for the BNF interface and for the Scanless interface.

In this special-case grammar, there are no action adverbs. Marpa waits until it knows which interface the grammar will be used for. At that point the internals use the symbol names to map rules to actions on a "just in time" basis.

Comments

Comments on this post can be sent to the Marpa Google Group: marpa-parser@googlegroups.com

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.