Smart whitespace and the Ruby Slippers

Scannerless parsing

[ This is cross-posted from the new home of the Ocean of Awareness blog. ]

I've been working on a "scannerless" Marpa interface. "Scannerless" means that the user does not need to write a separate lexer -- the lexer (scanner) is included in the parser. One of my working examples is the synopsis from the main Marpa::R2 POD page, rewritten to do its own lexing:

:start ::= Expression
Expression ::=
    || Expression '*' Expression action => do_multiply
    || Expression '+' Expression action => do_add
Number ~ digits '.' digits action => do_literal
Number ~ digits action => do_literal
digits ~ [\d]+

Here the notation is that of my last post, as documented here. New for the scannerless parser are

  • the :start pseudo-symbol, which indicates the start rule;
  • rules with a tilde ("~") to separate LHS from RHS: these indicate rules whose whitespace is to be left as-is
  • single-quoted strings, to tell Marpa which character to look for; and
  • square-bracketed character classes, to tell Marpa to look for a class of characters. Their interpretation is done by Perl, and therefore the allowed classes are exactly those accepted by your version of Perl.

Valid strings in this language are "15329 + 42 * 290 * 711", "42*3+7", "3*3+4* 4", along with all their whitespace variants.

My recent posts have been tutorial. My work on scannerless parsing is not quite ready for a tutorial presentation, so this post will be conceptual. It is about an interesting issue that arises in scannerless parsing, one which Perl 6 also had to solve, and which Marpa solves in a new and different way. That issue is whitespace.

Dealing with whitespace

For the statements with a declaration operator of ::=, whitespace is handled automatically by Marpa. Valid strings in the above language are "42*3+7", "42 * 3 + 7" and "42 * 3+7", all of which yield 133 as the answer. The trick is to, on one hand, allow whitespace to be optional and, on the other hand, recognize that strings like "42" must be a single number. That is, the parser should not recognize optional whitespace between the two digits and decide that "42", is actually two numbers: "4" and "2".

The Perl 6 project has already taken on scannerless parsing. My methods for dealing with whitespace are based on theirs. Central to their solution is "smart whitespace". ("Smart whitespace" is my term -- the Perl 6 doc is more matter-of-fact.) Smart whitespace is whitespace which is optional, except between word characters. Stated another way, smart whitespace is either explicit whitespace, or a word boundary. In the case of "42", "4" and "2" are both word characters, so there is no word boundary between them, and therefore no smart whitespace.

Implementing smart whitespace

Left parsers (like that which Perl 6 uses) often know very little about the context of the parse. But left parsers do know the current "character transition" -- what the previous character was, and what the current character is. In a left parser, finding word boundaries for the purpose of detecting smart whitespace fits in nicely with the way it works in general.

Marpa, of course, also knows the previous and current characters. It is certainly possible for Marpa to check every transition for a word boundary. But in Marpa's case, this check would be an additional overhead, handling just one special case. It'd be nice if we could look for word boundaries in a cool Marpa-ish way, preferably one with efficiency advantages.

Out come the Ruby Slippers

"Ruby Slippers" parsing, as a reminder, is new with Marpa, despite seeming a very obvious concept. It amounts to adjusting the input to the parser based on what the parser wants. This can be seen as assuring the parser that whatever it wishes for will happen, the same power that was conferred on Dorothy in Wizard of Oz by a happy choice of footware.

To make the Ruby Slippers work in this case, we make a word boundary a special kind of virtual token, and we define smart whitespace to be one of two things:

  • A sequence of one or more characters of real, physical whitespace.
  • A virtual word-boundary token.

We then proceed normally with the parse, until there's a problem. When the parser reports a problem, we ask it if it is looking for one of the virtual word boundary tokens. If so, we give it one and continue. Why does life have to be difficult?

Comments on this post can be sent to the Marpa Google Group:

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.