More Marpa Madness
For the past year or so, I've been following the posts on Marpa with interest, but I never got around to writing anything with it, because honestly, the docs seemed a little bit opaque and incomplete to me.
Then, the other day, I saw Jeffrey's post about lexing with Marpa and I took it as a challenge. You see, I've never written a lexer. I've written grammars using "lexer-free" parser builders like Parse::RecDescent and Regexp::Grammars, but when it came to writing anything that required a lexer, I was paralyzed. It seemed to me that lexing was frequently ambiguous, and dealing with that ambiguity was a black art that I couldn't understand.
But I knew that Marpa could handle parsing ambiguities, and the docs told me that ambiguous lexing was tameable too. After a few false starts, I figured out how to work the character-per-earleme model (which is sadly lacking in examples) and started writing the parts I would need to port my TAP reference parser to Marpa.
For those who want to see the code, it's on the marpa branch of TAP::Spec::Parser git, which passes the (admittedly rather small) test suite. But my purpose in this post is to describe the lexing technique, which is pretty simple and, I think, nicely generic.
The token table begins at line 203, and looks like this:
my %tokens = (
# ...
'TODO' => [ qr/\GTODO/i, 'TODO' ],
'SKIP' => [ qr/\GSKIP/i, 'SKIP' ],
'ok' => [ qr/\Gok/i, 'ok' ],
'not ok' => [ qr/\Gnot ok/i, 'not ok' ],
# ...
'Positive_Integer' => [ qr/\G([1-9][0-9]*)/, sub { 0 + $1 } ],
'Number_Of_Tests' => [ qr/\G(\d+)/, sub { 0 + $1 } ],
);
Each entry in this table has a key, which corresponds to one of the terminal names in the grammar, a match rule (which is an anchored regex), and optionally a rule for deriving a token value. To match a token, the lexer checks whether its regex matches at the current pos()
of the input, and if so, it records the name of the token, the number of characters matched, and the token value (which is executed if it's a coderef, otherwise taken verbatim).
To scan and parse an input string, parse_line
initializes a Marpa recognizer, sets the lexing position to be the beginning of the string, and then does the following:
- Ask Marpa what tokens are "expected" (allowed to start at the current position).
- Ask the lexer if any of the expected tokens are present at the current position (the regexes for other tokens aren't attempted).
- Inform Marpa of whatever tokens actually matched (there may be multiple, in which case Marpa will speculatively keep track of multiple parses).
- Advance the Marpa recognizer by one earleme (this is one character, since we're doing "earleme-per-character" matching).
- Advance the lexer by one character.
- Repeat (goto 1) until we run out of string.
- Let Marpa know that the input is complete (so any "in-progress" matches that would need more input to be successful are discarded).
- Ask Marpa for the valid parse or parses.
In steps 1 and 2 it's possible that there are either no expected tokens, or no matched tokens, at the current character position. This is not an error, because it's possible that we're just in the middle of a token, and we need to scan to the end of it (which is accomplished by advancing the lexer and calling complete_earleme
).
In step 4, it's possible that Marpa will announce that the parse is "exhausted". This is an error, or at least an exceptional condition, as it means the parse can't succeed with the input it's been given. In TAP::Spec::Parser
's line parser we do something a little bit weird, which is to convert any such failure into a "Junk Line" token, but in general it's an opportunity to either put on the ruby slippers or report an error. In either case, the expected tokens from the terminals_expected
method come in handy. In addition, you could also add a flag that makes lex
attempt to match any token at the current location, instead of only "expected" ones, which would allow generating nice "Expected 'x', got 'y'" type error messages.
P.S. TAP::Spec::Parser
marpa branch uses two Marpa parsers. At first, this was done to make TAP's "junk line" handling easier, but it's also good for error reporting. Since TAP is a line-oriented protocol, the "line parser" turns a line of input into an object that's used as an input token to the "stream parser", and every EOL is a commit point. Since the "stream" parser expects lines as tokens, and doesn't know anything about the insides of lines, it means that out-of-sequence TAP generates errors like "Expected 'Test_Result', found 'Plan'" instead of errors like "Expected 'Status', found '1..'". I think that's a pretty cool side effect.
I liked the aggressive use of two Marpa parsers to tackle this problem. Using simultaneous Marpa parsers on some grammars is something that I had thought about, but never implemented. In paricular, even though I've looked at alternative ways to use Marpa to deal with line-oriented languages like TAP, your approach never occurred to me. It's very attractive.
Marpa once had not only its own lexer, but its own specification language, called MDL. The "Ruby Slippers" were invented for MDL -- MDL used the trick of looking at the tokens expected by the parser in order to simplify lexing. But MDL looked to most people like a clumsy combination of COBOL and Flex. MDL seemed to leave people totally confused as to what Marpa was or why they might want to use it.
Killing MDL looks like it was a good idea. You rediscovered all MDL's lexing insights, plus some. Perhaps somebody will also come up with a domain-specific language for Marpa parsers, parsed by Marpa itself. Chances are I will like it better than MDL.