<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Andrew Rodland</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/andrew_rodland/" />
    <link rel="self" type="application/atom+xml" href="http://blogs.perl.org/users/andrew_rodland/atom.xml" />
    <id>tag:blogs.perl.org,2009-11-03:/users/andrew_rodland//123</id>
    <updated>2012-01-09T04:15:13Z</updated>
    <subtitle>A blog about the Perl programming language</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Pro 4.38</generator>

<entry>
    <title>More Marpa Madness</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/andrew_rodland/2012/01/more-marpa-madness.html" />
    <id>tag:blogs.perl.org,2012:/users/andrew_rodland//123.2658</id>

    <published>2012-01-09T03:24:44Z</published>
    <updated>2012-01-09T04:15:13Z</updated>

    <summary>For the past year or so, I&apos;ve been following the posts on Marpa with interest, but I never got around to writing anything with it, because honestly, the docs seemed a little bit opaque and incomplete to me. Then, the...</summary>
    <author>
        <name>Andrew Rodland</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/andrew_rodland/">
        <![CDATA[<p>For the past year or so, I've been following the posts on <a href="https://metacpan.org/module/Marpa::XS">Marpa</a> with interest, but I never got around to <em>writing</em> anything with it, because honestly, the docs seemed a little bit opaque and incomplete to me.</p>

<p>Then, the other day, I saw Jeffrey's post about <a href="http://blogs.perl.org/users/jeffrey_kegler/2012/01/what-no-lexer.html">lexing with Marpa</a> and I took it as a challenge. You see, I've never written a lexer. I've written grammars using "lexer-free" parser builders like <a href="https://metacpan.org/module/Parse::RecDescent">Parse::RecDescent</a> and <a href="https://metacpan.org/module/Regexp::Grammars], and hand-written recursive-descent parsers with the help of [Parser::MGC](https://metacpan.org/module/Parser::MGC">Regexp::Grammars</a>, but when it came to writing anything that required a lexer, I was paralyzed. It seemed to me that lexing was frequently ambiguous, and dealing with that ambiguity was a black art that I couldn't understand.</p>

<p>But I knew that Marpa could handle parsing ambiguities, and the docs told me that <a href="https://metacpan.org/module/Marpa::XS::Advanced::Models#AMBIGUOUS-LEXING">ambiguous lexing</a> was tameable too. After a few false starts, I figured out how to work the <a href="https://metacpan.org/module/Marpa::XS::Advanced::Models#THE-CHARACTER-PER-EARLEME-MODEL">character-per-earleme model</a> (which is sadly lacking in examples) and started writing the parts I would need to port my <a href="https://metacpan.org/module/TAP::Spec::Parser">TAP reference parser</a> to Marpa.</p>

<p>For those who want to see the code, it's on <a href="https://github.com/arodland/TAP-Spec-Parser/blob/marpa/lib/TAP/Spec/Parser.pm">the marpa branch</a> of TAP::Spec::Parser git, which passes the (admittedly rather small) test suite. But my purpose in this post is to describe the lexing technique, which is pretty simple and, I think, nicely generic.</p>

<p>The token table begins at line 203, and looks like this:</p>

<pre><code>my %tokens = (
  #  ... 
  'TODO'   =&gt; [ qr/\GTODO/i, 'TODO' ],
  'SKIP'   =&gt; [ qr/\GSKIP/i, 'SKIP' ],
  'ok'     =&gt; [ qr/\Gok/i, 'ok' ],
  'not ok' =&gt; [ qr/\Gnot ok/i, 'not ok' ],
  # ... 
  'Positive_Integer' =&gt; [ qr/\G([1-9][0-9]*)/, sub { 0 + $1 } ],
  'Number_Of_Tests' =&gt; [ qr/\G(\d+)/, sub { 0 + $1 } ],
);
</code></pre>

<p>Each entry in this table has a key, which corresponds to one of the terminal names in the grammar, a match rule (which is an anchored regex), and optionally a rule for deriving a token value. To match a token, the lexer checks whether its regex matches at the current <code>pos()</code> of the input, and if so, it records the name of the token, the number of characters matched, and the token value (which is executed if it's a coderef, otherwise taken verbatim).</p>

<p>To scan and parse an input string, <code>parse_line</code> initializes a Marpa recognizer, sets the lexing position to be the beginning of the string, and then does the following:</p>

<ol>
<li>Ask Marpa what tokens are "expected" (allowed to start at the current position).</li>
<li>Ask the lexer if any of the expected tokens are present at the current position (the regexes for other tokens aren't attempted).</li>
<li>Inform Marpa of whatever tokens actually matched (there may be multiple, in which case Marpa will speculatively keep track of multiple parses).</li>
<li>Advance the Marpa recognizer by one earleme (this is one character, since we're doing "earleme-per-character" matching).</li>
<li>Advance the lexer by one character.</li>
<li>Repeat (goto 1) until we run out of string.</li>
<li>Let Marpa know that the input is complete (so any "in-progress" matches that would need more input to be successful are discarded).</li>
<li>Ask Marpa for the valid parse or parses.</li>
</ol>

<p>In steps 1 and 2 it's possible that there are either no expected tokens, or no matched tokens, at the current character position. This is <em>not</em> an error, because it's possible that we're just in the middle of a token, and we need to scan to the end of it (which is accomplished by advancing the lexer and calling <code>complete_earleme</code>).</p>

<p>In step 4, it's possible that Marpa will announce that the parse is "exhausted". This <em>is</em> an error, or at least an exceptional condition, as it means the parse can't succeed with the input it's been given. In <code>TAP::Spec::Parser</code>'s line parser we do something a little bit weird, which is to convert any such failure into a "Junk Line" token, but in general it's an opportunity to either <a href="http://blogs.perl.org/users/jeffrey_kegler/2011/11/marpa-and-the-ruby-slippers.html">put on the ruby slippers</a> or report an error. In either case, the expected tokens from the <code>terminals_expected</code> method come in handy. In addition, you could also add a flag that makes <code>lex</code> attempt to match <em>any</em> token at the current location, instead of only "expected" ones, which would allow generating nice "Expected 'x', got 'y'" type error messages.</p>

<p>P.S. <code>TAP::Spec::Parser</code> marpa branch uses two Marpa parsers. At first, this was done to make TAP's "junk line" handling easier, but it's also good for error reporting. Since TAP is a line-oriented protocol, the "line parser" turns a line of input into an object that's used as an input token to the "stream parser", and every EOL is a commit point. Since the "stream" parser expects lines as tokens, and doesn't know anything about the <em>insides</em> of lines, it means that out-of-sequence TAP generates errors like "Expected 'Test_Result', found 'Plan'" instead of errors like "Expected 'Status', found '1..'". I think that's a pretty cool side effect.</p>
]]>
        

    </content>
</entry>

</feed>
