<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Ocean of Awareness</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/" />
    <link rel="self" type="application/atom+xml" href="http://blogs.perl.org/users/jeffrey_kegler/atom.xml" />
    <id>tag:blogs.perl.org,2009-11-03:/users/jeffrey_kegler//63</id>
    <updated>2013-04-30T02:44:19Z</updated>
    <subtitle>Jeffrey Kegler&apos;s tech blog.  Topics include his Marpa parser and parsing in general.</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Pro 4.38</generator>

<entry>
    <title>Is Earley parsing fast enough?</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2013/04/is-earley-parsing-fast-enough.html" />
    <id>tag:blogs.perl.org,2013:/users/jeffrey_kegler//63.4627</id>

    <published>2013-04-29T19:28:04Z</published>
    <updated>2013-04-30T02:44:19Z</updated>

    <summary> &quot;First we ask, what impact will our algorithm have on the parsing done in production compilers for existing programming languages? The answer is, practically none.&quot; -- Jay Earley&apos;s Ph.D thesis, p. 122. [ This is cross posted from its home at the Ocean of Awareness blog. In the above quote, the inventor of the Earley parsing algorithm poses a question. Is his algorithm fast enough for a production compiler? His answer is a stark &quot;no&quot;. This is the verdict on Earley&apos;s that you often hear repeated today, 45 years later. Earley&apos;s, it is said, has a too high a &quot;constant factor&quot;. Verdicts tends to be repeated more often than examined. This particular verdict originates with the inventor himself. So perhaps it is not astonishing that many treat the dismissal of Earley&apos;s on grounds of speed to be as valid today as it was in 1968. But in the past 45 years, computer technology has changed beyond recognition and researchers have made several significant improvements to Earley&apos;s. It is time to reopen this case. What is a &quot;constant factor&quot; The term &quot;constant factor&quot; here has a special meaning, one worth looking at carefully. Programmers talk about time efficiency in two...</summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[<blockquote>
      "First we ask, what impact will our algorithm have on the parsing
      done in production compilers for existing programming languages?
      The answer is, practically none." -- Jay Earley's Ph.D thesis, p. 122.
    </blockquote>
<p>[ This is <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2013/04/fast_enough.html">cross posted from its home at the Ocean of Awareness blog</a>.</p>
    <p>In the above quote, the inventor of the Earley parsing
      algorithm poses a question.
      Is his algorithm fast enough for a production compiler?  His answer is a
      stark "no".
    </p>
    <p>
      This is the verdict on Earley's that you often
      hear repeated today, 45 years later.
      Earley's, it is said, has a too high a "constant factor".
      Verdicts tends to be repeated more often than examined.
      This particular verdict originates with the inventor himself.
      So perhaps it is not astonishing
      that many treat the dismissal
      of Earley's on grounds of speed to be as valid today as it
      was in 1968.
    </p>
    <p>But in the past 45 years,
      computer technology has changed beyond recognition
      and researchers
      have made several significant improvements to Earley's.
      It is time to reopen this case.
    </p><h3>What is a "constant factor"</h3>
    <p>The term "constant factor" here has a special meaning,
      one worth looking at carefully.
      Programmers talk about time efficiency in two ways:
      time complexity and speed.
    </p>
    <p>
      Speed is simple:
      It's how fast the algorithm is against the clock.
      To make comparison easy,
      the clock can be an abstraction.
      The clock ticks could be, for example, weighted instructions
      on some convenient and mutually-agreed architecture.
    </p>
    <p>
      By the time Earley was writing, programmers had discovered that simply comparing
      speeds,
      even on well-chosen abstract clocks, was not enough.
      Computers were improving very quickly.
      A speed result
      that was clearly significant when the comparison was made
      could quickly become unimportant.
      Researchers needed to
      talk about time efficiency in a way that made what they said as true
      decades later as on the day they said it.
      To do this, researchers created the idea of time complexity.
    </p>
    <p>Time complexity is measured using several notations, but the most
      common is
      <a href="http://en.wikipedia.org/wiki/Big_O_notation">big-O
        notation</a>.
      Here's the idea:
      Assume we are comparing two algorithms, Algorithm A and Algorithm B.
      Assume that algorithm A uses 42 weighted instructions for each input symbol.
      Assume that algorithm B uses 1792 weighted instructions for each input symbol.
      Where the count of input symbols is N,
      A's speed is 42*N, and B's is 1792*N.
      But the time complexity of both is the same: O(N).
      The big-O notation throws away the two "constant factors", 42 and 1792.
      Both are said to be "linear in N".
      (Or more often, just "linear".)
    </p>
    <p>It often happens that algorithms we need to compare for time efficiency
      have different speeds,
      but the same time complexity.
      In practice,
      this usually this means we can treat them as having essentially
      the same time efficiency.
      But not always.
      It sometimes happens that this difference is relevant.
      When this happens, the rap against the slower algorithm is that it
      has a "high constant factor".
    </p>
    <h3>OK, about that high constant factor</h3>
    <p>What is the "constant factor" between Earley and the current favorite
      parsing algorithm, as a number?
      (My interest is practical, not historic,
      so I will be talking about Earley's
      as modernized by Aycock, Horspool, Leo and myself.
      But much of what I say applies to Earley's algorithm in general.)
    </p>
    <p>What the current favorite parsing algorithm is
      can be an interesting question.
      When Earley wrote, it was hand-written recursive descent.
      The next year (1969) LALR parsing was invented,
      and the year after (1970) a tool that used it was introduced -- yacc.
      At points over the next decades,
      yacc chased both Earley's
      and recursive descent almost completely out of the textbooks.
      <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2010/09/perl-and-parsing-6-rewind.html">
        But as I have detailed elsewhere</a>,
          yacc had serious problems.
          In 2006 things went full circle -- the industry's standard C
          compiler, GCC, replaced LALR with recursive descent.
        </p>
    <p>So back to 1970.
    That year, Jay Earley wrote up his algorithm for
    "Communications of the ACM",
      and put a rough number on his "constant factor".
      He said that his algorithm was an "order of magnitude" slower
      than the current favorites -- a factor of 10.
      Earley suggested ways to lower this 10-handicap,
      and modern implementations have followed up on them
      and found others.
      But for this post,
      let's concede the factor of ten and throw
      in another.
      Let's say Earley's is 100 times slower than the current favorite,
      whatever that happens to be.
    </p>
    <h3>Moore's Law and beyond</h3>
    <p>Let's look at the handicap of 100
      in the light of Moore's Law.
      Since 1968, computers have gotten a billion times faster -- nine orders
      of magnitude. Nine factors of ten.
      This means that today Earley's runs
      seven factors of ten faster than
      the current favorite algorithm did in
      1968.
      Earley's is 10 million times as fast as the algorithm that was
      then considered practical.
    </p>
    <p>
      Of course, our standard of "fast enough to be practical" also evolves.
      But it evolves a lot more slowly.
      Let's exaggerate
      and say that "practical" meant "takes an hour" in 1968,
      but that today we would demand that the same program take only a second.
      Do the arithmetic and you find that Earley's is now
      more than 2,000 times faster than it needs to be to be practical.
    </p>
    <p>Bringing in Moore's Law is just the beginning.
      The handicap Jay Earley gave his algorithm
      is based on a straight comparison of CPU speeds.
      But parsing, in practical cases, involves I/O.
      And the "current favorite" needs to do as much I/O as Earley's.
      I/O overheads, and the accompanying context switches,
      swamp considerations of CPU speed,
      and that is more true today
      that it was in 1968.
      When an application is I/O bound, CPU is in effect free.
      Parsing may not be I/O bound in this sense, but neither
      is it one of those applications where the comparison can be made
      in raw CPU terms.
    </p>
    <p>Finally, pipelining has changed
      the nature of the CPU overhead itself radically.
      In 1968, the time to run a series of CPU
      instructions varied linearly with the number of instructions.
      Today, that is no longer true,
      and the change favors strategies like Earley's,
      which require a higher instruction count,
      but achieve efficiency in other ways.
    </p>
    <h3>Achievable speed</h3>
    <p>
      So far, I've spoken in terms of theoretical speeds, not achievable ones.
      That is, I've assumed that both Earley's
      and the current favorite are producing their best speed, unimpeded by
      implementation considerations.
    </p>
    <p>
      Earley, writing in 1968 and thinking of hand-written recursive descent,
      assumed that production compilers
      could be, and in practice usually would be,
      written by
      programmers with plenty of time to do
      careful and well-thought-out hand-optimization.
      After forty-five years of real-life experience,
      we know better.
    </p>
    <p>
      In those widely used practical compilers and interpreters
      that rely on lots of procedural logic --
      and these days that is almost all of them --
      it is usually all the maintainers can do to keep the procedural logic correct.
      In all but a few cases, optimization is opportunistic,
      not systematic.
      Programmers have been exposed to
      the realities of parsing with
      large amounts of complex procedural logic,
      and hand-written recursive descent has acquired a
      reputation for being slow.
    </p>
   <p>
      In theory,
      LALR based compilers are less dependent on procedural
      parsing and therefore easier to keep optimal.
      In practice they are as bad or worse.
      LALR parsers usually still need a considerable amount of procedural logic,
      but procedural logic is harder to write for LALR than it
      is for recursive descent.
    </p>
    <p>Modern Earley parsing
      has a much easier time actually delivering
      its theoretical best speed in practice.
      Earley's is powerful enough,
      and in its modern version well-enough aware of the state of the parse,
      that procedural logic can be kept to minimum or eliminated.
      Most of the parsing is done by the mathematics at its core.
    </p>
    <p>
      The math at Earley's core can be heavily optimized,
      and any optimization benefits all applications.
      Optimization of special-purpose procedural logic benefits
      only the application that uses that logic.
    </p>
    <h3>Other considerations</h3>
    <p>But you might say,
    </p><blockquote>
      "A lot of interesting points, Jeffrey, but all things being
      equal, a factor of 10,
      or even what's left from a factor of ten once I/O,
      pipelining and implementation inefficiencies have all nibbled away at it,
      is still worth having.
      It may in a lot of instances not even be measurable, but why not grab
      it for the sake of the cases where it is?"
    </blockquote><p>
      Which is a good point.
      The "implementation inefficiences" can be nasty enough that Earley's is in
      fact faster in raw terms,
      but let's assume
      that some cost in speed is still being paid for the use of Earley's.
      Why incur that cost?
    </p><h4>Error diagnosis</h4><p>
      The parsing algorithms currently favored,
      in their quest for efficiency,
      do not maintain full
      information about the state of the parse.
      This is fine when the source is 100% correct,
      but in practice an important function of a parser is to find and
      diagnose errors.
      When the parse fails, the current favorites
      often have little idea of why.
      An Earley parser knows the full state of the parse.
      This added knowledge can save a lot of
      programmer time.
    </p><h4>Readability</h4>
    <p>
      The more that a parser does from the grammar,
      and the less procedural logic it uses,
      the more readable the code will be.
      This has a determining effect on maintainance costs
      and the software's ability to evolve over time.
    </p><h4>Accuracy</h4>
    <p>Procedural logic can produce inaccuracy -- inability
      to describe or control the actual language begin parsed.
      Some parsers, particularly LALR and PEG,
      have a second major source of inaccuracy -- they use
      a precedence scheme for conflict resolution.
      In specific cases, this can work, but
      precedence-driven conflict resolution
      produces a language without
      a "clean" theoretical description.
    </p>
    <p>
      The obvious problem with not knowing what language you
      are parsing is failure to parse correct source code.
      But another, more subtle, problem can be worse over the
      life cycle of a language ...
    </p>
    <h4>False positives</h4>
    <p>False positives are cases
      where the input is in error,
      and should be reported as such, but instead
      the result is what you wanted.
      This may sound like unexpected good news,
      but when a false positive does surface,
      it is quite possible that it cannot be fixed
      without breaking code that, while incorrect, does work.
      Over the life of a language, false positives are deadly.
      False positives produce buggy and poorly understood code
      which must be preserved and maintained forever.
    </p>
    <h4>Power</h4>
    <p>
      The modern Earley implementation can parse vast classes
      of grammar in linear time.
      These classes include all those currently in practical use.
    </p><h4>Flexibility</h4>
    <p>Modern Earley implementations
      parse all context-free grammars in times that are, in practice,
      considered optimal.
      With other parsers,
      the class of grammars parsed is highly restricted,
      and there is usually a real danger that a new change
      will violate those restrictions.
      As mentioned,
      the favorite alternatives to Earley's
      make it hard to know exactly what language you are,
      in fact, parsing.
      A change can break one of these parsers
      without there being any indication.
      By comparison,
      syntax changes and extensions to Earley's grammars
      are carefree.
    </p>
    <h3>For more about Marpa</h3>
    <p>
      Above I've spoken of "modern Earley parsing",
      by which I've meant Earley parsing as amended and improved
      by the efforts of Aho, Horspool, Leo and myself.
      At the moment, the only implementation that contains
      all of these modernizations is Marpa.
    </p>
    <p>
      Marpa's latest version is
      <a href="https://metacpan.org/module/Marpa::R2">Marpa::R2,
        which is available on CPAN</a>.
      Marpa's
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.052000/pod/Scanless/DSL.pod">SLIF
        is
        a new interface</a>,
      which represents a major increase
      in Marpa's "whipitupitude".
      The SLIF has tutorials
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/dsl_simpler2.html">here
      </a>
      and
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/announce_scanless.html">
        here</a>.
      Marpa has
      <a href="http://jeffreykegler.github.com/Marpa-web-site/">a web page</a>,
      and of course it is the focus of
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/">
        my "Ocean of Awareness" blog</a>.
    </p>
    <p>
      Comments on this post
      can be sent to the Marpa's Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>]]>
        
    </content>
</entry>

<entry>
    <title>Marpa&apos;s SLIF now allows procedural parsing</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2013/04/marpas-slif-now-allows-procedural-parsing.html" />
    <id>tag:blogs.perl.org,2013:/users/jeffrey_kegler//63.4597</id>

    <published>2013-04-22T14:13:37Z</published>
    <updated>2013-04-23T09:43:24Z</updated>

    <summary> [ This is cross-posted from the Ocean of Awareness blog. ] Marpa&apos;s SLIF (scanless interface) allows an application to parse directly from any BNF grammar. Marpa parses vast classes of grammars in linear time, including all those classes currently in practical use. With its latest release, Marpa::R2&apos;s SLIF also allows an application to intermix its own custom lexing and parsing logic with Marpa&apos;s, and to switch back and forth between them. This means, among other things, that Marpa&apos;s SLIF can now do procedural parsing. What is procedural parsing? Procedural parsing is parsing using ad hoc code in a procedural language. The opposite of procedural parsing is declarative parsing -- parsing driven by some kind of formal description of the grammar. Procedural parsing may be described as what you do when you&apos;ve given up on your parsing algorithm. Dissatisfaction with parsing theory has left modern programmers accustomed to procedural parsing. And in fact some problems are best tackled with procedural parsing....</summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[ <p>[ This is <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2013/04/procedural.html">cross-posted from the Ocean of Awareness blog</a>. ]</p>
<p>
      Marpa's SLIF (scanless interface)
      allows an application to parse directly from any BNF grammar.
      Marpa parses vast classes of grammars in linear time,
      including all those classes currently in practical use.
      With
      <a href="https://metacpan.org/release/Marpa-R2">
        its latest release</a>,
      Marpa::R2's SLIF
      also allows an application to intermix
      its own custom lexing and parsing logic
      with Marpa's,
      and to switch back and forth between them.
      This means,
      among other things,
      that Marpa's SLIF can now
      do procedural parsing.
    </p>
    <p>
      What is procedural parsing?
      Procedural parsing is parsing using
      ad hoc code in a procedural language.
      The opposite of procedural parsing is declarative parsing
      -- parsing driven by some kind of formal description
      of the grammar.
      Procedural parsing
      may be described as what you do when you've given up
      on your parsing algorithm.
      Dissatisfaction with parsing theory
      has left modern programmers accustomed to procedural parsing.
      And in fact some problems are best tackled with procedural parsing.
    </p>
]]>
        <![CDATA[   <h3>An example</h3>
    <p>
      One such problem is parsing Perl-style here-documents.
      Peter Stuifzand has tackled this using
      <a href="https://metacpan.org/release/JKEGL/Marpa-R2-2.052000">
        the
        just-released version of Marpa::R2</a>.
      For those unfamiliar, Perl allows documents to be incorporated
      into its source files in line-oriented fashion as "here-documents".
      Here-documents can be used in expressions.
      The syntax to do this is very handy, if a little strange.
      For example,
    </p>
    <blockquote><pre>
say &lt;&lt;ENDA, &lt;&lt;ENDB, &lt;&lt;ENDC; say &lt;&lt;ENDD;
a
ENDA
b
ENDB
c
ENDC
d
ENDD</pre></blockquote>
    <p>
      starts with a single line declaring four here-documents spread out
      over two
      <tt>say</tt>
      statements.
      The expressions of the form
    </p><blockquote><pre>&lt;&lt;ENDX</pre></blockquote><p>
      are here-document expressions.
      <tt>&lt;&lt;</tt>
      is the heredoc operator.
      The string which follows it (in this example,
      <tt>ENDA</tt>,
      <tt>ENDB</tt>, etc.) is the heredoc terminator string --
      the string that will signal end
      of body of the here-document.
      The body of the here-documents follow, in order, over the next eight lines.
      More details of here-document syntax, with examples, can be found
      in
      <a href="http://perldoc.perl.org/perlop.html#Quote-Like-Operators">the
        Perl documentation</a>.
    </p>
    <p>All of this poses quite a challenge to a parser-lexer combination,
      which is one reason I chose it as an example --
      to illustrate that the Marpa's SLIF support for procedural parsing can
      handle genuinely difficult cases.
      There are a few ways Marpa could approach this.
      The one
      Peter Stuifzand chose was to
      to read the
      here-document's body as the value of the terminator in
      each
      <tt>&lt;&lt;ENDX</tt>
      expression.
    </p>
    <p>
      The strategy works this way:
      Marpa allows the application to mark certain lexemes as "pause" lexemes.
      Whenever a "pause" lexeme is encountered, Marpa's internal scanning stops,
      and control is handed over to the application.
      In this case, the application is set up to pause after every newline,
      and before the terminator in every here-document expression.
    </p>
    <p>
      While reading the line containing the four here-document expressions,
      Marpa's SLIF pauses and resumes five times -- once for each here-document expression,
      then once for the final newline.
      Details can be found in compact form in the heavily commented code
      in
      <a href="https://gist.github.com/jeffreykegler/5431739">this
        Github gist</a>.
    </p>
    <h3>Marpa as a better procedural parser</h3>
    <p>So far I've talked in terms of Marpa "allowing" procedural parsing.
      In fact, there can be much more to it.
      Marpa can make procedural parsing easier and more accurate.
    </p>
    <p>Marpa knows, at every point, which rules it is recognizing, and how far it
      is into them.
      Marpa also knows which new rules the grammar expects, and which terminals.
      The procedural parsing logic can consult this information to guide its decisions.
      Marpa can provide your procedural parsing logic with radar,
      as well as the option to use a very smart autopilot.
    </p>
    <h3>For more about Marpa</h3>
    <p>
      Marpa's latest version is
      <a href="https://metacpan.org/module/Marpa::R2">Marpa::R2,
        which is available on CPAN</a>.
      Marpa's
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.052000/pod/Scanless/DSL.pod">SLIF
        is
        a new interface</a>,
      which represents a major increase
      in Marpa's "whipitupitude".
      The SLIF has tutorials
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/dsl_simpler2.html">here
      </a>
      and
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/announce_scanless.html">
        here</a>.
      Marpa has
      <a href="http://jeffreykegler.github.com/Marpa-web-site/">a web page</a>,
      and of course it is the focus of
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/">
        my "Ocean of Awareness" blog</a>.
    </p>
    <p>
      Comments on this post
      can be sent to the Marpa's Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>]]>
    </content>
</entry>

<entry>
    <title>What if languages were free?</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2013/03/what-if-languages-were-free.html" />
    <id>tag:blogs.perl.org,2013:/users/jeffrey_kegler//63.4463</id>

    <published>2013-03-24T00:07:41Z</published>
    <updated>2013-03-24T05:53:00Z</updated>

    <summary>[ This is cross-posted from the Ocean of Awareness blog. ] In 1980, George Copeland wrote an article titled &quot;What if Mass Storage were Free?&quot;. Costs of mass storage were showing signs that they might fall dramatically. Copeland, as a thought exercise, took this trend to its extreme. Among other things, he predicted that deletion would become unnecessary, and in fact, undesirable. Copeland&apos;s thought experiment has proved prophetic. For many purposes, mass storage is treated as if it were free. For example, you probably retrieved this blog post from a server provided to me at no charge, in the hope that I might write and upload something interesting. Until now languages were high-cost efforts. Worse, language projects ran a high risk of disappointment, up to and including total failure. I believe those days are coming to an end. Small languages, shaped to the problem domain What if whenever you needed a new language, poof, it was there? You would be encouraged to tackle each problem domain with a new language dedicated to dealing with that domain. Since each language is no larger than its problem domain, learning a language would be essentially the same as learning the problem domain. The...</summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[<p>[ This is <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/03/what_if_free.html">cross-posted from the Ocean of Awareness blog</a>. ]</p>
    <p>In 1980, George Copeland wrote
      <a href="http://dl.acm.org/citation.cfm?id=802685">
    an article</a>
    titled "What if Mass Storage were Free?".
      Costs of mass storage were showing signs
      that they might fall dramatically.
      Copeland, as a thought exercise, took this trend to its extreme.
      Among other things, he predicted that deletion would become
      unnecessary, and in fact, undesirable.
    </p>
    <p>Copeland's
      thought experiment has proved prophetic.
      For many purposes, mass storage is treated as if it were free.
      For example, you probably retrieved this blog post from a server
      provided to me at no charge, in the hope
      that I might write and upload something interesting.
    </p>
    <p>
      Until now languages were high-cost efforts.
      Worse, language projects ran a high risk of disappointment,
      up to and including total failure.
      I believe those days are coming to an end.
    </p>
    <h3>Small languages, shaped to the problem domain</h3>
    <p>What if whenever you needed a new language, poof, it was there?
      You would be encouraged to tackle each problem domain with
      a new language dedicated to dealing with that domain.
      Since each language is no larger than its problem domain,
      learning a language would be essentially the same as learning
      the problem domain.
      The incremental effort required to learn the language
      itself would head toward zero.
    </p>
    <h3>No more language bloat</h3>
    <p>Language bloat would end.
      Currently, the risk and cost of developing languages
      make it imperative to extend the ones we have.
      Free languages mean fewer reasons to add features
      to existing languages.
    </p>
    <h3>No more search for THE perfect language</h3>
    <p>
      No language is perfect for all tasks.
      But because the high cost of languages favors
      large, general-purpose languages,
      we are compelled to try for perfection anyway.
      Ironically, we are often making the language worse,
      and we know it.
    </p>
    <h3>A world full of perfect languages</h3>
    <p>An older sense of the word perfect is
      "having all the properties or qualities requisite to its nature and kind".
      The C language might be called perfect in this sense.
      C lacks a lot of features that are highly desirable in most contexts.
      But for programming that is portable
and close to the hardware,
      the C language is perfect or close to it.
      If languages were free, this is the kind of perfection
      that we would seek --
      languages precisely fitted to their domain,
      so that adding to them cannot make them better.
    </p>
    <h3>Moving toward free</h3>
    <p>
      My own effort to contribute to 
      a fall in the cost of languages is the Marpa parser.
      Marpa
      produces a reasonable parser for every language you can write in BNF.
      If the BNF is for a grammar in any of the classes currently in practical
      use, the parser Marpa produces will have linear speed.
      In one case, using Marpa,
      <a href="https://gist.github.com/4447349">a targeted language</a>
      was written
      in less than an hour.
      <a href="http://blogs.perl.org/users/jeffrey_kegler/2013/01/a-language-for-writing-languages.html">
        More typically</a>, Marpa reduce the time needed to create new languages to hours.
    </p>
    <p>As one example of going from "impossible" to "easy",
      I have written a drop-in solution to an example in the
      <a href="http://en.wikipedia.org/wiki/Design_Patterns">Gang
        of Four book</a>.
      The Gang of Four described a language
      and its interpretation,
      but they did not include a parser.
      Creating a parser
      to fit their example would have been
      impossibly hard when the Gang of Four wrote.
      Using Marpa, it is easy.
      The parser can be found in
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/03/bnf_to_ast.html">this
        earlier blog post</a>.
    </p>
    <p>
      Marpa's latest version is
      <a href="https://metacpan.org/module/Marpa::R2">Marpa::R2,
        which is available on CPAN</a>.
      Recently, it has gained immensely in "whipitupitude" with
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.048000/pod/Scanless/DSL.pod">
        a new interface</a>,
      which has tutorials
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/dsl_simpler2.html">here
      </a>
      and
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/announce_scanless.html">
        here</a>.
      Marpa has
      <a href="http://jeffreykegler.github.com/Marpa-web-site/">a web page</a>,
      and of course it is the focus of
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/">
        my "Ocean of Awareness" blog</a>.
    </p>
    <p>
      Comments on this post
      can be sent to the Marpa's Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>]]>
        
    </content>
</entry>

<entry>
    <title>The Interpreter Design Pattern</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2013/03/the-interpreter-design-pattern.html" />
    <id>tag:blogs.perl.org,2013:/users/jeffrey_kegler//63.4450</id>

    <published>2013-03-20T17:52:11Z</published>
    <updated>2013-03-20T17:59:54Z</updated>

    <summary> [ This is cross-posted from the Ocean of Awareness blog. ] The influential Design Patterns book lays out 23 patterns for programming. One of them, the Interpreter Pattern, is rarely used. Steve Yegge puts it a bit more strikingly -- he says that the book contains 22 patterns and a practical joke. That sounds (and in fact is) negative, but elsewhere Yegge says that &quot;[t]ragically, the only [Go4] pattern that can help code get smaller (Interpreter) is utterly ignored by programmers&quot;. (The Design Patterns book has four authors, and is often called the Gang of Four book, or Go4.) In fact, under various names and definitions, the Interpreter Pattern and its close relatives and/or identical twins are widely cited, much argued and highly praised [1]. As they should be. Languages are the most powerful and flexible design pattern of all. A language can include all, and only, the concepts relevent to your domain. A language can allow you to relate them in all, and only, the appropriate ways. A language can identify errors with pinpoint precision, hide implementation details, allow invisible &quot;drop-in&quot; enhancements, etc., etc., etc. In fact languages are so powerful and flexible, that their use is pretty...</summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[<p>
      [ This is <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/03/interpreter.html">cross-posted from the Ocean of Awareness blog</a>. ]
    </p>
    <p>The influential
      <a href="http://en.wikipedia.org/wiki/Design_Patterns">
        <em>Design Patterns</em>
        book</a>
      lays out 23 patterns for programming.
      One of them, the Interpreter Pattern, is rarely used.
      Steve Yegge puts it a bit more strikingly -- he says
      that the book contains
      <a href="https://sites.google.com/site/steveyegge2/ten-great-books">22
        patterns and a practical joke</a>.
    </p>
    <p>That sounds (and in fact is) negative, but
      <a href="http://steve-yegge.blogspot.com/2007/12/codes-worst-enemy.html">
        elsewhere</a>
      Yegge says that
      "[t]ragically, the only [Go4] pattern that can help code get smaller
      (Interpreter) is utterly ignored by programmers".
      (The
      <i>Design Patterns</i>
      book has four authors,
      and is often called the Gang of Four book, or Go4.)
    </p>
    <p>
      In fact, under various names and definitions, the
      Interpreter Pattern and its close relatives and/or identical twins
      are widely cited,
      much argued and highly praised <a href="#NOTE1">[1]</a>.
</p>
<p>
          As they should be.
          Languages are the most powerful and flexible design pattern of all.
          A language can include all, and only, the concepts relevent
          to your domain.
          A language can allow you to relate them in all, and only, the appropriate ways.
          A language can identify errors with pinpoint precision,
          hide implementation details,
          allow invisible "drop-in" enhancements, etc., etc., etc.
     </p>
    <p>
      In fact languages are so powerful and flexible,
      that their use is pretty much universal.
      The choice is not whether or not to use a language to solve
      the problem,
      but whether to use
      a general-purpose language,
      or a domain-specific language.
      Put another way,
      if you decide not to use a language targeted
      to your domain,
      it almost always means that you
      are choosing to use another language that is not specifically
      fitted to your domain.
    </p>
    <p>
      Why then, is the Interpreter Pattern so little used?
      Why does Yegge call it a practical joke?
    </p>
    <h3>There's a problem</h3>
    <p>The problem with the Interpreter Pattern is that you must
      turn your language into an AST --
      that is,
      you must parse it somehow.
      Simplifying the language can help here.
      But if the point is to be simple at the expense of power
      and flexibility,
      you might as well
      stick with the other 22 design patterns.
    </p>
    <p>
      On the other hand,
      creating a parser for anything but the simplest languages
      has been a time-consuming effort,
      and one of a kind known for disappointing results.
      In fact,
      language development efforts run
      a real risk of total failure.
    </p>
    <p>How did the Go4 deal with this?
      They defined the problem away.
      They stated that the parsing issue was separate from the
      Interpreter Pattern, which was limited to what you did with the AST
      once you'd somehow come up with one.
    </p>
    <p>
      But AST's don't (so to speak) grow on trees.
      You have to get one from somewhere.
      In their example, the Go4 simply built an AST in their code,
      node by node.
      In doing this, they bypassed the BNF and the problem of parsing.
      But they also bypassed their language and the whole point
      of the Interpreter Pattern.
    </p>
    <p>
      Which is why Yegge characterized the chapter as a practical joke.
      And why other programming techniques and patterns are almost
      always preferred to the Interpreter Pattern.
    </p>
    <h3>Finding that one missing piece</h3>
    <p>So that's how the Go4 left things.
      A potentially great programming technique,
      made almost useless because
      of a missing piece.
      There was no easy, general, and practical way to generate AST's.
    </p>
    <p>
      Few expected that to change.
      I was more optimistic than most.
      In 2007 I embarked on a full-time project:
      to create a parser based on Earley's algorithm.
      I was sure that it would fulfill two of the criteria --
      it would be easy to use, and it would be general.
      As for practical -- well, a lot of parsing problems
      are small, and a lot of applications don't require a lot
      of speed, and for these I expected the result to be good enough.
    </p>
    <p>What I didn't realize was that
      all of the problems preventing
      Earley's from seeing real, practical use
      has already been solved in the academic literature.
      I was not alone in not having put the picture together.
      The people who had solved the problems
      had focused on two disjoint sets of issues,
      and were unaware of each other's
      work.
      In 1991, in the Netherlands,
      the mathematican Joop Leo had
      arrived at an astounding result --
      he showed how to make Earley's run in linear time for LR-regular grammars.
      LR-regular is a vast class of grammars.
      It easily includes, as a proper subset, every class of grammar now
      in practical use -- regular expressions, PEG, recursive descent,
      the LALR on which yacc and bison are based, you name it.
      (For those into the math,
      LR-regular includes LR(k)
      for all <i>k</i>,
      and therefore LL(k),
      also for all <i>k</i>.)
      </p>
      <p>
      Leo's mathematical approach did not address some nagging practical issues,
      foremost among them the handling of nullable rules and symbols.
      But ten years later in Canada,
      Aycock and Horspool focused on exactly these issues,
      and solved them.
      Aycock-Horspool
      seem to have been unaware of Leo's earlier result.
      The time complexity of the Aycock-Horspool
      algorithm was essentially that of
      Earley's original algorithm.
    </p>
    <p>
      Because of Leo's work,
      for any grammar in any class currently in practical use,
      an Earley's parser could be fast.
      If only it could be combined with the approach
      of Aycock and Horspool, I realized,
      Leo's speeds could be available in an everyday programming tool.
    </p>
    <p>
      In changing the Earley parse engine,
      Aycock-Horspool and Leo had branched off in different directions.
      It was not obvious that their approaches could be combined, much less how.
      And in fact, the combination of the two is not a simple algorithm.
      But it is fast,
      and the new Marpa parse engine makes full information
      about the state of the parse (rules recognized, symbols expected, etc.)
      available as it proceeds.
      This is very convenient for, among other things, error reporting.
    </p>
    <h3>Eureka and all that</h3>
    <p>The result is an algorithm which parses anything
      you can write in BNF and
      does it in times considered optimal in practice.
      Unlike recursive descent, you don't have to write out the parser --
      Marpa generates a parser for you, from the BNF.
      It's the easy, "drop-in" solution that the Go4 needed and did not have.
      A reworking of the Go4 example, with the missing parser added,
      is in
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/03/bnf_to_ast.html">a
        previous blog post</a>, and the code for the reworking is in
      <a href="https://gist.github.com/jeffreykegler/5121769">
        a Github gist</a>.
    </p>
    <h3>More about Marpa</h3>
    <p>
      Marpa's latest version is
      <a href="https://metacpan.org/module/Marpa::R2">Marpa::R2,
        which is available on CPAN</a>.
      Recently, it has gained immensely in "whipitupitude" with
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.048000/pod/Scanless/DSL.pod">
        a new interface</a>,
      which has tutorials
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/dsl_simpler2.html">here
      </a>
      and
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/announce_scanless.html">
        here</a>.
      Marpa has
      <a href="http://jeffreykegler.github.com/Marpa-web-site/">a web page</a>,
      and of course it is the focus of
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/">
        my "Ocean of Awareness" blog</a>.
    </p>
    <p>
      Comments on this post
      can be sent to the Marpa's Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>
    <h3>Notes</h3>
    <p><a name="NOTE1">Note 1</a>:
      For example,
      <a href="http://en.wikipedia.org/wiki/Domain-specific_language">the Wikipedia article on DSL's</a>;
      <a href="http://www.faqs.org/docs/artu/minilanguageschapter.html">Eric Raymond discussing mini-languages</a>;
      <a href="http://www.dmst.aueb.gr/dds/pubs/jrnl/2000-JSS-DSLPatterns/html/dslpat.html">
        "Notable Design Patterns for Domain-Specific Languages"</a>, Diomidis Spinellis; and
      <a href="http://www.c2.com/cgi/wiki?DomainSpecificLanguage">the c2.com wiki</a>.
    </p>]]>
        
    </content>
</entry>

<entry>
    <title>BNF to AST</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2013/03/bnf-to-ast.html" />
    <id>tag:blogs.perl.org,2013:/users/jeffrey_kegler//63.4414</id>

    <published>2013-03-11T23:40:04Z</published>
    <updated>2013-03-11T23:44:14Z</updated>

    <summary>[ This is cross-posted from the new home of the Ocean of Awareness blog. ] The latest version of Marpa takes parsing &quot;whipitupitude&quot; one step further. You can now go straight from a BNF description of your language, and an input string, to an abstract syntax tree (AST). To illustrate, I&apos;ll use an example from the Gang of Four&apos;s (Go4&apos;s) chapter on the Interpreter pattern. (It&apos;s pages 243-255 of the Design Patterns book.) The Go4 knew of no easy general way to go from BNF to AST, so they dealt with that part of the interpreter problem by punting -- they did not even try to parse the input string. Instead they constructed the BNF they&apos;d just presented and constructed an AST directly in their code. The reason the Go4 didn&apos;t know of an easy, generally-applicable way to parse their example was that there was none. Now there is. In this post, Marpa will take us quickly and easily from BNF to AST. (Full code for this post can be found in a Github gist.) The Go4&apos;s example was a simple boolean expression language, whose primary input was true and x or y and not x Here, in full, is...</summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[<p>[ This is <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/03/bnf_to_ast.html">cross-posted from the new home of the Ocean of Awareness blog</a>. ]</p>
  <p>
      <!--
      marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
      -->
      The latest version of
      <a href="https://metacpan.org/module/Marpa::R2">
      Marpa</a> takes parsing "whipitupitude" one step further.
      You can now go straight from
      a BNF description of your language,
      and an input string,
      to an abstract syntax tree (AST).
    </p>
    <p>To illustrate, I'll use an example from the
      Gang of Four's (Go4's) chapter
      on the Interpreter pattern.
      (It's pages 243-255 of the
      <a href="http://en.wikipedia.org/wiki/Design_Patterns">
      <em>Design Patterns</em> book</a>.)
      The Go4 knew of no easy general way to go from BNF to AST,
      so they dealt with that part of the interpreter problem
      by punting --
      they did not even try to parse the input string.
      Instead they constructed the BNF they'd just presented and
      constructed an AST directly in their code.
    </p>
    <p>The reason the Go4 didn't know of an easy,
    generally-applicable way
      to parse their example was that
      there was none.
      Now there is.
      In this post, Marpa will take us
      quickly and easily
      from BNF to AST.
      (Full code for this post can
      be found in
      <a href="https://gist.github.com/jeffreykegler/5121769">a Github gist</a>.)
    </p>
    <p>
      The Go4's example was a simple boolean expression language,
      whose primary input was
    </p>
    <blockquote>
      <pre>
true and x or y and not x
</pre>
    </blockquote>
    <p>Here, in full, is the BNF for an slight elaboration of the
      Go4 example.
      It is written in the DSL for Marpa's Scanless interface (SLIF DSL),
      and includes specifications for building the AST.
    </p><blockquote>
      <pre>
:default ::= action =&gt; ::array

:start ::= &lt;boolean expression&gt;
&lt;boolean expression&gt; ::=
       &lt;variable&gt; bless =&gt; variable
     | '1' bless =&gt; constant
     | '0' bless =&gt; constant
     | ('(') &lt;boolean expression&gt; (')') action =&gt; ::first bless =&gt; ::undef
    || ('not') &lt;boolean expression&gt; bless =&gt; not
    || &lt;boolean expression&gt; ('and') &lt;boolean expression&gt; bless =&gt; and
    || &lt;boolean expression&gt; ('or') &lt;boolean expression&gt; bless =&gt; or

&lt;variable&gt; ~ [[:alpha:]] &lt;zero or more word characters&gt;
&lt;zero or more word characters&gt; ~ [\w]*

:discard ~ whitespace
whitespace ~ [\s]+
</pre>
    </blockquote>
    <p>This syntax should be fairly transparent.
      In previous posts I've given
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/dsl_simpler2.html">
        a tutorial</a>,
      and a
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/announce_scanless.html">a
        mini-tutorial</a>.
      And of course, the interface is
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.048000/pod/Scanless/DSL.pod">
        documented</a>.
    </p>
    <p>For those skimming, here are a few quick comments on less-obvious features.
      To guide Marpa in building the AST,
      the BNF statements have
      <tt>action</tt>
      and
      <tt>bless</tt>
      adverbs.
      The
      <tt>bless</tt>
      adverbs indicate a Perl class into which the node should be
      blessed.
      This is convenient for using an object-oriented approach with the AST.
      The
      <tt>action</tt>
      adverb tells Marpa how to build the nodes.
      "<tt>action =&gt; ::array</tt>" means the result of the rule should
      be an array containing its child nodes.
      "<tt>action =&gt; ::first</tt>" means the result of the rule should just be
      its first child.
      Many of the child symbols,
      especially literal strings of a structural nature,
      are in parentheses.
      This makes them invisible to
      the semantics.
    </p>
    <p>A
      <tt>:default</tt>
      pseudo-rule specifies the defaults -- in this case the
      "<tt>action =&gt; ::array</tt>" adverb setting.
      The
      <tt>:start</tt>
      pseudo-rule specified the start symbol.
      The <tt>:discard</tt> pseudo-rule
      indicates that whitespace is to be discarded.
    </p>
    <p>The Go4 did not deal with precedence.
      In their example, the input string is fully parenthesized,
      even though its priorities are the standard ones.
      I've eliminated the parentheses, because
      the standard precedence is implemented in SLIF grammar.
      The double vertical bar ("<tt>||</tt>") is a "loosen" operator --
      an alternative after "loosen" operator will be
      at a looser precedence than the one before.
      Alternatives separated by a single bar are at the same precedence.
    </p><h3>Creating the AST</h3><p>
      Creating the AST is simple.
      First, we use Marpa to turn the above DSL for boolean expressions
      into a parser.
      (We'd saved the SLIF DSL source in the string
      <tt>$rules</tt>.)
    </p><blockquote>
      <pre>
my $grammar = Marpa::R2::Scanless::G->new(
    {   bless_package => 'Boolean_Expression',
        source        => \$rules,
    }   
);  
</pre>
    </blockquote>
    <p>Next we define a closure that uses
      <tt>$grammar</tt>
      to turn
      BNF into AST's.
    </p><blockquote>
      <pre>
sub bnf_to_ast {
    my ($bnf) = @_;
    my $recce = Marpa::R2::Scanless::R->new( { grammar => $grammar } );
    $recce->read( \$bnf );
    my $value_ref = $recce->value();
    if ( not defined $value_ref ) {
        die "No parse for $bnf";
    }
    return ${$value_ref};
} ## end sub bnf_to_ast
</pre>
    </blockquote><p>
Where <tt>$bnf</tt> is our input string,
we run it as follows:
    </p><blockquote>
      <pre>
my $ast1 = bnf_to_ast($bnf);
</pre>
    </blockquote>
    <h3>The AST</h3>
    <p>If we use Data::Dumper to examine the AST,
    </p><blockquote>
      <pre>
say Data::Dumper::Dumper($ast1) if $verbose_flag;
</pre>
    </blockquote><p>
      we see this:
    </p><blockquote>
      <pre>
$VAR1 = bless( [
                 bless( [
                          bless( [
                                   'true'
                                 ], 'Boolean_Expression::variable' ),
                          bless( [
                                   'x'
                                 ], 'Boolean_Expression::variable' )
                        ], 'Boolean_Expression::and' ),
                 bless( [
                          bless( [
                                   'y'
                                 ], 'Boolean_Expression::variable' ),
                          bless( [
                                   bless( [
                                            'x'
                                          ], 'Boolean_Expression::variable' )
                                 ], 'Boolean_Expression::not' )
                        ], 'Boolean_Expression::and' )
               ], 'Boolean_Expression::or' );
</pre>
    </blockquote>
    <h3>Processing the AST</h3>
    <p>In their example,
    the Go4 processed their AST in several ways:
    straight evaluation, copying,
      and substitution of the occurrences of a variable in one boolean expression
      by another boolean expression.
      It is obvious that the AST above is the computational
      equivalent of the Go4's AST,
      but for the sake of completeness I carry out the same operations
      <a href="https://gist.github.com/jeffreykegler/5121769">in the Github gist</a>.
    </p>
    <p>
      AST creation via Marpa's SLIF is self-hosting --
      the SLIF DSL is parsed into an AST,
      and a parser created by interpreting the AST.
      The Marpa SLIF DSL source file in this post,
      that describes boolean expressions,
      was itself turned into an AST on its way to becoming a parser
      that turns boolean expressions into AST's.
    </p><h3>Comments</h3>
    <p>
      Comments on this post
      can be sent to the Marpa Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>
]]>
        
    </content>
</entry>

<entry>
    <title>A language for writing languages</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2013/01/a-language-for-writing-languages.html" />
    <id>tag:blogs.perl.org,2013:/users/jeffrey_kegler//63.4197</id>

    <published>2013-01-14T00:15:33Z</published>
    <updated>2013-01-14T00:21:57Z</updated>

    <summary>[ This is cross-posted from the new home of the Ocean of Awareness blog. ] Marpa::R2&apos;s Scanless interface is not yet two weeks old, but already there are completed applications. Significantly, two of them are for work. A JSON Parser The non-work-related application is a JSON parser. Given what it does, it easily could have been work-related. (It&apos;s been available for a few days as a gist, so it may well be in production use somewhere.) It was written by Peter Stuifzand, runs 185 lines and took him less than 30 minutes to write. Peter reports that it was a matter of typing in the grammar, and adding a few Perl functions to provide the semantics. There are, of course, other JSON parsers out there, many of which run faster. These, however, took weeks to write. If you are, for example, thinking of extending JSON, and development time is a major consideration, the Marpa-based solution will be attractive. Printer escape codes Peter also did a Marpa-based language for work -- a solution to the problem of printer escape codes. For those unfamilar, a printer&apos;s special features can often be invoked by &quot;escape sequences&quot; -- byte sequences which control things like...</summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[<p>[ This is cross-posted from <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/language_for_languages.html">the new home of the Ocean of Awareness blog</a>. ] 
  <p><a href="https://metacpan.org/release/Marpa-R2">
      Marpa::R2</a>'s
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/announce_scanless.html">
      Scanless interface</a>
      is not yet two weeks old,
      but already there are completed applications.
      Significantly, two of them are for work.
      <h3>A JSON Parser</h3>
      <p>The non-work-related application is
      <a href="https://gist.github.com/4447349">
      a JSON parser</a>.
      Given what it does,
      it easily could have been work-related.
      (It's been available for a few days as a gist,
      so it may well be in production use somewhere.)
      It was written by Peter Stuifzand,
      runs 185 lines
      and took him less than 30 minutes to write.
      Peter reports that it was a matter of
      typing in the grammar,
      and adding a few Perl functions to provide the semantics.
      <p>
      There are, of course, other JSON parsers out there,
      many of which run faster.
      These, however, took weeks to write.
      If you are, for example,
      thinking of
      <a href="http://bolinfest.com/essays/json.html">
      extending JSON</a>,
      and development time is a major consideration,
      the Marpa-based solution will be attractive.
      <h3>Printer escape codes</h3>
      <p>Peter also did
      <a href="https://groups.google.com/d/msg/marpa-parser/n4ouLW0e6P8/vdrku9fczZEJ">
      a Marpa-based language for work</a> --
      a solution to the problem of printer escape codes.
      For those unfamilar, a printer's special features can often be invoked
      by "escape sequences" --
      byte sequences which control things like cursor motion, color, character sets,
      graphics, etc., etc.
      It's nice to invoke them with a set of well-named functions.
      </p><p>Escape sequences are usually repetitive,
      and when complex, are usually not complex in an interesting way.
      They can be programmed with regex or eval hacks.
      But this time
      Peter chose to write 
      a mini-language that specifies 
      escape sequences,
      and to use Marpa to
      compile the mini-language into Perl code.
      He was done in a hour.
      <h3>A log file query language</h3>
      <p>Meanwhile, an interesting and adventurous language effort
      was underway on
      the other side of the Atlantic, where Paul Bennett,
      faced with analyzing lots of nginx log files,
      <a href="https://plus.google.com/u/0/110360907592575381901/posts/XdTPRHvbA8w#110360907592575381901/posts/XdTPRHvbA8w">
      decided a powerful custom log query language was
      the best way to address his issue</a>.
      Paul needed to design and specify his language from scratch.
      Paul was also facing a learning curve,
      but he read the gist for a Scanless interface example,
      and apparently was able to teach himself quickly from there.
      (He doesn't say, but it might have been the one for
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/dsl_simpler2.html">
      this post</a>.)
      <p>Like Peter's escape sequence language,
      Paul's log query language program compiles to Perl.
      Its writing and debugging
      were spread out over 3 days.
      Paul reports that his language is on the job already,
      but that
      it needs some clean-up before going onto CPAN.
      <p>
      The snippets Paul shows are enticing.
      The language seems to include 
      strings, integers and timestamps as supported types;
      regexes;
      a full set of comparison and boolean operators;
      and helpful new "any", "between" and "one" operators.
      Pretty good for three days.
      A lot of nasty problems snuggled away
      in log files may find their
      hiding places are not nearly as safe 
      as they have been able to expect.
    <h3>Where to start</h3>
    <p>If you're interested in learning more about Marpa's Scanless
    interface, there is
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/dsl_simpler2.html">
      a tutorial</a>.
      Additionally,
      the announcement of the Scanless interface contained
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/announce_scanless.html">a
      mini-tutorial</a>.
    <h3>Comments</h3>
    <p>
      Comments on this post
      can be sent to the Marpa Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>
]]>
        
    </content>
</entry>

<entry>
    <title>Making DSL&apos;s even simpler</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2013/01/making-dsls-even-simpler.html" />
    <id>tag:blogs.perl.org,2013:/users/jeffrey_kegler//63.4183</id>

    <published>2013-01-08T16:59:04Z</published>
    <updated>2013-01-08T17:02:48Z</updated>

    <summary>[ This is cross-posted from the new home of the Ocean of Awareness blog. ] In a previous post, I described a method of writing powerful domain-specific languages (DSLs), one that was simpler and faster than previous approaches. This post takes things significantly further. The approach described in the previous post was not itself directly DSL-based, and it required the programmer to write a separate lexer. This post uses Marpa::R2&apos;s new Scanless interface. The Scanless interface is a DSL for writing DSL&apos;s and it incorporates the specification of the lexer into the language description. When it comes to dealing with a programming problem, no tool is as powerful and flexible as a custom language targeted to the problem domain. But writing a domain specific language (DSL) is among the least used approaches, and for what has been a very good reason -- in the past, DSL&apos;s have been very difficult to write. This post takes a tutorial approach. It does not assume knowledge of the previous tutorials on this blog. The full code for this post is in a Github gist. Our example DSL is a calculator, one whose features are chosen for the purpose of illustration. It is not...</summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[<p>[ This is <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/dsl_simpler2.html">cross-posted from the new home of the Ocean of Awareness blog</a>. ]
  <p><a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/dsl.html">
        In a previous post</a>, I described a method of writing
	powerful domain-specific languages (DSLs),
	one that was simpler and faster
	than previous approaches.
      This post takes things significantly further.
      <p>
      The approach described in the previous post was not itself directly
      DSL-based,
      and it required the programmer to write a separate lexer.
      This post uses
      <a href="https://metacpan.org/module/Marpa::R2">Marpa::R2</a>'s
      new Scanless interface.
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.040000/pod/Scanless.pod">
      The Scanless interface</a>
      is a DSL for writing DSL's
      and it incorporates the specification of the lexer into
      the language description.
    </p>
    <p>
      When it comes to dealing with a programming problem,
      no tool is as powerful and flexible as
      a custom language targeted to the problem domain.
      But writing a domain specific language (DSL) is among the
      least used approaches,
      and for what has been a very good reason --
      in the past,
      DSL's have been very difficult to write.
    </p>
    <p>This post takes a tutorial approach.
      It does
      <b>not</b>
      assume knowledge of the previous tutorials
      on this blog.
    </p><p>
      The full code for this post is in
      <a href="https://gist.github.com/4480523">
        a Github gist</a>.
      Our example DSL is a calculator,
      one whose features
      are chosen for the purpose of illustration.
      It is not a "toy" example -- its error reporting
      is quite good and it has a test suite.
      Nonetheless, it is both short and easy to read,
      capable of being
      written quickly and maintained and extended easily.
    </p>
    <h3>The Grammar</h3>
    <p>
      The grammar for our calculator
      divides naturally into two parts.  Here is the first:
    </p><blockquote>
      <pre>
:start ::= script
script ::= expression
script ::= script ';' expression action =&gt; do_arg2
&lt;reduce op&gt; ::= '+' | '-' | '/' | '*'
expression ::=
     number
   | variable action =&gt; do_is_var
   | '(' expression ')' assoc =&gt; group action =&gt; do_arg1
  || '-' expression action =&gt; do_negate
  || expression '^' expression action =&gt; do_caret assoc =&gt; right
  || expression '*' expression action =&gt; do_star
   | expression '/' expression action =&gt; do_slash
  || expression '+' expression action =&gt; do_plus
   | expression '-' expression action =&gt; do_minus
  || expression ',' expression action =&gt; do_array
  || &lt;reduce op&gt; 'reduce' expression action =&gt; do_reduce
  || variable '=' expression action =&gt; do_set_var
</pre></blockquote>
    <p>The format of the grammar is documented
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.040000/pod/Scanless.pod">
        here</a>.
      It consists of a series of rules.
      Each rule has a
      left hand side (LHS)
      and a right hand side (RHS),
      which are separated by a rule operator.
      In the rules above, the rule operator is the BNF operator (<tt>::=</tt>).
      <p>
      The first rule is a pseudo-rule -- its LHS is the pseudo-symbol
      <tt>:start</tt>,
      and indicates that
      <tt>script</tt>
      is the grammar's start symbol.
      The next two rules indicate that
      <tt>script</tt>
      is a series of one
      or more
      <tt>expression</tt>'s, separated by a semicolon.
    </p><p>
      Rules can have action "adverbs"
      to describe the semantics.
      For example, the adverb "<tt>action =&gt; do_args</tt>"
      says that the semantics for
      the preceding RHS are implemented by a Perl closure named
      <tt>do_args</tt>.
    </p><p>The rule for
      <tt>&lt;reduce op&gt;</tt>
      introduces two new features: symbols names
      in angle brackets, and alternatives,
      separated by a veritcal bar, ("<tt>|</tt>").
    </p>
    <p>The last and longest rule, defined an
      <tt>expression</tt>,
      is a precedence rule.
      It is a series of alternatives, some separated by
      a single vertical bar,
      and others separated by a double vertical bar ("<tt>||</tt>").
      The double vertical bar indicates that the alternatives after it
      are at a looser ("lower") precedence than the alternatives before it.
      The single vertical bar separates alternatives at the same precedence level.
    </p><p>
      While Marpa's Scanless interface allows lexical and structural rules
      to be intermixed,
      it is usually convenient to have the lexical rules come after
      the structural rules:
    </p>
    <blockquote>
      <pre>
number ~ [\d]+
variable ~ [\w]+
:discard ~ whitespace
whitespace ~ [\s]+
# allow comments
:discard ~ &lt;hash comment&gt;
&lt;hash comment&gt; ~ &lt;terminated hash comment&gt; | &lt;unterminated
   final hash comment&gt;
&lt;terminated hash comment&gt; ~ '#' &lt;hash comment body&gt; &lt;vertical space char&gt;
&lt;unterminated final hash comment&gt; ~ '#' &lt;hash comment body&gt;
&lt;hash comment body&gt; ~ &lt;hash comment char&gt;*
&lt;vertical space char&gt; ~ [\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]
&lt;hash comment char&gt; ~ [^\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]
END_OF_GRAMMAR
</pre></blockquote>
    <p>
      Rules in this second set of rules have
      the same syntax as rules in the first set,
      but instead of the BNF operator (<tt>::=</tt>),
      they have a match operator (<tt>~</tt>) separating the LHS and RHS.
      The BNF operator can be seen as telling Marpa, "When it comes to whitespace and comments,
      do what I
      <b>mean</b>".
      The match operator tells Marpa to "Do exactly what I
      <b>say</b>
      on a literal,
      character-by-character basis."
    </p><p>
      The first two lines indicate how
      <tt>number</tt>'s and
      <tt>variable</tt>'s
      are formed.
      The square bracketed character classes accept anything acceptable to Perl.
      <tt>:discard</tt>
      is another pseudo-symbol -- any lexeme recognized as a
      <tt>:discard</tt>
      symbol is thrown away.
    </p><p>
      This is how whitespace and comments are dealt with.
      Note that our calculator recognizes "hash comments",
      and takes some care to do the right thing even when the hash comment is at
      the end of a string which does not end in vertical whitespace.
      It is interesting to compare the representation of hash comments here with
      the usual regular expression notation.
      Regular expressions are much more concise, but the BNF-ish form
      can be easier to read.
      In this example,
      long descriptive angle-bracketed symbol names
      save the reader the trouble of
      puzzling out the purpose of some of the more obscure cases.
    </p>
    <p>
      Now that we have defined the grammar, we need to pre-process it:
    </p>
    <blockquote>
      <pre>
      my $grammar = Marpa::R2::Scanless::G-&gt;new(
	{ action_object  =&gt; 'My_Actions',
	  default_action =&gt; 'do_arg0',
	  source =&gt; \$rules,
	}
      );
</pre></blockquote>
    <p>
      The
      <tt>action_object</tt>
      named argument specifies a package to implement
      the semantics -- Marpa will look up the names of the Perl closures in that
      package.
      The
      <tt>default_action</tt>
      named argument specified the action name for RHS's
      which do not explicitly specify one with an
      <tt>action</tt>
      adverb.
    </p>
    <h3>Running a parse</h3>
    <p>
      The <tt>calculate()</tt> closure uses our grammar to parse a string.
    </p>
    <blockquote>
      <pre>
sub calculate {
    my ($p_string) = @_;

    my $recce = Marpa::R2::Scanless::R-&gt;new( { grammar =&gt; $grammar } );

    my $self = bless { grammar =&gt; $grammar }, 'My_Actions';
    $self-&gt;{recce}        = $recce;
    $self-&gt;{symbol_table} = {};
    local $My_Actions::SELF = $self;

    if ( not defined eval { $recce-&gt;read($p_string); 1 } ) {

        # Add last expression found, and rethrow
        my $eval_error = $EVAL_ERROR;
        chomp $eval_error;
        die $self-&gt;show_last_expression(), "\n", $eval_error, "\n";
    } ## end if ( not defined eval { $recce-&gt;read($p_string); 1 })
    my $value_ref = $recce-&gt;value();
    if ( not defined $value_ref ) {
        die $self-&gt;show_last_expression(), "\n",
            "No parse was found, after reading the entire input\n";
    }
    return ${$value_ref}, $self-&gt;{symbol_table};

} ## end sub calculate

</pre></blockquote>
    <p>Walking through the code,
    we first create a recognizer ("recce" for short) from our grammar.
      Next, we define a parse object named "<tt>$self</tt>".
      (Object enthusiasts will, I hope, forgive a certain awkwardness at this stage.)
    <p>
    Next, we call the
      <tt>read()</tt>
      method on the recognizer with our string.
      We then check the result of the <tt>read()</tt> method for errors.
      <p>
      Finally, we return our results.
      This calculator allows variables, whose values it keeps in a symbol table.
      Since these can be important side effects, the symbol table is returned
      as part of the results.
    </p>
    <h3>Dealing with errors</h3>
    <p>This calculator has error reporting that compares favorably with
      production languages.
      (Unfortunately, these often do not set the bar very high.)
      The methods of the Scanless interface return diagnostics that
      pinpoint where things
      went wrong from the technical point of view,
      and what the problem was from the technical point of view.
      As a diagnostic, this is often adequate, but not always.
      Marpa's diagnostics have 100% technical accuracy, but
      the parsing may have ceased to reflect the programmer's intent before
      there is a technical problem.
    </p>
    <p>To help the programmer sync his intent to what Marpa is seeing,
    when there is a problem,
      this calculator reports to the user the text for the last
      <tt>expression</tt>
      that was successfully
      recognized.
      Here's the code that finds it:
    </p><blockquote>
      <pre>
sub show_last_expression {
    my ($self) = @_;
    my $recce = $self-&gt;{recce};
    my ( $start, $end ) = $recce-&gt;last_completed_range('expression');
    return 'No expression was successfully parsed' if not defined $start;
    my $last_expression = $recce-&gt;range_to_string( $start, $end );
    return "Last expression successfully parsed was: $last_expression";
} ## end sub show_last_expression
</pre></blockquote>
    <h3>The semantics</h3>
    <p>Here is a snippet of the semantics, with a few of the simpler semantic closures.
    </p><blockquote>
      <pre>
package My_Actions;
our $SELF;
sub new { return $SELF }
sub do_set_var {
    my ( $self, $var, undef, $value ) = @_;
    return $self-&gt;{symbol_table}-&gt;{$var} = $value;
}
sub do_negate { return -$_[2]; }
sub do_arg0 { return $_[1]; }
sub do_arg1 { return $_[2]; }
sub do_arg2 { return $_[3]; }
</pre></blockquote>
    <h3>About this example</h3>
    <p>Full code for this example can be found in 
      <a href="https://gist.github.com/4480523">a Github gist</a>.
      Semantics, legalese, a test suite and other packaging
      bring its total length to not quite 300 lines.
    It uses the latest indexed CPAN release
    of <a href="https://metacpan.org/module/Marpa::R2">Marpa::R2</a>.
    Marpa also has <a href="http://jeffreykegler.github.com/Marpa-web-site/">
    a web page</a>.
    <h3>Comments</h3>
    <p>
      Comments on this post
      can be sent to the Marpa Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>]]>
        
    </content>
</entry>

<entry>
    <title>Announcing Marpa&apos;s Scanless interface</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2013/01/announcing-marpas-scanless-interface.html" />
    <id>tag:blogs.perl.org,2013:/users/jeffrey_kegler//63.4170</id>

    <published>2013-01-03T03:56:29Z</published>
    <updated>2013-01-03T03:59:57Z</updated>

    <summary><![CDATA[[ This is cross-posted from the new home of the Ocean of Awareness blog. ] Marpa::R2's Scanless interface is now out of beta and available in full release on CPAN. This interface allows Marpa to be used without the need to create a separate lexer (scanner), and increases Marpa's level of "whipitupitude". Here's what a simple calculator looks like in the Scanless interface: :start ::= Script Script ::= Expression+ separator =&gt; comma action =&gt; do_script comma ~ [,] Expression ::= Number | '(' Expression ')' action =&gt; do_parens assoc =&gt; group || Expression '**' Expression action =&gt; do_pow assoc =&gt; right || Expression '*' Expression action =&gt; do_multiply | Expression '/' Expression action =&gt; do_divide || Expression '+' Expression action =&gt; do_add | Expression '-' Expression action =&gt; do_subtract Number ~ [\d]+ :discard ~ whitespace whitespace ~ [\s]+ # allow comments :discard ~ &lt;hash comment&gt; &lt;hash comment&gt; ~ &lt;terminated hash comment&gt; | &lt;unterminated final hash comment&gt; &lt;terminated hash comment&gt; ~ '#' &lt;hash comment body&gt; &lt;vertical space char&gt; &lt;unterminated final hash comment&gt; ~ '#' &lt;hash comment body&gt; &lt;hash comment body&gt; ~ &lt;hash comment char&gt;* &lt;vertical space char&gt; ~ [\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}] &lt;hash comment char&gt; ~ [^\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}] The full example, with semantics, is in...]]></summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[<p>[ This is cross-posted from <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2013/01/announce_scanless.html">the new home of the Ocean of Awareness blog</a>. ]
<p>
      <a href="https://metacpan.org/release/Marpa-R2">
        Marpa::R2</a>'s
      Scanless interface is now out of beta and
      available in full release on CPAN.
      This interface allows Marpa to be used without the need to create
      a separate lexer (scanner),
      and increases Marpa's level of "whipitupitude".
      Here's what a simple calculator looks like in the Scanless interface:
    </p><blockquote>
      <pre>
:start ::= Script
Script ::= Expression+ separator =&gt; comma action =&gt; do_script
comma ~ [,]
Expression ::=
    Number
    | '(' Expression ')' action =&gt; do_parens assoc =&gt; group
   || Expression '**' Expression action =&gt; do_pow assoc =&gt; right
   || Expression '*' Expression action =&gt; do_multiply
    | Expression '/' Expression action =&gt; do_divide
   || Expression '+' Expression action =&gt; do_add
    | Expression '-' Expression action =&gt; do_subtract
Number ~ [\d]+
 
:discard ~ whitespace
whitespace ~ [\s]+
# allow comments
:discard ~ &lt;hash comment&gt;
&lt;hash comment&gt; ~ &lt;terminated hash comment&gt; | &lt;unterminated
   final hash comment&gt;
&lt;terminated hash comment&gt; ~ '#' &lt;hash comment body&gt; &lt;vertical space char&gt;
&lt;unterminated final hash comment&gt; ~ '#' &lt;hash comment body&gt;
&lt;hash comment body&gt; ~ &lt;hash comment char&gt;*
&lt;vertical space char&gt; ~ [\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]
&lt;hash comment char&gt; ~ [^\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]
</pre></blockquote>
    <p>The full example, with semantics,
      is in
      <a href="https://gist.github.com/4440418">
        a Github gist</a>.
      It is almost identical to the example in
      the Scanless interface documents, and to a test in Marpa::R2's test suite.
    </p><p>
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.038000/pod/BNF.pod">
        Marpa's BNF interface</a>
	came out of beta and into full release at the same time as the Scanless interface.
      Like the Scanless interface,
      the BNF interface allows you to write your grammar in a BNF variant.
      Unlike the Scanless interface, it requires you to do your own lexing.
    </p><p>The Scanless interface is a superset of the BNF interface, so
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.038000/pod/BNF.pod">
    the documentation of the BNF interface</a>
      is the best place to start for learning both.
      However, to work with either, you probably should already
      have at least some familiarity with
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.038000/pod/Marpa_R2.pod">Marpa's
        standard interface</a>.
    </p>
    <p>
      Not long ago, my work on Marpa was a lone endeavour.
      One sign of Marpa's emergence is that my work
      now is often based on insights gained by others who have used Marpa.
      The BNF interface is based on one written by Peter Stuifzand.
      And the approach to scannerless parsing that I finally settled on
      was suggested to me by Andrew Rodland's prior work on pairing Marpa grammars.
    </p>
    <h3>Comments</h3>
    <p>
      Comments on this post
      can be sent to the Marpa Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>
]]>
        
    </content>
</entry>

<entry>
    <title>A self-parsing and self-lexing grammar</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2012/12/a-self-parsing-and-self-lexing-grammar.html" />
    <id>tag:blogs.perl.org,2012:/users/jeffrey_kegler//63.4166</id>

    <published>2013-01-01T03:43:21Z</published>
    <updated>2013-01-01T03:47:22Z</updated>

    <summary> [ This is cross-posted from the new home of the Ocean of Awareness blog. ] In a previous post, I showed a self-parsing grammar, written in Marpa&apos;s new BNF interface. That grammar was in a tradition going back to the 70&apos;s. Following the tradition, I cheated a bit. That grammar required, but did not include, a lexer to make a prepass over its input. This post contains a self-parsing and self-lexing grammar, the one for Marpa&apos;s forthcoming Scanless interface. This grammar is about as self-contained as a grammar can get, short of being encoded into a Universal Turing machine. Many readers will prefer to be introduced to the Scanless interface by way of a simpler example, but based on the response to the previous post I know there are others who share my fascination with self-description and self-exemplification. And there is something to be said for reading an example that is a final authority on itself. This is certainly a practical example. The grammar that follows is used to parse itself and all other grammars written for the Marpa&apos;s Scanless interface. It is also used to parse the strings written for Marpa&apos;s BNF interface, the Scanless interface&apos;s predecessor. Starting...</summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[ <p>[ This is cross-posted from <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/12/self_lex.html">the new home of the Ocean of Awareness blog</a>. ]
<p><a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/self_parse.html">In
        a previous post</a>, I showed a self-parsing grammar,
      written in Marpa's new BNF interface.
      That grammar was in a tradition going back to the 70's.
      Following the tradition, I cheated a bit.
      That grammar required,
      but did not include, a lexer to make a prepass over
      its input.
    </p><p>
      This post contains a self-parsing
      and self-lexing grammar,
      the one for Marpa's forthcoming Scanless interface.
      This grammar is about as self-contained as a grammar can get,
      short of being encoded into a
      <a href="http://en.wikipedia.org/wiki/Universal_Turing_machine">Universal
        Turing machine</a>.
    </p>
    <p>
      Many readers will
      prefer to be introduced to the Scanless interface
      by way of
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.035_003/pod/Scanless.pod">
        a simpler example</a>,
      but based on the response to the previous post I know
      there are others who share my fascination with
      self-description and self-exemplification.
      And there is something to be said for reading an example
      that is a final authority on itself.
    </p>
    <p>
      This is certainly a practical example.
      The grammar that follows is used
      to parse itself and all other grammars written for the
      Marpa's Scanless interface.
      It is also used to parse the strings written
      for Marpa's BNF interface,
      the Scanless interface's predecessor.
    </p>
    <h3>Starting out</h3>
    <p>
      The grammar in this blog post is abridged a bit,
      and rearranged for ease of explanation.
      The original is
      <a href="https://metacpan.org/source/JKEGL/Marpa-R2-2.035_003/lib/Marpa/R2/meta/metag.bnf">
        here</a>.
    </p><blockquote>
      <pre>
# Copyright 2012 Jeffrey Kegler
</pre></blockquote>
    <p>The file starts with legalese, which I've cut.
      (The grammar is under GNU's LGPL 3.0.)
      Note the hash comment -- since this is a self-describing self-lexer,
      the grammar will eventually tell us how it deals with hash comments.
    </p>
    <blockquote>
      <pre>
:start ::= rules
rules ::= rule+
rule ::= &lt;start rule&gt;
  | &lt;empty rule&gt;
  | &lt;priority rule&gt;
  | &lt;quantified rule&gt;
  | &lt;discard rule&gt;
</pre></blockquote>
    <p>The
      <tt>rules</tt>
      symbol is the start symbol.
      Our grammar consists of a series of one or more rules,
      each one of which falls into one of five types.
    </p>
    <p>
    </p><p>
      By my count,
      three of the rules in our self-describing grammar are
      themselves self-describing.
      The first and last rule above are two of them.
      The definition of
      <tt>rules</tt>
      is itself a series
      of one or more
      <tt>rule</tt>'s.
      And the definition of
      <tt>rule</tt>
      is a
      <tt>&lt;priority rule&gt;</tt>,
      which is
      one of the 5 types of
      <tt>rule</tt>
      that are allowed.
    </p>
    <p>
      The first two rules exemplify two of the other rule types:
      the first is a
      <tt>&lt;start rule&gt;</tt>
      and the second is a
      <tt>&lt;quantified rule&gt;</tt>.
      A start rule is defined as
      consisting of the
      <tt>:start</tt>
      pseudo-symbol,
      followed by a
      <tt>::=</tt>
      operator,
      followed by a
      <tt>symbol</tt>:
    </p>
    <blockquote>
      <pre>
&lt;start rule&gt; ::= (':start' &lt;op declare bnf&gt;) symbol
&lt;op declare bnf&gt; ~ '::='
</pre></blockquote>
    <p>
      The parentheses can be ignored.
      Their purpose is to make life easier on the semantics
      -- they surround symbols
      that are hidden from the semantics.
    </p>
    <h3>"What I say" versus "what I mean"</h3>
    <p>When we define a grammar rule,
      sometimes we are asking Marpa to do exactly what
      we say, character for character.
      For example, in the
      "<tt>&lt;op declare bnf&gt; ~ '::='</tt>" rule above,
      we are saying that the symbol
      <tt>&lt;op declare bnf&gt;</tt>
      is exactly the string '::=',
      nothing more and nothing less.
      "Do exactly what I say" rules are called lexical rules,
      and for them we use the match operator ("<tt>~</tt>").
    </p>
    <p>In other cases, we are asking Marpa to "do what I mean".
      That is, we are saying that "the rule essentially consists of these symbols,
      but I also want you
      to do what is reasonable with whitespace and comments."
      An example of a "do what I mean" rule was
      "<tt>rules ::= rule+</tt>".
      Within, between, before and after the
      <tt>rule</tt>
      symbols,
      there will be comments and whitespace.
      The whitespace will sometimes be optional,
      and will sometimes be required.
    </p>
    <p>
      Spelling all the comment and whitespace handling
      out would be very tedious and error-prone.
      We want there to be rules of a kind that "do what we mean".
      Marpa's "do what I mean" rules are called "structural" rules,
      and are identified by the BNF operator ("<tt>::=</tt>").
    </p><p>Some structural rules have lexical content within them.
      An example was
      "<tt>&lt;start rule&gt; ::= (':start' &lt;op declare bnf&gt;) symbol</tt>".
      That is basically a structural rule, where Marpa should "do what I mean"
      with whitespace and comments.
      But it also contains a string,
      <tt>':start'</tt>.
      When a string or a character class
      occurs inside a structural rule,
      Marpa knows how to treat them properly.
      Marpa knows, for example,
      that whitespace is not OK
      between the "<tt>a</tt>" and the "<tt>r</tt>" of "<tt>:start</tt>".
    </p>
    <p>
      The "what I mean" versus "what I say" distinction
      corresponds very closely to the distinction in Perl 6 grammars
      between the "rule" and "token".
      It also corresponds to the traditional division of labor in compilers,
      between the lexer and the parser proper.
    </p>
    <h3>Rules</h3>
    <p>Above we saw an example of a quantified rule: "<tt>rules ::= rule+</tt>".
      Here is the definition:
    </p>
    <blockquote>
      <pre>
&lt;quantified rule&gt;
  ::= lhs &lt;op declare&gt;
    &lt;single symbol&gt; quantifier &lt;adverb list&gt;
lhs ::= &lt;symbol name&gt;
&lt;op declare&gt;
  ::= &lt;op declare bnf&gt; | &lt;op declare match&gt;
&lt;op declare match&gt; ~ '~'
quantifier ::= '*' | '+'
</pre></blockquote>
    <p>
      A quantified rule contains a left hand side (LHS) symbol name,
      one of the two declaration operators,
      a
      <tt>single symbol</tt>,
      a plus or minus "quantifier",
      and an adverb list.
      The adverb list can be empty, as it was in our example.
      (In fact the adverb list has been empty in every rule so far.)
    </p>
    <p>
      Next come the two rule types that we've yet to see:
    </p>
    <blockquote>
      <pre>
&lt;discard rule&gt;
  ::= (':discard' &lt;op declare match&gt;) &lt;single symbol&gt;
&lt;empty rule&gt; ::= lhs &lt;op declare&gt; &lt;adverb list&gt;
</pre></blockquote>
    <p>
      We'll explain what a "discard rule" is when we encounter one.
      An empty rule indicates that its LHS symbol is nullable.
      We won't encounter an empty rule in this grammar.
    </p>
    <blockquote>
      <pre>
&lt;priority rule&gt; ::= lhs &lt;op declare&gt; priorities
priorities ::= alternatives+
    separator =&gt; &lt;op loosen&gt; proper =&gt; 1
&lt;op loosen&gt; ~ '||'
alternatives ::= alternative+
    separator =&gt; &lt;op equal priority&gt; proper =&gt; 1
alternative ::= rhs &lt;adverb list&gt;
&lt;op equal priority&gt; ~ '|'
</pre></blockquote>
    <p>
      Most rules, including most of the rules we've already seen,
      are priority rules.
      Priority rules are
      so-called because in their most complicated form they can express
      a precedence scheme.
      The typical rule in a grammar is a priority rule with only
      one priority -- a "simple" priority rule.
      All priority rules in this grammar will be of the simple kind.
    </p>
    <p>
      Within priorities, there can be alternatives,
      and we have seen examples of these.
      The self-defining rule that defined a
      <tt>rule</tt>
      stated that it was one of a set of 5 possible
      rule types (priority rule being among those types).
      In that self-defining rule,
      the different types of rule were alternatives within a single
      priority.
    </p><p>And, as long as we are on the subject of self-defining rules,
      there are three of them in this grammar, of which we have previously
      identified two.
      The definition of
      <tt>&lt;priority rule&gt;</tt>
      is the third and last self-defining rule --
      it is itself a priority rule.
    </p>
    <h3>Symbols</h3>
    <p>We've used
      <tt>symbol</tt>,
      <tt>&lt;symbol name&gt;</tt>,
      and
      <tt>&lt;single symbol&gt;</tt>
      a few times.
      It's time to see how they are defined:
    </p>
    <blockquote>
      <pre>
&lt;single symbol&gt; ::=
    symbol
  | &lt;character class&gt;
symbol ::= &lt;symbol name&gt;
&lt;symbol name&gt; ::= &lt;bare name&gt;
&lt;symbol name&gt; ::= &lt;bracketed name&gt;
&lt;bare name&gt; ~ [\w]+
&lt;bracketed name&gt; ~ '&lt;' &lt;bracketed name string&gt; '&gt;'
&lt;bracketed name string&gt; ~ [\s\w]+
</pre></blockquote>
    <p>At this point,
      <tt>symbol</tt>
      and
      <tt>&lt;symbol name&gt;</tt>
      are
      essentially the same thing:
      someday there may be another way to specify symbols
      other than by name.
      <tt>&lt;single symbol&gt;</tt>
      means any expression guaranteed
      to produce a single symbol.
      <tt>symbol</tt>
      is obviously one;
      a character class is the other.
    </p>
    <p>The rules above contain our first mention
      of character classes and,
      by coincidence,
      our first use of character classes.
      Character classes are enclosed in square brackets,
      and look exactly like Perl character classes.
      In fact, they are implemented as Perl character classes, memoized for
      efficiency.
    </p>
    <p>Now that we know what a symbol can be,
      let's look at how right hand sides are built up:
    </p>
    <blockquote>
      <pre>
rhs ::= &lt;rhs primary&gt;+
&lt;rhs primary&gt; ::= &lt;single symbol&gt;
&lt;rhs primary&gt; ::= &lt;single quoted string&gt;
&lt;rhs primary&gt; ::= &lt;parenthesized rhs primary list&gt;
&lt;parenthesized rhs primary list&gt;
  ::= ('(') &lt;rhs primary list&gt; (')')
&lt;rhs primary list&gt; ::= &lt;rhs primary&gt;+
</pre></blockquote>
    <p>A right hand side (RHS) is a sequence of one or more RHS "primaries".
      A RHS primary can be a single symbol, a string in single quotes,
      or a sublist of one or more RHS primaries in parentheses.
    </p><h3>Adverbs</h3>
    <blockquote>
      <pre>
&lt;adverb list&gt; ::= &lt;adverb item&gt;*
&lt;adverb item&gt; ::=
      action
    | &lt;left association&gt;
    | &lt;right association&gt;
    | &lt;group association&gt;
    | &lt;separator specification&gt;
    | &lt;proper specification&gt;
</pre></blockquote>
    <p>Adverb lists are lists of zero or more adverbs, which can be of one of six kinds.
      Of these six, four do not occur in this grammar:
    </p><blockquote>
      <pre>
&lt;left association&gt; ::= ('assoc' '=&gt;' 'left')
&lt;right association&gt; ::= ('assoc' '=&gt;' 'right')
&lt;group association&gt; ::= ('assoc' '=&gt;' 'group')
action ::= ('action' '=&gt;') &lt;action name&gt;
&lt;action name&gt; ::= &lt;bare name&gt;
</pre></blockquote>
    <p>
      Three of the unused
      adverbs have to do with the associativity (right/left/group)
      of priorities.
      Since all our "prioritized" rules are trivial (have only one priority),
      this grammar does not use them.
      (Their use is described in
      <a href="https://metacpan.org/module/Marpa::R2::BNF">
        the documentation of Marpa's BNF interface</a>.)
      We also will not see
      <tt>action</tt>
      adverbs in this grammar,
      for reasons explained below.
    </p>
    <p>
      Here are the two adverbs that we do see:
    </p><blockquote>
      <pre>
&lt;separator specification&gt;
  ::= ('separator' '=&gt;') &lt;single symbol&gt;
&lt;proper specification&gt; ::= ('proper' '=&gt;') boolean
boolean ~ [01]
</pre></blockquote>
    <p>These adverbs are used for quantified rules.
      One specifies a "separator" that can go between items of the series.
      The other specifies whether separation is "proper" or not.
      (When
      <tt>proper</tt>
      is 0, a separator is allowed after the last item of a series.
      When that is the case,
      the separator does not really always separate two items
      and in that sense the separator is not "proper".)
    </p><h3>Discarded tokens</h3>
    <blockquote>
      <pre>
:discard ~ whitespace
whitespace ~ [\s]+
</pre></blockquote>
    <p>The two rules say that sequences of whitespace are recognized
      as tokens, then discarded.
      Perl-style comments are handled in the same way:
    </p><blockquote>
      <pre>
# allow comments
:discard ~ &lt;hash comment&gt;
&lt;hash comment&gt; ~ &lt;terminated hash comment&gt;
  | &lt;unterminated final hash comment&gt;
&lt;terminated hash comment&gt;
  ~ '#' &lt;hash comment body&gt; &lt;vertical space char&gt;
&lt;unterminated final hash comment&gt;
  ~ '#' &lt;hash comment body&gt;
&lt;hash comment body&gt; ~ &lt;hash comment char&gt;*
&lt;vertical space char&gt; ~ [\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]
&lt;hash comment char&gt; ~ [^\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]
</pre></blockquote>
    <p>"Unterminated final hash comments" deal with the special
      case of hash comments at the end of a file,
      when that file is not properly
      terminated with a newline.
      The
      <tt>&lt;unterminated final hash comment&gt;</tt>
      symbol is an example of how
      a long angle bracketed symbol name can make things clearer.
      Without the long name, it might not be evident what that rule and symbol
      were trying to accomplish.
    </p>
    <p>
      Long symbol names
      have a cleaner look than comments,
      but they are not a panacea.
      They have
      the special advantage that the description goes wherever the symbol
      name goes.
      But when the description is too long,
      that advantage becomes a disadvantage.
    </p><h3>Strings</h3>
    <p>In the next snippet, defining single-quoted strings,
      the description is clearly too long for the symbol name,
      so much of it does go into a comment.
    </p><blockquote>
      <pre>
# In single quotes strings and character classes
# no escaping or internal newlines, and disallow empty string
&lt;single quoted string&gt;
  ~ ['] &lt;string without single quote or vertical space&gt; [']
&lt;string without single quote or vertical space&gt;
  ~ [^'\x{0A}\x{0B}\x{0C}\x{0D}\x{0085}\x{2028}\x{2029}]+
</pre></blockquote>
    <p>Note two Unicode vertical whitespace codepoints,
      U+2028 and U+2029, are included.
      The implementation only supports 7-bit ASCII,
      so for the moment these accomplish nothing.
      But when Unicode support is added, the grammar won't need to be changed.
    </p><h3>Character classes</h3>
    <p>Finally, there are character classes:
    </p><blockquote>
      <pre>
&lt;character class&gt; ~ '[' &lt;cc string&gt; ']'
&lt;cc string&gt; ~ &lt;cc character&gt;+
&lt;cc character&gt; ~ &lt;escaped cc character&gt;
  | &lt;safe cc character&gt;
&lt;escaped cc character&gt; ~ '\' &lt;horizontal character&gt;

# hex 5d is right square bracket
&lt;safe cc character&gt;
  ~ [^\x{5d}\x{0A}\x{0B}\x{0C}\x{0D}\x{0085}\x{2028}\x{2029}]

# a horizontal character is any character that is not vertical space
&lt;horizontal character&gt; ~ [^\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]
</pre></blockquote>
    <p>These are Perl character classes,
      and are passed unaltered to Perl for interpretation.
      Marpa needs to recognize that they start and end with square brackets.
      And it also must recognize enough of their internals
      to deal with escaped square brackets,
      which makes
      them the most
      complicated lexeme of the grammar.
    </p>
    <h3>Longest tokens matching</h3>
    <p>When there is a choice of lexicals, Marpa follows a longest tokens matching strategy.
      The effect is usually that it does what you mean.
      (In part this is because longest token match
      is the usual default for regular expressions and
      lexical analyzers, so that programmers are trained by their tools
      to really mean a longest match
      whenever they specify a match.)
    </p><p>Most of what longest-token-match does is obvious.
      For example,
      in the rule "<tt>weeknights ~ 'Mon' | 'Tue' | 'Wed' | 'Thu'</tt>",
      it recognizes that
      <tt>weeknights</tt>
      is one symbol and not two symbols like
      "<tt>wee</tt>" and "<tt>knights</tt>".
      Longest tokens match
      means that if you have a grammar where you specify both
      <tt>++</tt>
      and
      <tt>+</tt>
      as operators, Marpa will always prefer
      the longer operator:
      <tt>++</tt>.
    </p><p>Because this is Marpa, there is a slight difference from the traditional longest
      <b>token</b>
      matching.
      Note that in Marpa's matching strategy, "tokens" is plural.
      If more than one possibility has the same length, Marpa will try them all.
      This plays a role in our meta-grammar.
      For example,
      <tt>separator</tt>
      is a keyword.
      But it is also a valid symbol name.
      Marpa allows it to be both, and figures out which is meant at the
      structural level, based on context.
    </p><h3>Semantics</h3>
    <p>
      In practice, a grammar is usually tied tightly to a single semantics.
      This is an exception.
      The Scanless interface's meta-grammar is also the grammar for Marpa's
      BNF interface, and the BNF interface has a different semantics.
    </p>
    <p>
      For most grammars in Marpa's BNF or Scanless interface,
      the semantics would be specified using
      <tt>action</tt>
      adverbs.
      Examples of the normal method of specifying
      semantics are in the documentation for
      <a href="https://metacpan.org/module/Marpa::R2::BNF">
        the BNF interface</a>
      and for
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.035_003/pod/Scanless.pod">
        the Scanless interface</a>.
    </p>
    <p>
      In this special-case grammar,
      there are no
      <tt>action</tt>
      adverbs.
      Marpa waits until it knows which interface the grammar will be used for.
      At that point the internals use the symbol names to map rules to
      actions on a "just in time"
      basis.
    </p>
    <h3>Comments</h3>
    <p>
      Comments on this post can be sent to the Marpa Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>
]]>
        
    </content>
</entry>

<entry>
    <title>Smart whitespace and the Ruby Slippers</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2012/12/smart-whitespace-and-the-ruby-slippers.html" />
    <id>tag:blogs.perl.org,2012:/users/jeffrey_kegler//63.4096</id>

    <published>2012-12-03T03:41:44Z</published>
    <updated>2012-12-03T03:51:02Z</updated>

    <summary>Scannerless parsing [ This is cross-posted from the new home of the Ocean of Awareness blog. ] I&apos;ve been working on a &quot;scannerless&quot; Marpa interface. &quot;Scannerless&quot; means that the user does not need to write a separate lexer -- the lexer (scanner) is included in the parser. One of my working examples is the synopsis from the main Marpa::R2 POD page, rewritten to do its own lexing: :start ::= Expression Expression ::= Number || Expression &apos;*&apos; Expression action =&gt; do_multiply || Expression &apos;+&apos; Expression action =&gt; do_add Number ~ digits &apos;.&apos; digits action =&gt; do_literal Number ~ digits action =&gt; do_literal digits ~ [\d]+ Here the notation is that of my last post, as documented here. New for the scannerless parser are the :start pseudo-symbol, which indicates the start rule; rules with a tilde (&quot;~&quot;) to separate LHS from RHS: these indicate rules whose whitespace is to be left as-is single-quoted strings, to tell Marpa which character to look for; and square-bracketed character classes, to tell Marpa to look for a class of characters. Their interpretation is done by Perl, and therefore the allowed classes are exactly those accepted by your version of Perl. Valid strings in this language are...</summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[<h3>Scannerless parsing</h3>
<p>[ This is cross-posted from the
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/12/whitespace.html">new home of the Ocean of Awareness blog</a>. ]
</p>
    <p>I've been working
      on a "scannerless" Marpa interface.
      "Scannerless" means that the user does not need to write
      a separate lexer --
      the lexer (scanner) is included in the parser.
      One of my working examples is
      the synopsis from
      <a href="https://metacpan.org/module/Marpa::R2">
        the main Marpa::R2 POD page</a>,
      rewritten to do its own lexing:
    </p>
    <blockquote>
      <pre>
    <tt>
:start ::= Expression
Expression ::=
       Number
    || Expression '*' Expression action => do_multiply
    || Expression '+' Expression action => do_add
Number ~ digits '.' digits action => do_literal
Number ~ digits action => do_literal
digits ~ [\d]+
    </tt>
      </pre>
    </blockquote>
    <p>Here the notation is that of
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/11/iterative.html">
        my last post</a>,
      as
      <a href="https://metacpan.org/module/Marpa::R2::BNF">
        documented here</a>.
      New for the scannerless parser are
    </p><ul>
      <li>
        the
        <tt>:start</tt>
        pseudo-symbol, which indicates the start rule;
      </li><li>
        rules with a tilde ("<tt>~</tt>") to separate
        LHS from RHS: these indicate rules whose
        whitespace is to be left as-is
      </li><li>single-quoted strings, to tell Marpa which
        character to look for; and
      </li><li>square-bracketed character classes, to
        tell Marpa to look for a class of characters.
        Their interpretation is done by Perl,
        and therefore the allowed classes are exactly those
        accepted by your version of Perl.
      </li></ul>
    <p>
      Valid strings in this language are "<tt>15329 + 42 * 290 * 711</tt>",
      "<tt>42*3+7</tt>",
      "<tt>3*3+4* 4</tt>",
      along with all their whitespace variants.
    </p>
    <p>My recent posts have been tutorial.
      My work on scannerless parsing is
      not quite ready for a tutorial presentation,
      so this post will be conceptual.
      It is about an interesting issue that arises in
      scannerless parsing,
      one which Perl 6 also had to solve,
      and which Marpa solves in a new and different way.
      That issue is whitespace.
    </p>
    <h3>Dealing with whitespace</h3>
    <p>For the statements with a declaration operator of
      <tt>::=</tt>,
      whitespace is handled automatically by Marpa.
      Valid strings in the above language are
      "<tt>42*3+7</tt>",
      "<tt>42 * 3 + 7</tt>" and
      "<tt>42 * 3+7</tt>",
      all of which yield 133 as the answer.
      The trick is to, on one hand, allow whitespace to be optional
      and, on the other hand, recognize that strings like "<tt>42</tt>"
      must be a single number.
      That is, the parser should not recognize optional whitespace
      between the two digits and decide that
      "<tt>42</tt>",
      is actually two numbers:
      "<tt>4</tt>" and
      "<tt>2</tt>".
    </p>
    <p>
      The Perl 6 project has already taken on scannerless parsing.
      My methods for dealing with whitespace are based on theirs.
      Central to
      their solution is "smart whitespace".
      ("Smart whitespace" is my term --
      the
      <a href="http://perlcabal.org/syn/S05.html">
        Perl 6 doc</a>
      is more matter-of-fact.)
      Smart whitespace is whitespace which is optional, except between
      word characters.
      Stated another way, smart whitespace is either explicit whitespace,
      or a word boundary.
      In the case of "<tt>42</tt>",
      "<tt>4</tt>" and
      "<tt>2</tt>" are both word characters, so there is no
      word boundary between them, and therefore no smart whitespace.
    </p>
    <h3>Implementing smart whitespace</h3>
    <p>
      Left parsers (like that which Perl 6 uses)
      often know very little about the context of the parse.
      But left parsers do know the current "character transition" --
      what the previous character was,
      and what the current character is.
      In a left parser, finding word boundaries for the
      purpose of detecting smart whitespace fits in
      nicely with the way it works in general.
    </p>
    <p>Marpa, of course,
      also knows the previous and current characters.
      It is certainly possible for
      Marpa to check every transition for a word boundary.
      But in Marpa's case, this check would
      be an additional overhead, handling just one special case.
      It'd be nice if we could look for word boundaries in a cool Marpa-ish way,
      preferably one with efficiency advantages.
    </p>
    <h3>Out come the Ruby Slippers</h3>
    <p>"Ruby Slippers" parsing, as a reminder, is new with Marpa,
      despite seeming a very obvious concept.
      It amounts to adjusting the input to the parser based on what
      the parser wants.
      This can be seen as assuring the parser that whatever it wishes
      for will happen, the same power that was conferred on Dorothy
      in
      <em>Wizard of Oz</em>
      by a happy choice of footware.
    </p>
    <p>
      To make the Ruby Slippers work in this case,
      we make a word boundary a special kind of virtual token,
      and we define smart whitespace to be one of two things:
    </p><ul>
      <li>
        A sequence of one or more characters of
        real, physical whitespace.
      </li><li>
        A virtual word-boundary token.
      </li></ul><p>
      We then proceed normally with the parse,
      until there's a problem.
      When the parser reports a problem,
      we ask it if it is looking for one
      of the virtual word boundary tokens.
      If so, we give it one and continue.
      Why does life have to be difficult?
    </p>
    <p>
      Comments on this post
      can be sent to the Marpa Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>
]]>
        
    </content>
</entry>

<entry>
    <title>Marpa::R2 is now in full release</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2012/11/marpar2-is-now-in-full-release.html" />
    <id>tag:blogs.perl.org,2012:/users/jeffrey_kegler//63.4067</id>

    <published>2012-11-26T04:41:34Z</published>
    <updated>2012-12-30T01:27:46Z</updated>

    <summary>[ This is cross-posted from the new home of the Ocean of Awareness blog. ] Announcing Marpa::R2 Marpa::R2 is now in full, official release. For those new to this blog, Marpa::R2 is an efficient, practical general BNF parser, targeted at applications too complex for regular expressions. Marpa::R2 is based on the Marpa parsing algorithm. New, but squarely based on the published literature, the Marpa algorithm parses every class of grammar in practical use today in linear time. Marpa::R2 is the successor to Marpa::XS and installs and runs on Windows. has better error reporting. is faster. has a cleaner, simpler interface. Marpa::XS remains available and, since changes to it are now on a &quot;bug fix only&quot; basis, should be quite stable. While Marpa::R2&apos;s interface will have a familiar look to users of Marpa::XS, it is not fully compatible: changes are documented here. Those who have been following this blog may have noticed that a new BNF interface has been added to Marpa::R2. This is growing -- I am currently adding scannerless parsing to it, which means that applications will be able to run Marpa::R2 without a lexer. Because the BNF interface is new and still under very active development, it is...</summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[[ This is <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/11/announce_r2.html">cross-posted from the new home of the Ocean of Awareness blog</a>. ]
  <h3>Announcing Marpa::R2</h3>
    <p>
      <a href="https://metacpan.org/release/Marpa-R2">
        Marpa::R2</a>
      is now in full, official release.
      For those new to this blog, Marpa::R2 is an efficient, practical general
      BNF parser, targeted at applications too complex for
      regular expressions.
      Marpa::R2 is based on
      <a href="http://jeffreykegler.github.com/Marpa-web-site/">
        the Marpa parsing algorithm</a>.
      New, but squarely based on the published literature,
      the Marpa algorithm
      parses every class of grammar in practical use today
      in linear time.
    </p>
    <p>Marpa::R2 is the successor to Marpa::XS and
    </p>
    <ul>
      <li>
        <p>installs and runs on Windows.
        </p>
      </li>
      <li>
        <p>has better error reporting.
        </p>
      </li>
      <li>
        <p>is faster.
        </p>
      </li>
      <li>
        <p>
          has a cleaner, simpler interface.
        </p>
      </li>
    </ul>
    <p>
    <a href="https://metacpan.org/module/Marpa::XS">Marpa::XS</a>
    remains available and,
      since changes to it are now on a "bug fix only" basis,
      should be quite stable.
      While Marpa::R2's interface will have a familiar look
      to users of Marpa::XS, it is not fully compatible:
      <a href="https://metacpan.org/module/Marpa::R2::Changes">
      changes are documented here</a>.
    </p>
    <p>
      Those who have been following this blog may have noticed
      that
      <a href="https://metacpan.org/module/Marpa::R2::BNF">
      a new BNF interface</a>
      has been added to Marpa::R2.
      This is growing --
      I am currently adding scannerless parsing to it,
      which means that applications will be able to run
      Marpa::R2 without a lexer.
      Because the BNF interface is new
      and still under very active development,
      it is being kept in beta status for the time being.
    </p>
    <h3>Comments</h3>
    <p>
      The Windows port of Marpa was the work of Jean-Damien Durand,
      who utilized Alberto Sim&otilde;es'
      <a href="http://search.cpan.org/dist/Config-AutoConf/">
        Config::AutoConf</a>.
      Comments on this post
      can be sent to the Marpa Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>]]>
        
    </content>
</entry>

<entry>
    <title>A Marpa tutorial: iterative parser development</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2012/11/a-marpa-tutorial-iterative-parser-development.html" />
    <id>tag:blogs.perl.org,2012:/users/jeffrey_kegler//63.4058</id>

    <published>2012-11-19T00:42:26Z</published>
    <updated>2012-11-19T00:47:24Z</updated>

    <summary>[ This is cross-posted from the new home of the Ocean of Awareness blog. ] Developing a parser iteratively This post describes a manageable way to write a complex parser, a little bit at a time, testing as you go. This tutorial will &quot;iterate&quot; a parser through one development step. As the first iteration step, we will use the example parser from the previous tutorial in this series, which parsed a Perl subset. You may recall that the topic of that previous tutorial was pattern search. Pattern search and iterative parser development are essentially the same thing, and the same approach can be used for both. Each development stage of our Perl parser will do a pattern search for the Perl subset it parses. We can use the accuracy of this pattern search to check our progress. The subset we are attempting to parse is our &quot;search target&quot;. When our &quot;searches&quot; succeed in finding all instances of the target, we have successfully written a parser for that subset, and can move on to the next step of the iteration. What we need to do This tutorial is the latest of a series, each of which describes one self-contained example of...</summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[<p>[ This is cross-posted from the <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/11/iterative.html">new home of the Ocean of Awareness blog</a>. ]
<h3>Developing a parser iteratively</h3>
    <p>This post describes a manageable way
      to write a complex parser,
      a little bit at a time, testing as you go.
      This tutorial will "iterate" a parser
      through one development step.
      As the first iteration step,
      we will use the example parser from
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/11/pattern_search.html">
        the previous tutorial in this series</a>,
      which parsed a Perl subset.
    </p>
    <p>
      You may recall that the topic of that previous tutorial was pattern search.
      Pattern search and iterative parser development are
      essentially the same thing,
      and the same approach can be used for both.
      Each development stage of our Perl parser will do a pattern search
      for the Perl subset it parses.
      We can use the accuracy of this pattern search
      to check our progress.
      The subset we are attempting to parse is our "search target".
      When our "searches" succeed in finding all instances
      of the target,
      we have successfully written a parser for that subset,
      and can move on to the next step of the iteration.
    </p>
    <h3>What we need to do</h3>
    <p>
      This tutorial is the latest of
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/metapages/annotated.html#TUTORIAL">
        a series</a>,
      each of which describes one self-contained example of a Marpa-based parser.
      In this tutorial we use the example from
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/11/pattern_search.html">
        the previous tutorial</a>
      as the first iteration step
      in the iterative development of a Perl parser.
      For the iteration step in this example, we will add two features.
    </p><ul>
      <li><p>The previous iteration step was more of a recognizer than a parser.
          In particular, its grammar was too simplified to support a semantics,
          even for the Perl subset it recognized.
          We will fix that.
        </p></li><li>Having amplified the grammar, we will add a semantics,
        simple, but quite powerful enough to use in checking our progress
        in developing the parser.
      </li></ul>
    <h3>The grammar</h3>
    <p>
    Here is our grammar from the previous post:
    </p><blockquote>
      <pre>
    <tt>
start ::= prefix target
prefix ::= any_token*
target ::= expression
expression ::=
       number | scalar | scalar postfix_op
    || op_lparen expression op_rparen assoc =&gt; group
    || unop expression
    || expression binop expression`
    </tt>
    </pre>
    </blockquote><p>
    <a href="https://metacpan.org/module/Marpa::R2::BNF">
      The format is documented here</a>.
      These eight lines were enough to descibe arithmetic expressions sufficiently well
      for a recognizer, as well as to provide the "scaffolding" for the unanchored search.
      Nice compression, but now that we are talking about supporting a Perl semantics,
      we will need more.
    </p><p>Adding the appropriate grammar is a matter of turning to the
      <a href="http://perldoc.perl.org/perlop.html#Operator-Precedence-and-Associativity">
        appropriate section of the
        <tt>perlop</tt>
        man page</a>
      and copying it.
      I needed to change the format and name the operators,
      but the process was pretty much rote, as you can see:
    </p><blockquote>
      <pre>
    <tt>
my $perl_grammar = Marpa::R2::Grammar-&gt;new(
    {   start          =&gt; 'start',
        actions        =&gt; 'main',
        default_action =&gt; 'do_what_I_mean',
        rules          =&gt; [ &lt;&lt;'END_OF_RULES' ]
start ::= prefix target action =&gt; do_arg1
prefix ::= any_token* action =&gt; do_undef
target ::= expression action =&gt; do_target
expression ::=
     number
   | scalar
   | op_lparen expression op_rparen assoc =&gt; group
  || op_predecrement expression
   | op_preincrement expression
   | expression op_postincrement
   | expression op_postdecrement
  || expression op_starstar expression assoc =&gt; right
  || op_uminus expression
   | op_uplus expression
   | op_bang expression
   | op_tilde expression
  || expression op_star expression
   | expression op_slash expression
   | expression op_percent expression
   | expression kw_x expression
  || expression op_plus expression
   | expression op_minus expression
  || expression op_ltlt expression
   | expression op_gtgt expression
  || expression op_ampersand expression
  || expression op_vbar expression
   | expression op_caret expression
  || expression op_equal expression assoc =&gt; right
  || expression op_comma expression
END_OF_RULES
    }
);
    </tt>
    </pre>
    </blockquote>
    <h3>The lexer</h3>
    <p>
      The lexer is table-driven.
      I've used this same approach to lexing in every post
      in this tutorial series.
      Those interested in
      an explanation of how the lexer works can
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/dsl.html">
        find one in the first tutorial</a>.
      Having broken out the operators, I had to rewrite
      the lexing table,
      but that was even more rote than rewriting
      the grammar.
      I won't repeat the
      lexer table here --
      it can be found in
      <a href="https://gist.github.com/4093504">the Github gist</a>.
    </p>
    <h3>Adding the semantics</h3>
    <p>Our semantics will create a syntax tree.
      Here is that logic.
      (Note that the first argument to these semantic closures
      is a per-parse "object",
      which we don't use here.)
    </p><blockquote>
      <pre>
    <tt>
sub do_undef       { undef; }
sub do_arg1        { $_[2]; }
sub do_what_I_mean { shift; return $_[0] if scalar @_ == 1; return \@_ }

sub do_target {
    my $origin = ( Marpa::R2::Context::location() )[0];
    return if $origin != $ORIGIN;
    return $_[1];
} ## end sub do_target
    </tt>
    </pre>
    </blockquote>
    <p>
      There is some special logic in the
      <tt>do_target()</tt>
      method,
      involving the "origin", or starting location of the target.
      Perl arithmetic expressions,
      when they are the target of an unanchored search,
      are ambiguous.
      For example, in the string "<tt>abc 1 + 2 + 3 xyz</tt>",
      there are two targets ending at the same position:
      "<tt>2 + 3</tt>" and "<tt>1 + 2 + 3</tt>".
      We are interested only in longest of these,
      whose start location is indicated by the
      <tt>$ORIGIN</tt>
      variable.
    </p><p>The next logic will be familiar from our
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/11/pattern_search.html">
        pattern search tutorial</a>.
      It repeatedly looks for non-overlapping occurrences of
      <tt>target</tt>,
      starting from the end and going back to the beginning of the input.
    </p><blockquote>
      <pre>
    <tt>
my $end_of_search;
my @results = ();
RESULTS: while (1) {
    my ( $origin, $end ) =
        $self-&gt;last_completed_range( 'target', $end_of_search );
    last RESULTS if not defined $origin;
    push @results, [ $origin, $end ];
    $end_of_search = $origin;
} ## end RESULTS: while (1)
    </tt>
    </pre>
    </blockquote>
    <p>This final code sample is the logic
      that unites pattern search with incremental
      parsing.
      It is a loop through
      <tt>@results</tt>
      that prints the original text
      and, depending on a flag,
      its syntax tree.
    </p>
    <p>
      Near the top of the loop,
      the "<tt>$recce-&gt;set( { end =&gt; $end } )</tt>"
      call sets the end of parse location to the current
      result.
      At the bottom of the loop,
      we call
      "<tt>$recce-&gt;reset_evaluation()</tt>".
      This is necessary to allow us to evaluate the
      input stream again, but with a new
      <tt>$end</tt>
      location.
    </p>
    <blockquote>
      <pre>
    <tt>
RESULT: for my $result ( reverse @results ) {
    my ( $origin, $end ) = @{$result};

    <big><b>... Print out the original text ...</b></big>

    $recce-&gt;set( { end =&gt; $end } );
    my $value;
    VALUE: while ( not defined $value ) {
        local $main::ORIGIN = $origin;
        my $value_ref = $recce-&gt;value();
        last VALUE if not defined $value_ref;
        $value = ${$value_ref};
    } ## end VALUE: while ( not defined $value )
    if ( not defined $value ) {
        say 'No parse'
            or die "say() failed: $ERRNO";
        next RESULT;
    }
    say Data::Dumper::Dumper($value)
        or die "say() failed: $ERRNO"
        if not $quiet_flag;
    $recce-&gt;reset_evaluation();
} ## end RESULT: for my $result ( reverse @results )
    </tt>
    </pre>
    </blockquote>
    <p>The
      <tt>VALUE</tt>
      sub-loop is
      where the
      <tt>$ORIGIN</tt>
      variable
      was set.
      In the semantics,
      <tt>do_target()</tt>
      checks this.
      In the case of an ambiguous parse,
      <tt>do_target()</tt>
      turns any target which does not
      cover the full span from
      <tt>$origin</tt>
      to
      <tt>$end</tt>
      into a Perl
      <tt>undef</tt>,
      which will
      eventually become
      the value of its parse.
      The logic in the
      <tt>VALUE</tt>
      loop
      ignores parses whose value is a Perl <tt>undef</tt>,
      so that only the longest target for each
      <tt>$end</tt>
      location is printed.
    </p>
    <h3>Code and comments</h3>
    <p>The example in this post is available as
      <a href="https://gist.github.com/4093504">a Github gist</a>.
      It was run with
      <a href="https://metacpan.org/release/JKEGL/Marpa-R2-2.024000/">
        Marpa::R2 2.024000</a>,
      as of this writing the latest full release.
      Its main test, which is included in the gist,
      used displays from the
      <a href="http://perldoc.perl.org/perlop.html">perlop man page</a>.
    </p>
    <p>
      Comments on this post
      can be sent to the Marpa Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>
]]>
        
    </content>
</entry>

<entry>
    <title>A Marpa tutorial: pattern searches</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2012/11/a-marpa-tutorial-pattern-searches.html" />
    <id>tag:blogs.perl.org,2012:/users/jeffrey_kegler//63.4038</id>

    <published>2012-11-12T05:07:57Z</published>
    <updated>2012-11-12T05:12:54Z</updated>

    <summary><![CDATA[[ This is cross-posted from the new home of the Ocean of Awareness blog. ] Pattern searches We use regular expressions for pattern searching these days. But what if your search target is not a regular expression? In this post I will show how to use Marpa to search text files for arbitrary context-free expressions. This tutorial builds on earlier tutorials. It is possible to simply dive into it, but it may be easier to start with two of my earlier posts, here and here. The grammar I will use arithmetic expressions as the example of a search target. Even the arithmetic subset of Perl expressions is quite complex, but in this case we can get the job done with eight lines of grammar and a lexer driven by a table of just over a dozen lines. Here is the grammar: start ::= prefix target prefix ::= any_token* target ::= expression expression ::= number | scalar | scalar postfix_op || op_lparen expression op_rparen assoc =&gt; group || unop expression || expression binop expression` This grammar uses Marpa::R2's BNF interface. It takes considerable advantage of the fact that we are not parsing these expressions, but recognizing them. Because of this, we...]]></summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[[ This is cross-posted from <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/11/pattern_search.html">the new home of the Ocean of Awareness blog</a>. ]
 <h3>Pattern searches</h3>
    <p>
      We use regular expressions for pattern searching these days.
      But what if your search target is not a regular expression?
      In this post I will show how to use Marpa to search text files for
      arbitrary context-free expressions.
    </p>
    <p>
      This tutorial builds on earlier tutorials.
      It is possible to simply dive into it,
      but it may be easier
      to start with two of my earlier posts,
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/dsl.html">here</a>
      and
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/error.html">here</a>.
    </p>
    <h3>The grammar</h3>
    <p>
      I will use arithmetic expressions as
      the example of a search target.
      Even the arithmetic subset of Perl expressions is quite complex,
      but in this case we can get the job done
      with eight lines of grammar and a lexer driven
      by a table of just over a dozen lines.
      Here is the grammar:
    </p>
    <blockquote>
      <pre>
    <tt>
start ::= prefix target
prefix ::= any_token*
target ::= expression
expression ::=
       number | scalar | scalar postfix_op
    || op_lparen expression op_rparen assoc =&gt; group
    || unop expression
    || expression binop expression`
    </tt>
    </pre>
    </blockquote>
    <p>
      This grammar uses
      <a href="https://metacpan.org/module/Marpa::R2::BNF">
        Marpa::R2's BNF interface</a>.
      It takes considerable advantage of the fact that we are not
      <b>parsing</b>
      these expressions, but
      <b>recognizing</b>
      them.
      Because of this, we don't have to specify whether expressions left- or right-associate.
      We can also ignore what operators mean and group them according to syntax only
      -- binary, prefix unary and postfix unary.
      Similarly, we can ignore the precedence within these large groups.
      This leaves us with numbers, scalars,
      parentheses,
      and binary, prefix unary and postfix unary operators.
      (To keep this example simple, we restrict the primaries
      to numeric constants and Perl scalars.)
    </p>
    <p>
      What we are searching for is defined by the
      <tt>target</tt>
      symbol.
      For
      <tt>target</tt>
      you could substitute
      the start symbol of
      any context-free grammar,
      and the structure of this example will still work.
      To turn a parser for
      <tt>target</tt>
      into a pattern searcher, we add a new start
      symbol (unimaginatively named "<tt>start</tt>")
      and two rules that
      allow the target to have a
      <tt>prefix</tt>.
    </p>
    <h3>Ambiguous parsing</h3>
    <p>To do an anchorless pattern search,
      this example will use ambiguous parsing.
      This grammar always has at least one parse going,
      representing the prefix for
      the zero or more targets
      that our parser
      expects to find in the future.
      The prefix will never end, because
      any token (as indicated by a token
      named, literally,
      <tt>any_token</tt>)
      extends it.
    </p>
    <p>
      If we are in the process of recognizing a
      <tt>target</tt>,
      we will have one or more other parses going.
      I say "one or more" because the search method
      described in this post
      allows <tt>target</tt> to be ambiguous.
      But arithmetic expressions,
      the target pattern used in this example,
      are not ambiguous.
      So our example will have
      at most two parses active at any point:
      one for the prefix and another for the target.
    </p>
    <p>
      Ambiguous parsing has a serious potential downside --
      it is not necessarily linear
      and therefore not necessarily efficient.
      But Marpa can parse many classes of ambiguous grammar in linear time.
      Grammars like the one in this post --
      a prefix and an unambiguous search target --
      fall into one of the linearly parseable classes.
      Keeping the prefix going requires a tiny constant overhead per token.
    </p>
    <h3>The lexer table</h3>
    <p>
      The lexer is driven by a table of pairs: token name and regex.
    </p><blockquote>
      <pre>
<tt>
my @lexer_table = (
    [ number     =&gt; qr/(?:\d+(?:\.\d*)?|\.\d+)/xms ],
    [ scalar     =&gt; qr/ [\$] \w+ \b/xms ],
    [ postfix_op =&gt; qr/ [-][-] | [+][+] /xms ],
    [ unop       =&gt; qr/ [-][-] | [+][+] /xms ],
    [   binop =&gt; qr/
          [*][*] | [&gt;][&gt;] | [&lt;][&lt;]
        | [*] | [\/] | [%] | [x] \b
        | [+] | [-] | [&amp;] | [|] | [=] | [,]
    /xms
    ],
    [   unop =&gt; qr/ [-] | [+] | [!] | [~] /xms
    ],
    [ op_lparen =&gt; qr/[(]/xms ],
    [ op_rparen =&gt; qr/[)]/xms ],
);
</tt>
</pre>
    </blockquote>
    <p>
      Order is significant here.
      In particular
      two-character operators are checked for first.
      This guarantees that
      two consecutive minus signs
      will be seen as an
      decrement operator, and not as a double negation.
    </p>
    <h3>Ambiguous lexing</h3>
    <p>The very careful reader may have noticed that
      <tt>any_token</tt>
      is not in the lexing table.
      The main loop is written so that every token is read as an
      <tt>any_token</tt>.
      If no token from the lexing table is accepted,
      the next character in the input stream
      is read as an
      <tt>any_token</tt>.
      If a token from the lexing table
      <b>is</b>
      accepted,
      then it gets read twice,
      once as an
      <tt>any_token</tt>,
      and once as the token type taken from the lexing table
      entry.
    </p>
    <p>Ambiguous lexing is a familiar technique to
      the Natural Language Processing community.
      Engish, in particular, is a language that abounds
      in lexemes that can play multiple roles.
      The word "sort", for example, can easily be
      an noun, a verb or an adjective.
    </p>
    <h3>The Ruby Slippers</h3>
    <p>The main loop will also be a simple case of the use
      of the Ruby Slippers.
      For those unfamiliar,
      the "Ruby Slippers" parsing technique handles difficult lexing
      and parsing problems by asking the parser, at the problem point,
      what it is looking for,
      and providing it.
      This seems a fairly obvious approach,
      but the Ruby Slippers are new with Marpa --
      traditional parsers could not easily
      determine where they were in a parse.
    </p>
    <p>
      One way to use the Ruby Slippers is to ask the parser in
      advance what it is looking for.
      The code that follows uses another method.
      Instead of determining in advance what tokens to read,
      it simply feeds tokens to the parser.
    </p>
    <p>
      Token rejection is a "soft" error -- it costs
      little to try, and little to retry.
      The following code can
      efficiently determine which entry in the lexing table is appropriate,
      simply by trying each of them in order.
      If the
      <tt>alternative()</tt>
      method returns a Perl
      <tt>undef</tt>,
      indicating that a token was rejected,
      then the main loop will try later entries in the lexing table.
    </p>
    <p>
      When a token is accepted,
      the main loop can safely assume that it is on the right track.
      Marpa is 100% accurate about
      which tokens can and cannot result in a successful parse.
    </p>
    <h3>The main loop</h3>
    <p>
      The main loop iterates through input looking for tokens.
      Whitespace is skipped.
      Comments are not skipped.
      Finding arithmetic expressions in
      strings and/or comments can be useful.
      We will assume that is the case here.
    </p>
    <blockquote>
      <pre>
<tt>
my $length = length $string;
pos $string = $positions[-1];
TOKEN: while ( pos $string &lt; $length ) {
    next TOKEN if $string =~ m/\G\s+/gcxms;    # skip whitespace
    my $position = pos $string;
    FIND_ALTERNATIVE: {
        TOKEN_TYPE: for my $t (@lexer_table) {
            my ( $token_name, $regex ) = @{$t};
            next TOKEN_TYPE if not $string =~ m/\G($regex)/gcxms;
            if ( not defined $recce-&gt;alternative($token_name) ) {
                pos $string = $position;       # reset position for matching
                next TOKEN_TYPE;
            }
            $recce-&gt;alternative('any_token');
            last FIND_ALTERNATIVE;
        } ## end TOKEN_TYPE: for my $t (@lexer_table)
        ## Nothing in the lexer table matched
        ## Just read the currrent character as an 'any_token'
        pos $string = $position + 1;
        $recce-&gt;alternative('any_token');
    } ## end FIND_ALTERNATIVE:
    $recce-&gt;earleme_complete();
    my $latest_earley_set_ID = $recce-&gt;latest_earley_set();
    $positions[$latest_earley_set_ID] = pos $string;
} ## end TOKEN: while ( pos $string &lt; $length )
</tt>
</pre>
    </blockquote>
    <p>
      The
      <tt>earleme_complete()</tt>
      method tells Marpa that all the alternatives
      at one location have been entered,
      and that the parse should now move on to the next location.
      (Marpa's idea of location is called an "earleme", in honor of the great
      parsing theorist, Jay Earley.)
    </p>
    <h3>How to parse without really trying</h3>
    <p>
    At this point, I want to draw the reader's attention to the code
    that deals with special cases for the minus sign.
    Specifically, to the fact that there is no such code.
    The more familiar you are with PPI and/or
      <tt>perly.y</tt>,
      the more remarkable this will seem.
      </p>
      <p>
      To take one example, PPI correctly realizes that the minus
      sign in
      "<tt>1+2-3</tt>" is a binary operator.
      However PPI fails on "<tt>(1+2)-3</tt>" --
      it thinks the minus sign is part of the number "-3".
      Why don't the authors of PPI just look at the Perl
      interpreter and copy the logic there?
      Take a glance at <tt>perly.y</tt>
      and <tt>toke.c</tt> 
      and you will know the answer to that question.
      </p>
      <p>What is PPI's problem here?
      The problem is that,
      without knowing where you are in the expression,
      you cannot tell whether a minus sign is a unary
      operator or a binary operator.
      And the parse engines for PPI and for Perl itself,
      while quite different in many respects,
      share a property common to traditional parsers --
      in determining context
      they offer the lexer, respectively,
      little and no help.
      </p>
      <p>
      In the code in this example,
      Marpa's <tt>alternative()</tt> method is, by accepting
      and rejecting tokens, guiding the lexer to the right choice.
      Because of Perl's grammar, a minus sign at a given position
      cannot be both a unary operator and a binary operator.
      And Marpa is 100% accurate in its knowledge of which
      tokens are possible.
      So Marpa's
      <tt>alternative()</tt> method
      always knows whether a minus sign can be
      a unary or binary operator and accepts
      or rejects the token accordingly.
    </p>
    <p>
      This is the Ruby Slippers in action --
      a very simple solution to what for the Perl
      interpreter and PPI
      is a very complicated problem.
      When I developed the Ruby Slippers technique,
      my most serious problem 
      was convincing myself that something
      so simple could really work.
    </p>
    <h3>Finding the targets</h3>
    <p>
      Once the parse is complete, it remains to find
      and print the "targets" found
      by the search.
      In
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/error.html">
      a previous post</a>,
      I showed how, 
      given a symbol name,
      to find the last occurrence of the symbol in a Marpa parse.
      That routine needed to be modified to allow repeated searches,
      but the change was straightforward.
      The code is in the
      <a href="https://gist.github.com/4057239">
      gist</a>,
      and the ideas behind it were explained
      in
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/error.html">
      the previous post</a>,
      so I won't repeat them here.
    </p>
    <h3>Code and comments</h3>
    <p>The example in this post is available as
    <a href="https://gist.github.com/4057239">
      a Github gist</a>.
      It was run with
      <a href="https://metacpan.org/release/JKEGL/Marpa-R2-2.024000/">
      Marpa::R2 2.024000</a>,
      as of this writing the latest full release.
      My main test, which is included in the gist,
      used displays from the
      <a href="http://perldoc.perl.org/perlop.html">perlop man page</a>.
    </p>
    <p>
      Comments on this post
      can be sent to the Marpa Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>]]>
        
    </content>
</entry>

<entry>
    <title>A grammar that exemplifies, describes and parses itself</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2012/11/a-grammar-that-exemplifies-describes-and-parses-itself.html" />
    <id>tag:blogs.perl.org,2012:/users/jeffrey_kegler//63.4018</id>

    <published>2012-11-05T01:41:15Z</published>
    <updated>2012-11-05T01:45:59Z</updated>

    <summary>[ This is cross-posted from the Ocean of Awareness blog. ] I&apos;ve written a grammar in Marpa&apos;s new BNF interface, to parse Marpa&apos;s new BNF interface. In the 70&apos;s, when I learned parsing theory, this was a very fashionable thing to do, perhaps because yacc had done it, in Appendix B of the original 1975 paper. By 1979, Hoftstadter&apos;s book Godel-Escher-Bach (GEB) was out, and the next year it took the Pulitzer for General Nonfiction. Self-description, recursion, self-reference, self-embedding, you (preferably autologically) name it, these things were all the rage. Reading code that is at once both self-example and self-description still holds a certain magic for me. Regular expressions cannot describe themselves. Recursive descent parsers are hand-written in another general-purpose language, so there can be no concise self-description. Ironically, yacc actually cannot parse its own description language. (&quot;Ironically&quot; is the word used in the paper.) Like almost all useful grammars, yacc&apos;s description language goes beyond the capabilities of yacc&apos;s LALR parser, and a lexer hack is needed to make the code in Appendix B work. Marpa is a general BNF parser and requires no special hacks to parse the following efficiently: rules ::= rule+ action =&gt; do_rules rule ::= empty_rule...</summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parser" label="parser" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[<p>[ This is cross-posted from
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/self_parse.html">
the Ocean of Awareness blog</a>. ]
</p>
<p>I've written a grammar in Marpa's new BNF interface,
      to parse Marpa's new BNF interface.
      In the 70's, when I learned parsing theory,
      this was a very fashionable thing to do, perhaps because
      yacc had done it,
      in Appendix B of
      <a href="http://dinosaur.compilertools.net/yacc/">
        the original 1975 paper</a>.
      By 1979, Hoftstadter's book Godel-Escher-Bach (GEB) was out,
      and the next year it took the Pulitzer for
      General Nonfiction.
      Self-description, recursion, self-reference, self-embedding,
      you
      (preferably
      <a href="http://en.wikipedia.org/wiki/Autological_word">autologically</a>)
      name it,
      these things were all the rage.
    </p>
    <p>Reading code
    that is at once both self-example and self-description
    still holds a certain magic for me.
      Regular expressions cannot describe themselves.
      Recursive descent parsers are hand-written
      in another general-purpose language,
      so there can be no concise self-description.
      Ironically, yacc actually cannot parse its own description language.
      ("Ironically" is the word used in the paper.)
      Like almost all useful grammars, yacc's description language
      goes beyond the capabilities of yacc's LALR parser,
      and a lexer hack is needed to make the code in Appendix B work.
    </p>
    <p>Marpa is a general BNF parser and requires no special hacks
    to parse the following efficiently:
    </p>
    <blockquote>
      <pre>
rules ::= rule+ action => do_rules
rule ::= empty_rule | priority_rule | quantified_rule
priority_rule ::= lhs op_declare priorities
  action => do_priority_rule
empty_rule ::= lhs op_declare adverb_list
  action => do_empty_rule
quantified_rule ::= lhs op_declare name quantifier adverb_list
    action => do_quantified_rule
priorities ::= alternatives+
    separator => op_tighter proper => 1
    action => do_discard_separators
alternatives ::= alternative+
    separator => op_eq_pri proper => 1
    action => do_discard_separators
alternative ::= rhs adverb_list action => do_alternative
adverb_list ::= adverb_item* action => do_adverb_list
adverb_item ::=
      action
    | left_association | right_association | group_association
    | separator_specification | proper_specification

action ::= kw_action op_arrow name action => do_action
left_association ::= kw_assoc op_arrow kw_left
  action => do_left_association
right_association ::= kw_assoc op_arrow kw_right
  action => do_right_association
group_association ::= kw_assoc op_arrow kw_group
  action => do_group_association
separator_specification ::= kw_separator op_arrow name
  action => do_separator_specification
proper_specification ::= kw_proper op_arrow boolean
action => do_proper_specification

lhs ::= name action => do_lhs
rhs ::= names
quantifier ::= op_star | op_plus
names ::= name+ action => do_array
name ::= bare_name | reserved_word | quoted_name
name ::= bracketed_name action => do_bracketed_name

reserved_word ::= kw_action | kw_assoc | kw_separator | kw_proper
  | kw_left | kw_right | kw_group
</pre>
    </blockquote>
    <p>
    The conventions are standard or transparent.
    The "<tt>::=</tt>" symbol separates the left and right hand sides of rules.
    The "<tt>|</tt>" symbol separates alternative right hand sides.
    The "<tt>*</tt>" and
    "<tt>+</tt>" are quantifiers, similar to those in regular expressions,
    and indicate, respectively, zero or more repetitions and one or more repetitions
    of the preceding symbol.
    Adverbs take the form "<tt>keyword => value</tt>",
    and indicate semantics or the style of sequence separation.
    Full documentation can be found
    <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.023_010/pod/BNF.pod">
    here</a>.
    <p>
      Self-parsing compiler compilers ruled the earth
      in the age of bellbottoms.
      Self-parsing has lasted better, but not by much.
      When some years I wrote a self-describing language as an interface to
      Marpa, it seemed to confuse people.
      They wondered what Marpa did --
      parsing your own description did not seem to be
      about <b>doing</b> anything.
      These days my examples feature a lot of calculators.
      ("Ironically", Hofstadter seems to have had the same problem with
      GEB -- he felt that
      people did not understand what his book was saying --
      even those who liked it.)
    </p>
    <p>
      But ideas from Larry Wall and Peter Stuifzand
      have re-ignited my interest in self-parsing.
      And this time the self-parsing parser was written
      with a specific purpose.
      I plan to enhance this language.
      I have found that the convenience of this interface
      more than compensates for the circular
      dependency issues.
      The BNF source in this post is
      <a href="https://metacpan.org/source/JKEGL/Marpa-R2-2.023_010/lib/Marpa/R2/meta/Stuifzand.bnf">
      the source</a>
      for its own parser,
      and I plan to use it
      to produce improved versions
      of itself.
    </p>
    <h3>Comments</h3>
    <p>
      Comments on this post can be sent to the Marpa Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>]]>
        
    </content>
</entry>

<entry>
    <title>A Marpa DSL tutorial: Error reporting made easy</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/jeffrey_kegler/2012/10/a-marpa-dsl-tutorial-error-reporting-made-easy.html" />
    <id>tag:blogs.perl.org,2012:/users/jeffrey_kegler//63.4005</id>

    <published>2012-10-30T17:22:44Z</published>
    <updated>2012-10-31T23:31:18Z</updated>

    <summary>[ This is cross-posted from the new home of the Ocean of Awareness blog. ] Using Marpa&apos;s facilities for error reporting, a quickly written domain-specific language can, as of its first draft, have error reporting whose helpfulness and precision exceeds that of carefully hand-crafted production compilers. This post will show how, with an example. Two techniques will be used. First and most basic, Marpa&apos;s knowledge of the point at which the parse can no longer proceed is 100% accurate and immediate. This is not the case with yacc-derived parsers, and is not the case with most recursive descent parsers. However, even Marpa&apos;s 100% accuracy in pinpointing the problem location is only accuracy in the technical sense -- it cannot take into account what the programmer intended. A second technique allows the programmer to double-check his intentions against what the parser has actually seen. Marpa can tell the programmer exactly how it thinks the input parsed, up to the point at which it could no longer proceed. The Marpa parser can report the answer to questions like &quot;What was the last statement you successfully parsed?&quot; &quot;What was the last expression you successfully parsed?&quot; &quot;What was the last arithmetic expression you successfully...</summary>
    <author>
        <name>Jeffrey Kegler</name>
        <uri>http://www.jeffreykegler.com</uri>
    </author>
    
    <category term="dsl" label="DSL" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="marpa" label="Marpa" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="parsing" label="parsing" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/jeffrey_kegler/">
        <![CDATA[<p>[ This is cross-posted from
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/error.html">
the new home of the Ocean of Awareness blog</a>. ]</p>
  <p>Using
      Marpa's facilities for error reporting,
      a quickly written domain-specific language can,
      as of its first draft,
      have error reporting whose helpfulness and precision exceeds
      that of carefully hand-crafted production compilers.
      This post will show how, with an example.
    </p><p>
      Two techniques will be used.
      First and most basic,
      Marpa's knowledge of the point
      at which the parse
      can no longer proceed is 100% accurate and immediate.
      This is not the case with yacc-derived parsers,
      and is not the case with most recursive descent parsers.
    </p>
    <p>
      However, even Marpa's 100% accuracy in pinpointing
      the problem location is only accuracy
      in the technical sense -- it cannot take into account what the
      programmer intended.
      A second technique allows the programmer to double-check his
      intentions against what the parser has actually seen.
      Marpa can tell the programmer exactly how it thinks
      the input parsed, up to the point at which it could no
      longer proceed.
      The Marpa parser can report the answer to questions like
    </p><blockquote><p>
        "What was the last statement you successfully parsed?"<br>
        "What was the last expression you successfully parsed?"<br>
        "What was the last arithmetic expression you successfully parsed?"<br>
        "Where did the last successfully parsed block start?  End?"<br>
      </p>
    </blockquote>
    <h3>The language</h3>
    <p>
      To focus on the logic of the error reporting,
      I looked for a language that was error-prone,
      but extremely simple.
      For this purpose,
      prefix arithmetic is like a gift from the dakinis.
      It is almost trivial in concept,
      and almost impossible to get right when it is more than a few
      characters long.
      Two valid strings in this language are
      <q>say + 1 2</q>
      and
      <q>+++ 1 2 3 + + 1 2 4</q>.
      Their results are, in order, 3 and 13.
    </p><p>
      I restricted the calculator to addition, because even with one
      operator, prefix notation is more than confusing enough to serve our purposes.
      I have included an optional
      <tt>say</tt>
      keyword, in order
      to illustrate rejection of a token by type.
      In pure prefix arithmetic, either all tokens are valid or none are.
      The
      <tt>say</tt>
      keyword is only valid as the first token.
    </p><h3>The grammar</h3><p>
      The full code for this post is in
      <a href="https://gist.github.com/3974816">
        a Github gist</a>.
      It was run using
      <a href="https://metacpan.org/release/JKEGL/Marpa-R2-2.023_008">
        a release candidate for the full release of Marpa::R2</a>.
      Here is the grammar.
    </p><blockquote><pre><tt>
my $prefix_grammar = Marpa::R2::Grammar-&gt;new(
    {   start          =&gt; 'Script',
        actions        =&gt; 'My_Actions',
        default_action =&gt; 'do_arg0',
        rules          =&gt; [ &lt;&lt;'END_OF_RULES' ]
Script ::=
     Expression
   | kw_say Expression action =&gt; do_arg1
Expression ::=
     Number
   | op_add Expression Expression action =&gt; do_add
END_OF_RULES
    }
);
</tt></pre></blockquote>
    <p>The rules are specified in another DSL,
      of the kind I've used
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/dsl.html">
        in previous posts</a>.
      This one is incorporated in Marpa::R2 itself,
      and is
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.023_008/pod/BNF.pod">
        documented here</a>.
      Here are its features relevant to this example:
    </p><dl>
      <dt><strong><tt>::=</tt></strong></dt>
      <dd>A BNF rule in LHS
        <tt>::=</tt>
        RHS form</dd>
      <dt><strong><tt>|</tt></strong></dt>
      <dd>Separates alternative RHS's at the
        <strong>same</strong>
        precedence level</dd>
      <dt><strong><tt>=&gt;</tt></strong></dt>
      <dd><tt>keyword =&gt; value</tt>, where
        <tt>keyword</tt>
        is the name of an adverb.</dd>
    </dl>
    <p>The
      "<tt>action =&gt; do_add</tt>"
      adverb indicates that the semantics for the alternative
      are in the Perl closure named
      <tt>do_add</tt>.
    </p><p>The rest of the grammar's definition will be familiar to Marpa users.
      <tt>Script</tt>
      is the start symbol,
      the Perl closures implementing semantics are to be found in the
      <tt>My_Actions</tt>
      package,
      and where no semantics are explicitly specified,
      the Perl closure
      <tt>do_arg0</tt>
      is the default.
    </p><h3>The semantics</h3>
    <p>The semantics for this example are easy.
    </p><blockquote><pre><tt>
sub My_Actions::do_add  { shift; return $_[1] + $_[2] }
sub My_Actions::do_arg0 { shift; return shift; }
sub My_Actions::do_arg1 { shift; return $_[1]; }
</tt></pre></blockquote>
    <p>
      The first argument to a Marpa semantic closure is a "per-parse variable",
      which is not used in this application.
      The other arguments are the values of the child nodes,
      as determined recursively and in lexical order.
    </p><h3>The lexing table</h3>
    <p>
      In this post,
      I am skipping around in the code --
      <a href="https://gist.github.com/3974816">
        the full code is in the gist</a>.
      But lexical analysis is of particular interest to new
      Marpa users.
      The lexer I use for this example is overkill --
      table-driven and using Perl's progressive matching
      capabilities, it is capable of serving a much more
      complex language.
      (I talked about lexing more
      <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/dsl.html">
        in a previous example</a>.)
      Here is the lexing table:
    </p>
    <blockquote>
      <pre>
my @terminals = (
    [ Number =&gt; qr/\d+/xms,   'Number' ],
    [ op_add =&gt; qr/[+]/xms,   'Addition operator' ],
    [ kw_say =&gt; qr/say\b/xms, qq{"say" keyword} ],
);
</pre></blockquote>
    <p>The lexing table is an array of 3-element arrays.
      Each sub-array contains the symbol name, a regular expression
      that is used to recognize it, and a "long name",
      a human-readable name more appropriate for error messages
      than the symbol name.
      For some languages,
      the order of our lexing tables may be significant,
      although in the case of this language it makes no difference.
    </p>
    <h3>Types of parsing error</h3>
    <p>Before plunging into the error-handling code,
      I will describe the forms parsing errors take,
      and show the messages that the code this example
      DSL's error handling produces.
    </p>
    <h4>No valid token</h4>
    <p>The lexer may reach a point in the input
      where it does not find one of the allowed tokens.
      An example in this language would be an input with an
      an exclamation point.
      This is no need to talk much about this kind of error,
      which has always been relatively easy to diagnose,
      pinpoint and, usually, to fix.
    </p>
    <h4>The parser rejects a token</h4>
    <p>
      In some cases the lexer finds a token,
      but it is not one
      that the parser will accept at that point,
      so the parser rejects the token.
      An example for this language would be the input
      "<tt>+ 1 say 2</tt>", which causes the following diagnostic:
    </p>
    <blockquote>
      <pre>
Last expression successfully parsed was:  1
A problem occurred here:  say 2
Parser rejected token ""say" keyword"
</pre>
    </blockquote>
    <p>Marpa successfully determined that
      "<tt>1</tt>" is a valid expression of the language, but
      "<tt>+ 1</tt>" is not.
    </p>
    <h4>The parser becomes exhausted</h4>
    <p>
      In other cases, the parser may "dead end" -- reach a point
      where no more input can be accepted.
      One example is with the input
      "<tt>+ 1 2 3 + + 1 2 4</tt>".
      This causes the following diagnostic:
    </p><blockquote><pre>
Last expression successfully parsed was: + 1 2
The parse became exhausted here: " 3 + + 1 2 4"
</pre></blockquote>
    <p>The parser has completed a prefix expression.
      Unlike infix and postfix expressions, once a prefix
      expression has been
      allowed to end
      there is no way to "extend" or "restart" it.
      The parse is "exhausted".
    </p><p>A second example of an exhausted parse
      occurs with the the input
      "<tt>1 + 2 +3  4 + 5 + 6 + 7</tt>".
      Here is the diagnostic:
    </p><blockquote><pre>
Last expression successfully parsed was: 1
The parse became exhausted here: " + 2 +3  4 + 5 + 6 + 7"
</pre></blockquote>
    <h4>The input is fully accepted, but there is no parse</h4><p>
      Finally, it may happen that lexer and parser read and accept
      the entire input, but do not find a valid parse in it.
      For example, if the input is
      "<tt>+++</tt>", the diagnostic will be:
    </p><blockquote><pre>
No expression was successfully parsed
No parse was found, after reading the entire input
</pre></blockquote>
    <p>The input was a good start for a prefix expression,
      but no numbers were ever found,
      and our DSL reports that it never recognized any
      prefix expressions.
    </p><p>A more complicated case is this input:
      "<tt>++1 2++</tt>".
      Here is what our DSL tells us:
    </p><blockquote><pre>
Last expression successfully parsed was: +1 2
No parse was found, after reading the entire input
</pre></blockquote>
    <p>
      Our DSL did find a good expression, and tells us where it was.
      If there is more than one good expression, our DSL tells us
      the most recent.
      With input "<tt>++1 2++3 4++</tt>",
      the diagnostic becomes
    </p><blockquote><pre>
Last expression successfully parsed was: +3 4
No parse was found, after reading the entire input
</pre></blockquote>
    <p>In fact, if we thought it would be helpful
      our DSL could show all the expressions found,
      or the last
      <i>N</i>
      expressions for some
      <i>N</i>.
      This is a simple language with nothing but expressions
      involving a single operator.
      More interesting languages will have statements and blocks,
      and layers of subexpressions.
      The logic below can be straightforwardly modified to show us
      as much about these as we think will be helpful.
    </p><h3>Parsing the DSL</h3>
    <blockquote><pre>
sub my_parser {
    my ( $grammar, $string ) = @_;
    my @positions = (0);
    my $recce = Marpa::R2::Recognizer-&gt;new( { grammar =&gt; $grammar } );

    my $self = bless {
        grammar   =&gt; $grammar,
        input     =&gt; \$string,
        recce     =&gt; $recce,
        positions =&gt; \@positions
        },
        'My_Error';

    my $length = length $string;
    pos $string = $positions[-1];

    <big><b>... "Reading the tokens" goes here ...</b></big>

    my $value_ref = $recce-&gt;value;
    if ( not defined $value_ref ) {
        die $self-&gt;show_last_expression(), "\n",
            "No parse was found, after reading the entire input\n";
    }
    return ${$value_ref};
} ## end sub my_parser
</pre></blockquote>
    <p>The above closure takes a grammar and an input string, and either produces a parse
      value,
      or a diagnostic telling us exactly why it could not.
      For truly helpful diagnostics, I find it necessary to be able to quote
      the input exactly.
      The
      <tt>@positions</tt>
      array will be used to map the locations that the Marpa
      parser uses back to positions in the original input string.
      Marpa location 0 is always before any input symbol, so it is initialized
      to string position 0.
    </p>
    <p>
      The
      <tt>$self</tt>
      object is a convenience.
      It collects the information the error handler needs,
      and allows an elegant syntax for the error-handling calls.
    </p>
    <p>The loop for reading tokens will be described below.
      After it, but before the
      <tt>return</tt>,
      is our first error check.
      "No parse" errors show up after all the tokens have been read,
      when the
      <tt>$recce-&gt;value()</tt>
      call returns a Perl
      <tt>undef</tt>.
      In that case,
      we produce the message we showed above.
      The tricky details are hidden in the
      <tt>show_last_expression()</tt>
      method,
      which we will come to.
    </p><h3>Reading the tokens</h3>
    <blockquote>
      <pre>
TOKEN: while ( pos $string &lt; $length ) {
    next TOKEN if $string =~ m/\G\s+/gcxms;    # skip whitespace
    if ( $recce-&gt;exhausted() ) {
	die $self-&gt;show_last_expression(), "\n",
	    q{The parse became exhausted here: "},
	    $self-&gt;show_position( $positions[-1] ), qq{"\n},
	    ;
    } ## end if ( $recce-&gt;exhausted() )

    <big><b>...  "Looping through the lexing table" goes here ...</b></big>

    die 'A problem occurred here: ',
	$self-&gt;show_position( $positions[-1] ), "\n",
	q{No valid token was found};
} ## end TOKEN: while ( pos $string &lt; $length )
</pre>
    </blockquote>
    <p>This loop implements part of our progressive matching
      within
      <tt>$string</tt>,
      and contains two of our four error checks.
      The
      <tt>exhausted()</tt>
      method check if the parse is
      exhausted,
      and again the hard work is done by the
      <tt>show_last_expression()</tt>
      method.
    </p><p>If we get through the lexing table without finding a token,
      we produce an invalid token message
      and report the position using the
      <tt>show_position()</tt>
      method.
      For invalid tokens, position should be all that
      the user needs to know.
      Position is also reported in the case of an exhausted parse.
      Implementation of the
      <tt>show_position()</tt>
      method presents
      no difficulties -- the code can be found in the gist.
    </p><h3>Looping through the lexing table</h3>
    <blockquote><pre>
TOKEN_TYPE: for my $t (@terminals) {
    my ( $token_name, $regex, $long_name ) = @{$t};
    next TOKEN_TYPE if not $string =~ m/\G($regex)/gcxms;
    if ( defined $recce-&gt;read( $token_name, $1 ) ) {
	my $latest_earley_set_ID = $recce-&gt;latest_earley_set();
	$positions[$latest_earley_set_ID] = pos $string;
	next TOKEN;
    }
    die $self-&gt;show_last_expression(), "\n",
	'A problem occurred here: ',
	$self-&gt;show_position( $positions[-1] ), "\n",
	qq{Parser rejected token "$long_name"\n};
} ## end TOKEN_TYPE: for my $t (@terminals)
</pre></blockquote>
    <p>
      Our innermost loop is through the lexing table,
      checking each table entry against the input string.
      If a match is found, the Marpa recognizer's
      <tt>read()</tt>
      method is called.
      This may fail due to our fourth and last type of error:
      a rejected token.
      Again,
      <tt>show_position()</tt>
      reports position
      and
      <tt>show_last_expression()</tt>
      does the interesting stuff.
    </p><h3>Showing the last expression</h3>
    <blockquote><pre><tt>
sub My_Error::show_last_expression {
    my ($self) = @_;
    my $last_expression =
        $self-&gt;input_slice( $self-&gt;last_completed_range('Expression') );
    return
        defined $last_expression
        ? "Last expression successfully parsed was: $last_expression"
        : 'No expression was successfully parsed';
} ## end sub My_Error::show_last_expression
</tt></pre></blockquote>
    <p>At its top level,
      <tt>show_last_expression()</tt>
      finds the parse locations of the last completed
      <tt>Expression</tt>
      symbol,
      using the
      <tt>last_completed_range()</tt>
      method.
      (In Marpa,
      as in other Earley parsers, a symbol or rule that has been recognized
      from start to finish is said to be "completed".)
      The parse locations are passed to the
      <tt>input_slice()</tt>
      method,
      which translates them into the corresponding substring of the input
      string.
    </p><blockquote>
      <pre>
sub My_Error::input_slice {
    my ( $self, $start, $end ) = @_;
    my $positions = $self-&gt;{positions};
    return if not defined $start;
    my $start_position = $positions-&gt;[$start];
    my $length         = $positions-&gt;[$end] - $start_position;
    return substr ${ $self-&gt;{input} }, $start_position, $length;
} ## end sub My_Error::input_slice
</pre>
    </blockquote>
    <h3>Finding the last successful parse of a symbol</h3>
    <p>The
      <tt>last_completed_range()</tt>
      method does the
      complicated part of the error handling -- finding the last
      successfully recognized ("completed")
      occurrence of a symbol.
      The
      <tt>last_completed_range()</tt>
      method does not use
      any internals, but it certainly gets technical
      in its use of the external methods.
      It or something like it
      is a prime candidate to be folded into the Marpa
      interface someday.
    </p><p>
      Successful recognitions of a symbol are called,
      again following standard Earley parsing terminology,
      "completions".
      Completions are recorded by rule,
      so the first thing that must be done is to turn the
      symbol name into a list of those rules which have
      that symbol on their left hand side.
      These are called the
      <tt>@sought_rules</tt>.
      We also need to initialize the loop by
      recording the last parse location ("latest Earley set").
      <tt>$earley_set</tt>
      will be our loop variable.
    </p>
    <blockquote><pre>
sub My_Error::last_completed_range {
    my ( $self, $symbol_name ) = @_;
    my $grammar      = $self-&gt;{grammar};
    my $recce        = $self-&gt;{recce};
    my @sought_rules = ();
    for my $rule_id ( $grammar-&gt;rule_ids() ) {
        my ($lhs) = $grammar-&gt;rule($rule_id);
        push @sought_rules, $rule_id if $lhs eq $symbol_name;
    }
    die "Looking for completion of non-existent rule lhs: $symbol_name"
        if not scalar @sought_rules;
    my $latest_earley_set = $recce-&gt;latest_earley_set();
    my $earley_set        = $latest_earley_set;

    <big><b>... "Traversing the Earley sets" goes here ...</b></big>

    return if $earley_set &lt; 0;
    return ( $first_origin, $earley_set );
} ## end sub My_Error::last_completed_range
</pre></blockquote>
    <p>
      Once we have traversed the Earley sets, we need only return
      the appropriate value.
      If the Earley set number fell below 0, we never found any completions
      of the "sought rules",
      a circumstance which we report with a bare
      <tt>return</tt>
      statement.
      Otherwise,
      <tt>$first_origin</tt>
      and
      <tt>$earley_set</tt>
      will be set to the first and last parse locations of the completion,
      and we return them.
    </p>
    <h3>Traversing the Earley sets</h3>
    <p>This is our final code sample, and the buck stops here.
      Marpa::R2 introduced more detailed user access to the progress reporting
      information, and
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.023_008/pod/Progress.pod">
        that interface</a>
      is used here.
    </p>
    <p>We traverse the Earley sets in reverse order,
      beginning with the latest and going back, if necessary to Earley set 0.
      For each Earley sets, there are "progress items", reports of the progress
      as of that Earley set.
      Of these, we are only interested in completions,
      which have a "dot position" of -1.
      (Those interested in a fuller explanation of "dot positions",
      progress items, and
      progress reports, can look in
      <a href="https://metacpan.org/module/JKEGL/Marpa-R2-2.023_008/pod/Progress.pod">
        the documentation for progress reports</a>.)
      Of the completions, we are interested only in those for one of
      the
      <tt>@sought_rules</tt>.
    </p>
    <p>
      For any given set of sought rules, more than one might end at
      an given Earley set.
      Usually we are most interested in the longest of these,
      and this logic assumes that we are only interested in the
      longest completion.
      We check if the start of the completion (its "origin") is prior to
      our current match, and if so its becomes our new
      <tt>$first_origin</tt>.
    </p>
    <p><tt>$first_origin</tt>
      was initialized to an non-existent Earley set,
      higher in number than any actual one.
      Once out of the loop through the progress items, we check if
      <tt>$first_origin</tt>
      is still at its initialized value.
      If so, we need to iterate backward one more Earley set.
      If not, we are done, and
      <tt>$first_origin</tt>
      and
      <tt>$earley_set</tt>
      contain the information that we were looking for -- the start and end
      locations of the most recent longest completion of one of the
      <tt>@sought_rules</tt>.
    </p><blockquote>
      <pre>
my $first_origin = $latest_earley_set + 1;
EARLEY_SET: while ( $earley_set &gt;= 0 ) {
    my $report_items = $recce-&gt;progress($earley_set);
    ITEM: for my $report_item ( @{$report_items} ) {
	my ( $rule_id, $dot_position, $origin ) = @{$report_item};
	next ITEM if $dot_position != -1;
	next ITEM if not scalar grep { $_ == $rule_id } @sought_rules;
	next ITEM if $origin &gt;= $first_origin;
	$first_origin = $origin;
    } ## end ITEM: for my $report_item ( @{$report_items} )
    last EARLEY_SET if $first_origin &lt;= $latest_earley_set;
    $earley_set--;
} ## end EARLEY_SET: while ( $earley_set &gt;= 0 )
</pre>
    </blockquote>
    <h3>Comments</h3>
    <p>
      Comments on this post can be sent to the Marpa Google Group:
      <code>marpa-parser@googlegroups.com</code>
    </p>
]]>
        
    </content>
</entry>

</feed>
