Perl and Parsing 10: "Use" the Easier Way

In the previous post in this series, I described the way the Traditional Perl Parser (TPP) parses Perl's use statement. This post will describe an IMHO attractive alternative approach, using my own Marpa::XS. Marpa::XS has a prototype Perl parser as part of its test suite.

The TPP Way

If that previous post is not fresh in your mind, it may be useful at this point to look back at it. Summarizing, the TPP uses the "Ruby Slippers" method: It first simplifies the grammar so that, while convenient for the parser, the grammar falls short of describing the actual language. The lexer then fakes the input to the parser so that the parser's "wishful thinking" comes true.

In the TPP, the parser requires all use statements take this form:

use_statement ::= USE WORD WORD LIST

Here the WORD's may be either versions or names of modules -- the semantics looks at the internal representations to sort this out once the parsing is finished.

The Marpa::Perl Way

Marpa is a lot better than TPP's LALR parse engine at the Ruby Slippers thing. Marpa will tell its lexers exactly what it wants at every point, whereas the TPP's lexer can expect no help from the LALR parse engine and is forced to do its own parsing of use statement. But in Marpa::Perl, I decided NOT to use the Ruby Slippers to parse Perl's use statement. Marpa is good at the Ruby Slippers, but it also allows even better and easier ways to tackle this problem.

Marpa can parse far more grammars than the TPP's parse engine. Marpa can even parse ambigious grammars. The use statement comes in three forms, and it is easy to create a rule for each one.

use_statement ::= USE NAME VERSION LIST
use_statement ::= USE VERSION
use_statement ::= USE NAME LIST

This is a very natural way to describe the use statement -- so natural that in fact, it's almost exactly the way the Perl documentation DOES describe it. The one difference is that the Perl documentation lists separate forms for use statements with and without LIST's of arguments. This is useful for clarity, but LIST's can be empty. Separate rules in the parser's BNF dedicated to module requests without the LIST's would create a useless and inefficient ambiguity.

The three rules above demonstrate ambiguity of a helpful kind. Consider

use Fatal 5;
The TPP parses this as a request to use the (so far non-existent) version 5 or higher of the Fatal. According to both TPP's and Marpa::Perl's BNF, however, it could also be parsed as a request to Fatal to wrap a Perl subroutine named 5. In fact, that's exactly how the TPP will interpret
use Fatal +5;

Marpa::Perl solves the ambiquity by using Marpa's new ranking method, which allows every rule to have an integer rank. Marpa::Perl ranks the rules in the order I listed them above, from highest to lowest. The long form (both name and version) of the module request ranks highest, meaning that Marpa::Perl parses a use statement that way whenever possible. When a use statement cannot possibly be the long form of a module request, Marpa::Perl tries to parse it as a perl version request. As a last resort, Marpa::Perl interprets a use statement as the short form (name only) of a module request.

By using this rule ranking, Marpa::Perl's lexer can avoid the issue of determining what is a version number and what is a name. Marpa allows tokens to be ambiguous. Marpa::Perl's lexer can tell the parser that v5.0 is BOTH a name and a version, and let the parser sort things out. LALR is, to say the least, not tolerant of ambiguities, so the TPP's lexer must tackle the issue of identifying versions head-on.

Summarizing ...

When it comes to parsing Perl's use statement, I believe I have shown that Marpa can do it more easily. Parsing the use statement in Marpa required the three BNF rules shown, plus arranging to have the lexer recognize that numbers and v-strings can be either a version or a name, which is done with a simple loop over a fixed list. Look at the corresponding code in toke.c's tokenize_use() and op.c's Perl_utilize(), and you'll see some of the most difficult logic that you're likely to find anywhere.

Perhaps more significant is what Marpa::Perl could do that the TPP does not and could not do. Imagine, for example, that you've written a module to make available the regular expression logic as implemented in the classic AT&T UNIX version 8, affectionately remembered as v8 UNIX. Imagine further that you decide, unhappily, to name this module, v8.pm. In that case you can simply forget about loading it with TPP's use statement, which will insist that v8 is a version number, even in a case like this:

use v8 0 qw(re);
The use statement just shown can only have one parse. A human being can quickly determine what that one parse must be. Why have we come to believe this is too much to expect of our parsing algorithms?

Notes

  1. "Traditional Perl Parser": In this series, I call the Perl interpreter's parser, the one whose implementation centers on the files toke.c and perly.y, the Traditional Perl Parser, or TPP. Most others just call it the Perl parser, but in this series I discuss its alternatives (Marpa::Perl and PPI) a lot. This little bit of special terminology saves confusion.

  2. "names of modules": Actually, these can also be pragma names, a fact that, for brevity's sake, I will not mention again.

  3. "last resort": Actually, the TPP accepts some undocumented syntaxes. This gets into strange territory. The TPP parses

    use v2 Fatal;
    
    as a request to use version 2.0 or higher of the Fatal module. You could take the point of view that the TPP should reject statements like this as syntax errors, in which case TPP's behavior is a bug. For the purposes of this blog post, I ignore this issue.

  4. "a simple loop over a fixed list": For the code see pperl/Marpa/Perl.pm in the latest developer's version of Marpa::XS . The code discussed begins at line 1200.

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.