Partial parsing and error reporting

By Jeffrey Kegler on July 22, 2012 7:48 PM

"One of the secrets to mathematical problem solving is that one needs to place a high value on partial progress, as being a crucial stepping stone to fully solving the problem." -- Terence Tao

Once an error is found, a traditional parser is pretty much lost. This is true for even state of the art compilers and interpreters. One small typo results in many screens of useless diagnostics. If you're an old hand, you scroll back over these, knowing that, for the better quality compilers, the first few lines often have something to do with the real problem. With Marpa, we can do better than this.

Robust error reporting is equivalent to the problem of finding code interspersed among arbitary text. For example, consider this mixture of text and Perl fragments.

Note: line:column figures include preceding whitepace
The next line is a perl fragment
{42;{1,2,3;4}}
Code block from 3:5 to 3:13
Code block from 2:33 to 3:14
The next line is a perl fragment
{42;{1,2,3,4}}
Hash from 7:5 to 7:13
Code block from 7:5 to 7:13
Code block from 6:33 to 7:14
The next line is a perl fragment
{42;{;1,2,3;4}}
Code block from 12:5 to 12:14
Code block from 11:33 to 12:15
The next line is a perl fragment
{42;+{1,2,3,4}}
Hash from 16:6 to 16:14
Code block from 15:33 to 16:15

I have written the prototype of a utility (ucurly.pl) that finds Perl fragments scattered among other material, and parses them. To test the accuracy of the parse, I have it look for one of Perl's parsing ambiguities --- anonymous hash constructors that could also be code blocks. For the example above, ucurly.pl finds what there is to be found:

Perl fragment: {42;{1,2,3;4}}
Code block at 3:5 3:13 {1,2,3;4}
Code block at 2:33 3:14 {42;{1,2,3;4}}
Perl fragment: {42;{1,2,3,4}}
Ambiguous Hash at 7:5 7:13 {1,2,3,4}
Ambiguous Code block at 7:5 7:13 {1,2,3,4}
Code block at 6:33 7:14 {42;{1,2,3,4}}
Perl fragment: {42;{;1,2,3;4}}
Code block at 12:5 12:14 {;1,2,3;4}
Code block at 11:33 12:15 {42;{;1,2,3;4}}
Perl fragment: {42;+{1,2,3,4}}
Hash at 16:6 16:14 {1,2,3,4}
Code block at 15:33 16:15 {42;+{1,2,3,4}}
perl tokens = 62; all tokens=267; 23.22%

My Marpa-based Perl parser, as it currently stands, is a partial Perl parser. On select problems, it is 100%. For example, it understands all of the output from Data::Dumper. And it scored perfectly in the example in this blog post. It will parse perhaps 90% of a typical, complex Perl program.

At this point in the Marpa-based Perl parser's evolution, rather than add Perl syntax, I decided to use it to investigate partial parsing. Some readers will have observed that "90% coverage" is a synonym with "useless" in current parsing practice. That's because traditional parsing algorithms are close to binary. Traditional parsers only understand input if it is correct or nearly so.

When partial parsing is possible, things are more interesting. For example, I turned ucurly.pl, my ambiguity-finding utility, loose on CGI.pm, a complex bit of Perl code by anyone's standard. My utility failed to parse a lot of correct Perl code in CGI.pm, producing a lot of false negatives. But this did not prevent it from singling out line 461:


      @result = map {ref $_ ? $_ : $self->_decode_utf8($_) } @result;

This line is ambiguous -- the first argument to map could be either an anonymous hash constructor, or a code block. My guess, and I believe, the Perl's parser's guess, is that it's a code block. But it'd be nice to have a utility that spots these, so that all doubt can be removed:

      @result = map { ; ref $_ ? $_ : $self->_decode_utf8($_) } @result;

2 comments

Tagged as:

Earley, Marpa, parsing. parser

2 Comments

jnareb.myopenid.com | July 23, 2012 3:24 AM

I think that it is because of 'map' signature that Perl parser prefers code block interpretation over hashref interpretation.

Jeffrey Kegler replied to comment from jnareb.myopenid.com | July 23, 2012 5:47 AM

@jnareb: The description of map in perlfunc has a discussion of this, including suggestions for helping perl guess what is meant. ("Guess" and "help" are the words perlfunc uses.)

Perl was designed by a linguist, and guessing with help from context is how natural languages work. But as you move from quick-and-dirty scripts toward production code, a guessed-at syntax becomes less desirable.

About Jeffrey Kegler

I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.

More info »

Ocean of Awareness