Perl and Parsing 8: The Where and Why of Rejection
Why Perl is Just Not That Into Your Syntax
In a previous post, I noted that Perl often cannot precisely locate syntax errors in its scripts. Still less can it identify the exact problem. In this post, I will demonstrate an experimental utility which does pinpoint Perl syntax errors, precisely indicating where and what the problem is.
Here's my example from the previous post.
my $lyric =
'Sloopy wears a red dress, yeah
As old as the hills
but when sloopy wears that red dress, yeah
you know it gives me the chills
Sloopy when I see you walking,
walking down the street
I say don\'t worry sloopy, girl
You belong to me'~!@$%^&*()_+;
print "$lyric\n";
And here is Perl's output for the error:
Bareword found where operator expected at bomb2.pl line 9, near ")_"
(Missing operator before _?)
syntax error at bomb2.pl line 9, near "You belong to me'~"
(Might be a runaway multi-line '' string starting on line 2)
Execution of bomb2.pl aborted due to compilation errors.
As I said
in that previous post,
perl
clearly has very
little idea where things went wrong -- it's guessing.
Pinpointing the Error
At this point, let me give away the ending. The point of failure is the very first special symbol: the tilde. When I ran my fingers from left to right across the top of my keyboard, I was hoping to produce a more complicated example. But perhaps it is just as well I did not.Here, from my experimental Marpa-based utility, is what Perl is looking for when it encounters the tilde: The dot in the rules indicates how far the parse has already progressed.
line -> label sideff . SEMI sideff -> expr . IF expr sideff -> expr . UNLESS expr sideff -> expr . WHILE expr sideff -> expr . UNTIL iexpr sideff -> expr . FOR expr sideff -> expr . WHEN expr or_expr -> or_expr . OROP and_expr or_expr -> or_expr . DOROP and_expr and_expr -> and_expr . ANDOP argexpr argexpr -> argexpr . COMMA argexpr -> argexpr . COMMA term term_listop -> term_cond . ASSIGNOP term_listop term_assign -> term_cond . ASSIGNOP term_assign term_cond -> term_dotdot . QUESTION term_cond COLON term_cond term_dotdot -> term_oror . DOTDOT term_oror term_oror -> term_oror . OROR term_andand term_oror -> term_oror . DORDOR term_andand term_andand -> term_andand . ANDAND term_bitorop term_bitorop -> term_bitorop . BITOROP term_bitandop term_bitandop -> term_bitandop . BITANDOP term_eqop term_eqop -> term_relop . EQOP term_relop term_relop -> term_uniop . RELOP term_uniop term_shiftop -> term_shiftop . SHIFTOP term_addop term_addop -> term_addop . ADDOP term_mulop term_mulop -> term_mulop . MULOP term_matchop term_matchop -> term_matchop . MATCHOP term_uminus term_powop -> term_increment . POWOP term_powop term_increment -> term_arrow . POSTINC term_increment -> term_arrow . POSTDEC term_arrow -> term_arrow . ARROW method LPAREN listexprcom RPAREN term_arrow -> term_arrow . ARROW method subscripted -> term_hi . ARROW LSQUARE expr RSQUARE subscripted -> term_hi . ARROW LCURLY expr SEMI RCURLY subscripted -> term_hi . ARROW LPAREN RPAREN subscripted -> term_hi . ARROW LPAREN expr RPAREN
The names of the symbols are based on those in perly.y. Operators are not shown symbolically, but are indicated with the name in caps: "POSTINC" instead of "++". Terms are suffixed with their precedence: "term_assignop" is the symbol for terms with the same precedence as the assignment operator. "term_hi" is the symbol for terms at the highest precedence level.
A tilde, when it is a single-character Perl operator, is always a prefix unary operator. Tildes also form part of several multi-character operators, but that is not the case here. Here is what Perl is looking for when it encounters the tilde:
- A semicolon that would end the statement. (Statements are called line's internally by perly.y.)
- The postfix conditonal keywords.
- A lot of binary operators.
- A couple of postfix unary operators. These, if found, will produce a compile-time complaint about incrementing a constant.
No prefix unary operator is in the above list, and the parse fails here.
About the Utility Used in This Post
Finding the exact point of failure and the exact reasons would seem like something that you'd want in a parser. But in fact, production languages have tended to be like Perl -- they try to indicate the general area of a syntax problem and to make a good guess as to its nature. But they leave it to the programmer to figure out exactly where they failed and why.
Marpa, then, is unusual, in that for any grammar you can write in BNF, and any input, it will either produce a parse, or a precise characterization of the failure. Marpa::XS::Perl is still experimental and under development. As I tackle tasks (like preparing this post, for example) I add the necessary capabilities. An example of what my utility cannot yet do is deal with floating point constants. (They're not hard, I just haven't encountered them yet in a test case.)
My original intent with Marpa::XS::Perl was to use it for snippets, and for academic and toy examples, and it cannot yet deal with production Perl code. My purpose so far has been to demonstrate that Marpa could be the basis of a practical Perl parsing utility.
Notes
Note 1: Of course, in one sense, the exact nature of the problem depends on what the person writing the script intended, and on this my utility has not a clue. In this post, "finding the exact problem" means finding the exact location of a parse failure, and finding exactly what perl was looking for when perl did not find what perl wanted to find.
In determining the "exact location of parse failure", I also avoid mind-reading. I use a definition taken from the parsing literature: In a rejected token stream, the point of failure is the first token which made a successful parse impossible. In other words, if you encounter a token which cannot possibly be part of a successful parse, given the input you've already read, that token is the point of failure. Looking at it from the opposite point of view, if you can find some additonal input that makes the parse succeed, you have not yet found a point of failure.
Note 2: To be precise, the output in this post was automatically generated by my utility, then edited for readability. Specifically, the edits removed those lines which were for rules with the dot at the end, and removed rule numbers and token numbers from the beginnings of the lines. Since rules with the dot at the end are completed, they do not generate any expectations for future tokens, and are irrelevant here. Similarly, in this context the internal rule numbers and location numbers would be clutter. While I made these readability edits by hand, they were rote and could easily have been automated.
Note 3: Unlike in textbook BNF, the BNF in perly.y does not have a separate symbol for terms of each precedence. The BNF in perly.y is wildly ambiguous, unlike the Perl language. perly.y uses a tie-breaking technique, in combination with the BNF, to assign precedence. While some use of this kind of tie-breaking is standard in yacc, for the Perl parser, Larry used it far more boldly than had been the practice before. Or for that matter, has been since. This is very important aspect of Perl parsing, one on which I've been planning to post.
So, we should change the famous quote:
"Only Perl can parse Perl."
To:
"Even Perl can't parse Perl."
@dly: Actually, to get pedantic (which I tend to do), this post is about error detection and not about parsing. Parsing (in the strict sense) is examining a string according to a grammar and giving a binary answer: "Yes, the string matches the grammar" or "No, it does not". So in the examples in this post, Perl certainly is successfully parsing -- it was supposed to say "no" and it did correctly say "no".
I should perhaps have emphasized more that the inability to say where the problem occurs is a property of LALR parsing, rather than a peculiarity of Perl. The old LALR-based Portable C compiler (which was standard when I was a neophyte and more sensitive to these things) was often quite vague about why it would not compile my C programs.