Reporting mismatched delimiters
[ This is cross-posted by invitation, from its home on the Ocean of Awareness blog. ]
In many contexts, programs need to identify non-overlapping pieces of a text. One very direct way to do this is to use a pair of delimiters. One delimiter of the pair marks the start and the other marks the end. Delimiters can take many forms: Quote marks, parentheses, curly braces, square brackets, XML tags, and HTML tags are all delimiters in this sense.
Mismatching delimiters is easy to do. Traditional parsers are often poor at reporting these errors: hopeless after the first mismatch, and for that matter none too precise about the first one. This post outlines a scaleable method for the accurate reporting of mismatched delimiters. I will illustrate the method with a simple but useable tool -- a utility which reports mismatched brackets.
The example script
The example script, bracket.pl, reports mismatched brackets in the set:
() {} []
They are expected to nest without overlaps. Other text is treated as filler. bracket.pl is not smart about things like strings or comments. This does have the advantage of making bracket.pl mostly language-agnostic.
Because it's intended primarily to be read as an illustration of the technique, bracket.pl's grammar is a basic one. The grammar that bracket.pl uses is so simple that an emulator of bracket.pl could be written using recursive descent. I hope the reader who goes on to look into the details will see that this technique scales to more complex situations, in a way that a solution based on a traditional parser will not.
Error reports
The description of how the method works will make more sense after we've looked at some examples of the diagnostics bracket.pl produces. To be truly useful, bracket.pl must report mismatches that span many lines, and it can do this. But single-line examples are easier to follow. All the examples in this post will be contained in a one line. Consider the string '((([))'. bracket.pl's diagnostics are:
* Line 1, column 1: Opening '(' never closed, problem detected at end of string ((([)) ^ ==================== * Line 1, column 4: Missing close ], problem detected at line 1, column 5 ((([)) ^^
In the next example bracket.pl realizes that it cannot accept the ')' at column 16, without first closing the set of curly braces started at column 5. It identifies the problem, along with both of the locations involved.
* Line 1, column 5: Missing close }, problem detected at line 1, column 16 [({({x[]x{}x()x)})] ^ ^
So far, so good. But an important advantage of bracket.pl has yet to be seen. Most compilers, once they report a first mismatched delimiter, produce error messages that are unreliable -- so unreliable that they are useless in practice. bracket.pl repairs a mismatched bracket before continuing, so that it can do a reasonable job of analyzing the text that follows. Consider the text '({]-[(}-[{)'. The output of bracket.pl is
* Line 1, column 1: Missing close ), problem detected at line 1, column 3 ({]-[(}-[{) ^ ^ ==================== * Line 1, column 2: Missing close }, problem detected at line 1, column 3 ({]-[(}-[{) ^^ ==================== * Line 1, column 3: Missing open [ ({]-[(}-[{) ^ ==================== * Line 1, column 5: Missing close ], problem detected at line 1, column 7 ({]-[(}-[{) ^ ^ ==================== * Line 1, column 6: Missing close ), problem detected at line 1, column 7 ({]-[(}-[{) ^^ ==================== * Line 1, column 7: Missing open { ({]-[(}-[{) ^ ==================== * Line 1, column 9: Missing close ], problem detected at line 1, column 11 ({]-[(}-[{) ^ ^ ==================== * Line 1, column 10: Missing close }, problem detected at line 1, column 11 ({]-[(}-[{) ^^ ==================== * Line 1, column 11: Missing open ( ({]-[(}-[{) ^
Each time, bracket.pl corrects itself, and accurately reports the next set of problems.
A difficult error report
To be 100% accurate, bracket.pl would have to guess the programmer's intent. This is, of course, not possible. Let's look at a text where bracket.pl's guesses are not so good: {{]}. Here we will assume the closing square bracket is a typo for a closing parenthesis. Here's the result:
* Line 1, column 1: Missing close }, problem detected at line 1, column 3 {{]} ^ ^ ==================== * Line 1, column 2: Missing close }, problem detected at line 1, column 3 {{]} ^^ ==================== * Line 1, column 3: Missing open [ {{]} ^ ==================== * Line 1, column 4: Missing open { {{]} ^
Instead of one error, bracket.pl finds four.
But even in this case, the method is fairly good, especially when compared with current practice. The problem is at line 1, column 3, and the first three messages all identify this as one of their potential problem locations. It is reasonable to believe that a programmer, especially once he becomes used to this kind of mismatch reporting, will quickly find the first mismatch and fix it. For this difficult case, bracket.pl may not be much better than the state of the art, but it is certainly no worse.
How it works
For full details of the workings of bracket.pl there is the code, which is heavily commented. This section provides a conceptual overview.
bracket.pl uses two features of Marpa: left-eideticism and the Ruby Slippers. By left-eidetic, I mean that Marpa knows everything there is to know about the parse at, and to left of, the current position. As a consequence, Marpa also knows exactly which of its input symbols can lead to a successful parse, and is able to stop as soon as it knows that the parse cannot succeed.
In the Ruby Slippers technique, we arrange for parsing to stop whenever we encounter an input which would cause parsing to fail. The application then asks Marpa, "OK. What input would allow the parse to continue?" The application takes Marpa's answer to this question, and uses it to concoct an input that Marpa will accept.
In this case, bracket.pl creates a virtual token which fixes the mismatch of brackets. Whatever the missing bracket may be, bracket.pl invents a bracket of that kind, and adds it to the virtual input. This done, parsing and error detection can proceed as if there was no problem. Of course, the error which made the Ruby Slippers token necessary is recorded, and those records are the source of the error reports we saw above.
To make its error messages as informative as possible in the case of missing closing brackets, bracket.pl needs to report the exact location of the opening bracket. Left-eideticism again comes in handy here. Once the virtual closing bracket is supplied to Marpa, bracket.pl asks, "That bracketed text that I just closed -- where did it begin?" The Marpa parser tracks the start location of all symbol and rule instances, so it is able to provide the application with the exact location of the starting bracket.
When bracket.pl encounters a problem at a point where there are unclosed opening brackets, it has two choices. It can be optimistic or it can be pessimistic. "Optimistic" means it can hope that something later in the input will close the opening bracket. "Pessimistic" means it can decide that "all bets are off" and use Ruby Slippers tokens to close all the currently active open brackets.
bracket.pl uses the pessimistic strategy. While the optimistic strategy sounds better, in practice the pessimistic one seems to provide better diagnostics. The pessimistic strategy does report some fixable problems as errors. But the optimistic one can introduce spurious fixes. These hide the real errors, and it is worse to miss errors than it is to overreport them. Even when the pessimistic strategy overreports, its first error message will always accurately identify the first problem location.
While bracket.pl is already useable, I think of it as a prototype. Beyond that, the problem of matching delimiters is in fact very general, and I believe these techniques may have very wide application.
For more
The example script of this post is a Github gist. For more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group.