Marpa v. Parse::RecDescent: some numbers

By Jeffrey Kegler on April 11, 2012 9:08 PM

The application

In a recent blog post, Flavio Poletti described an unusual language, one which he'd written a parser for, using the very traditional method of recursive descent. Specifically he'd used Perl's Parse::RecDescent module. Several people asked me how Marpa would do on this language. This post contains a comparison, including a benchmark.

The reader who wants an exact description should look at Flavio's post and code. I call it a Dyck-Hollerith language because it combines Hollerith constants (strings preceded by a count), with balanced parentheses (what is called a Dyck language by mathematicians). Here's Flavio's example:


A2(A2(S3(Hey)S13(Hello, World!))S5(Ciao!))

When I did a Marpa versus Perl regexes comparison, I was very careful to choose an application which showed Marpa in a good light. Here I did not pick the application, and it is almost as if it was designed to show recursive descent in a good light.

Recursive descent tries to parse by treating each rule as a subroutine call. This is a very natural way for a programmer to look at a grammar. But, alas, it only works in certain cases, and most of those are already handled by regular expressions. However, Dyck languages, including balanced parentheses, are one of the boundary cases: languages beyond the abilities of regular expressions, but within the capabilities of recursive descent.

Hollerith constants highlight the other strength of recursive descent. Since recursive descent's basic idea is so intuitive, it is easy to add custom hacks to it. And Hollerith constants are beyond the capabilities of even general BNF parsing. Earley's, Marpa, CYK, GLR, you name it, none of them can handle Hollerith constants without some kind of custom hack. You might say that Hollerith constants force a parser to wrestle in the mud. And when it comes to wrestling in the mud, recursive descent is hard to beat.

The benchmark

Length	Seconds
Length	Marpa::XS	Parse::RecDescent
1000	2.938	13.616
2000	7.067	62.083
3000	13.953	132.549
10000	121.654	1373.171

For Parse::RecDescent, the code used in the benchmark is Flavio's, with light modifications such as the one that allows inputs of varying lengths. The Marpa::XS parser used in this benchmark is on github as a gist.

Since Marpa::XS has its parse engine in optimized C, while Parse::RecDescent is pure Perl, one would hope that Marpa::XS would put better numbers on the board, and it did. Marpa::XS has some startup overhead, but settles in at roughly 10 times faster than the Parse::RecDescent implementation.

To test inputs of varying lengths, I took the example above and created arrays of it in the Dyck-Hollerith language. The reported length is the length of the top-level array. The example is 42 characters long, so a reported length of 1000 means the input string is 6+1000*42+1=42007 characters long. (The 6 is the length of the "A1000(" prefix of the enclosing array and the 1 is for its closing parenthesis.)

Conclusions

Benchmarks are like traffic accidents. They have a draw and an excitement, one that the wiser and nobler sort know is improper and to be resisted. But you'll often catch even them slowing down to look.

Other factors than speed may determine the choice of a parser, particularly if large inputs do not occur. As noted, many programmers prefer the recursive descent approach of turning grammar rules into subroutines calls: for them, it is the way things SHOULD work. And given recursive descent's hacker friendliness, it is the way things often DO work.

But I think those accustomed to recursive descent may want to look at Marpa. Marpa has all the same context that recursive descent has, and allows the same tricks and more. In this benchmark, Marpa did NOT haul out the Ruby Slippers -- it limited itself to the context that recursive descent has.

Marpa is faster in this benchmark because it gives the application the best of both worlds -- custom hacks, *AND* a parse engine written in optimized C. Recursive descent is good at custom hacks, and for this grammar its underlying algorithm has the same big-O complexity. But by its nature, Parse::RecDescent is pure Perl, so Marpa::XS clocks a steady factor-of-ten advantage. And, in fact, of the time Marpa::XS does use, only about 20% is used by its parse engine. Overhead and lexing in Marpa::XS are in pure Perl, and they take 80% of the time.

4 comments

Tagged as:

Marpa Earley parsing parser, Parse::RecDescent, recursive descent

4 Comments

jnareb.myopenid.com | April 12, 2012 3:44 AM

How would benchmark look like if you used pure Perl version of Marpa parser (Marpa::PP)? Or is PP version left behind?

It would be nice to see plot of Marpa and Parse::RecDescent performance in loglog scale to check if they have the same O(n^k) complexity (performance).

BTW it is nice to have another example...

Jeffrey Kegler replied to comment from jnareb.myopenid.com | April 12, 2012 9:14 AM

@jnareb: User interest seems to be mostly in Marpa::XS.

Chris Dolan | April 12, 2012 7:55 PM

This is quite interesting, but reading the gist, it looks to me like you cheated by ignoring the $count. That is, your implementation would incorrectly succeed at parsing the following bad input:

A4(A3(A2(S255(foo))))

Have I misinterpreted?

Jeffrey Kegler replied to comment from Chris Dolan | April 12, 2012 8:57 PM

@Chris: Actually my parser rejects the string you mention, which seems to be your preferred behavior.

The Dyck-Hollerith language has redundant information, and a question of interpretation arises when the two structures (counts and parentheses), say different things. For each of arrays and strings, we have three choices. First, use the count. Second, use the parentheses. Third, use both and consider it an error if they are inconsistent. Still other variations might have warnings, etc., etc., etc.

Once you settle on an interpetation as "right", Marpa could implement it. Flavio didn't deal with the question so I assumed perfect input, XML-style, which seemed adequate for a quick & dirty attempt. Guessing at the motives behind the language, perhaps its high redundancy was an attempt to help out the parser, in which case I was proceeding in the spirit of things. I used the counts for the strings, and parentheses for the arrays. If the inconsistency bothers you, you can add a line to my implementation which checks the array count and die's if it's a mismatch.

This raises another point in Marpa's favor. For a production parser of very long strings in this language, error-handling would indeed be an issue. Whichever method of handling inconsistencies you prefer, Marpa could deal with it. Many of the ways of dealing with inconsistency make parsing wildly ambiguous, which would present severe problems for Parse::RecDescent.

About Jeffrey Kegler

I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.

More info »

Ocean of Awareness