Marpa v. Perl regexes: some numbers

By Jeffrey Kegler on November 7, 2011 5:21 PM

In this post, I pit Marpa against the Perl regex engine. The example I will use is unanchored searching for balanced parentheses. I have claimed that many problems now tackled with regexes are better solved with a more powerful parser, like Marpa. I believe the numbers in this post back up that claim.

To be very clear, I am NOT claiming that Marpa should or can replace regexes in general. For each character, all an RE (regular expression) engine needs to do is to compute a transition from one "state" to another state based on that character -- essentially a simple lookup. It's the sort of thing a modern C compiler should optimize into a series of machine instructions that you can count on the fingers of one hand.

Marpa is much more powerful than an regular expression engine, and to deliver this power Marpa makes a list of all the possibilities for each token. Tricks are used to compress these per-token lists, and Marpa's code to process them is heavily optimized. But even so, Marpa's processing requires more than a handful of machine instructions.

In this context, I think some of the numbers below may surprise people. RE's beat everything else as long as they stick to their game. But these days they are often stretched beyond their limits, often without an appreciation of how quickly their efficiency deteriorates when outside those limits.

Unanchored searches for balanced parens

I chose the problem solved by Regexp::Common::balanced -- unanchored searches for balanced parens. "Unanchored" here means that the search is not anchored to the beginning, or any other specific point of the string. Unanchored searches need, not only to identify the matching string, but to determine where in the searched string to find the target.

When it comes to unanchored matches, most users want the "first longest" match. That is, they want the first match but, when one match is completely contained in another one, they want the longest match. This is the problem in its hardest form. It is simple to find the match which ends first. "First longest" needs the match which STARTS first. "First longest" is the problem that Regexp::Common::balanced addresses. For this benchmark, I programmed Marpa to return exactly the same results that Regexp::Common::balanced returns.

The examples I will use in this post are sets of parens of varying length. All the examples will have a prefix, a balanced paren "target", and a short, unbalanced, trailer. If the string is of length N, the prefix consists of N-8 left parens. The target is always this string: "(()())". The trailer is always two left parens: "((". Here, with spacing added for clarity, is the test string for length 20:

(((((((((( ((()()) ((

The Results

Number of parens in test string	Executions per second
Number of parens in test string	libmarpa (pure C)	Marpa::XS (mixed C and Perl)	Regexp:: Common:: balanced	tchrist's regex	Marpa::PP (Pure Perl)
10	4524.89	111.71	3173.30	33429.33	47.39
100	1180.64	58.96	62.09	197.25	15.35
500	252.40	19.50	2.43	7.58	4.09
1000	117.16	10.28	0.53	1.84	2.14
2000	56.07	5.47	0.12	0.34	1.08
3000	36.35	3.72	0.05	0.13	0.74

The above results are in executions per second -- a higher number means a faster algorithm. These numbers are what happens when regexes are pushed beyond their limits. Regex::Balanced::common goes quadratic on these examples, while all versions of Marpa stay linear. (Here linear and quadratic refer to speed. The results above are reported in executions per second, and you need to take the reciprocal to get the speed.)

libmarpa, the pure C version of Marpa, is faster than Regexp::Common::balanced on even the shortest example I tested. Marpa::XS, the XS (mixed C and Perl) version, catches up with it when the length of the test strings gets a little past 100 characters. You would expect that Marpa::PP, which is implemented entirely in Perl, would not have a chance against the Perl regex engine, which is implemented in C. But somewhere in the low 100's Marpa::PP also catches up and by the time we are testing strings of 500 characters, Marpa::PP is running twice as fast.

As the length of the test strings increases, Marpa's relative advantage grows. At 3000, Marpa::XS is over 74 times as fast as Regexp::Common::balanced, and libmarpa is well over 700 times as fast. Even Marpa::PP is nearly 15 times as fast, and steadily improving its advantage.

[ In the comments, Tom Christiansen shares a Perl regex which is faster than Regexp::Common::balanced. I've included results for it in the table above. For discussion of it, see the comments. ]

And I hardly even cheated!

In this comparison, I tried to be "fair". "Fair" can be hard to define in a benchmark.

The Choice of Tests

The test strings, and the problem (unanchored searching for balanced parens) were chosen to highlight the limits of Perl regexes. On the other hand, this problem is typical of the things programmers want to do, as well as of the sort of thing they ask Perl regexes to to.

Official versus Developer's Versions

For Marpa::XS and Marpa::PP, I insisted on using official distributions out on CPAN. My benchmarking script is available as a gist on github. These benchmarks were done using only the DOCUMENTED interfaces of Marpa::XS 0.020000 and Marpa::PP 0.010000. Both these versions had several useful capabilities which are not in the documented interface, and, especially with the shorter test strings, both Marpa::XS and Marpa::PP pay a real price for my decision not to use them. But I wanted to demonstrate the kinds of speeds that real users can get, using what is actually on CPAN now, as it is documented TODAY.

The libmarpa numbers, on the other hand, are for a developer's version. The libmarpa library has never had a separate release, and is not yet documented. A stable libmarpa is part of Marpa::XS and the latest code is in Marpa::R2. The version used for these tests is the one in Marpa::R2 and my benchmarking code is in the Marpa::R2 github repository.

Precompilation

Parsers, like Marpa and yacc, are designed for repeated use. Regular expression engines, on the other hand, are often used in "one-shot" applications. When the application is viewed as a regex, it seems fair to include any precompilation in the benchmark times. If the application is thought of as parsing, it seems fair to allow the algorithms to see the grammar first, without the clock running, and optimize the heck out of it. Which is fair here? The choice would make a real difference. Marpa does a lot of precomputation, more than any other parsing algorithm I know of.

I decided that this test would be about pitting Marpa versus regexes, on the regex's own turf and using their rules. In all the tests above, for every repetition, Marpa was forced to redo all its precomputations, while the clock was running. For shorter tests in particular, this put Marpa at a real disadvantage. But it makes the results clearly relevant to the use of Marpa inside a regex engine.

Reading the input string

Both libmarpa and Regexp::Common::balanced have a big advantage over Marpa::XS and Marpa::PP. Marpa::XS and Marpa::PP have to use Perl to convert the input string into their internal format. For Perl's regex engine and libmarpa, this is done in C. I decided to require both algorithms to do their string manipulation "with the clock running", very much to the disadvantage of Marpa::XS and Marpa::PP.

This disadvantage could be called unfair, since the choice of language for string manipulation is about interface and convenience, and does not really have anything to do with the speed of the underlying algorithms. But I felt that, when string conversion times were included, run times were more realistic -- more like what would be encountered in the actual applications that regexes are asked to deal with. This handicap makes Marpa::XS's performance all the more impressive.

A faster regex engine

I hope this post will encourage use of Marpa::XS. That is why I carefully limited the benchmark code for Marpa::XS to documented features of an already available, stable beta release.

I also want to suggest the possibility of using libmarpa to extend the Perl regex engine. It is possible to efficiently identify, for any regex, the presence of features that are problematic for an RE-based recognition engine. A regex implementation could check for these and select a recognition engine accordingly -- the tradition RE-based engine for simpler regexes or, when it seems beneficial, a Marpa-based recognition engine.

Notes

"regular expression": In the pure mathematical sense, a regular expression (RE) is a state machine that recognizes only patterns that use concatenation (ab), repetition (a*), alternation a|b), or some combination of these. Perl regexes are sometimes regular expressions or their equivalent. More often they are not. For example, any Perl regex which captures substrings is not equivalent to a regular expression.
"token": In the example in this post, "token" can be taken as a synonym for character. RE's typically work with the individual characters of character strings. In this post, so does Marpa. Typically, general purpose parsers like Marpa let a lower level gather one or more individual characters into "tokens".
"paren": Saying "parenthesis" gets tedious. I often abbreviate it to "paren".
"quadratic": By quadratic, I mean O(n^2) where n is the length of the test string. That Regexp::Common::balanced goes quadratic is my conjecture -- I have no proof. But many of the simpler approaches to unanchored search are quadratic, and the benchmark numbers suggest this is the case here.

17 comments

Tagged as:

Marpa Earley parsing parser Perl regex regular expression

17 Comments

tempire | November 7, 2011 6:02 PM

I would really like to see some Marpa tutorials with some example solutions to some popular parsing problems.

The existing POD tutorials make a lot of assumptions that make them difficult to consume if you're not familiar with parsing theory.

tchrist.myopenid.com | November 7, 2011 7:33 PM

This is a completely unrealistic test. Do you realize what Regexp::Common::balanced is doing, and why it makes no sense to do that anymore?

macbook# perl -MRegexp::Common -le 'print $RE{balanced}{-parens=>"()"}' (?^:(?^:(?:$(?:(?>[^\($]+)|(??{$Regexp::Common::balanced [0]}))*\))))

Isn’t that nasty? That was how you had to do it prior to v5.10. These days if you simply use plain old Perl instead of that old beast, it runs ~34x faster on the Camel, and you get the same results, too. The pattern you really, really, really want to be using is the far simpler and super-faster qr/($(?:[^()]++|(?-1))*+$)/. Watch it beat the living snot out of Regexp::Common right here:

macbook# /usr/bin/time perl -MRegexp::Common -00 -nle 'print $1 while /($RE{balanced}{-parens=>"()"})/g' < camel4.pod | wc -l 1.87 real 1.83 user 0.01 sys 18361

macbook# /usr/bin/time perl -00 -nle 'print $1 while /($(?:[^()]++|(?-1))*+$)/g' < camel4.pod | wc -l 0.06 real 0.05 user 0.00 sys 18361

That’s not really enough granularity to know what is doing on. So here’s a tenfold copy to get better accuracy:

macbook# /usr/bin/time perl -MRegexp::Common -00 -nle 'print $1 while /($RE{balanced}{-parens=>"()"})/g' < /tmp/camel4x10 | wc -l 18.63 real 18.02 user 0.11 sys 183610

macbook# /usr/bin/time perl -00 -nle 'print $1 while /($(?:[^()]++|(?-1))*+$)/g' < /tmp/camel4x10 | wc -l 0.55 real 0.51 user 0.02 sys 183610

There — see what I mean?

--tom

Jeffrey Kegler replied to comment from tchrist.myopenid.com | November 7, 2011 9:14 PM

@tchrist: Please allow me the liberty of calling your improvement over Regexp::Common::balanced, the tchrist regex. Rather than put a table in the comments, I've added a column with numbers for the tchrist regex to my original post.

The tchrist regex is still quadratic. In the asymptote it seems to be 3 times faster than the Regexp::Common::balanced solution. This means that it delays somewhat the point at which the various Marpa implementations catch up. I include precomputation of the pattern/grammar in all timings. This is greatly to Marpa's disadvantage, and means that the tchrist regex wins handily for strings of length 10.

By the time the test strings have length 100, Marpa's precomputations have had time to pay off. libmarpa is over 5 times faster, and Marpa::XS has almost caught up. At 500, Marpa::XS is twice as fast and when the test strings are 1000 characters long, even the Pure Perl Marpa is beating the tchrist regex.

At 3000 characters, libmarpa is 279 times as fast, Marpa::XS is 28 times as fast and the Pure Perl Marpa is 5 times as fast. Quadratic is a bad thing.

tchrist.myopenid.com | November 8, 2011 3:14 AM

You make it sound like this is of practical concern. I'd like to see where and how your concern arises, because I am not seeing it in real data. I ask because I've been doing this for some time now, and I never have a catastrophic backtracking failure leading to unacceptable performance. My test corpus has been the PubMed Open Access collection, which contains nearly 200,000 fulltext biomedical papers. You’d think that if something ugly were going to show up, it would do so across those, and it hasn’t.

Going back to the pod for the 4th edition of the Camel, that single file holds 17,029 paragraphs, 69,446 lines, 381,037 words, 2,536,522 graphemes, 2,536,527 characters, and 2,555,505 bytes. It also holds 8,856 open parentheses and 8,853 close parentheses.

It makes a difference whether you process it by line, by paragraph, or by the whole file. Here are the timings for R::C::balanced when the file is processes linewise, by paragraph, or all at once:

line: 5.43 real, 5.32 user, 0.02 sys
para: 1.87 real, 1.83 user, 0.01 sys
file: 18.55 real, 9.36 user, 8.44 sys

But here are the incredibly more reasonable timings using my version:

line: 0.09 real, 0.08 user, 0.00 sys
para: 0.08 real, 0.05 user, 0.00 sys
file: 0.15 real, 0.12 user, 0.01 sys

That's not just 3x faster; it's 127x faster. That's two full orders of magnitude faster. I don't know about you, but I pay attention to little things like a couple of orders of magnitude on real world data.

A few other things to notice. One is the dramatic differences between all three timings in the R::C::balanced versions. Yes, the paragraph mode takes less time than the line mode in both, but that's because it can't find a mate more often. And look how with R::C::balanced this is a multiplicative factor by about 3:1, while in the native pattern it is far less magnified.

There's also something wicked going on with the system time in R::C::balanced's full-file test that does not happen with the native pattern. And yes, this is repeatable; it's not just there during one run and disappears once the file is in the buffer cache. What's doing on is that the R::C::balanced version is being a memory pig, and the other one isn't.

Anyway, my point is that when running against real-world strings that vary in size from several hundred K for each of the almost 200,000 PubMed papers but ranging up to 2.5 megabytes for the full Camel4 text, I see absolutely no catastrophe when using the native regex. There is no exponential failure that leads to a process sitting there dead in the water. Its real-time performance is completely acceptable, parsing out thousands of parentheticals in a mere tenth of a second against that entire 2.5 megabyte string, and doing much better than that on lesser files.

That's not some hypothetical constructed pathologic case where we worry ourselves to death by chasing asymptotes that never actually arise in real life. That's using real world data, and its performance on that real world data is certainly plenty good enough for me.

Isn't it good enough for you?

Joel Berger | November 8, 2011 8:38 AM

Jeffrey, while your posts are often above my head, I keep being impressed at the ability and versatility that you demonstrate with Marpa.

I have been wondering, now and seeing this example, finally must ask: would it be possible for Marpa to replace portions of the perl chain? Is MarpaPerl a possibility in the future? Could it pass a large fraction of the perl test suite?

It appears that you have shown that Marpa could improve the speed, ambiguity (your posts on `use` for example) and others have me wondering if this is more that just an academic question at this point.

Jeffrey Kegler replied to comment from Joel Berger | November 8, 2011 9:07 AM

@ Joel Berger: Thanks! Marpa is the pioneering work of Jay Earley, as moved forward by Joop Leo, Nigel Horspool, John Aycock and, I hope, my work on it. I think this approach has long been underrated, and at this point have devoted years of work trying to bring it into general use.

I think a Marpa-enhanced and even a Marpa-powered Perl would be a step forward. But it is beyond what I can do alone -- it will never happen without the support and assistance of others.

Aristotle | November 8, 2011 10:30 AM

I wonder about a regex meta-plugin that combines re2 for (mathematically) regular expressions with Marpa for the additional features in Perl’s pattern syntax.

Jeffrey Kegler replied to comment from tchrist.myopenid.com | November 8, 2011 11:12 AM

@ tchrist: Your test cases are realistic and practical where the practice and the reality consist of English-language texts in which parentheses are sparse and stretches of balanced parens are short. But the problem of matching balanced parens has other applications. In fact, it represents a far more general problem -- parsing boundary-delimited nested data. In this reality, the nesting will often enmesh most or all of the input, and an database of 10,000 items is considered modest in size..

Let me return to the point I made at the beginning of my post. Regular expressions were originally invented for simple patterns, and most of the time were applied to short inputs. This is STILL the way most regexes are used. For these cases, Marpa offers no improvement over the traditional recognition engines.

Regexes have been victims of their own success, as well of the failures of more general parsing techniques. Regexes have been the technology which works, and the approach where extensions might pay off. The advanced features in perlre now encourage users to create regexes far beyond the capability of an RE-based recognition engine. And more complex patterns tend to be applied to longer strings. For these, yes, frankly, the RE-based engine will often be "not good enough".

Jeffrey Kegler replied to comment from Aristotle | November 8, 2011 5:32 PM

@Aristotle: This is a very interesting idea. There definitely is a synergy, in the sense that if re2 addresses something, Marpa will never be able to it better. Also very nice is the combined list of Vim, Perl and re2 syntaxes which, with notes, could become a roadmap for the project.

tchrist.myopenid.com | November 8, 2011 7:20 PM

It looks like you added my regex to your table but didn't put a space between columns, so it runs together so people can't tell what is what.

You might also care to explain in your article why my pattern is (well, or seems to be; see abutment issue) beating your C version at the beginning of the chart.

My pattern works quite well for what I use it for: natural langauge processing. I'm not parsing gigabyte lisp programs. Knowing what works in your domain is important.

I believe you are unfair to Perl's pattern when you pretend that they are not cut out for the sort of work that I am putting them to, because my own tests prove they very much are.

Jeffrey Kegler replied to comment from tchrist.myopenid.com | November 9, 2011 6:49 PM

@tchrist: I addressed the table abutment problem by widening the table & centering the data. The decimal point alignment is now lost, but that's the lesser issue.

Your regex pattern is a win for short strings, because Marpa gets its speed & power from lots and lots of grammar precomputations. For a one-shot parse of a short string, this is a losing strategy. The standard with regexes is to precompile/precompute on the fly, so even though this makes Marpa look bad, I think including the precomputation in the time is the fair way to do it.

tchrist.myopenid.com | November 9, 2011 7:10 PM

I dunno about "short" strings: the PubMed papers are several hundred thousand bytes long, which is already out of what I call a "short" string, and Camel4 is over two and a half million bytes long.

Perhaps we have different notions of "short". I can perhaps imagine a highly specialized niche world in which those would be "short", but not a normal one.

Jeffrey Kegler replied to comment from tchrist.myopenid.com | November 10, 2011 9:30 AM

You misread my statement that "Marpa is slower on short strings" as a claim that "Marpa is always faster on long strings". I didn't make this claim and I do not believe it is true, even if you define "long" very strictly as strings of 100% parentheses where the solution is in doubt until the end.

Take strings of the form ( () (((( .... That is they start with a left paren, then have a pair of balanced parens, followed by more left parens. These strings can be arbitrarily long. Without lookahead, it is not clear until the end whether the 1st or 2nd left paren begins the first balanced set of parens.

On this new example, your regex does not go quadratic, and is still beating Marpa 40 to 1 on strings of length 10,000. Marpa remains linear, but apparently so does your regex, so that that the regex would remain much faster no matter how large the string grew.

My claim is that the are SOME strings of balanced parens for which Marpa is faster. I also claim these strings are of interest. For example, you could look at them in general terms, seeing the parens as start and end tags, and the problem as the general one of nested data.

Jakub Narebski | November 11, 2011 7:48 AM

@Jeffrey Kegler: Could you include loglog plot of results, i.e. logarithmic x axis with size, logarithmic y axis with time (perhaps with more data points than in table). From the slope of lines one can easily read if for given form of input performance is linear or quadratic.

A bit of math:



y = x^n

log(y) = n * log(x)

http://tendentious.org/author/mdorman/ | November 14, 2011 5:18 AM

I've been following Marpa's development for quite some time.

I recently built a grammar for simple SQL-like boolean expressions using Regexp::Grammars, and while I found it pretty easy to do, and generally performant, I would, during development, occasionally stumble across a grammar change that would make everything go exponential.

So my interest would be in seeing a comparison of a grammar built with Marpa and Regexp::Grammars---with attention not only to speed of parsing, but also the amount of code necessary to get a fully working solution, since I was easily able to create the code to have the grammar spit out a boolean value that represented whether the statement was true or not.

This might be an example that wouldn't get tchrist so hot under the collar. ;)

Jeffrey Kegler replied to comment from http://tendentious.org/author/mdorman/ | November 14, 2011 8:28 AM

@mdorman: Can you give me one or more examples of regexes on which Regexp::Grammars goes non-linear? Especially interesting would be regexes of kinds that you hope and expect can run quickly.

http://tendentious.org/author/mdorman/ replied to comment from Jeffrey Kegler | November 15, 2011 3:47 AM

Sadly, no---they were iterations in the process of building the parser, and for obvious reasons they were ones I didn't find worth holding onto.

Some, perhaps all, of them were issues of mine---this was the first time I'd tried to build a parser using a formal grammar, and it's been a long time since I read an introductory text on them, so it took me a while to sort out things like how to handle parenthesized expressions and such.

Now that I've got a working grammar, though, I do intend to try converting it to Marpa.

About Jeffrey Kegler

I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.

More info »

Ocean of Awareness