Marpa v. Perl regexes: some numbers

In this post, I pit Marpa against the Perl regex engine. The example I will use is unanchored searching for balanced parentheses. I have claimed that many problems now tackled with regexes are better solved with a more powerful parser, like Marpa. I believe the numbers in this post back up that claim.

To be very clear, I am NOT claiming that Marpa should or can replace regexes in general. For each character, all an RE (regular expression) engine needs to do is to compute a transition from one "state" to another state based on that character -- essentially a simple lookup. It's the sort of thing a modern C compiler should optimize into a series of machine instructions that you can count on the fingers of one hand.

Marpa is much more powerful than an regular expression engine, and to deliver this power Marpa makes a list of all the possibilities for each token. Tricks are used to compress these per-token lists, and Marpa's code to process them is heavily optimized. But even so, Marpa's processing requires more than a handful of machine instructions.

In this context, I think some of the numbers below may surprise people. RE's beat everything else as long as they stick to their game. But these days they are often stretched beyond their limits, often without an appreciation of how quickly their efficiency deteriorates when outside those limits.

Unanchored searches for balanced parens

I chose the problem solved by Regexp::Common::balanced -- unanchored searches for balanced parens. "Unanchored" here means that the search is not anchored to the beginning, or any other specific point of the string. Unanchored searches need, not only to identify the matching string, but to determine where in the searched string to find the target.

When it comes to unanchored matches, most users want the "first longest" match. That is, they want the first match but, when one match is completely contained in another one, they want the longest match. This is the problem in its hardest form. It is simple to find the match which ends first. "First longest" needs the match which STARTS first. "First longest" is the problem that Regexp::Common::balanced addresses. For this benchmark, I programmed Marpa to return exactly the same results that Regexp::Common::balanced returns.

The examples I will use in this post are sets of parens of varying length. All the examples will have a prefix, a balanced paren "target", and a short, unbalanced, trailer. If the string is of length N, the prefix consists of N-8 left parens. The target is always this string: "(()())". The trailer is always two left parens: "((". Here, with spacing added for clarity, is the test string for length 20:

(((((((((( ((()()) ((

The Results

Number
of parens
in test
string
Executions per second
libmarpa
(pure C)
Marpa::XS
(mixed C
and Perl)
Regexp::
Common::
balanced
tchrist's
regex
Marpa::PP
(Pure
Perl)
10 4524.89 111.71 3173.30 33429.33 47.39
100 1180.64 58.96 62.09 197.25 15.35
500 252.40 19.50 2.43 7.58 4.09
1000 117.16 10.28 0.53 1.84 2.14
2000 56.07 5.47 0.12 0.34 1.08
3000 36.35 3.72 0.05 0.13 0.74

The above results are in executions per second -- a higher number means a faster algorithm. These numbers are what happens when regexes are pushed beyond their limits. Regex::Balanced::common goes quadratic on these examples, while all versions of Marpa stay linear. (Here linear and quadratic refer to speed. The results above are reported in executions per second, and you need to take the reciprocal to get the speed.)

libmarpa, the pure C version of Marpa, is faster than Regexp::Common::balanced on even the shortest example I tested. Marpa::XS, the XS (mixed C and Perl) version, catches up with it when the length of the test strings gets a little past 100 characters. You would expect that Marpa::PP, which is implemented entirely in Perl, would not have a chance against the Perl regex engine, which is implemented in C. But somewhere in the low 100's Marpa::PP also catches up and by the time we are testing strings of 500 characters, Marpa::PP is running twice as fast.

As the length of the test strings increases, Marpa's relative advantage grows. At 3000, Marpa::XS is over 74 times as fast as Regexp::Common::balanced, and libmarpa is well over 700 times as fast. Even Marpa::PP is nearly 15 times as fast, and steadily improving its advantage.

[ In the comments, Tom Christiansen shares a Perl regex which is faster than Regexp::Common::balanced. I've included results for it in the table above. For discussion of it, see the comments. ]

And I hardly even cheated!

In this comparison, I tried to be "fair". "Fair" can be hard to define in a benchmark.

The Choice of Tests

The test strings, and the problem (unanchored searching for balanced parens) were chosen to highlight the limits of Perl regexes. On the other hand, this problem is typical of the things programmers want to do, as well as of the sort of thing they ask Perl regexes to to.

Official versus Developer's Versions

For Marpa::XS and Marpa::PP, I insisted on using official distributions out on CPAN. My benchmarking script is available as a gist on github. These benchmarks were done using only the DOCUMENTED interfaces of Marpa::XS 0.020000 and Marpa::PP 0.010000. Both these versions had several useful capabilities which are not in the documented interface, and, especially with the shorter test strings, both Marpa::XS and Marpa::PP pay a real price for my decision not to use them. But I wanted to demonstrate the kinds of speeds that real users can get, using what is actually on CPAN now, as it is documented TODAY.

The libmarpa numbers, on the other hand, are for a developer's version. The libmarpa library has never had a separate release, and is not yet documented. A stable libmarpa is part of Marpa::XS and the latest code is in Marpa::R2. The version used for these tests is the one in Marpa::R2 and my benchmarking code is in the Marpa::R2 github repository.

Precompilation

Parsers, like Marpa and yacc, are designed for repeated use. Regular expression engines, on the other hand, are often used in "one-shot" applications. When the application is viewed as a regex, it seems fair to include any precompilation in the benchmark times. If the application is thought of as parsing, it seems fair to allow the algorithms to see the grammar first, without the clock running, and optimize the heck out of it. Which is fair here? The choice would make a real difference. Marpa does a lot of precomputation, more than any other parsing algorithm I know of.

I decided that this test would be about pitting Marpa versus regexes, on the regex's own turf and using their rules. In all the tests above, for every repetition, Marpa was forced to redo all its precomputations, while the clock was running. For shorter tests in particular, this put Marpa at a real disadvantage. But it makes the results clearly relevant to the use of Marpa inside a regex engine.

Reading the input string

Both libmarpa and Regexp::Common::balanced have a big advantage over Marpa::XS and Marpa::PP. Marpa::XS and Marpa::PP have to use Perl to convert the input string into their internal format. For Perl's regex engine and libmarpa, this is done in C. I decided to require both algorithms to do their string manipulation "with the clock running", very much to the disadvantage of Marpa::XS and Marpa::PP.

This disadvantage could be called unfair, since the choice of language for string manipulation is about interface and convenience, and does not really have anything to do with the speed of the underlying algorithms. But I felt that, when string conversion times were included, run times were more realistic -- more like what would be encountered in the actual applications that regexes are asked to deal with. This handicap makes Marpa::XS's performance all the more impressive.

A faster regex engine

I hope this post will encourage use of Marpa::XS. That is why I carefully limited the benchmark code for Marpa::XS to documented features of an already available, stable beta release.

I also want to suggest the possibility of using libmarpa to extend the Perl regex engine. It is possible to efficiently identify, for any regex, the presence of features that are problematic for an RE-based recognition engine. A regex implementation could check for these and select a recognition engine accordingly -- the tradition RE-based engine for simpler regexes or, when it seems beneficial, a Marpa-based recognition engine.

Notes

  1. "regular expression": In the pure mathematical sense, a regular expression (RE) is a state machine that recognizes only patterns that use concatenation (ab), repetition (a*), alternation a|b), or some combination of these. Perl regexes are sometimes regular expressions or their equivalent. More often they are not. For example, any Perl regex which captures substrings is not equivalent to a regular expression.
  2. "token": In the example in this post, "token" can be taken as a synonym for character. RE's typically work with the individual characters of character strings. In this post, so does Marpa. Typically, general purpose parsers like Marpa let a lower level gather one or more individual characters into "tokens".
  3. "paren": Saying "parenthesis" gets tedious. I often abbreviate it to "paren".
  4. "quadratic": By quadratic, I mean O(n^2) where n is the length of the test string. That Regexp::Common::balanced goes quadratic is my conjecture -- I have no proof. But many of the simpler approaches to unanchored search are quadratic, and the benchmark numbers suggest this is the case here.

17 Comments

I would really like to see some Marpa tutorials with some example solutions to some popular parsing problems.

The existing POD tutorials make a lot of assumptions that make them difficult to consume if you're not familiar with parsing theory.

This is a completely unrealistic test. Do you realize what Regexp::Common::balanced is doing, and why it makes no sense to do that anymore?

macbook# perl -MRegexp::Common -le 'print $RE{balanced}{-parens=>"()"}'
(?^:(?^:(?:\((?:(?>[^\(\)]+)|(??{$Regexp::Common::balanced [0]}))*\))))

Isn’t that nasty? That was how you had to do it prior to v5.10. These days if you simply use plain old Perl instead of that old beast, it runs ~34x faster on the Camel, and you get the same results, too. The pattern you really, really, really want to be using is the far simpler and super-faster qr/(\((?:[^()]++|(?-1))*+\))/. Watch it beat the living snot out of Regexp::Common right here:

macbook# /usr/bin/time perl -MRegexp::Common -00 -nle 'print $1 while /($RE{balanced}{-parens=>"()"})/g' < camel4.pod | wc -l
1.87 real 1.83 user 0.01 sys
18361

macbook# /usr/bin/time perl -00 -nle 'print $1 while /(\((?:[^()]++|(?-1))*+\))/g' < camel4.pod | wc -l
0.06 real 0.05 user 0.00 sys
18361

That’s not really enough granularity to know what is doing on. So here’s a tenfold copy to get better accuracy:

macbook# /usr/bin/time perl -MRegexp::Common -00 -nle 'print $1 while /($RE{balanced}{-parens=>"()"})/g' < /tmp/camel4x10 | wc -l
18.63 real 18.02 user 0.11 sys
183610

macbook# /usr/bin/time perl -00 -nle 'print $1 while /(\((?:[^()]++|(?-1))*+\))/g' < /tmp/camel4x10 | wc -l
0.55 real 0.51 user 0.02 sys
183610


There — see what I mean?

--tom

You make it sound like this is of practical concern. I'd like to see where and how your concern arises, because I am not seeing it in real data. I ask because I've been doing this for some time now, and I never have a catastrophic backtracking failure leading to unacceptable performance. My test corpus has been the PubMed Open Access collection, which contains nearly 200,000 fulltext biomedical papers. You’d think that if something ugly were going to show up, it would do so across those, and it hasn’t.

Going back to the pod for the 4th edition of the Camel, that single file holds 17,029 paragraphs, 69,446 lines, 381,037 words, 2,536,522 graphemes, 2,536,527 characters, and 2,555,505 bytes. It also holds 8,856 open parentheses and 8,853 close parentheses.

It makes a difference whether you process it by line, by paragraph, or by the whole file. Here are the timings for R::C::balanced when the file is processes linewise, by paragraph, or all at once:

line: 5.43 real, 5.32 user, 0.02 sys
para: 1.87 real, 1.83 user, 0.01 sys
file: 18.55 real, 9.36 user, 8.44 sys
But here are the incredibly more reasonable timings using my version:
line: 0.09 real, 0.08 user, 0.00 sys
para: 0.08 real, 0.05 user, 0.00 sys
file: 0.15 real, 0.12 user, 0.01 sys
That's not just 3x faster; it's 127x faster. That's two full orders of magnitude faster. I don't know about you, but I pay attention to little things like a couple of orders of magnitude on real world data.

A few other things to notice. One is the dramatic differences between all three timings in the R::C::balanced versions. Yes, the paragraph mode takes less time than the line mode in both, but that's because it can't find a mate more often. And look how with R::C::balanced this is a multiplicative factor by about 3:1, while in the native pattern it is far less magnified.

There's also something wicked going on with the system time in R::C::balanced's full-file test that does not happen with the native pattern. And yes, this is repeatable; it's not just there during one run and disappears once the file is in the buffer cache. What's doing on is that the R::C::balanced version is being a memory pig, and the other one isn't.

Anyway, my point is that when running against real-world strings that vary in size from several hundred K for each of the almost 200,000 PubMed papers but ranging up to 2.5 megabytes for the full Camel4 text, I see absolutely no catastrophe when using the native regex. There is no exponential failure that leads to a process sitting there dead in the water. Its real-time performance is completely acceptable, parsing out thousands of parentheticals in a mere tenth of a second against that entire 2.5 megabyte string, and doing much better than that on lesser files.

That's not some hypothetical constructed pathologic case where we worry ourselves to death by chasing asymptotes that never actually arise in real life. That's using real world data, and its performance on that real world data is certainly plenty good enough for me.

Isn't it good enough for you?

Jeffrey, while your posts are often above my head, I keep being impressed at the ability and versatility that you demonstrate with Marpa.

I have been wondering, now and seeing this example, finally must ask: would it be possible for Marpa to replace portions of the perl chain? Is MarpaPerl a possibility in the future? Could it pass a large fraction of the perl test suite?

It appears that you have shown that Marpa could improve the speed, ambiguity (your posts on `use` for example) and others have me wondering if this is more that just an academic question at this point.

I wonder about a regex meta-plugin that combines re2 for (mathematically) regular expressions with Marpa for the additional features in Perl’s pattern syntax.

It looks like you added my regex to your table but didn't put a space between columns, so it runs together so people can't tell what is what.

You might also care to explain in your article why my pattern is (well, or seems to be; see abutment issue) beating your C version at the beginning of the chart.

My pattern works quite well for what I use it for: natural langauge processing. I'm not parsing gigabyte lisp programs. Knowing what works in your domain is important.

I believe you are unfair to Perl's pattern when you pretend that they are not cut out for the sort of work that I am putting them to, because my own tests prove they very much are.

I dunno about "short" strings: the PubMed papers are several hundred thousand bytes long, which is already out of what I call a "short" string, and Camel4 is over two and a half million bytes long.

Perhaps we have different notions of "short". I can perhaps imagine a highly specialized niche world in which those would be "short", but not a normal one.

@Jeffrey Kegler: Could you include loglog plot of results, i.e. logarithmic x axis with size, logarithmic y axis with time (perhaps with more data points than in table). From the slope of lines one can easily read if for given form of input performance is linear or quadratic.

A bit of math:


y = x^n
log(y) = n * log(x)

I've been following Marpa's development for quite some time.

I recently built a grammar for simple SQL-like boolean expressions using Regexp::Grammars, and while I found it pretty easy to do, and generally performant, I would, during development, occasionally stumble across a grammar change that would make everything go exponential.

So my interest would be in seeing a comparison of a grammar built with Marpa and Regexp::Grammars---with attention not only to speed of parsing, but also the amount of code necessary to get a fully working solution, since I was easily able to create the code to have the grammar spit out a boolean value that represented whether the statement was true or not.

This might be an example that wouldn't get tchrist so hot under the collar. ;)

Sadly, no---they were iterations in the process of building the parser, and for obvious reasons they were ones I didn't find worth holding onto.

Some, perhaps all, of them were issues of mine---this was the first time I'd tried to build a parser using a formal grammar, and it's been a long time since I read an introductory text on them, so it took me a while to sort out things like how to handle parenthesized expressions and such.

Now that I've got a working grammar, though, I do intend to try converting it to Marpa.

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.