Never matching: everybody is doing it wrong

Well, not actually wrong, just slow. But the exaggeration makes a punchier headline, you’ll admit.

This comes up when an interface takes a pattern to match things against. Sometimes you have some reason to want this match to always fail, so you want to pass a pattern which will never match. The customary way of doing this is to pass qr/(?!)/. There is a problem with that, though.

I’m not talking here about the fact that if possible, you really don’t want to pass an actual qr object. We’ve already covered that. It was a surprising enough discovery that I’ll take this opportunity to signal-boost that while we’re here, but this article is not about that.

Instead, this one is about the fact that /(?!)/ is O(n). Yes! Exactly! I was just looking at an example of this pattern in some code I’m working on and had a scales-off-my-eyes moment. Indeed this is borne out by some quick benchmarking; something like this:

use Benchmark::Dumb 'cmpthese'; # check this out if you haven't
my $iter = 100;
my $len = 100000;
$::a = 'a' x $len;
cmpthese 0, {
  floating => q[die if $::a =~ /(?!)/;] x $iter,
  anchored => q[die if $::a =~ /^(?!)/;] x $iter,
};

This gives me results like this:

         Rate/s Precision/s floating anchored
floating 5.5948      0.0019       --   -99.9%
anchored 4165.1         3.9 74346.2%       --

Note the disparity in rate: the anchored version is orders of magnitude faster.

Confused about what is going on? The answer is that a pattern does not have to match at the start of the string, so if a pattern fails to match there, the regexp engine will skip one character and try again from that point. To return false, the regexp engine must exhaust the entire string in this way, attempting to match the pattern at every offset – unless the pattern is anchored, which the engine knows cannot succeed anywhere other than the start of the string. Thus, while both patterns always fail, it takes floating as long to fail as the input string is long, i.e. O(n), whereas anchored always fails in the same amount of time for any input string, i.e. O(1).

In fact, this goes the other way too. Because floating is simpler, for very short strings (i.e. small n), it is faster. If I turn the $len parameter down to 10, I get results like this:

          Rate/s Precision/s anchored floating
anchored  3740.8         3.1       --    -3.3%
floating 3867.43       0.023     3.4%       --

For a $len of 0, I even get results like this:

          Rate/s Precision/s anchored floating
anchored  3730.2         3.1       --   -10.0%
floating 4146.88       0.028    11.2%       --

But just turning the length up to 12 is enough to consistently get me results like this:

         Rate/s Precision/s    floating anchored
floating 3913.9         3.4          --    -5.5%
anchored 4141.9         3.8 5.82+-0.13%       --

So you might squeeze out a smidgen more performance by using floating in cases where you know that input strings cannot be long and that they will often be empty. But the breakeven point on my machine is at a length of just 11, where anchored mostly wins by a few percent but floating occasionally beats it by hundredths of a percent (in Dumbbench precision parlance). And even in the empty-string case, the difference in absolute terms is so marginal that overall I feel this is not worth even keeping in mind.

This leaves a conveniently simple conclusion:

Use /^(?!)/ as the always-fail pattern. Don’t use /(?!)/.

Leave a comment

About Aristotle

user-pic Waxing philosophical