Never matching: everybody is doing it wrong
Well, not actually wrong, just slow. But the exaggeration makes a punchier headline, you’ll admit.
This comes up when an interface takes a pattern to match things against. Sometimes you have some reason to want this match to always fail, so you want to pass a pattern which will never match. The customary way of doing this is to pass qr/(?!)/
. There is a problem with that, though.
I’m not talking here about the fact that if possible, you really don’t want to pass an actual qr
object. We’ve already covered that. It was a surprising enough discovery that I’ll take this opportunity to signal-boost that while we’re here, but this article is not about that.
Instead, this one is about the fact that /(?!)/
is O(n). Yes! Exactly! I was just looking at an example of this pattern in some code I’m working on and had a scales-off-my-eyes moment. Indeed this is borne out by some quick benchmarking; something like this:
use Benchmark::Dumb 'cmpthese'; # check this out if you haven't
my $iter = 100;
my $len = 100000;
$::a = 'a' x $len;
cmpthese 0, {
floating => q[die if $::a =~ /(?!)/;] x $iter,
anchored => q[die if $::a =~ /^(?!)/;] x $iter,
};
This gives me results like this:
Rate/s Precision/s floating anchored floating 5.5948 0.0019 -- -99.9% anchored 4165.1 3.9 74346.2% --
Note the disparity in rate: the anchored
version is orders of magnitude faster.
Confused about what is going on? The answer is that a pattern does not have to match at the start of the string, so if a pattern fails to match there, the regexp engine will skip one character and try again from that point. To return false, the regexp engine must exhaust the entire string in this way, attempting to match the pattern at every offset – unless the pattern is anchored, which the engine knows cannot succeed anywhere other than the start of the string. Thus, while both patterns always fail, it takes floating
as long to fail as the input string is long, i.e. O(n), whereas anchored
always fails in the same amount of time for any input string, i.e. O(1).
In fact, this goes the other way too. Because floating
is simpler, for very short strings (i.e. small n), it is faster. If I turn the $len
parameter down to 10, I get results like this:
Rate/s Precision/s anchored floating anchored 3740.8 3.1 -- -3.3% floating 3867.43 0.023 3.4% --
For a $len
of 0, I even get results like this:
Rate/s Precision/s anchored floating anchored 3730.2 3.1 -- -10.0% floating 4146.88 0.028 11.2% --
But just turning the length up to 12 is enough to consistently get me results like this:
Rate/s Precision/s floating anchored floating 3913.9 3.4 -- -5.5% anchored 4141.9 3.8 5.82+-0.13% --
So you might squeeze out a smidgen more performance by using floating
in cases where you know that input strings cannot be long and that they will often be empty. But the breakeven point on my machine is at a length of just 11, where anchored
mostly wins by a few percent but floating
occasionally beats it by hundredths of a percent (in Dumbbench precision parlance). And even in the empty-string case, the difference in absolute terms is so marginal that overall I feel this is not worth even keeping in mind.
This leaves a conveniently simple conclusion:
Use /^(?!)/
as the always-fail pattern. Don’t use /(?!)/
.
Leave a comment