Match Anything, Quickly

Sometimes I want to filter a set of strings, but the details of the filter are not known beforehand. In particular, I may want a null filter, which simply accepts anything.

This looks like a job for a regular expression, but I can think of at least two implementations. One is to pass around regular expression objects. The second is to wrap a match (m//) in a subroutine reference, and pass that around. Given the use of regular expressions, there are a number of possibilities for a regular expression that matches any string.

I wondered whether one of the alternatives I was choosing among was faster than another, so I decided to Benchmark them. Both implementations applied the regular expression to a global variable. In practice this would probably be a localized $_, but my read of the Benchmark module says that it also localizes $_, but leaves it undef.

Note that the empty pattern is not benchmarked, because it is equivalent to the last successfully-matched pattern, if any. The sub { 1 } was included because if we're dealing in code references, the null filter simply needs to return a true value.

Here are the results, obtained with Perl 5.36.0, unthreaded. The script that generated them is on GitHub

ImplementationRate
sub { 1 }294117647.06/sec
sub { m/ .? /smx }21645021.65/sec
sub { m/ .{0} /smx }21598272.14/sec
sub { m/ (*ACCEPT) /smx }20964360.59/sec
sub { m/ (?) /smx }20876826.72/sec
sub { m/ \A /smx }20746887.97/sec
sub { m/ (?:) /smx }20618556.70/sec
sub { m/ ^ /smx }20618556.70/sec
qr/ (?) /smx2344665.89/sec
qr/ (?:) /smx2344116.27/sec
qr/ ^ /smx2336448.60/sec
qr/ \A /smx2315350.78/sec
qr/ .? /smx2208968.41/sec
qr/ .{0} /smx2180074.12/sec
qr/ (*ACCEPT) /smx1717327.84/sec

Somewhat to my surprise, the subroutine-reference implementation was an order of magnitude faster than the regular-expression-reference implementation. I expected that, Regexps being first-class objects, it would be pretty much equivalent to m/ ... / wrapped in a subroutine -- maybe even a little faster.

A little messing around with perl -MO=Concise got me the following:

$ perl -MO=Concise -e '$_ =~ m/foo/;'
5  <@> leave[1 ref] vKP/REFC ->(end)
1     <0> enter v ->2
2     <;> nextstate(main 1 -e:1) v:{ ->3
4     </> match(/"foo"/) vKS ->5
-        <1> ex-rv2sv sK/1 ->4
3           <$> gvsv(*_) s ->4
-e syntax OK
$ perl -MO=Concise -e '$_ =~ qr/foo/;'
7  <@> leave[1 ref] vKP/REFC ->(end)
1     <0> enter v ->2
2     <;> nextstate(main 1 -e:1) v:{ ->3
6     </> match() vKS ->7
-        <1> ex-rv2sv sK/1 ->4
3           <$> gvsv(*_) s ->4
5        <|> regcomp(other->6) sK ->6
4           </> qr(/"foo"/) s ->5
-e syntax OK

The salient difference, to my eye, was the presence of the regcomp operator in the second case. perldoc-search on this led me eventually to perlreapi which says, in part,



"precomp" "prelen"


Used for optimisations. "precomp" holds a copy of the pattern that was compiled and "prelen" its length. When a new pattern is to be compiled (such as inside a loop) the internal "regcomp" operator checks if the last compiled "REGEXP"'s "precomp" and "prelen" are equivalent to the new one, and if so uses the old pattern instead of compiling a new one.

The relevant snippet from "Perl_pp_regcomp":



if (!re || !re->precomp || re->prelen != (I32)len ||
memNE(re->precomp, t, len))
/* Compile a new pattern */


So I assume that the speed difference might be reduced if the filter was called in a tight enough loop. But if so, the Benchmark loop is not tight enough, and it's pretty tight. On the other hand, maybe the Benchmark loop is tight enough, and the extra time is spent determining that a recompilation is not needed. But it will take deeper knowledge of Perl internals than I possess to sort this out.

6 Comments

Type::Tiny's grep method should be pretty fast to filter a list of strings in most cases.

use Types::Standard -types;
use Types::Common::String -types;

# null filter:
@filtered = Any->grep( @strings );

# filters which do things:
@filtered = NonEmptyStr->grep( @strings );
@filtered = LowerCaseStr->grep( @strings );
@filtered = StrLen->of( 4, 10 )->grep( @strings );

I ran into the same thing while developing Language::FormulaEngine::Parser. After some testing, I found that eval-ing the function that contained the regexes yielded a huge speedup vs. trying to iterate qr// objects. I thought it had more to do with the loop vs. inlining the series of regex comparisons though. I never considered that a single qr// match might be slower than a single inline regex match.

Hi, Cincinnati perlmongers had a chat about this and we realized that you forgot /smx on the sub-eval'd version. Fixing that, it still shows that eval'd sub is faster than regexref, but reveals that (*ACCEPT) is the slowest way to "match anything" in both situations.

We also found out that qr/$re/o fixes most of the performance problem of using qr//, and are currently debating whether /o can cause bugs when used this way.

Actually, the /o flag needed to be on the match itself to speed it up, not the qr// expression.

The original code wasn't a valid test because the subroutine versions of the test cases were missing the /smx flags shown in the output, so they were all failing to match a space character immediately. They weren't matching successfully, so it wasn't a fair comparison.

Here's the test cases for an apples-to-apples test using /o to compile the match regex once and code references on both sides to minimize differences:

foreach my $re ( qw{ (*ACCEPT) (?) (?:) .? .{0} \A ^ } ) {
    my $code = eval "sub { \$MATCH =~ m/ $re /smx }";
    my $code2 = eval "do { my \$regex = qr/ $re /smx; " .
        "sub { \$MATCH =~ /\$regex/o }};";
    push @cases,
        "sub { m/ $re /smx }", $code,
        "qr/ $re /smx", $code2;
}

The test cases above do show the qr// versions running a little slower, but they're all in the same ballpark overall:

sub { m/ ^ /smx }            14306151.65/sec
sub { m/ \A /smx }           13477088.95/sec
sub { m/ (?:) /smx }         13315579.23/sec
qr/ ^ /smx                   11848341.23/sec
sub { m/ (?) /smx }          11668611.44/sec
qr/ (?) /smx                 11655011.66/sec
qr/ \A /smx                  11627906.98/sec
qr/ (?:) /smx                11198208.29/sec
sub { m/ .? /smx }           11160714.29/sec
sub { m/ .{0} /smx }         10615711.25/sec
qr/ .{0} /smx                 9624639.08/sec
qr/ .? /smx                   9074410.16/sec
sub { m/ (*ACCEPT) /smx }     4442470.01/sec
qr/ (*ACCEPT) /smx            4140786.75/sec

However, removing the /o flag from the match on $regex causes the match regex to be recompiled, which makes the qr versions 2-3 times slower, but not an order of magnitude slower:

sub { m/ (?) /smx }          11534025.37/sec
sub { m/ ^ /smx }            11389521.64/sec
sub { m/ \A /smx }           11248593.93/sec
sub { m/ .? /smx }           10706638.12/sec
sub { m/ (?:) /smx }         10449320.79/sec
sub { m/ .{0} /smx }          9009009.01/sec
sub { m/ (*ACCEPT) /smx }     4460303.30/sec
qr/ \A /smx                   4004805.77/sec
qr/ ^ /smx                    3909304.14/sec
qr/ (?) /smx                  3877471.89/sec
qr/ (?:) /smx                 3871467.29/sec
qr/ .? /smx                   3692762.19/sec
qr/ .{0} /smx                 3608805.49/sec
qr/ (*ACCEPT) /smx            2558199.03/sec

Easy mistake to make! I never noticed it until I added extra code to print the results of the matches and discovered that the subs never matched!

Your basic point still stands though -- without using /o when calling a qr// regex, it is several times slower than using the anonymous subroutine...

Leave a comment

About Tom Wyant

user-pic I blog about Perl.