Converting glob patterns to efficient regexes in Perl and JavaScript

In Glob Matching Can Be Simple And Fast Too Russ Cox benchmarked several glob implementations by matching the string a100 against the pattern (a*)nb. (Here exponentiation refers to string repetition, as in 'a' x 100 matched against ('a*' x $n) . 'b' in Perl syntax.) What he found was that some implementations returned results instantly whereas others got extremely slow as soon as n grew past 5, taking seconds, minutes and even hours to finish.

This problem is caused by excessive backtracking. It's possible to implement globbing without nested backtracking (and the linked article explains a simple algorithm to do so), but a naive recursive implementation will suffer from this issue. It affects some shells, FTP servers, and programming languages, including Perl: File::Glob uses code from BSD libc (which is affected, unlike glibc). A patch was written and the next File::Glob release will include a fixed algorithm.

But what really caught my eye was the way Python implements glob matching: It translates each glob pattern to a regex, then simply invokes the regex engine. This approach also suffers from exponential backtracking because Python's regex engine uses an exponential-time algorithm. As it turns out I do something very similar in some of my JavaScript code to support user-specified wildcards.

The question is: Does converting the pattern to a regex suffer from the same problem in JavaScript and Perl, and can we avoid excessive backtracking somehow?

At first blush it seems like Perl is not affected:

$ perl -e '("a" x 100) =~ /\A(?:a.*?){6}b\z/s'

returns instantly. But this is because a specific regex optimization kicks in: Perl first checks whether the fixed substring 'b' appears in the target string, and because it doesn't, the match fails immediately. We can defeat this optimization by switching to:

$ perl -e '("a" x 100) =~ /\A(?:a.*?){6}[bc]\z/s'

or

$ perl -e '("a" x 100) =~ /\A(?:a.*?){6}b.*?a\z/s'

(Warning: This takes 1½ minutes to finish on my system; increasing the 6 to 7 or 8 is not recommended.)

So this approach seems vulnerable:

#!/usr/bin/env perl
use strict;
use warnings;

sub glob2re {
    my ($pat) = @_;
    $pat =~ s{(\W)}{
        $1 eq '?' ? '.' :
        $1 eq '*' ? '.*?' :
        '\\' . $1
    }eg;
    return qr/\A$pat\z/s;
}

('a' x 100) =~ glob2re(('a*' x 7) . 'b*a');

The above code indeed takes a long time to finish.

What about JavaScript? Here's the equivalent code:

function glob2re(pat) {
    pat = pat.replace(/\W/g, function (m0) {
        return (
            m0 === '?' ? '[\\s\\S]' :
            m0 === '*' ? '[\\s\\S]*?' :
            '\\' + m0
        );
    });
    return new RegExp('^' + pat + '$');
}

glob2re('a*'.repeat(7) + 'b').test('a'.repeat(100));

(Note the extra suckiness: JavaScript has no /s flag so you need something like [\s\S] to match any character.)

I've only tried this in Firefox but it also takes a long time to finish.

Clearly the naive approach does not work well. Can we fix it?

The algorithm presented by Russ Cox works by limiting the stack of saved backtracking states to at most 1 entry (yeah, that's not much of a "stack" anymore). As soon as a new instance of * starts being processed, all previous backtracking information is forgotten.

In Perl we can get the same effect by using the (*PRUNE) control verb:

#!/usr/bin/env perl
use strict;
use warnings;

sub glob2re {
    my ($pat) = @_;
    $pat =~ s{(\W)}{
        $1 eq '?' ? '.' :
        $1 eq '*' ? '(*PRUNE).*?' :
        '\\' . $1
    }eg;
    return qr/\A$pat\z/s;
}

('a' x 100) =~ glob2re(('a*' x 70) . 'b*a');

With this tiny change (adding (*PRUNE) to the generated regex), even 70 wildcards in a single pattern pose no problem: The program finishes instantly.

Again, what about JavaScript? Here the situation is a bit more complicated because JavaScript doesn't support control verbs. Normally this wouldn't be much of a problem because we could just turn foo*bar*baz into /foo(?>.*?bar)(?>.*?baz)/ instead, using (?>...) (an independent subexpression) to limit backtracking. Unfortunately JavaScript doesn't support (?>...) either. But wait! We can combine capturing groups, positive look-ahead, and backreferences to simulate (?>foo): Just use (?=(foo))\1 instead. Well, what we really want is a backreference to the last thing we captured. Perl again makes this easy with a relative backreference (\g{-1}) but in JavaScript we're forced to use absolute numbering instead. We can still do it because we control the whole regex, we just have to do a bit more manual work:

function glob2re(pat) {
    function tr(pat) {
        return pat.replace(/\W/g, function (m0) {
            return (
                m0 === '?' ? '[\\s\\S]' :
                '\\' + m0
            );
        });
    }

    var n = 1;
    pat = pat.replace(/\W[^*]*/g, function (m0, mp, ms) {
        if (m0.charAt(0) !== '*') {
            return tr(m0);
        }
        var eos = mp + m0.length === ms.length ? '$' : '';
        return '(?=([\\s\\S]*?' + tr(m0.substr(1)) + eos + '))\\' + n++;
    });
    return new RegExp('^' + pat + '$');
}

glob2re('a*'.repeat(70) + 'b').test('a'.repeat(100));

The code is starting to look a bit crazy and the regexes it generates are even worse, but it does work: Even 70 wildcards finish instantly.

Conclusion: Yes, converting glob patterns to efficient regexes is possible. It's even trivial in Perl. In JavaScript we have to jump through some annoying hoops but in the end we still get a regex that does what we want.

4 Comments

Interesting. One question: why did you translate '*' to '.*?' rather than just '.*'?

Because of (*PRUNE).

The .* requires backtracking to match: it gobbles up everything first, then relies on backtracking to give up enough of the matched string to make the match pass. (That’s why they say “death to dot star”.) But (*PRUNE) throws away any accumulated backtracking points, so once the whole string is consumed, there is nowhere to go, and the match fails.

use Test::More 'no_plan';
# equivalent to glob_match('foobarfoo', '*foo*bar*') sub match { 'foobarfoo' =~ m/\A $_[0] foo $_[0] bar $_[0] \z/x }
ok match($_), $_ for '.*', '.*?', '(*PRUNE) .*', '(*PRUNE) .*?';

The (*PRUNE) .* variant is the only one that fails because the first .* matches the whole string, then backtracking gives up just enough matches to find the foo at the end, then (*PRUNE) forgets the rest of the backtracking states, then the following .* bar has nowhere to match… and now there is nowhere to backtrack to. So the match fails.

The variant with just .* (i.e. without (*PRUNE)) works in that case, because after backtracking to the foo at the end of the string, the .* bar again fails, but backtracking then continues giving up more of the string until it finds the foo at the start of the string, after which all of foo .* bar .* \z matches. So that one manages to match – but it requires a lot of backtracking to get there.

That, again, is why they say “death to dot star”.

The variant with just .*? is better in this case, because it tries to consume as little of the string as possible, so .*? foo finds the foo at the start of the string, and then everything proceeds nicely. But it can still backtrack. And the problem with backtracking is that if you have a lot of quantifiers in the pattern, it will take exponential time to confirm match failures. The backtracking non-greedy version has that pitfall even if it seems to work better here than the backtracking greedy variant.

So the variants without (*PRUNE) are bad because they can blow up in backtracking – which was the point of this article.

The (*PRUNE) .*? variant works by not requiring backtracking to match and preventing exponential backtracking to confirm a failure when the pattern cannot possibly match.

(Technically, what happens is that .*? tries consuming nothing, then backtracks into consuming one more character, then backtracks into consuming one more character, etc., whereas .* tries consuming everything, then backtracks into consuming one less, character, then backtracks into consuming one less character, etc. So what (*PRUNE) throws away is effectively just the one backtracking state that would try to consume one more character. Because of what globs can match, the whole glob will always match if you match any literal atom as far to the left of the string as possible, so the first match allowed by any non-greedy quantifier is a suitable choice.)

The short-for-me/long-for-you version of the answer would be “read the Friedl book”. 😊

Confirming no exponential backtracking with all the JS implementations:

$ d8
V8 version 6.0.0 (candidate)

$ jjs -version
nashorn 9-internal

$ js --version
JavaScript-C55.0a1

$ rhino
Rhino 1.7 release 3 2017 02 13

$ seed --version
Seed 3.8.1

$ perl -mJE -E'say JE->VERSION'
0.066

(Nashorn and Rhino require a String.prototype.repeat polyfill to run.)

Leave a comment

About mauke

user-pic I blog about Perl.