Converting glob patterns to regular expressions

Let's say you have a glob pattern with shell-style wildcards from a config file or user input, where ? matches any character and * matches any string (0 or more characters). You want to convert it to a regex, maybe because you just want to match it (and Perl already supports regexes) or because you want to embed it as part of a bigger regex.

You might start with a naive replacement:

s/\?/./g;   # ? -> .
s/\*/.*/g;  # * -> .*

Unfortunately this is broken: It leaves all other characters untouched, including those that have a special meaning in regexes, such as (, +, |, etc.

Let's revise it:

s{(\W)}{
    $1 eq '?' ? '.' :
    $1 eq '*' ? '.*' :
    '\\' . $1
}eg;

Now we match and replace every non-word character. If it's ? or *, we turn it into its regex equivalent; otherwise we backslash-escape it just like quotemeta would do.

But what if the input is something like a***b? This would turn into a.*.*.*b, which when run on a long target string without bs by a backtracking engine can be very inefficient (barring extra optimizations). A missing b would make the match fail at the end, which would cause the engine to go through all possible ways .*.*.* could subdivide the string amongst themselves before giving up. In general this takes O(nk) time (where n is the length of the target string and k is the number of stars in the pattern).

We can do better than that by realizing ** is equivalent to *, which means that any sequence of stars is equivalent to a single *, and preprocessing the pattern:

tr,*,,s;  # ***...* -> *

This still doesn't fix everything, though: *?*?* doesn't contain any repeated *s but still allows for exponential backtracking. One way to work around this is to normalize the pattern even further: Because *? is equivalent to ?*, we can move all the ?s to the front:

# "*?*?*"
1 while s/\*\?/?*/g;
# "?*?**"  (after 1 iteration)
# "??***"  (after 2 iterations)
tr,*,,s;
# "??*"
s{(\W)}{
    $1 eq '?' ? '.' :
    $1 eq '*' ? '.*' :
    '\\' . $1
}eg;
# "...*"

However, I don't like that the transformation is spread out over two regex substitutions and one transliteration, when there is a way to do it all in a single substitution:

s{
    ( [?*]+ )  # a run of ? or * characters
|
    (\W)       # any other non-word character
}{
    defined $1
        ? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
        : '\\' . $2
}xeg;

That is, we turn each run of ? or * characters into .{N} (if there was no *) or .{N,} (if there was at least one *) where N is the number of ?s in the run.

Given an input of *?*?*, this would generate .{2,} ("match 2 or more of any character").

And finally, if we wanted the user to be able to escape characters with a backslash to match them literally:

s{
    ( [?*]+ )  # a run of ? or * characters
|
    \\ (.)     # backslash escape
|
    (\W)       # any other non-word character
}{
    defined $1
        ? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
        : quotemeta $+
}xeg;

10 Comments

Minor tweak: if you are matching file names like File::Glob does, you need to exclude '/' from the characters matched by the wildcards. Unless you are allowing '**' to match truly anything, in which case you need to distinguish it from '*'.

Thank you for the post.

A * might be inside [...], which would be like Perl's character classes. In that case, you want to leave it alone. :)

But I like the spread-out version better. :-) I find it much easier to understand what’s going on there.

And I typically find it easier to reason about stars and pluses than counted repetitions, even if it means I have to count dots.

So I think I would have written this like so:

1 while s/[*]+([?]+)[*]*/$1*/g;
 
s[
    ([?][*]+)
    |([?])
    |([*]+)
    |(\W)
]{
    ; $1 ? '.+'
    : $2 ? '.'
    : $3 ? '.*'
    : quotemeta $4
}xge;

That is, first, shuffle around question marks while collapsing wildcards as soon as possible. Then recognise particular wildcard sequences and translate them.

Note how easy it is to modify this to generate counted repetitions and how relatively simpler the generator expression for them ends up:

s[
    (?:([?]+)([*]*))
    |([*]+)
    |(\W)
]{
    ; $1 ? ( '.{' . ( length $1 ) . ( $2 ? ',' : '' ) . '}' )
    : $3 ? '.*'
    : quotemeta $4
}xge;

(Note that I didn’t use ([?]*)([*]*) because that would match the empty string. But it’s silly (i.e. much less readable) to translate * to .{0,} instead of .* anyway.)

Or you could use a hash for the convertion:
my %convert = (
  '*' => '.*',
  '?' => '.',
);

s/(\W)/$convert{$1}||$1/eg;

You could have a look at glob_to_regex() in Text::Glob.

I haven't checked it, but if it doesn't handle the cases discussed here, you could submit a PR tomorrow, as part of your CPAN Day celebration :-)

And while you're there, you could add the github repo to the dist's metadata, so it will appear in the sidebar on MetaCPAN.

I wrote this once, for one of my own scripts:


# Given a glob pattern, return a regex that has the following behavior when
# matched against Cwd::realpath canolicalized path strings:
#  - matches successfully if and only if the path would have been matched by
#    the original glob
#  - $1, $2, ... will contain what would have been matched by groups of
#    consecutive wildcards (*, ?, [...]) in the original glob
#  - $::REGMARK will contain the numeric index of the "brace expansion"
#    alternative in the original glob that would have first matched the path
# If  tree => 1  is passed, a nested array with the individual parts is
# returned instead.

sub glob_to_regex {
    my ($pattern, %args) = @_;
    
    $pattern =~ s/(\\. | \* | \? | \[ [^\]]* \])/quotemeta $1/xeg;
    
    my @alts;
    foreach (bsd_glob($pattern, GLOB_QUOTE | GLOB_BRACE |
                                GLOB_TILDE | GLOB_NOCHECK)) {
        my $pattern = $_;
        my @parts;
        while ($pattern =~ /((?: \\. | . )*?) (?: (\z)
                            | (\*) | (\?) | \[ (!)? (\])? ([^\]]*) \] )/gx) {
            my $wildcard = $3 ? '[^/]*' :
                           $4 ? '[^/]?' :
                           $7 ? '(?!/)['.($5 ? '^' : '')
                                        .($6 ? '\]' : '').$7.']' : '';
            if (length $1) { push @parts, [$1, $wildcard] }
            elsif (!@parts) { push @parts, ['', $wildcard] }
            else { $parts[$#parts][1] .= $wildcard }
        }
        push @alts, $args{tree} ? \@parts :
                    join('', map { (quotemeta($_->[0] =~ s/\\(.)/$1/rg)) .
                                   (length $_->[1] ? "($_->[1])" : '') } @parts)
                    . '(*MARK:'.@alts.')';
    } 
    
    return $args{tree} ? @alts : qr/^(?|@{[ join '|', @alts ]})$/
}

It knows about the special significance that directory separators have in glob patterns, and handles brace expansion (by letting bsd_glob expand the pattern without expanding wildcards). Also handles character class wildcards, besides the simple ? and * wildcards.

Leave a comment

About mauke

user-pic I blog about Perl.