Converting glob patterns to regular expressions

By mauke on August 14, 2015 1:42 PM

Let's say you have a glob pattern with shell-style wildcards from a config file or user input, where ? matches any character and * matches any string (0 or more characters). You want to convert it to a regex, maybe because you just want to match it (and Perl already supports regexes) or because you want to embed it as part of a bigger regex.

You might start with a naive replacement:

s/\?/./g;   # ? -> .
s/\*/.*/g;  # * -> .*

Unfortunately this is broken: It leaves all other characters untouched, including those that have a special meaning in regexes, such as (, +, |, etc.

Let's revise it:

s{(\W)}{
    $1 eq '?' ? '.' :
    $1 eq '*' ? '.*' :
    '\\' . $1
}eg;

Now we match and replace every non-word character. If it's ? or *, we turn it into its regex equivalent; otherwise we backslash-escape it just like quotemeta would do.

But what if the input is something like a***b? This would turn into a.*.*.*b, which when run on a long target string without bs by a backtracking engine can be very inefficient (barring extra optimizations). A missing b would make the match fail at the end, which would cause the engine to go through all possible ways .*.*.* could subdivide the string amongst themselves before giving up. In general this takes O(n^k) time (where n is the length of the target string and k is the number of stars in the pattern).

We can do better than that by realizing ** is equivalent to *, which means that any sequence of stars is equivalent to a single *, and preprocessing the pattern:

tr,*,,s;  # ***...* -> *

This still doesn't fix everything, though: *?*?* doesn't contain any repeated *s but still allows for exponential backtracking. One way to work around this is to normalize the pattern even further: Because *? is equivalent to ?*, we can move all the ?s to the front:

# "*?*?*"
1 while s/\*\?/?*/g;
# "?*?**"  (after 1 iteration)
# "??***"  (after 2 iterations)
tr,*,,s;
# "??*"
s{(\W)}{
    $1 eq '?' ? '.' :
    $1 eq '*' ? '.*' :
    '\\' . $1
}eg;
# "...*"

However, I don't like that the transformation is spread out over two regex substitutions and one transliteration, when there is a way to do it all in a single substitution:

s{
    ( [?*]+ )  # a run of ? or * characters
|
    (\W)       # any other non-word character
}{
    defined $1
        ? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
        : '\\' . $2
}xeg;

That is, we turn each run of ? or * characters into .{N} (if there was no *) or .{N,} (if there was at least one *) where N is the number of ?s in the run.

Given an input of *?*?*, this would generate .{2,} ("match 2 or more of any character").

And finally, if we wanted the user to be able to escape characters with a backslash to match them literally:

s{
    ( [?*]+ )  # a run of ? or * characters
|
    \\ (.)     # backslash escape
|
    (\W)       # any other non-word character
}{
    defined $1
        ? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
        : quotemeta $+
}xeg;

10 comments

Tagged as:

glob, regex

10 Comments

Tom Wyant | August 14, 2015 3:42 PM | Reply

Minor tweak: if you are matching file names like File::Glob does, you need to exclude '/' from the characters matched by the wildcards. Unless you are allowing '**' to match truly anything, in which case you need to distinguish it from '*'.

Thank you for the post.

brian d foy | August 14, 2015 4:03 PM | Reply

A * might be inside [...], which would be like Perl's character classes. In that case, you want to leave it alone. :)

Aristotle | August 15, 2015 9:06 AM | Reply

But I like the spread-out version better. :-) I find it much easier to understand what’s going on there.

And I typically find it easier to reason about stars and pluses than counted repetitions, even if it means I have to count dots.

So I think I would have written this like so:

1 while s/[*]+([?]+)[*]*/$1*/g;
 
s[
    ([?][*]+)
    |([?])
    |([*]+)
    |(\W)
]{
    ; $1 ? '.+'
    : $2 ? '.'
    : $3 ? '.*'
    : quotemeta $4
}xge;

That is, first, shuffle around question marks while collapsing wildcards as soon as possible. Then recognise particular wildcard sequences and translate them.

Note how easy it is to modify this to generate counted repetitions and how relatively simpler the generator expression for them ends up:

s[
    (?:([?]+)([*]*))
    |([*]+)
    |(\W)
]{
    ; $1 ? ( '.{' . ( length $1 ) . ( $2 ? ',' : '' ) . '}' )
    : $3 ? '.*'
    : quotemeta $4
}xge;

(Note that I didn’t use ([?]*)([*]*) because that would match the empty string. But it’s silly (i.e. much less readable) to translate * to .{0,} instead of .* anyway.)

Shawn H Corey | August 15, 2015 2:34 PM | Reply

Or you could use a hash for the convertion:

my %convert = (

  '*' => '.*',

  '?' => '.',

);

s/(\W)/$convert{$1}||$1/eg;

Neil Bowers | August 15, 2015 9:03 PM | Reply

You could have a look at glob_to_regex() in Text::Glob.

I haven't checked it, but if it doesn't handle the cases discussed here, you could submit a PR tomorrow, as part of your CPAN Day celebration :-)

And while you're there, you could add the github repo to the dist's metadata, so it will appear in the sidebar on MetaCPAN.

smls | August 16, 2015 11:40 PM | Reply

I wrote this once, for one of my own scripts:


# Given a glob pattern, return a regex that has the following behavior when
# matched against Cwd::realpath canolicalized path strings:
#  - matches successfully if and only if the path would have been matched by
#    the original glob
#  - $1, $2, ... will contain what would have been matched by groups of
#    consecutive wildcards (*, ?, [...]) in the original glob
#  - $::REGMARK will contain the numeric index of the "brace expansion"
#    alternative in the original glob that would have first matched the path
# If  tree => 1  is passed, a nested array with the individual parts is
# returned instead.

sub glob_to_regex {

    my ($pattern, %args) = @_;

    

    $pattern =~ s/(\\. | \* | \? | \[ [^\]]* \])/quotemeta $1/xeg;

    

    my @alts;

    foreach (bsd_glob($pattern, GLOB_QUOTE | GLOB_BRACE |

                                GLOB_TILDE | GLOB_NOCHECK)) {

        my $pattern = $_;

        my @parts;

        while ($pattern =~ /((?: \\. | . )*?) (?: (\z)

                            | (\*) | (\?) | \[ (!)? (\])? ([^\]]*) \] )/gx) {

            my $wildcard = $3 ? '[^/]*' :

                           $4 ? '[^/]?' :

                           $7 ? '(?!/)['.($5 ? '^' : '')

                                        .($6 ? '\]' : '').$7.']' : '';

            if (length $1) { push @parts, [$1, $wildcard] }

            elsif (!@parts) { push @parts, ['', $wildcard] }

            else { $parts[$#parts][1] .= $wildcard }

        }

        push @alts, $args{tree} ? \@parts :

                    join('', map { (quotemeta($_->[0] =~ s/\\(.)/$1/rg)) .

                                   (length $_->[1] ? "($_->[1])" : '') } @parts)

                    . '(*MARK:'.@alts.')';

    } 

    

    return $args{tree} ? @alts : qr/^(?|@{[ join '|', @alts ]})$/

}

It knows about the special significance that directory separators have in glob patterns, and handles brace expansion (by letting bsd_glob expand the pattern without expanding wildcards). Also handles character class wildcards, besides the simple ? and * wildcards.

mauke replied to comment from Tom Wyant | September 5, 2015 2:33 PM | Reply

The original inspiration for this came from matching general text, not file names. That's why / isn't treated specially.

(In one place I use this for matching some HTTP header values, in which case . is actually [!#$%&'*+\-.^`|~\w].)

mauke replied to comment from brian d foy | September 5, 2015 2:34 PM | Reply

I'm not doing full-blown shell filename expansion, so I simply don't support [...] here.

mauke replied to comment from Aristotle | September 5, 2015 2:36 PM | Reply

Eh, it's generated code. I don't care much about how readable it is. :-)

I don't think your version handles the last case with backslash escapes.

mauke replied to comment from Neil Bowers | September 5, 2015 2:44 PM | Reply

Text::Glob doesn't treat repeated * specially, no. But it supports many other features (wildcards don't match a leading dot, curlies, etc) plus it uses a rather C-like approach to converting the pattern (iterating over single characters, state machine) so it would take more than 10 seconds of looking at it to make any major changes to the algorithm.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About mauke

I blog about Perl.

More info »

mauke