August 2015 Archives

Converting glob patterns to regular expressions

Let's say you have a glob pattern with shell-style wildcards from a config file or user input, where ? matches any character and * matches any string (0 or more characters). You want to convert it to a regex, maybe because you just want to match it (and Perl already supports regexes) or because you want to embed it as part of a bigger regex.

You might start with a naive replacement:

s/\?/./g;   # ? -> .
s/\*/.*/g;  # * -> .*

Unfortunately this is broken: It leaves all other characters untouched, including those that have a special meaning in regexes, such as (, +, |, etc.

Let's revise it:

s{(\W)}{
    $1 eq '?' ? '.' :
    $1 eq '*' ? '.*' :
    '\\' . $1
}eg;

Now we match and replace every non-word character. If it's ? or *, we turn it into its regex equivalent; otherwise we backslash-escape it just like quotemeta would do.

But what if the input is something like a***b? This would turn into a.*.*.*b, which when run on a long target string without bs by a backtracking engine can be very inefficient (barring extra optimizations). A missing b would make the match fail at the end, which would cause the engine to go through all possible ways .*.*.* could subdivide the string amongst themselves before giving up. In general this takes O(nk) time (where n is the length of the target string and k is the number of stars in the pattern).

We can do better than that by realizing ** is equivalent to *, which means that any sequence of stars is equivalent to a single *, and preprocessing the pattern:

tr,*,,s;  # ***...* -> *

This still doesn't fix everything, though: *?*?* doesn't contain any repeated *s but still allows for exponential backtracking. One way to work around this is to normalize the pattern even further: Because *? is equivalent to ?*, we can move all the ?s to the front:

# "*?*?*"
1 while s/\*\?/?*/g;
# "?*?**"  (after 1 iteration)
# "??***"  (after 2 iterations)
tr,*,,s;
# "??*"
s{(\W)}{
    $1 eq '?' ? '.' :
    $1 eq '*' ? '.*' :
    '\\' . $1
}eg;
# "...*"

However, I don't like that the transformation is spread out over two regex substitutions and one transliteration, when there is a way to do it all in a single substitution:

s{
    ( [?*]+ )  # a run of ? or * characters
|
    (\W)       # any other non-word character
}{
    defined $1
        ? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
        : '\\' . $2
}xeg;

That is, we turn each run of ? or * characters into .{N} (if there was no *) or .{N,} (if there was at least one *) where N is the number of ?s in the run.

Given an input of *?*?*, this would generate .{2,} ("match 2 or more of any character").

And finally, if we wanted the user to be able to escape characters with a backslash to match them literally:

s{
    ( [?*]+ )  # a run of ? or * characters
|
    \\ (.)     # backslash escape
|
    (\W)       # any other non-word character
}{
    defined $1
        ? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
        : quotemeta $+
}xeg;

About mauke

user-pic I blog about Perl.