Converting glob patterns to regular expressions
Let's say you have a glob pattern with shell-style wildcards from a config file or user input, where ?
matches any character and *
matches any string (0 or more characters). You want to convert it to a regex, maybe because you just want to match it (and Perl already supports regexes) or because you want to embed it as part of a bigger regex.
You might start with a naive replacement:
s/\?/./g; # ? -> .
s/\*/.*/g; # * -> .*
Unfortunately this is broken: It leaves all other characters untouched, including those that have a special meaning in regexes, such as (
, +
, |
, etc.
Let's revise it:
s{(\W)}{
$1 eq '?' ? '.' :
$1 eq '*' ? '.*' :
'\\' . $1
}eg;
Now we match and replace every non-word character. If it's ?
or *
, we turn it into its regex equivalent; otherwise we backslash-escape it just like quotemeta
would do.
But what if the input is something like a***b
? This would turn into a.*.*.*b
, which when run on a long target string without b
s by a backtracking engine can be very inefficient (barring extra optimizations). A missing b
would make the match fail at the end, which would cause the engine to go through all possible ways .*.*.*
could subdivide the string amongst themselves before giving up. In general this takes O(nk) time (where n is the length of the target string and k is the number of stars in the pattern).
We can do better than that by realizing **
is equivalent to *
, which means that any sequence of stars is equivalent to a single *
, and preprocessing the pattern:
tr,*,,s; # ***...* -> *
This still doesn't fix everything, though: *?*?*
doesn't contain any repeated *
s but still allows for exponential backtracking. One way to work around this is to normalize the pattern even further: Because *?
is equivalent to ?*
, we can move all the ?
s to the front:
# "*?*?*"
1 while s/\*\?/?*/g;
# "?*?**" (after 1 iteration)
# "??***" (after 2 iterations)
tr,*,,s;
# "??*"
s{(\W)}{
$1 eq '?' ? '.' :
$1 eq '*' ? '.*' :
'\\' . $1
}eg;
# "...*"
However, I don't like that the transformation is spread out over two regex substitutions and one transliteration, when there is a way to do it all in a single substitution:
s{
( [?*]+ ) # a run of ? or * characters
|
(\W) # any other non-word character
}{
defined $1
? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
: '\\' . $2
}xeg;
That is, we turn each run of ?
or *
characters into .{N}
(if there was no *
) or .{N,}
(if there was at least one *
) where N
is the number of ?
s in the run.
Given an input of *?*?*
, this would generate .{2,}
("match 2 or more of any character").
And finally, if we wanted the user to be able to escape characters with a backslash to match them literally:
s{
( [?*]+ ) # a run of ? or * characters
|
\\ (.) # backslash escape
|
(\W) # any other non-word character
}{
defined $1
? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
: quotemeta $+
}xeg;