Converting glob patterns to regular expressions
Let's say you have a glob pattern with shell-style wildcards from a config file or user input, where ? matches any character and * matches any string (0 or more characters). You want to convert it to a regex, maybe because you just want to match it (and Perl already supports regexes) or because you want to embed it as part of a bigger regex.
You might start with a naive replacement:
s/\?/./g; # ? -> .
s/\*/.*/g; # * -> .*
Unfortunately this is broken: It leaves all other characters untouched, including those that have a special meaning in regexes, such as (, +, |, etc.
Let's revise it:
s{(\W)}{
$1 eq '?' ? '.' :
$1 eq '*' ? '.*' :
'\\' . $1
}eg;
Now we match and replace every non-word character. If it's ? or *, we turn it into its regex equivalent; otherwise we backslash-escape it just like quotemeta would do.
But what if the input is something like a***b? This would turn into a.*.*.*b, which when run on a long target string without bs by a backtracking engine can be very inefficient (barring extra optimizations). A missing b would make the match fail at the end, which would cause the engine to go through all possible ways .*.*.* could subdivide the string amongst themselves before giving up. In general this takes O(nk) time (where n is the length of the target string and k is the number of stars in the pattern).
We can do better than that by realizing ** is equivalent to *, which means that any sequence of stars is equivalent to a single *, and preprocessing the pattern:
tr,*,,s; # ***...* -> *
This still doesn't fix everything, though: *?*?* doesn't contain any repeated *s but still allows for exponential backtracking. One way to work around this is to normalize the pattern even further: Because *? is equivalent to ?*, we can move all the ?s to the front:
# "*?*?*"
1 while s/\*\?/?*/g;
# "?*?**" (after 1 iteration)
# "??***" (after 2 iterations)
tr,*,,s;
# "??*"
s{(\W)}{
$1 eq '?' ? '.' :
$1 eq '*' ? '.*' :
'\\' . $1
}eg;
# "...*"
However, I don't like that the transformation is spread out over two regex substitutions and one transliteration, when there is a way to do it all in a single substitution:
s{
( [?*]+ ) # a run of ? or * characters
|
(\W) # any other non-word character
}{
defined $1
? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
: '\\' . $2
}xeg;
That is, we turn each run of ? or * characters into .{N} (if there was no *) or .{N,} (if there was at least one *) where N is the number of ?s in the run.
Given an input of *?*?*, this would generate .{2,} ("match 2 or more of any character").
And finally, if we wanted the user to be able to escape characters with a backslash to match them literally:
s{
( [?*]+ ) # a run of ? or * characters
|
\\ (.) # backslash escape
|
(\W) # any other non-word character
}{
defined $1
? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
: quotemeta $+
}xeg;
I blog about Perl.