Converting glob patterns to regular expressions
Let's say you have a glob pattern with shell-style wildcards from a config file or user input, where ?
matches any character and *
matches any string (0 or more characters). You want to convert it to a regex, maybe because you just want to match it (and Perl already supports regexes) or because you want to embed it as part of a bigger regex.
You might start with a naive replacement:
s/\?/./g; # ? -> .
s/\*/.*/g; # * -> .*
Unfortunately this is broken: It leaves all other characters untouched, including those that have a special meaning in regexes, such as (
, +
, |
, etc.
Let's revise it:
s{(\W)}{
$1 eq '?' ? '.' :
$1 eq '*' ? '.*' :
'\\' . $1
}eg;
Now we match and replace every non-word character. If it's ?
or *
, we turn it into its regex equivalent; otherwise we backslash-escape it just like quotemeta
would do.
But what if the input is something like a***b
? This would turn into a.*.*.*b
, which when run on a long target string without b
s by a backtracking engine can be very inefficient (barring extra optimizations). A missing b
would make the match fail at the end, which would cause the engine to go through all possible ways .*.*.*
could subdivide the string amongst themselves before giving up. In general this takes O(nk) time (where n is the length of the target string and k is the number of stars in the pattern).
We can do better than that by realizing **
is equivalent to *
, which means that any sequence of stars is equivalent to a single *
, and preprocessing the pattern:
tr,*,,s; # ***...* -> *
This still doesn't fix everything, though: *?*?*
doesn't contain any repeated *
s but still allows for exponential backtracking. One way to work around this is to normalize the pattern even further: Because *?
is equivalent to ?*
, we can move all the ?
s to the front:
# "*?*?*"
1 while s/\*\?/?*/g;
# "?*?**" (after 1 iteration)
# "??***" (after 2 iterations)
tr,*,,s;
# "??*"
s{(\W)}{
$1 eq '?' ? '.' :
$1 eq '*' ? '.*' :
'\\' . $1
}eg;
# "...*"
However, I don't like that the transformation is spread out over two regex substitutions and one transliteration, when there is a way to do it all in a single substitution:
s{
( [?*]+ ) # a run of ? or * characters
|
(\W) # any other non-word character
}{
defined $1
? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
: '\\' . $2
}xeg;
That is, we turn each run of ?
or *
characters into .{N}
(if there was no *
) or .{N,}
(if there was at least one *
) where N
is the number of ?
s in the run.
Given an input of *?*?*
, this would generate .{2,}
("match 2 or more of any character").
And finally, if we wanted the user to be able to escape characters with a backslash to match them literally:
s{
( [?*]+ ) # a run of ? or * characters
|
\\ (.) # backslash escape
|
(\W) # any other non-word character
}{
defined $1
? '.{' . ($1 =~ tr,?,,) . (index($1, '*') >= 0 ? ',' : '') . '}'
: quotemeta $+
}xeg;
Minor tweak: if you are matching file names like File::Glob does, you need to exclude '/' from the characters matched by the wildcards. Unless you are allowing '**' to match truly anything, in which case you need to distinguish it from '*'.
Thank you for the post.
A
*
might be inside[...]
, which would be like Perl's character classes. In that case, you want to leave it alone. :)But I like the spread-out version better. :-) I find it much easier to understand what’s going on there.
And I typically find it easier to reason about stars and pluses than counted repetitions, even if it means I have to count dots.
So I think I would have written this like so:
That is, first, shuffle around question marks while collapsing wildcards as soon as possible. Then recognise particular wildcard sequences and translate them.
Note how easy it is to modify this to generate counted repetitions and how relatively simpler the generator expression for them ends up:
(Note that I didn’t use
([?]*)([*]*)
because that would match the empty string. But it’s silly (i.e. much less readable) to translate*
to.{0,}
instead of.*
anyway.)Or you could use a hash for the convertion:
s/(\W)/$convert{$1}||$1/eg;
You could have a look at glob_to_regex() in Text::Glob.
I haven't checked it, but if it doesn't handle the cases discussed here, you could submit a PR tomorrow, as part of your CPAN Day celebration :-)
And while you're there, you could add the github repo to the dist's metadata, so it will appear in the sidebar on MetaCPAN.
I wrote this once, for one of my own scripts:
It knows about the special significance that directory separators have in glob patterns, and handles brace expansion (by letting bsd_glob expand the pattern without expanding wildcards). Also handles character class wildcards, besides the simple ? and * wildcards.
The original inspiration for this came from matching general text, not file names. That's why
/
isn't treated specially.(In one place I use this for matching some HTTP header values, in which case
.
is actually[!#$%&'*+\-.^`|~\w]
.)I'm not doing full-blown shell filename expansion, so I simply don't support
[...]
here.Eh, it's generated code. I don't care much about how readable it is. :-)
I don't think your version handles the last case with backslash escapes.
Text::Glob doesn't treat repeated
*
specially, no. But it supports many other features (wildcards don't match a leading dot, curlies, etc) plus it uses a rather C-like approach to converting the pattern (iterating over single characters, state machine) so it would take more than 10 seconds of looking at it to make any major changes to the algorithm.