Applying user-supplied regexes

My first post... not the snappiest title, but here goes...

Multiple Grep and Capture

I have a requirement for a module which performs the following:

Given a client-supplied list of regex patterns, which may or may not capture part of their output

And a search text
When I run my function.
Then return a list of tuples, where the regexp matches the search text. Each 'matching' tuple consists of the regexp, and an array of the captured elements of the text.

Important to note that: * we don't know up-front which regexes will be supplied - only that they should be valid * we don't assume there are any capture groups, but if there are, we would like to hold onto them

First attempt...

for my $re (@regexps) {
    my @captures = $search_text =~ /$re/;
    push @result, [ $re, \@captures ] if $&;
}

We can't just do if (@captures) because some regexps will be non-capturing.

Now this worked in all unit and integration tests I had to throw at the problem. The problem started when the code was finally integrated into the larger application (a mod_perl-based web service).

This seemed to work under a test harness, even with end-to-end integration and many regexp examples. However it was seen to fail badly when deployed under mod_perl.

Reading the docs, it appears that the $& variable is not guaranteed to be set of there is no match. Basically, it is only reset (along with all other capture variables) if there is an overall match.

Also, to my shame, I hadn't realised how poorly-performing the $& variable could be.

...so I made the code iterate

for my $re (@regexps) {
    if ($search_text =~ /$re/) {
        my @captures = grep { defined $_ } ( $1, $2, $3, $4, $5, $6, $7, $8, $9 );
        push @result, [ $re, \@captures ] if $&;
       }
}

(we only expect to match a few cases on each iteration) I was surprised there wasn't a special array to hold all captures from a regexp result.

Discussing the situation with colleagues, we started to probe what is returned in various cases:

For a non-capturing, non-matching regexp, * "0" =~ /1/ => undef

For a capturing, matching regexp, * "1" =~ /(1)/ => [ '1' ]

For a non-capturing, matching regexp, * "1" =~ /1/ => [ 1 ]

Interesting... two cases can return essentially the same array.

I'm not a power-user by any means, so I approached Perl Monks to ask about disambiguating native integers from their string representation, although I could be pretty sure it would be easy enough via Perl internals.

Through this line, I was given alternative suggestions for the regexp problem. The main pointer was the capture position arrays @- and @+, which I hadn't encountered before.

So a better rewrite of this would be:

for my $re (@regexps) {
    if ($search_text =~ /$re/) {
        my @captures;
        for my $match ( 1 .. $#- ) {
            push @captures, substr( $search_text, $-[$match], $+[$match]-$-[$match] );
        };
    }
}

What I learned was that it's useful to re-read perlretut occasionally as a reminder of the power of the regexp engine but also as a guide to best practice.

1 Comment

Here's a slightly simpler approach. As you note, if a regex has no captures, matching against it in list context produces a list containing the single element 1. This means that list assignment of the match result will have a truth status that accurately reflects whether the match succeeded. But then the problem you've discovered is that you have a spurious excess value in your capture array.

The simplest way to get rid of that is to splice off any captures from your array that weren't actually part of the match:

for my $re (@regexps) {
    my @captures = $search_text =~ $re or next;
    splice @captures, $#-;
}

The documentation for the @- variable says: "One can use $#- to find the last matched subgroup in the last successful match. Contrast with $#+, the number of subgroups in the regular expression." So this is exactly what you want: a list context match (so you don't need to explicitly loop over the captures), but only as many captures as actually matched.

Leave a comment

About ashleyhindmarsh

user-pic Here, I blog mostly about Perl. I trade in Perl and Java. I like Moo(se) and testing stuff. I worry I'm not perllish enough at times.