A regex for my book

I wrote the following on Facebook:

I have a very good reason for including this regular expression in my book. It's a tiny part of a much longer one I once wrote.

(?x-ism:(?-xism:(?!\.(?![0-9]))(?:(?:(?i)(?:[+-]?)(?:
(?=[.]?[0123456789])(?:[0123456789]*)(?:(?:[.])(?:
[0123456789]{0,}))?)(?:(?:[E])(?:(?:[+-]?)(?:[0123456789]+
))|))\b|\b(?-xism:[[:upper:]][[:alnum:]_]*)\b))(?:\s*
(?-xism:(?:\*\*|[-+*/%]))\s*(?-xism:(?!\.(?![0-9]))(?:(?:
(?i)(?:[+-]?)(?:(?=[.]?[0123456789])(?:[0123456789]*)(?:
(?:[.])(?:[0123456789]{0,}))?)(?:(?:[E])(?:(?:[+-]?)(?:
[0123456789]+))|))\b|\b(?-xism:[[:upper:]][[:alnum:]_]*)
\b)))*)

At which point, many people piled on and criticized this without asking why I was going to include this. I should have known better than to make such a cryptic post and then head to bed, so here's the explanation.

First, here's a sampling of a few anonymized comments. I'll respond inline to make it easier.

and what is, in your deluded opinion, the good reason for including it in a beginner book?

I've explained repeatedly that just because the title of the book is "Beginning Perl", the book is really "Working Perl" and there's a darned good reason for that regex.

You're writing regexes like that now, in Dec 2011, four years after we got named sub expressions? I guess you aren't using subroutines in your book either?

No, I'm not writing regular expressions like that now. I never said I was (I really didn't understand where this comment was coming from).

nor /x nor comments in the code

Actually, there was an /x, but I needed it to fit Facebook.

And what's with mixing [0123456789] and [0-9] and Posix classes?

That was a valid point. One was a [0-9] in my code, but the rest was from Regexp::Common.

‎[0123456789] instead of \d ? that's not a good ad for the book :)

That's a great ad for the book. The \d escape matches a hell of a lot more than 0 through 9 because it matches a Unicode digit, not just ASCII digits.

This is why we can't have nice things (said about Perl)

No, this is a perfect example of why we can have nice things (said about Perl). I didn't write that regular expression directly. It was for code which needed to match valid Prolog math expressions. Take a gander at this (it's a touch simpler than the original regex):

use Test::Most;
use Regexp::Common;

my $num = $RE{num}{real};
my $var = qr/[[:upper:]][[:alnum:]_]*/;
my $op  = qr{[-+*/%]};

my $simple_math_term = qr/$num|$var/;
my $simple_expression       = qr/
    $simple_math_term
    (?:
        \s*
        $op                  
        \s*
        $simple_math_term    
    )*
/x; 
my %is_valid = (
    '2 + 3'                  => 1,
    ' + 2 - 3'               => 0, 
    'Var'                    => 1,
    '-3.2e5 % SomeVar / Var' => 1,
    'not_a_var + 2'          => 0,
);  

 while ( my ( $expr, $good ) = each %is_valid ) {
    if ($good) {
        like $expr, qr/^$simple_expression$/, "Should be valid: $expr";
    }
    else {
        unlike $expr, qr/^$simple_expression$/, "Should not be valid: $expr";
    }
}
diag $simple_expression;

done_testing;

Is that perfect? No, but it shows how easy it is to take a hard problem and break it down into small ones. The resulting regular expression may be very ugly, but that's the entire point. Hard problems become easy when you learn how to break them down. More importantly, you never need to see the big, bad regex and you can easily test the smaller regexes you're composing to form the larger one.

For those who want to see the original regex I wrote, read the code for AI::Prolog::PreProcessor::Math. Or you can check out the tests where I verify, amongst other things, that I can successfully transform this:

Answer is 9 / (3 + (4+7) % ModValue) + 2 / (3+7).

Into this:

is(Answer, plus(div(9, mod(plus(3, plus(4, 7)), ModValue)), div(2, plus(3, 7)))).

By composing regexes from smaller ones, that problem became trivial.

5 Comments

At least that regex is great for t-shirts. :)

I like Working Perl as a title, or something along that idea: Everyday Perl, Work-a-day Perl, and other things that might connote that it's just the stuff that you'll probably use.

I once made a regexp to match every town/suburb name in Australia. As well as being mindboglingly long and complicated it was also blazingly fast.

I believe the regexs for simple_math_term and simple_expressions should be enclosed in "(?: )". If you do that whenever a regex piece consists of more than one atom, the next higher level can use it as if it were an atom: PIECE MODIFIER (e.g. $simple_math_term+), and it will be parsed the way you'd expect. The way it currently is, I believe the $simple_expression regex has a bug in it: it matches NUM or VAR (OP NUM or VAR)* instead of (NUM or VAR) (OP (NUM or VAR))*.

Leave a comment

About Ovid

user-pic Freelance Perl/Testing/Agile consultant and trainer. See http://www.allaroundtheworld.fr/ for our services. If you have a problem with Perl, we will solve it for you. And don't forget to buy my book! http://www.amazon.com/Beginning-Perl-Curtis-Poe/dp/1118013840/