Regexp::Assemble - Weekly Travelling in CPAN | Moments on Perl or other Programming Issues [blogs.perl.org]

Regexp::Assemble - Weekly Travelling in CPAN

By C.-Y. Fung on March 7, 2023 8:33 PM

It has been on my mind quite a while. Originally it was suggested on Twitter, by Mohammad Anwar, the maintainer of "The Weekly Challenge", that the community should re-publish the CPAN Weekly, which existed before I joined the hacker community. Our plan was having the newsletter began in December 2022. Actually that collided with the Advent Calendar, so, not a good time. Anyway, after many twists and turns, I was busy in late 2022 and early 2023 for job hunting (settled now). Now I try to act as a tour guide and visit some CPAN modules (or distributions) with you in a causal manner.

Destination: Regexp::Assemble

Date of Latest Release: Jun 20, 2017
Distribution: Regexp-Assemble
Module version: 0.38
Main Contributors: David Landgren and Ron Savage(RSAVAGE)

Regexp::Assemble is used for combining regular expressions.
my $ra = Regexp::Assemble->new; $ra->add('cat', 'rat'); say $ra->re; say $ra->as_string; # (?^:[cr]at) # [cr]at

The two methods of the module you will probably use most frequently, as_string and re, have subtle differences:

my $r = Regexp::Assemble->new;
my @roman = qw/I II III IV V
               VI VII VIII IX X
               XI XII XIII XIV XV
               XVI XVII XVIII IXX XX/;
$r->add(@roman);
say "\$r->re:", "\n", $r->re;
my $rx = $r->as_string;
say "\$r->as_string:", "\n", $r->as_string;
say "Matched." if "vii" =~ /$rx/i;
say "This won't be printed." if "vii" =~ $r->re;

# $r->re:
# (?^:(?:X(?:V(?:I(?:I?I)?)?|I(?:I?I|V)?|X)?|I(?:I?I|X?X|V)?|V(?:I(?:I?I)?)?))
# $r->as_string:
# (?:X(?:V(?:I(?:I?I)?)?|I(?:I?I|V)?|X)?|I(?:I?I|X?X|V)?|V(?:I(?:I?I)?)?)
# Matched.

It is mainly for performance.
my $rs0 = Regexp::Assemble->new; $rs0->add('[0-9]+'); $rs0->add('[0-9a-f]+'); $rs0->add('[0-9A-F]+'); say $rs0->re;

# (?^:(?:[0-9A-F]+|[0-9a-f]+|[0-9]+)) # though you can write [0-9A-Fa-f]+ equivalently.

It can be quite readable if your words are "similar" enough:
my $rday = Regexp::Assemble->new; $rday->add('Wednesday'); $rday->add('Wed'); $rday->add('We'); $rday->add('W'); $rday->add('wednesday'); $rday->add('WednesdaY'); $rday->add('Wednesdy'); $rday->add('Wednseday'); $rday->add('Wedsenady'); say $rday->re; # (?^:(?:W(?:e(?:d(?:n(?:esd(?:a[Yy]|y)|seday)|senady)?)?)?|wednesday))

Note that do not put slashes on the expressions.
my $rs = Regexp::Assemble->new; $rs->add('m/[0-9]+/'); $rs->add('m/[0-9a-f]+/'); $rs->add('m/[0-9A-F]+/'); say $rs->re; say "Great!?" if "A9F" =~ $rs->re; say "Mixed feelings." if "m/A9F/" =~ $rs->re;

# (?^:m\/(?:[0-9A-F]+|[0-9a-f]+|[0-9]+)\/) # Mixed feelings.

There are some other regular expression combinators on CPAN; one of these is Regexp::Trie.
It is generally faster than Regexp::Assemble but has fewer features.
Let's have a performance check:

# MATCHING FIRST 20 ROMAN NUMERALS
# -- FROM Weekly Travelling in CPAN, Mar 07 2023
# The first author of the module, David Landgren, 
# has written a general example on matching Roman Numerals:
# https://github.com/ronsavage/Regexp-Assemble
# /blob/master/examples/roman
# The idea of the check script here is modified from that.
use List::Util qw/shuffle sample any/;
use Regexp::Assemble;
use Regexp::Trie;
use feature 'say';

sub repr {
    return sample int 4*rand(), 
        shuffle('I' x (int 4*rand()), 'V', 'X');
}

my $size = 1000;

sub c0 {
    my $count = 0;
    for (1..$size) {
        my $letters = repr();
        $count++ if any {$letters =~ /^$_$/} @roman;
    }
    return $count;
}

my $ra = Regexp::Assemble->new;
$ra->anchor_line;
$ra->add(@roman);
my $ra_computed = $ra->re;

sub c1 {
    my $count = 0;
    for (1..$size) {
        $count++ if repr() =~ $ra_computed;
    }
    return $count;
}

my $rt = Regexp::Trie->new;
$rt->add($_) for @roman;
my $rt_computed = $rt->regexp;

sub c2 {
    my $count = 0;
    for (1..$size) {
        $count++ if repr() =~ /^$rt_computed$/;
    }
    return $count;
}

say c0()/$size;
say c1()/$size;
say c2()/$size;

use Benchmark q/cmpthese/;
cmpthese(10_000, {
    RAW => sub {c0}, 
    Assemble => sub {c1},
    Trie => sub {c2},  
});


=pod
0.695
0.691
0.698
           Rate      RAW Assemble     Trie
RAW      43.5/s       --     -92%     -93%
Assemble  550/s    1163%       --     -18%
Trie      668/s    1436%      22%       --

See https://metacpan.org/pod/Regexp::Assemble and https://github.com/ronsavage/Regexp-Assemble for more details, features, and some caveats.

THE HIGHLIGHTED PERL MODULE OF WEEK 10 OF 2023: Regexp::Assemble

5 comments

Tagged as:

cpan

5 Comments

demerphq | March 8, 2023 3:58 AM | Reply

Fwiw, the regex engine automatically performs the trie optimization, and has done since Perl 5.10. These days using these modules will probably slow things down, not speed them up. If you find evidence to the contrary please file a bug at https://github.com/Perl/perl5/issues

In the tests I did using an anchored pattern of the roman numerals joined by "|" with longest numerals first produced the best results, 75% faster than that produced by Regexp::Assemble or Regexp::Trie.

Gábor Szabó - גאבור סבו | March 10, 2023 11:14 PM | Reply

Nice, but IMHO it would be better if you (also) published these on DEV.to so people outside of the Perl community will also see them.

demerphq replied to comment from Gábor Szabó - גאבור סבו | March 11, 2023 5:09 AM | Reply

Were you talking to me there?

Sebastian Schleussner | March 15, 2023 5:40 PM | Reply

I agree with @demerphq! I complemented the benchmark with a case that applies the regexp
$rx = sprintf qr{^(?:%s)$}, join '|', @roman;
and ran it on seven different setups, i.a. Perl 5.10.1 on i586, Perl 5.26.1 on x64, and Perl 5.32.1 on armv5el.

In all the cases, this plain-vanilla precompiled RX was fastest, though by how much, differed: 8–25 times faster than the nail-curlingly slow loop through the array, 18%–60% faster than ::Assemble and ::Trie. Only on the i586 was ::Trie faster (by a hair's breadth 1%!) than ::Assemble.

C.-Y. Fung | March 15, 2023 8:16 PM | Reply

Dear demerphq and Sebastian,

Thanks for pointing out the optimization done by the Perl compiler! I will post a corrected performance comparison next week.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About C.-Y. Fung

This blog is inactive and replaced by https://e7-87-83.github.io/coding/blog.html ; but I post highly Perl-related posts here.

More info »

Moments on Perl or other Programming Issues