Regexp::Assemble - Weekly Travelling in CPAN

It has been on my mind quite a while. Originally it was suggested on Twitter, by Mohammad Anwar, the maintainer of "The Weekly Challenge", that the community should re-publish the CPAN Weekly, which existed before I joined the hacker community. Our plan was having the newsletter began in December 2022. Actually that collided with the Advent Calendar, so, not a good time. Anyway, after many twists and turns, I was busy in late 2022 and early 2023 for job hunting (settled now). Now I try to act as a tour guide and visit some CPAN modules (or distributions) with you in a causal manner.

Destination: Regexp::Assemble

Date of Latest Release: Jun 20, 2017
Distribution: Regexp-Assemble
Module version: 0.38
Main Contributors: David Landgren and Ron Savage(RSAVAGE)

Regexp::Assemble is used for combining regular expressions.

my $ra = Regexp::Assemble->new;
$ra->add('cat', 'rat');
say $ra->re;
say $ra->as_string;
# (?^:[cr]at)
# [cr]at

The two methods of the module you will probably use most frequently, as_string and re, have subtle differences:
my $r = Regexp::Assemble->new;
my @roman = qw/I II III IV V
               VI VII VIII IX X
               XI XII XIII XIV XV
               XVI XVII XVIII IXX XX/;
$r->add(@roman);
say "\$r->re:", "\n", $r->re;
my $rx = $r->as_string;
say "\$r->as_string:", "\n", $r->as_string;
say "Matched." if "vii" =~ /$rx/i;
say "This won't be printed." if "vii" =~ $r->re;

# $r->re:
# (?^:(?:X(?:V(?:I(?:I?I)?)?|I(?:I?I|V)?|X)?|I(?:I?I|X?X|V)?|V(?:I(?:I?I)?)?))
# $r->as_string:
# (?:X(?:V(?:I(?:I?I)?)?|I(?:I?I|V)?|X)?|I(?:I?I|X?X|V)?|V(?:I(?:I?I)?)?)
# Matched.

It is mainly for performance.

my $rs0 = Regexp::Assemble->new;
$rs0->add('[0-9]+');
$rs0->add('[0-9a-f]+');
$rs0->add('[0-9A-F]+');
say $rs0->re;

# (?^:(?:[0-9A-F]+|[0-9a-f]+|[0-9]+))
# though you can write [0-9A-Fa-f]+ equivalently.

It can be quite readable if your words are "similar" enough:

my $rday = Regexp::Assemble->new;
$rday->add('Wednesday');
$rday->add('Wed');
$rday->add('We');
$rday->add('W');
$rday->add('wednesday');
$rday->add('WednesdaY');
$rday->add('Wednesdy');
$rday->add('Wednseday');
$rday->add('Wedsenady');
say $rday->re;
# (?^:(?:W(?:e(?:d(?:n(?:esd(?:a[Yy]|y)|seday)|senady)?)?)?|wednesday))

Note that do not put slashes on the expressions.

my $rs = Regexp::Assemble->new;
$rs->add('m/[0-9]+/');
$rs->add('m/[0-9a-f]+/');
$rs->add('m/[0-9A-F]+/');
say $rs->re;
say "Great!?" if "A9F" =~ $rs->re;
say "Mixed feelings." if "m/A9F/" =~ $rs->re;

# (?^:m\/(?:[0-9A-F]+|[0-9a-f]+|[0-9]+)\/)
# Mixed feelings.

There are some other regular expression combinators on CPAN; one of these is Regexp::Trie.
It is generally faster than Regexp::Assemble but has fewer features.
Let's have a performance check:

# MATCHING FIRST 20 ROMAN NUMERALS
# -- FROM Weekly Travelling in CPAN, Mar 07 2023
# The first author of the module, David Landgren, 
# has written a general example on matching Roman Numerals:
# https://github.com/ronsavage/Regexp-Assemble
# /blob/master/examples/roman # The idea of the check script here is modified from that. use List::Util qw/shuffle sample any/; use Regexp::Assemble; use Regexp::Trie; use feature 'say'; sub repr { return sample int 4*rand(), shuffle('I' x (int 4*rand()), 'V', 'X'); } my $size = 1000; sub c0 { my $count = 0; for (1..$size) { my $letters = repr(); $count++ if any {$letters =~ /^$_$/} @roman; } return $count; } my $ra = Regexp::Assemble->new; $ra->anchor_line; $ra->add(@roman); my $ra_computed = $ra->re; sub c1 { my $count = 0; for (1..$size) { $count++ if repr() =~ $ra_computed; } return $count; } my $rt = Regexp::Trie->new; $rt->add($_) for @roman; my $rt_computed = $rt->regexp; sub c2 { my $count = 0; for (1..$size) { $count++ if repr() =~ /^$rt_computed$/; } return $count; } say c0()/$size; say c1()/$size; say c2()/$size; use Benchmark q/cmpthese/; cmpthese(10_000, { RAW => sub {c0}, Assemble => sub {c1}, Trie => sub {c2}, }); =pod 0.695 0.691 0.698 Rate RAW Assemble Trie RAW 43.5/s -- -92% -93% Assemble 550/s 1163% -- -18% Trie 668/s 1436% 22% --

See https://metacpan.org/pod/Regexp::Assemble and https://github.com/ronsavage/Regexp-Assemble for more details, features, and some caveats.

THE HIGHLIGHTED PERL MODULE OF WEEK 10 OF 2023: Regexp::Assemble

5 Comments

Fwiw, the regex engine automatically performs the trie optimization, and has done since Perl 5.10. These days using these modules will probably slow things down, not speed them up. If you find evidence to the contrary please file a bug at https://github.com/Perl/perl5/issues

In the tests I did using an anchored pattern of the roman numerals joined by "|" with longest numerals first produced the best results, 75% faster than that produced by Regexp::Assemble or Regexp::Trie.

Nice, but IMHO it would be better if you (also) published these on DEV.to so people outside of the Perl community will also see them.

Were you talking to me there?

I agree with @demerphq! I complemented the benchmark with a case that applies the regexp
$rx = sprintf qr{^(?:%s)$}, join '|', @roman;
and ran it on seven different setups, i.a. Perl 5.10.1 on i586, Perl 5.26.1 on x64, and Perl 5.32.1 on armv5el.

In all the cases, this plain-vanilla precompiled RX was fastest, though by how much, differed: 8–25 times faster than the nail-curlingly slow loop through the array, 18%–60% faster than ::Assemble and ::Trie. Only on the i586 was ::Trie faster (by a hair's breadth 1%!) than ::Assemble.

Leave a comment

About C.-Y. Fung

user-pic This blog is inactive and replaced by https://e7-87-83.github.io/coding/blog.html ; but I post highly Perl-related posts here.