TIMTOWTDI - matching regular expressions with events

In the past I had a task, run hundreds thousands of regular experssions
on millions text documents, and it should run fast (run time should
be measured by seconds).
Perl regexp alternation may be good option for that, because it put all
regexps on one trie structure and passes them at once, but I needed more
than simple match. I had to know what regexp was matched and exact position
of match, and much more. Trying to optimize matching process, I found that
my options to interact with regexp engine during matching process are very limited.
I decide to write my own regexp engine, I thought it shouldn't be too
complicated (when I found I'm wrong it was too late :-) ).
Basic idea is that every matching event will have it's own handler and
will run immediately as this event happens. Matching handler will get
relevant information from regexp engine and send back orders how to continue
matching process.
I created Regexp::SAR (Simple API for Regexp) module. Most code written in C/XS.
Some example how it looks like:

    #get third match and stop
    my $sar3 = new Regexp::SAR;
    my $matchedStr3;
    my $matchCount = 0;
    $sar3->addRegexp('\w+', sub {
                                my ($from, $to) = @_;
                                if ($matchCount == 3) {
                                    $matchedStr3 = substr($string3, $from, $to - $from);
                                else {
    $sar3->match('aa11 bb22 cc33 dd44');
    # $matchCount is 3, $matchedStr3 is 'cc33'

Currently there is no too much features in Regexp::SAR module compared
to perl's regexp engine. Many things can be done differently. One example is
that there is no need for '^' character (begin of string), you should
use "matchAt" which allow you run match at specific position and do
not continue, e.g.: matchAt($string, 0);.
I have some ideas what features should be added. First is "multithreaded"
matching. Theoretically it can already run by using "matchAt".
It's possible to create some thread pool where every thread run match at
specific position. Matching from second character could be run in parallel
with matching from first character.
Another idea is to implement anchors. Anchor is event thrown when previous
and next characters has different character type. It also can be implemented
with current module, e.g. anchor for ALPHA followed by DIGIT character:

    my $sar = new Regexp::SAR;
    my $alphaPos;
    $sar->addRegexp('\a', sub { $alphaPos = $_[0] });
    $sar->addRegexp('\d', sub {
                                 my $digitPos = $_[0];
                                 if (defined $alphaPos) {
                                    my $dist = $digitPos - $alphaPos;
                                    if ($dist == 1) {
                                        #PROCESS ANCHOR EVENT
    $sar->match('aa bb2cc dd');

It can be done also in perl's regexp, e.g.:
'\b' can be implemented as: $str =~ /\w\s/;.
and here is main difference between perl's regexp and SAR.
In perl you need to put all your regexp complexity in one string.
If you need to match some string followed somewhere by another string you
create regexp string like /abc.*?def/. In SAR you create two regexp handlers
for 'abc' and 'def' and match them in your way. It gives you more flexibility
for complicated regular expressions.

Another idea is matching on stream, when matching string passed for match
with chunks, e.g.:

    my $sar = new Regexp::SAR;
    $sar->addRegexp('abcdef', sub { print "Matched\n" });
    $sar->matchStream("11111 abc");
    $sar->matchStream("def 22222");

If you have another idea for new feature, please, send me.

Pinkhas Nisanov

Leave a comment

About Pinkhas Nisanov

user-pic I blog about Perl.