Actions from Pinkhas NisanovMovable Type Pro 4.382014-12-19T02:36:31Zhttps://blogs.perl.org/mt/mt-cp.cgi?__mode=feed&_type=actions&blog_id=0&id=3735Posted TIMTOWTDI - matching regular expressions with events to Pinkhas Nisanovtag:blogs.perl.org,2014:/users/pinkhas_nisanov//2559.65942014-12-19T02:36:31Z2014-12-19T03:04:33ZIn the past I had a task, run hundreds thousands of regular experssionson millions text documents, and it should run fast (run time shouldbe measured by seconds).Perl regexp alternation may be good option for that, because it put allregexps on...Pinkhas Nisanovhttps://blogs.perl.org/mt/mt-cp.cgi?__mode=view&blog_id=2559&id=3735
on millions text documents, and it should run fast (run time should be measured by seconds). Perl regexp alternation may be good option for that, because it put all regexps on one trie structure and passes them at once, but I needed more than simple match. I had to know what regexp was matched and exact position of match, and much more. Trying to optimize matching process, I found that my options to interact with regexp engine during matching process are very limited. I decide to write my own regexp engine, I thought it shouldn't be too complicated (when I found I'm wrong it was too late :-) ). Basic idea is that every matching event will have it's own handler and will run immediately as this event happens. Matching handler will get relevant information from regexp engine and send back orders how to continue matching process. I created Regexp::SAR (Simple API for Regexp) module. Most code written in C/XS. Some example how it looks like:
#get third match and stop my $sar3 = new Regexp::SAR; my $matchedStr3; my $matchCount = 0; $sar3->addRegexp('\w+', sub { my ($from, $to) = @_; ++$matchCount; if ($matchCount == 3) { $matchedStr3 = substr($string3, $from, $to - $from); $sar3->stopMatch(); } else { $sar3->continueFrom($to); } }); $sar3->match('aa11 bb22 cc33 dd44'); # $matchCount is 3, $matchedStr3 is 'cc33'
Currently there is no too much features in Regexp::SAR module compared to perl's regexp engine. Many things can be done differently. One example is that there is no need for '^' character (begin of string), you should use "matchAt" which allow you run match at specific position and do not continue, e.g.: matchAt($string, 0);. I have some ideas what features should be added. First is "multithreaded" matching. Theoretically it can already run by using "matchAt". It's possible to create some thread pool where every thread run match at specific position. Matching from second character could be run in parallel with matching from first character. Another idea is to implement anchors. Anchor is event thrown when previous and next characters has different character type. It also can be implemented with current module, e.g. anchor for ALPHA followed by DIGIT character:
my $sar = new Regexp::SAR; my $alphaPos; $sar->addRegexp('\a', sub { $alphaPos = $_[0] }); $sar->addRegexp('\d', sub { my $digitPos = $_[0]; if (defined $alphaPos) { my $dist = $digitPos - $alphaPos; if ($dist == 1) { #PROCESS ANCHOR EVENT } } }); $sar->match('aa bb2cc dd');
It can be done also in perl's regexp, e.g.: '\b' can be implemented as: $str =~ /\w\s/;. and here is main difference between perl's regexp and SAR. In perl you need to put all your regexp complexity in one string. If you need to match some string followed somewhere by another string you create regexp string like /abc.*?def/. In SAR you create two regexp handlers for 'abc' and 'def' and match them in your way. It gives you more flexibility for complicated regular expressions.
Another idea is matching on stream, when matching string passed for match with chunks, e.g.:
my $sar = new Regexp::SAR; $sar->addRegexp('abcdef', sub { print "Matched\n" }); $sar->startStream(); $sar->matchStream("11111 abc"); $sar->matchStream("def 22222"); $sar->stopStream();
If you have another idea for new feature, please, send me.