Text Processing Part 2: More Speed

In my previous post Text Processing: Divide and Conquer I took a text processing problem profiled it, then developed a few possible solutions. I benchmarked these options and now use the fastest solution… that I tested for. Two comments were posted for that article that gave insight into different and faster ways to solve this problem.

Back to the regular expression solution

Initially I just had an array of patterns that I fed through qw/$_/ixms and it was slow. I had not considered using alternation because I thought it was going to be too slow. Perl 5.10 fixed this kind of problem but I was so used to how it performed before that I had not considered it since. With this new information in hand I created a new benchmark set to compare performance. Here are the old numbers:

./method_bench.pl
         Rate method4 method5 method3 method1 method2
method4 585/s      --     -0%    -35%    -40%    -40%
method5 586/s      0%      --    -35%    -40%    -40%
method3 898/s     53%     53%      --     -8%     -8%
method2 972/s     66%     66%      8%      0%      --
method1 972/s     66%     66%      8%      --     -0%

Method1 was copied into the new test as a baseline. I purposely added a solution that I know would be slower ‘regex_overhead’ just to see what would happen. ‘regex_assem’ show no performance difference if I use $regex3->re() or not.

Running the above script I get the following numbers:

./comments_bench.pl 
                 Rate   regex_assem        method1 regex_overhead      one_regex
regex_assem    2.38/s            --           -21%           -86%           -88%
method1        3.00/s           26%             --           -83%           -85%
regex_overhead 17.5/s          635%           482%             --           -13%
one_regex      20.0/s          742%           567%            15%             --

These results match up with Aaron Crane’s comment about a major speed increase. Even the test with extra overhead is multiple times faster than the method1 I had. So now I go back to this script I wrote and change it to use regular expressions with alternation. I tested these two versions I now had against a test data set of a few hundred files totaling 36 megabytes. The old script took 37 seconds to process the test data while the new version only takes 9 seconds, 25% of the baseline. Bumping up the test data to 135mb the old way takes 105 seconds, the new way 17 seconds or 16% of the baseline.

What can I say? The Perl community came through for me. I wrote an article about how I solved a problem and two people came along and gave me advice that lead to a faster solution and since speed was the goal, a better solution. Thank you.

Leave a comment

About Kimmel

user-pic I like writing Perl code and since most of it is open source I might as well talk about it too. @KirkKimmel on twitter