Text Processing Part 2: More Speed
In my previous post Text Processing: Divide and Conquer I took a text processing problem profiled it, then developed a few possible solutions. I benchmarked these options and now use the fastest solution… that I tested for. Two comments were posted for that article that gave insight into different and faster ways to solve this problem.
Back to the regular expression solution
Initially I just had an array of patterns that I fed through qw/$_/ixms and it was slow. I had not considered using alternation because I thought it was going to be too slow. Perl 5.10 fixed this kind of problem but I was so used to how it performed before that I had not considered it since. With this new information in hand I created a new benchmark set to compare performance. Here are the old numbers:
./method_bench.pl Rate method4 method5 method3 method1 method2 method4 585/s -- -0% -35% -40% -40% method5 586/s 0% -- -35% -40% -40% method3 898/s 53% 53% -- -8% -8% method2 972/s 66% 66% 8% 0% -- method1 972/s 66% 66% 8% -- -0%
Method1 was copied into the new test as a baseline. I purposely added a solution that I know would be slower ‘regex_overhead’ just to see what would happen. ‘regex_assem’ show no performance difference if I use $regex3->re() or not.
Running the above script I get the following numbers:
./comments_bench.pl Rate regex_assem method1 regex_overhead one_regex regex_assem 2.38/s -- -21% -86% -88% method1 3.00/s 26% -- -83% -85% regex_overhead 17.5/s 635% 482% -- -13% one_regex 20.0/s 742% 567% 15% --
These results match up with Aaron Crane’s comment about a major speed increase. Even the test with extra overhead is multiple times faster than the method1 I had. So now I go back to this script I wrote and change it to use regular expressions with alternation. I tested these two versions I now had against a test data set of a few hundred files totaling 36 megabytes. The old script took 37 seconds to process the test data while the new version only takes 9 seconds, 25% of the baseline. Bumping up the test data to 135mb the old way takes 105 seconds, the new way 17 seconds or 16% of the baseline.
What can I say? The Perl community came through for me. I wrote an article about how I solved a problem and two people came along and gave me advice that lead to a faster solution and since speed was the goal, a better solution. Thank you.
Leave a comment