Text Processing: Divide and Conquer
Another day another generic text processing problem that many developers have had to solve before. I have a list of patterns and need to find if they exist in a group of files. If I did not need to do complex post processing then I could just use the command line like so
grep -ri -f patterns files/
and be done with it, alas the world is a cruel place :) So I cooked up a simple program that slurps a file in and does the comparison and tracks the necessary stats. The main problem with this approach is that it is too slow, I had over 4,000 patterns to match against every file. I used perl -d:NYTProf to confirm my theory. Here is the basic report I got from nytprofhtml:
1125978 1 1 386s 386s main::CORE:match (opcode) 1130211 2 1 1.09s 1.09s main::CORE:regcomp (opcode) 41804 6 2 378ms 638ms File::Spec::Unix::canonpath (recurses: max depth 1, inclusive time 159ms) 23226 20 7 131ms 518ms Path::Class::Dir::stringify 250764 6 1 87.5ms 308ms File::Spec::Unix::CORE:subst (opcode) 32515 6 2 85.6ms 435ms File::Spec::Unix::catdir 1236 2 2 56.0ms 447ms File::Spec::Unix::abs2rel 24462 3 2 49.0ms 49.0ms File::Spec::Unix::catpath 5968 10 5 43.1ms 325ms Path::Class::File::stringify 31924 10 3 42.1ms 42.1ms Path::Class::Entity::_spec 4903 1 1 27.5ms 146ms File::Spec::Unix::catfile 120 2 2 26.1ms 29.4ms utf8::SWASHNEW 980 974 5 19.3ms 20.3ms Encode::utf8::decode_xs (xsub) 6988 6 3 17.0ms 19.2ms File::Spec::Unix::splitpath 1840 4 1 15.4ms 655ms Path::Class::Rule::test (recurses: max depth 2, inclusive time 674ms)
1,125,978 matches in 3 megabytes of sample text files took 6 minutes and 26 seconds. This is way too slow so, let us optimize.
I started by looking at the patterns file and the majority of them are string literals which is good news. With a little preprocessing I now have all the string literals in a hash whose key is the first letter of the pattern and the value is a hash containing the patterns as keys and the values are coderefs to do whatever processing is needed. Now instead of having to run all of the literal matches every time I can test the first letter and use that to limit the matching scope. The best case scenario is 'x' which only has 1 pattern to test and the worst is 's' with over 700 patterns.
index instead of m//
Now you might be asking yourself why didn't I just use index since I have string literals? Index is designed for fast string searching with literal matches. Lets take a look at a simple example which loads 4,000 patterns and matches it against the entire text of Bram Stoker's Dracula which is a 16,248 line, 836kb file.
Is the sample program and it takes 2 minutes, 10 seconds to run. The bulk of the time is spent doing 4,000 match operations. Now lets try the same thing with index instead.
This program processes the entire Dracula text in 4.1 seconds.
The reason I didn't use index is that some of the literal patterns match as substrings of other words so I get a bunch of false positives. This does not happen with m// because I used \b to match up word boundaries. This was a precautionary measure because with that many input patterns I had to assume there would be some overlap and it turns out there is. A gentle look at simple string operations is a good read if this is new material and chapter 9 of "Mastering Algorithms with Perl" goes into depth about Boyer-Moore which Perl and grep both use.
I need tokens
This is what I had:
Which was fine but now with the need to test the first letter of each word, I need tokens. One of the advantages to making this change is I can now do text normalization on it, specifically removing duplicate tokens so less match attempts need to be made. List::MoreUtils makes this a breeze.
I read in the whole source file and strip out all the characters I do not care about. Then I split it into tokens using whitespace as the separator. Next I filter out any empty or one letter tokens. uniq from List::MoreUtils returns a list with no duplicate values. I was surprised that between List::MoreUtils and List::Util there was no function for getting the shortest or longest string in a list. Then again it is a trivial task to implement them using reduce as shown above.
Going back to the initial 3mb of text files this updated script takes under 3 seconds to execute, that is less than 1% of the baseline time. I can now run this script 127 times in the time it took for the first version to complete one run.
The more I used this little application on larger and larger datasets the more I wondered if I could get a little more speed out of it. I started by increasing my sample data size to 26 megabytes to simulate a more normal workload. Then I took a look at that foreach loop to see if I could speed things up. I created a simple benchmark script to compare five variations on a solution and then time them.
Rate method5 method4 method3 method2 method1 method5 667/s -- -0% -34% -40% -41% method4 670/s 0% -- -34% -40% -41% method3 1010/s 51% 51% -- -9% -11% method2 1115/s 67% 66% 10% -- -1% method1 1130/s 69% 69% 12% 1% --
Method1 turned out to be the fastest but what does this translate to in real work? Method5 takes 33 seconds to process 26mb of text while method1 only takes 22 seconds, a 33.33% reduction in execution time. For much larger datasets I am looking into doing things in parallel however the overhead of forking is greater than the amount of work that needs to be done on a per file basis.