Text::Extract::Word, MsOffice::Word::Surgeon - Weekly Travelling in CPAN
Distribution: Text::Extract::Word
Module version: 0.02
Main Contributors: Stuart Watt (SNKWATT)
License: The Artistic License 2.0
Distribution: MsOffice::Word::Surgeon
Module version: 2.01
Main Contributors: Laurent Dami (DAMI)
License: The Artistic License 2.0
Notice
Thanks to demerphq and Sebastian Schleussner's comments, we should visit a better and easy way to do a regex matching in Perl now. It is not Regexp::Assemble introduced two weeks before - the Perl compiler after version 5.10 has already done the optimization for us! But we have to use the correct syntax. To see what happens, the bottom of this post gives a comparison of regexes for Roman numerals again; we give two more players in the race, /^($r_str_combine)$/ [given my $r_str_combine = join "|", @roman;] and $rx = sprintf qr/^(?:%s)$/, join "|", @roman. We can see the former has almost the same performance as Regexp::Assemble, while the latter is usually the fastest. We do not need to import any modules and get a fast result!
Today let us have a short tour on two modules which can perform a similar function: text searching in MS Office Word documents.
Well, we know there are two common formats of MS Office Word document: .docx and the earlier .doc.
Text::Extract::Word deals with .doc. You can use the function get_all_text to get all text (I'm a verbose tour guide...), or use its object oriented interface which get the text in different location of a document:
# taken from synopsis of the module: my $file = Text::Extract::Word->new("test1.doc"); my $text = $file->get_text(); my $body = $file->get_body(); my $footnotes = $file->get_footnotes(); my $headers = $file->get_headers(); my $annotations = $file->get_annotations(); my $bookmarks = $file->get_bookmarks();
MsOffice::Word::Surgeon deals with .docx. Besides functionalities to extract text, you can also replace text by regular expression, and write a new .docx file.
Here comes a use case of the two modules. I was maintaining a collection of curricula vitae and database of candidate details, but due to a coding bug, some of the CV were missing or overwritten by others' CV. The CVs are in .doc, .docx or .pdf format, and have filenames as CV[number].[extension]. I use Text::Extract::Word and MsOffice::Word::Surgeon to check the MS Office documents.
Note that in the database, since I am physically located in Hong Kong, each candidate name is stored with Chinese characters and English alphabets. And names of some people consist of only two Chinese characters (some, like me, have 3; some people have 4(two characters for the surname, two characters for the given name)), so I chose to take the first two English words in the candidate name field and checked whether the two words are in the corresponding document. ID of each suspicious record will be printed.
Script one:
use utf8; use DBI; use Text::Extract::Word qw(get_all_text); use v5.30.0; for my $num (1..999) { search_cv($num) if -e "CV".$num.".doc" || -e "CV".$num.".DOC"; } sub search_cv { my $cv_id = $_[0]; my $filename = "CV".$cv_id.".doc"; my $dsn = "DBI:mysql:database=cvdb;host=127.0.0.1"; my $dbh = DBI->connect($dsn, 'mydatabaseadmin', 'mypassword', { mysql_enable_utf8 => 1 }); my $fullname; my $first_two_eng; my $sth = $dbh->prepare("SELECT name FROM candidate WHERE id=".$cv_id); $sth->execute; while (my $ref = $sth->fetchrow_hashref) { $fullname = $ref->{'name'}; } $fullname !~ s/[^[:ascii:]]//g; my $engname = $fullname; my $second_space = index($engname, " ", index($engname, " ")+1 ); my $first_two_eng = ($second_space != -1) ? (substr $engname, 0, $second_space) : (substr $engname, 0); my $found; my $text = get_all_text($filename); $found = index($text, $first_two_eng); if ($found != -1) { # say "found: ". $cv_id; } else { say "SUSPICIOUS: ". $cv_id; } }
Script two
use utf8; use DBI; use MsOffice::Word::Surgeon; use v5.30.0; for my $num (1..999) { search_cv($num) if -e "CV".$num.".docx"; } sub search_cv { my $cv_id = $_[0]; my $filename = "CV".$cv_id.".docx"; my $dsn = "DBI:mysql:database=cvdb;host=127.0.0.1"; my $dbh = DBI->connect($dsn, 'mydatabaseadmin', 'mypassword', { mysql_enable_utf8 => 1 }); my $fullname; my $first_two_eng; my $sth = $dbh->prepare("SELECT name FROM candidate WHERE id=".$cv_id); $sth->execute; while (my $ref = $sth->fetchrow_hashref) { $fullname = $ref->{'name'}; } $fullname !~ s/[^[:ascii:]]//g; my $engname = $fullname; my $second_space = index($engname, " ", index($engname, " ")+1 ); my $first_two_eng = ($second_space != -1) ? (substr $engname, 0, $second_space) : (substr $engname, 0); my $found; my $surgeon = MsOffice::Word::Surgeon->new(docx => $filename); my $text = $surgeon->document->plain_text; $found = index($text, $first_two_eng); if ($found != -1) { # say "found: ". $cv_id; } else { say "SUSPICIOUS: ". $cv_id; } }The above is just a very straightforward use case of these two modules. You may explore their POD and use them to suit your need!
# For PDF text search, I will introduce modules later.
Text::Extract::Word, MsOffice::Word::Surgeon
Comparison of Regexes for Roman Numerals
use v5.30.0; use List::Util qw/shuffle sample any/; use Regexp::Assemble; use Regexp::Trie; use feature 'say'; my @roman = qw/I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI XVII XVIII IXX XX/; sub repr { return sample int 4*rand(), shuffle('I' x (int 4*rand()), 'V', 'X'); } my $size = 1000; sub c0 { my $count = 0; for (1..$size) { my $letters = repr(); $count++ if any {$letters =~ /^$_$/} @roman; } return $count; } my $ra = Regexp::Assemble->new; $ra->anchor_line; $ra->add(@roman); my $ra_computed = $ra->re; sub c1 { my $count = 0; for (1..$size) { $count++ if repr() =~ $ra_computed; } return $count; } my $rt = Regexp::Trie->new; $rt->add($_) for @roman; my $rt_computed = $rt->regexp; sub c2 { my $count = 0; for (1..$size) { $count++ if repr() =~ /^$rt_computed$/; } return $count; } my $r_str_combine = join "|", @roman; sub cn { my $count = 0; for (1..$size) { $count++ if repr() =~ /^($r_str_combine)$/; } return $count; } my $rx = sprintf qr/^(?:%s)$/, join "|", @roman; sub cx { my $count = 0; for (1..$size) { $count++ if repr() =~ $rx; } return $count; } say c0()/$size; say c1()/$size; say c2()/$size; say cn()/$size; say cx()/$size; use Benchmark q/cmpthese/; cmpthese(10_000, { RAW => sub {c0}, Assemble => sub {c1}, Trie => sub {c2}, naive => sub {cn}, QR => sub {cx} });Result:
0.705 0.691 0.68 0.681 0.708 Rate RAW Trie naive Assemble QR RAW 42.7/s -- -94% -94% -94% -95% Trie 669/s 1468% -- -6% -7% -23% naive 711/s 1565% 6% -- -2% -18% Assemble 724/s 1595% 8% 2% -- -17% QR 867/s 1932% 30% 22% 20% --
Leave a comment