<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>The Incredible Journey</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/ben_bullock/" />
    <link rel="self" type="application/atom+xml" href="http://blogs.perl.org/users/ben_bullock/atom.xml" />
    <id>tag:blogs.perl.org,2009-11-03:/users/ben_bullock//392</id>
    <updated>2013-03-16T13:44:27Z</updated>
    <subtitle>A strange game. The only winning move is not to play.</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Pro 4.38</generator>

<entry>
    <title>Text::Fuzzy now with transpositions</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/ben_bullock/2013/03/textfuzzy-now-with-transpositions.html" />
    <id>tag:blogs.perl.org,2013:/users/ben_bullock//392.4441</id>

    <published>2013-03-16T13:30:17Z</published>
    <updated>2013-03-16T13:44:27Z</updated>

    <summary>The Text::Fuzzy module for approximate string matches, especially over lists, now also handles the Damerau-Levenshtein edit distance, thanks to a patch from Nick Logan (UGEXE). This is an edit distance where the difference between &quot;tarp&quot; and &quot;trap&quot; is one rather...</summary>
    <author>
        <name>Ben Bullock</name>
        
    </author>
    
    <category term="textfuzzyeditdistance" label="Text::Fuzzy edit-distance" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/ben_bullock/">
        <![CDATA[<p>The <a href="https://metacpan.org/release/Text-Fuzzy">Text::Fuzzy</a> module for approximate string matches, especially over lists, now also handles the Damerau-Levenshtein edit distance, thanks to a patch from Nick Logan (UGEXE). This is an edit distance where the difference between "tarp" and "trap" is one rather than two. Please download <a href="https://metacpan.org/source/BKB/Text-Fuzzy-0.10_01">this developer release</a> to try it out.<br />
</p>]]>
        <![CDATA[<p>In other news, searches across Unicode lists are now slightly (about 10-25%) faster. This is due to a filter which rejects impossible matches on the basis of the intersection of the alphabets of the strings. <a href="https://metacpan.org/source/BKB/Text-Fuzzy-0.10_01/examples">Examples are in the CPAN distribution</a>, and <a href="https://github.com/benkasminbullock/Text-Fuzzy/tree/master/benchmarks">benchmarking code is in the github repository</a>.</p>

<p>The transposition code should still be considered experimental at this stage, although it is passing all the tests from its parent module, <a href="https://metacpan.org/release/Text-Levenshtein-Damerau-XS">Text::Levenshtein::Damerau::XS</a>.<br />
</p>]]>
    </content>
</entry>

<entry>
    <title>CPAN::Nearest is now Text::Fuzzy</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/ben_bullock/2013/03/cpannearest-is-now-textfuzzy.html" />
    <id>tag:blogs.perl.org,2013:/users/ben_bullock//392.4425</id>

    <published>2013-03-13T14:21:41Z</published>
    <updated>2013-03-14T06:34:13Z</updated>

    <summary>The module CPAN::Nearest for finding the closest module to a misspelt name is now integrated into a more general module, Text::Fuzzy....</summary>
    <author>
        <name>Ben Bullock</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/ben_bullock/">
        <![CDATA[<p>The module <a href="https://metacpan.org/release/CPAN-Nearest">CPAN::Nearest</a> for finding the closest module to a misspelt name is now integrated into a more general module, <a href="https://metacpan.org/release/Text-Fuzzy">Text::Fuzzy</a>.<br />
</p>]]>
        <![CDATA[<p>The CPAN::Nearest module will continue to be maintained, but the algorithm it uses has been put into Text::Fuzzy so that it can be applied to a wider variety of searches. The basic notion of CPAN::Nearest and Text::Fuzzy is to speed up an edit-distance based search for the case where the user is looking for the nearest match over a list of entries. These modules use a variety of tricks to improve the speed of calculation of the Levenshtein edit distance beyond the very slow "dynamic programming" algorithm shown in textbooks.</p>

<p>Text::Fuzzy is applicable to the general case of misspellings, and comes with two example scripts, <a href="http://www.lemoda.net/perl/text-fuzzy-spellchecker/index.html">a spell-checker based on /usr/dict/words</a>, and <a href="http://www.lemoda.net/perl/perl-mod-speling/index.html">a cgi script</a> which does something like <a href="http://httpd.apache.org/docs/2.2/mod/mod_speling.html">mod_speling</a>. </p>

<p>Text::Fuzzy also offers edit distance based searches across Unicode strings, treating each character as a single entity. The Unicode searches are not quite as souped up as the ASCII searches, because some of the tricks are more difficult to apply.</p>

<p><b>Update</b><br />
I've added another example, a <a href="http://www.lemoda.net/perl/extract-kana/index.html">Unicode-based fuzzy search over a dictionary</a>. I also found an error in the documentation of Text::Fuzzy, which is now corrected with version 0.09.</p>

<p><b>Update 2</b><br />
On closer examination, the speed of the new and old modules is actually roughly the same, so I have removed the statements about speed differences. I didn't remember adding anything to Text::Fuzzy beyond what was in CPAN::Nearest, so the differences were probably due to a configuration error.<br />
</p>]]>
    </content>
</entry>

<entry>
    <title>Perl module for identifying Chinese IP addresses</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/ben_bullock/2013/02/perl-module-for-identifying-chinese-ip-addresses.html" />
    <id>tag:blogs.perl.org,2013:/users/ben_bullock//392.4303</id>

    <published>2013-02-10T12:16:07Z</published>
    <updated>2013-02-10T12:23:42Z</updated>

    <summary>This new Perl module IP::China provides a lookup for internet addresses which rapidly tells whether they are from China. It is based on a binary search of a database of about 2600 ranges, from the MaxMind GeoLite IP database....</summary>
    <author>
        <name>Ben Bullock</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/ben_bullock/">
        <![CDATA[<p>This <a href="https://metacpan.org/release/IP-China">new Perl module IP::China</a> provides a lookup for internet addresses which rapidly tells whether they are from China. It is based on a binary search of a database of about 2600 ranges, from <a href="http://dev.maxmind.com/geoip/geolite">the MaxMind GeoLite IP database</a>.</p>]]>
        <![CDATA[<p>This module is written in C and uses a binary search of the database, which is compiled into the code. Thus it should be extremely fast. The total size of the binary of the module is less than 30,000 bytes when compiled, so memory use should also be very small.<br />
</p>]]>
    </content>
</entry>

<entry>
    <title>Upgraded module Lingua::JA::FindDates</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/ben_bullock/2013/02/upgraded-module-linguajafinddates.html" />
    <id>tag:blogs.perl.org,2013:/users/ben_bullock//392.4277</id>

    <published>2013-02-08T07:37:36Z</published>
    <updated>2013-02-08T08:10:02Z</updated>

    <summary>I have upgraded a module, Lingua::JA::FindDates, for scanning text to find dates in Japanese format....</summary>
    <author>
        <name>Ben Bullock</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/ben_bullock/">
        <![CDATA[<p>I have upgraded a module, <a href="http://search.cpan.org/perldoc?Lingua::JA::FindDates">Lingua::JA::FindDates</a>, for scanning text to find dates in Japanese format.<br />
</p>]]>
        <![CDATA[<p>This module is for parsing Japanese text, extracting dates, and then, for example, converting them into another language. It is meant to be a component of a translation or data extraction system.</p>

<p>The code is basically a very big set of regular expressions. I have improved the internals of the code using qr// (quotes for regular expression) and /x (the extended regular expression flag) to make the regular expressions more readable. There is also one bug fix.</p>]]>
    </content>
</entry>

<entry>
    <title>New module to read a MATLAB .mat file into Perl</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/ben_bullock/2012/12/new-module-to-read-a-matlab-mat-file-into-perl.html" />
    <id>tag:blogs.perl.org,2012:/users/ben_bullock//392.4134</id>

    <published>2012-12-17T03:43:30Z</published>
    <updated>2012-12-17T03:54:04Z</updated>

    <summary>Data::MATFile is a Perl module to read the MATLAB MAT-File format....</summary>
    <author>
        <name>Ben Bullock</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/ben_bullock/">
        <![CDATA[<p><a href="https://metacpan.org/release/Data-MATFile">Data::MATFile</a> is a Perl module to read the MATLAB MAT-File format.</p>]]>
        <![CDATA[<p>The module can be found at <a href="https://metacpan.org/release/Data-MATFile">https://metacpan.org/release/Data-MATFile</a> and <a href="http://search.cpan.org/~bkb/Data-MATFile-0.01/">http://search.cpan.org/~bkb/Data-MATFile-0.01/</a>. This is a preliminary release for testing and evaluation. It is also <a href="http://prepan.org/module/44xKQxfiW3">on PrePAN</a> and <a href="https://github.com/benkasminbullock/Data-MATFile">github</a>.</p>

<p>The module includes a script "mat2json" which can be used to convert a MAT-File into JSON. People wishing to evaluate the module for their use may find this a convenient aid:</p>

<p>mat2json "some.mat" | json_xs</p>]]>
    </content>
</entry>

<entry>
    <title>Notice - WWW::Imgur end of life</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/ben_bullock/2012/12/notice---wwwimgur-end-of-life.html" />
    <id>tag:blogs.perl.org,2012:/users/ben_bullock//392.4126</id>

    <published>2012-12-14T02:18:43Z</published>
    <updated>2012-12-15T03:20:27Z</updated>

    <summary>The CPAN module WWW::Imgur for uploading images to the imgur.com website via Perl is now ceasing development. Due to a change in Imgur&apos;s terms and conditions, the current maintainer will no longer be using the imgur.com website for the purpose...</summary>
    <author>
        <name>Ben Bullock</name>
        
    </author>
    
    <category term="wwwimgurendoflifecpanmodule" label="www;;imgur end-of-life cpan module" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/ben_bullock/">
        <![CDATA[<p>The CPAN module <a href="http://metacpan.org/module/WWW::Imgur">WWW::Imgur</a> for uploading images to the <a href="http://imgur.com">imgur.com</a> website via Perl is now ceasing development. Due to a change in Imgur's terms and conditions, the current maintainer will no longer be using the imgur.com website for the purpose WWW::Imgur was created for, and thus the module will not be altered for the version 3 API.</p>]]>
        <![CDATA[<p>People who want to continue using the WWW::Imgur module can either continue using it with the version 2 API until that no longer functions, or take over the development of the module and upgrade it to version 3.</p>

<p><b>Edit:</b></p>

<p>This module should probably be mothballed and if anyone wants to make an equivalent module, to create one which is specific to the particular Imgur API. For example,</p>

<p>WWW::Imgur::API3</p>

<p>or</p>

<p>WWW::Imgur3</p>

<p>etc.</p>]]>
    </content>
</entry>

<entry>
    <title>testanything.org</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/ben_bullock/2012/07/testanythingorg.html" />
    <id>tag:blogs.perl.org,2012:/users/ben_bullock//392.3542</id>

    <published>2012-07-13T06:07:02Z</published>
    <updated>2012-07-13T08:35:33Z</updated>

    <summary>There seems to be a problem with spam on the Perl-related website &quot;testanything.org&quot;: http://testanything.org/wiki/index.php/Special:Recentchanges This has been going on for about a year I think. Almost all of the pages on that wiki are spam pages, as you will find...</summary>
    <author>
        <name>Ben Bullock</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/ben_bullock/">
        <![CDATA[<p>There seems to be a problem with spam on the Perl-related website "testanything.org":</p>

<p><a href="http://testanything.org/wiki/index.php/Special:Recentchanges">http://testanything.org/wiki/index.php/Special:Recentchanges</a></p>

<p>This has been going on for about a year I think. Almost all of the pages on that wiki are spam pages, as you will find out if you try the "random page" feature:</p>

<p><a href="http://testanything.org/wiki/index.php/Special:Random">http://testanything.org/wiki/index.php/Special:Random</a><br />
</p>]]>
        <![CDATA[<p><br />
Since people still seem to think that the site is valid, I wrote to the person who owns the site, but didn't get a reply.</p>

<p>If this is not a concern, then never mind, but just in case I am the only person who has noticed this, I would like to draw attention to it.</p>]]>
    </content>
</entry>

<entry>
    <title>Find a misspelt module name</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/ben_bullock/2012/06/find-a-misspelt-module-name.html" />
    <id>tag:blogs.perl.org,2012:/users/ben_bullock//392.3420</id>

    <published>2012-06-22T06:52:32Z</published>
    <updated>2012-07-16T23:52:23Z</updated>

    <summary>I often make typing mistakes. The other day I upgraded to Perl 5.14. I decided to not use the old libraries of Perl 5.10.1 and Perl 5.12.3 which ./Configure suggested, since sometimes these don&apos;t work properly. So I had to...</summary>
    <author>
        <name>Ben Bullock</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/ben_bullock/">
        <![CDATA[<p>I often make typing mistakes. The other day I upgraded to Perl 5.14. I decided to not use the old libraries of Perl 5.10.1 and Perl 5.12.3 which ./Configure suggested, since sometimes these don't work properly. So I had to reinstall all Perl modules. It turned out that I needed quite a few. My bad typing caused several problems. For example, I typed "cpanm CGI::Compress<b>:</b>Gzip" instead of Compress<b>::</b>Gzip and "cpanm Lingua::Stop<b>w</b>ords" instead of Stop<b>W</b>ords.</p>]]>
        <![CDATA[<p>Because I'm a terrible typist, I know that Git is very good at finding mistakes. If I type "git cofig", git gives me a message saying <br />
<pre><br />
git: 'cofig' is not a git command. See 'git --help'.</p>

<p>Did you mean this?<br />
	config<br />
</pre><br />
I wondered if it would be possible to do the same thing for misspelt cpan module names. So as an experiment, I started a module, <a href='http://metacpan.org/release/CPAN-Nearest'>CPAN::Nearest</a> to test the proposition. The good news is that it works:<br />
<pre><br />
$ ./nearest-module CGI::Compress:Gzip<br />
Closest to 'CGI::Compress:Gzip' is 'CGI::Compress::Gzip'.<br />
$ ./nearest-module Lingua::Stopwords <br />
Closest to 'Lingua::Stopwords' is 'Lingua::StopWords'.<br />
</pre><br />
Unfortunately, the volume of names of modules in the package list means it takes a second or two to run:<br />
<pre><br />
$ time ./nearest-module Lingua::Stopwords<br />
Closest to 'Lingua::Stopwords' is 'Lingua::StopWords'.</p>

<p>real	0m0.803s<br />
user	0m0.771s<br />
sys	0m0.012s<br />
</pre><br />
The slowness is not because the file is big. For an uncompressed package file, reading it takes only 0.05 seconds. The bottleneck is the edit distance search.</p>

<p>Perhaps if the speed can be improved, this module might be useful for people writing module installation scripts. Even better, maybe someone can work out how to make it run faster, perhaps by using a better algorithm, or perhaps by caching the results of common spelling mistakes.</p>

<p><a href='https://github.com/benkasminbullock/nearest-module'>The module is on github</a> as well as CPAN, so please have a go.</p>

<p><b>Update:</b></p>

<p>An improvement on the Levenshtein edit distance search is to have a "cutoff" in the algorithm so that it gives up searching once it finds that the edit distance is bigger than a maximum. Using this improvement resulted in about doubling of the speed of the search. I also instituted a "maximum possible distance" cutoff of 10 edits' distance, which is the furthest I think it could go, and if there is nothing the module returns an undef. The speed of version 0.03 is as follows:<br />
<pre><br />
$ time ./nearest-module Lingua::Stopwords<br />
Closest to 'Lingua::Stopwords' is 'Lingua::StopWords'.</p>

<p>real	0m0.393s<br />
user	0m0.392s<br />
sys	0m0.001s<br />
</pre><br />
This is compared to version 0.02 above (version 0.01 had a bug so results were meaningless).</p>

<p><b>Update part two: a reduced version of the edit distance calculating matrix</b></p>

<p>The original algorithm used a (length string 1) x (length string 2) matrix, but it is possible to reduce this to a 2 x (length string 2) sized matrix. I didn't think this would make much difference to the speed, but it does. Here is the speed for version 0.04 of CPAN::Nearest:<br />
<pre><br />
$ time ./nearest-module Lingua::Stopwords<br />
Closest to 'Lingua::Stopwords' is 'Lingua::StopWords'.</p>

<p>real	0m0.186s<br />
user	0m0.177s<br />
sys	0m0.008s<br />
</pre><br />
So it's more than twice as fast as the previous version, and four times faster than the original.</p>

<p><b>Update 3: why is the uncompression so slow?</b></p>

<p>gzip -d on the file only takes 0.15 seconds:</p>

<pre>
$ time gzip -d --keep 02packages.details.txt.gz

<p>real	0m0.152s<br />
user	0m0.067s<br />
sys	0m0.022s<br />
</pre><br />
Yet the C routines I used to read the file take two seconds to read the file. There must be something wrong, since the file is not even being written to disc, it doesn't make sense for it to be slower than the above.</p>

<p><b>Update 4: culprit is gzgets</b></p>

<p>The culprit for the extremely slow performance was the zlib routine "gzgets", which is incredibly slow. Replacing gzgets with gzread, reading the 02packages file takes only about 0.07 seconds more than reading from an uncompressed file. The following is reading from the compressed file:<br />
<pre><br />
$ time ./nearest-module Lingua::Stopwords<br />
Closest to 'Lingua::Stopwords' is 'Lingua::StopWords'.</p>

<p>real	0m0.259s<br />
user	0m0.258s<br />
sys	0m0.002s<br />
</pre><br />
The new version is on CPAN as version 0.05.</p>

<p>There is <a href='http://stackoverflow.com/questions/2832485/zlib-gzgets-extremely-slow'>more information about gzgets' performance problems on this discussion on stackoverflow.com</a> and <a href='http://mail.madler.net/pipermail/zlib-devel_madler.net/2010-January/thread.html#1177'>on the zlib mailing list</a>. The problem seems to have gone away in recent versions of zlib, but since end-users of the module may not have updated, CPAN::Nearest will stick with not using gzgets.</p>

<p><b>Update 5</b></p>

<p>I found another speedup, which reduces the time for a search by about half for a realistic search:<br />
<pre><br />
Search term: 'Lingo::Jingo::Mojo'<br />
New time: 0.14 Old time: 0.35 Speedup factor: 2.5</p>

<p>Search term: 'Lingua::Stopwords'<br />
New time: 0.17 Old time: 0.24 Speedup factor: 1.4118</p>

<p>Search term: 'Bust::A::Move'<br />
New time: 0.13 Old time: 0.27 Speedup factor: 2.0769</p>

<p>Search term: 'thequickbrownfoxjumpedoverthelazydogTHEQUICKBROWNFOXJUMPEDOVERTHELAZYDOG:'<br />
New time: 0.64 Old time: 0.64 Speedup factor:  1</p>

<p>Search term: 'thequickbrownfoxjumpedoverthelazydogTHEQUICKBROWNFOX'<br />
New time: 0.43 Old time: 0.54 Speedup factor: 1.2558</p>

<p>Search term: 'thequickbrownfoxjumpedoverthelazydog'<br />
New time: 0.29 Old time: 0.47 Speedup factor: 1.6207<br />
</pre><br />
All times are in seconds, and are the outputs of the script "time-performance.pl" in the distribution. The speedup relies on a trick of checking all the characters in each entry of the module list against an alphabet generated from the search term, and rejecting entries which have less than the required edit distance number of characters in common with the search term. This saves having to calculate the edit distance for many cases.</p>

<p>By experimenting, I found that there is a cutoff point of about forty-five unique characters in the string, where the alphabet method doesn't provide any speedup, and in fact it starts to slow the algorithm down, so the method cuts off if there are more than a certain number of unique characters in the search term. I haven't counted what CPAN module has the maximum number of characters, but anyway this module may be generalized to other edit distances in future.</p>

<p><b>Update 6</b><br />
Since the above time is quite largely the decompression time, let's compare the speeds searching the uncompressed file:<br />
<pre><br />
Search term: 'Lingo::Jingo::Mojo'<br />
New time: 0.07 Old time: 0.28 Speedup factor:  4</p>

<p>Search term: 'Lingua::Stopwords'<br />
New time: 0.09 Old time: 0.17 Speedup factor: 1.8889</p>

<p>Search term: 'Bust::A::Move'<br />
New time: 0.05 Old time: 0.19 Speedup factor: 3.8</p>

<p>Search term: 'thequickbrownfoxjumpedoverthelazydogTHEQUICKBROWNFOXJUMPEDOVERTHELAZYDOG:'<br />
New time: 0.57 Old time: 0.57 Speedup factor:  1</p>

<p>Search term: 'thequickbrownfoxjumpedoverthelazydogTHEQUICKBROWNFOX'<br />
New time: 0.35 Old time: 0.47 Speedup factor: 1.3429</p>

<p>Search term: 'thequickbrownfoxjumpedoverthelazydog'<br />
New time: 0.22 Old time: 0.39 Speedup factor: 1.7727</p>

</pre>
For the realistic search terms, the speedup obtained using the alphabet filter may be as much as four times as much, and the total search time is now less than a tenth of a second.
]]>
    </content>
</entry>

</feed>
