<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Kirk Kimmel</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/" />
    <link rel="self" type="application/atom+xml" href="http://blogs.perl.org/users/kirk_kimmel/atom.xml" />
    <id>tag:blogs.perl.org,2009-11-03:/users/kirk_kimmel//922</id>
    <updated>2013-06-05T16:02:30Z</updated>
    <subtitle>Solving normal problems with Perl.</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Pro 4.38</generator>

<entry>
    <title>Web Services Part 2: Using Joyent</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2013/06/web-services-part-2-using-joyent.html" />
    <id>tag:blogs.perl.org,2013:/users/kirk_kimmel//922.4745</id>

    <published>2013-06-05T14:54:54Z</published>
    <updated>2013-06-05T16:02:30Z</updated>

    <summary>I recently started trying out different cloud providers to find one that meets my Perl needs. I see many uses for cloud computing in the form of on demand, ubiquitus computing nodes that I can launch with a range of...</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>I recently started trying out different cloud providers to find one that meets my Perl needs. I see many uses for cloud computing in the form of on demand, ubiquitus computing nodes that I can launch with a range of "hardware" specifications. This translates directly into saving money, the prices are different for each hardware plan and you pick what you think you need and erase it when you are done. This article is focused on using Joyent to run Perl applications, not using Perl to interact with the Joyent API, that is a future article already in the works.</p>]]>
        <![CDATA[<p><img alt="sky-clouds-fly.gif" src="http://blogs.perl.org/users/kirk_kimmel/sky-clouds-fly.gif" width="499" height="300" class="mt-image-center" style="text-align: center; display: block; margin: 0 auto 20px;" /></p>

<h2>Joyent</h2>

<p>Six months ago I went to create my first instance and was disappointed. The selection of Linux distributions was just Debian and CentOS, both of which have old versions of Perl. Now the Perl programs I plan to use and the future ones I write are for new versions of Perl. I didn't even think about this until I had a running Debian instance and had a program error out on me. I had used the s///r option that was added in Perl 5.14 and Debian has 5.10, what a bother. I install perlbrew and Perl 5.16.3, then all the cpan modules I need. By the time this is all over I have a newer Perl but dependency problems from Debian's packages being older reared its head again.</p>

<p>Now the Debian users will tell me to switch to testing or unstable or add these other reposities. That is all fine and dandy but when the dependency is gtk3 and installing that breaks existing applications because gtk2 and gtk3 are not designed to be installed side by side, this poses a whole new kind of problem. I wanted to test some scripts I wrote using Gtk3::WebKit for headless automation. Setting that aside I spent a month trying out different applications and testing the network. There were no power outages, and no network issues at all.</p>

<p><b>Recently Joyent has added two versions of Ubuntu and two of Fedora.</b> The newest of all the options available now is Fedora 18 and I recommend going with that. I used Fedora 17 (Beefy Miracle) extensively to test another cloud hosting provider and having Perl 5.14.4 as the system perl was nice. I did not have to install perlbrew and build another perl to get all the new things. Fedora 18 ships with 5.16.2.</p>

<p>The billing system is just missing all kinds of features. You cannot view your bill based on current usage, which is a feature they plan to add. You cannot load a prebuilt image onto Joyent because they have no import capabilities. The closet you can get is to build up an instance and then save it for later, for a continual usage fee. Some of the download links on their site point to documentation that does not exist.</p>

<h2>Would I recommend it?</h2>

<p>If you are planning on setting up a server and leaving it running for long periods of time Joyent is a good pick. When trying to spin up instances quickly and frequently the lack of import features for operating system images makes Joyent a bad fit for that use case, hence longer jobs. A longer job only has to be setup once and that time is a small fraction of the whole. As for the missing features I mentioned above they would be really nice to have but do not stop me from using Joyent effectively. Using the API with Fedora 18 would make spinning up instances faster and easier but the problem of out of date packages still persists. Being able to load a custom external image is the way around that problem.</p>

<p>Joyent use it, but understand it still needs some polish.</p>

<p><img alt="lakitu.jpg" src="http://blogs.perl.org/users/kirk_kimmel/lakitu.jpg" width="545" height="538" class="mt-image-center" style="text-align: center; display: block; margin: 0 auto 20px;" /></p>]]>
    </content>
</entry>

<entry>
    <title>Web Services Part 1: YouTube playlists</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2013/06/web-services-part-1-youtube-playlists.html" />
    <id>tag:blogs.perl.org,2013:/users/kirk_kimmel//922.4744</id>

    <published>2013-06-05T03:12:12Z</published>
    <updated>2013-06-05T04:47:17Z</updated>

    <summary>It started off with just a simple problem; a friend recommended some songs from various artists on YouTube. I go and listen to the songs, like them, and start looking for more from the same artists. Browsing around I find...</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>It started off with just a simple problem; a friend recommended some songs from various artists on YouTube. I go and listen to the songs, like them, and start looking for more from the same artists. Browsing around I find some interesting playlists but notice some of them have the same songs and I think "I'll just download the playlists and remove duplicates with a quick script".</p>]]>
        <![CDATA[<p>
<img alt="Research-Meme-Blog-Content.png" src="http://blogs.perl.org/users/kirk_kimmel/Research-Meme-Blog-Content.png" width="400" height="330" class="mt-image-center" style="text-align: center; display: block; margin: 0 auto 20px;" />
</p>
<p>A little search-fu later and I am disappointed. I find broken links, closed source pay applications, (yeah like I am going to pay for a closed source application to access a public API) and a stack of how to blog posts consisting of nothing more than 'Download this program for Windows and click here, here, and here.' I hit up a few friends on irc and they all strongly recommended youtube-dl, a Python script that has command line options for getting playlists. Let me be clear when I say strongly recommended, I was told about how they had used it for years with no problems, how it was the go to script for downloading different items from youtube, etc... My expectations are now set pretty high considering the high praises my friends put forth meaning, I expect it to work for the problem I have. </p>
<h2>youtube-dl the disappointment</h2>
<p>I start by going to <a href="https://github.com/rg3/youtube-dl">github.com/rg3/youtube-dl</a> to pull down the latest version and I notice there are 193 Issues and 8 Pull Requests currently open. I clone the repository, set it up and play with it. I build up the following command to fetch just the playlist from youtube.</p>
<p>./youtube-dl https://www.youtube.com/playlist?list=FLNj0f88cIMwioXOKJE89Lpw --simulate --get-url &gt; music_vids.txt</p>
<p>Now youtube-dl defaults to trying to download the videos when I only want the urls, hence the simulate flag. The above combination did not work nor did the various combinations I tried out, including adding --ignore-errors. Instead I ran into the following problems:</p>
<ul>
<li>youtube-dl is slow. It took 2 minutes 17 seconds to get 36 video urls from a list of over a hundred before failing out. I expect this api call should only take a few seconds.</li>
<li>youtube-dl fails when a video in the playlist has been taken down due to DMCA violations. The application just exits with the error message and does not process the rest of the list.</li>
<li>youtube-dl fails when a video in the playlist has been deleted by the author. Fails and exit.</li>
<li>youtube-dl fails when a video in the playlist has been blocked based on your geo-location. Fails and exits.</li>
<li>The urls youtube-dl returns are for the CDN locations not the canonical urls. CDN urls are not guaranteed to be valid in the long term. At the very minimum this should be noted in the documentation.</li>
</ul>
<p>I checked the 8 pull requests for fixes and nothing, same for the issue queue. By now I have spent about twenty minutes on research and trying things out. At this point I think it will probably be faster for me to write a program to get the playlists then to try and fix this application or keep looking for something else.</p>
<p><img alt="99_problems.png" src="http://blogs.perl.org/users/kirk_kimmel/99_problems.png" width="651" height="241" class="mt-image-none" style="" /></p>

<h2>Build it yourself</h2>
<p>The first step is to see what modules are on CPAN. Now I am going to go into a little more depth about how I research this to show some of the potential problems with our Perl ecosystem, the key one being signal to noise ratio. Finding quality in the mass of everything is not a problem unique to Perl but the Perl community has come up with some interesting solutions. I open up search.cpan.org in one tab and metacpan.org in another. I do this because the search capabilities of both are different, amongst other things. Metacpan's search is great if the keyword you are looking for happens to be part of the module name. </p>
<h2>youtube-playlists more disappointment</h2>
<p>On metacpan when you put "youtube" in the search field the first auto-complete is youtube-playlists a script that is part of the WWW::YouTube::Download module. It was last updated May 5, 2013, has zero bug reports, two five star reviews and zero failed test reports. The synopsis shows example usage that looks exactly like what I want. I install WWW::YouTube::Download with no problems and run 'youtube-playlists FLNj0f88cIMwioXOKJE89Lpw' and it returns 25 results only on a play list with more than one hundred entries. Looking at the POD there are no cli options to change the result count and looking at the code I see it is only pulling down the first page of results as xml with no mechanism for handling pagination. I consider hacking away at this program but there must be a youtube module that handles most of the work instead of just pulling in raw xml feeds and doing it all yourself. I am also surprised this program did not show up in the general search engines, only in the CPAN ones.</p>
<h2>WWW::YouTube</h2>
<p>WWW::YouTube has not been updated since July 28, 2008 and has one review which is one star. The positive points are there are zero bugs in RT and zero failed test reports. The single review raises multiple issues that I agree with based on reading the POD for the different modules WWW::YouTube provides. Therefore WWW::YouTube is not a good fit.</p>
<h2 id="webserviceyoutube">WebService::YouTube</h2>
<p>Last updated Jan. 20, 2009, it has 4 open bug reports and zero failed test reports. Glancing at the POD I notice the following message: <b>This module support only Legacy API, does not support YouTube Data API based on Google data protocol.</b> Now to me in this instance Legacy reads as deprecated so why would I build something on that?</p>
<h2 id="webservicegdatayoutube">WebService::GData::YouTube</h2>
<p>WebService::GData::YouTube was the next viable option last updated Nov 13, 2011, with 3 bugs, 1 five star review and six failed test reports and 515 passed reports. A quick search for 'playlist' in the POD reveals get_user_playlist_by_id() - Retrieve the videos in a playlist by passing the playlist id. Exactly what I am looking for and it installs with no errors. Here is my short program.</p>
<p><script src="http://gist-it.appspot.com/github/kimmel/youtube-playlists/blob/master/playlist.pl"></script></p>
<p>I call get_user_playlist_by_id() in my while loop so it keeps fetching blocks of 25 videos until the playlist is done. Videos that have been blocked or taken down can be skipped because their 'duration' field does not exist. The urls returned as part of $video have query parameters that are not needed, so I use URI::URL to pull out the host and path. The program outputs the playlist position number followed by the canonical YouTube url.</p>
<h2 id="other-modules">Other Modules</h2>
<p>I also took a quick look at other modules to see what else there was. WWW::YouTube::Info and WWW::YouTube::Info::Simple do not have any playlist functionality. </p>
<img alt="homer_too_much.gif" src="http://blogs.perl.org/users/kirk_kimmel/homer_too_much.gif" width="300" height="225" class="mt-image-center" style="text-align: center; display: block; margin: 0 auto 20px;" />

<h2 id="the-take-away">The take away</h2>
<p>I would like to think I selected the best module available but there could still be something better on CPAN or out in the wild. It is always a balance between time spent doing research and building it yourself if you can easily conceptualize the problem. I think far too often people get sucked into the research phase which can be quite fun. I use the bug count, last release date, pass/fail test reports, CPAN Rating, and the list of previous versions to determine if a module is mature enough for a deeper look to solve the problem I have.</p>
<p>There is no single search engine that encapsulates everything so just using one generalized option like Google is not enough. Searching CPAN directly is always worth while. Next stop would have been github, then bitbucket, and sourceforge.</p>]]>
    </content>
</entry>

<entry>
    <title>Parallel Forking and Process Management</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2013/05/parallel-forking-and-process-management.html" />
    <id>tag:blogs.perl.org,2013:/users/kirk_kimmel//922.4680</id>

    <published>2013-05-15T13:45:33Z</published>
    <updated>2013-05-15T17:01:36Z</updated>

    <summary>Stop me if you have heard this one before. You have a list of files you need to process in a text file with one item per line. Handling this is fairly simple you read a line in and process...</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>Stop me if you have heard this one before. You have a list of files you need to process in a text file with one item per line. Handling this is fairly simple you read a line in and process it over and over again until you processed the whole list. This works great, but if that list is 40,000 items long and each item takes up to 30 seconds to run it suddenly takes a very long time to finish. In this case processing each item is just a system call to another cli application with no shared resources, thus allowing processing of items in parallel with no fuss. For this task I am using Parallel::ForkManager and here are the important bits:</p>]]>
        <![CDATA[<p><script src="https://gist.github.com/kimmel/5584103.js"></script></p>

<p>Now the above is basically the first example in the POD and not all that interesting. The "fun" part was discovering that some of the processes were hanging. Digging in I find it is the other application I am calling that is hanging because the file it was trying to read is for some reason unavailable. This kind of failure is acceptable, just log what didn't work and move on. To clean up the stale processes I used Proc::ProcessTable to check for jobs that were running too long and kill them every number of processes configured for Parallel::ForkManager. The code looks like this now:</p>

<p><script src="https://gist.github.com/kimmel/5584109.js"></script></p>

<p>I tweak the $process_count based on number of CPU cores since these processes are CPU heavy but RAM lite. I had multiple versions of this application running on different server sizes for the last week to determine performance metrics for the future. I want to squeeze as much work out of a system as I can every second it runs. Yes I do love optimizing code and fast results but this is purely practical. Above I mentioned a list of 40,000 items. Since I started using this little application I wrote that source file is now a few hundred thousand items long since processing time is not terribly long.</p>

<p>This is how I feel now.</p>

<p><img alt="yeah_science.jpg" src="http://blogs.perl.org/users/kirk_kimmel/yeah_science.jpg" width="599" height="397" class="mt-image-center" style="text-align: center; display: block; margin: 0 auto 20px;" /></p>]]>
    </content>
</entry>

<entry>
    <title>Download a mailman archive</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2013/03/download-a-mailman-archive.html" />
    <id>tag:blogs.perl.org,2013:/users/kirk_kimmel//922.4406</id>

    <published>2013-03-09T10:58:20Z</published>
    <updated>2013-03-09T11:23:00Z</updated>

    <summary>Oracle is closing down the opensolaris.org site on March 24th, which is inconvenient for the rest of us. I wanted to grab the mailman archives for the mailing lists so I fired up a search engine and looked for any...</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    <category term="scraper" label="scraper" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>Oracle is closing down the opensolaris.org site on March 24th, which is inconvenient for the rest of us. I wanted to grab the mailman archives for the mailing lists so I fired up a search engine and looked for any existing open source projects to do this. After trying two different scripts that did not quite work right I realized it would just be faster for me to write what I need.</p>

<p>I started by fetching the listinfo page which has links to all the lists archived and took a look at the data. Based on the page structure the easiest method was to iterate over all the links in the page and only go deeper if it lead to a mailing list's main page. On the mailing list page I follow the link to the archives page. Then just scan all the links in the page for .gz files and download them. <a href="http://metacpan.org/module/WWW::Mechanize">WWW::Mechanize</a> provides a save_content() function which handles saving the files locally with minimal effort. That was all it took.</p>

<h2>Optimization</h2>

<p>The most time consuming part of this whole process is fetching 8,000+ months of archives so I made sure to cache each page and file as I went along and use gzip as much as the service supports. I achieved both of these steps just by using <a href="http://metacpan.org/module/WWW::Mechanize::Cached::GZip">WWW::Mechanize::Cached::GZip</a> and CHI for the caching object. Here is the full program in all its shortness.</p>

<p><script src="https://gist-it.appspot.com/https://github.com/kimmel/at-mailman-scraper/blob/master/scrape_archive.pl"> </script></p>]]>
        
    </content>
</entry>

<entry>
    <title>Finding files faster</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2013/02/finding-files-faster.html" />
    <id>tag:blogs.perl.org,2013:/users/kirk_kimmel//922.4247</id>

    <published>2013-02-01T21:02:32Z</published>
    <updated>2013-02-02T01:35:00Z</updated>

    <summary>A little while back I wrote a pair of applications that used Path::Class::Rule to do the file finding. I selected this module because I like the interface for building up rules. I started to run into speed issues as the...</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>A little while back I wrote a pair of applications that used <a href="https://metacpan.org/module/Path::Class::Rule">Path::Class::Rule</a> to do the file finding. I selected this module because I like the interface for building up rules. I started to run into speed issues as the source directory grew larger and larger. Along comes rjbs's <a href="http://rjbs.manxome.org/rubric/entry/1981">the speed of Perl file finders</a> article and his speed chart backs up my findings that more files equals a marked increase in time.</p>

<p>This is where I found out about <a href="https://metacpan.org/module/Path::Iterator::Rule">Path::Iterator::Rule</a> which was just released by David Golden. It works the same as Path::Class::Rule but returns strings instead of objects, which gives a massive performance boost. Path::Iterator::Rule is a drop in replacement for Path::Class::Rule so updating my programs required very minor changes.</p>

<p>With Path::Class::Rule my application took an average of 66 seconds per run. Now the Path::Iterator::Rule version <b>only takes 5 seconds</b> with the same input. A full minute saved on each run, it feels good. </p>

<p>I am reminded of a quote from Brad Frost's article <a href="http://bradfrostweb.com/blog/post/performance-as-design/">Performance As Design</a> in which he states <b>"Good performance is good design"</b> and while the article context is web development, I think it applies to any kind of application. </p>]]>
        
    </content>
</entry>

<entry>
    <title>Text Processing Part 2: More Speed</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2012/11/text-processing-part-2-more-speed.html" />
    <id>tag:blogs.perl.org,2012:/users/kirk_kimmel//922.4055</id>

    <published>2012-11-18T17:04:25Z</published>
    <updated>2012-11-18T17:07:50Z</updated>

    <summary>In my previous post Text Processing: Divide and Conquer I took a text processing problem profiled it, then developed a few possible solutions. I benchmarked these options and now use the fastest solution… that I tested for. Two comments were...</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>In my previous post <a href="http://blogs.perl.org/users/kirk_kimmel/2012/09/text-processing-divide-and-conquer.html">Text Processing: Divide and Conquer</a> I took a text processing problem profiled it, then developed a few possible solutions. I benchmarked these options and now use the fastest solution… that I tested for. Two comments were posted for that article that gave insight into different and faster ways to solve this problem.</p>]]>
        <![CDATA[<h2 id="back-to-the-regular-expression-solution">Back to the regular expression solution</h2>
<p>Initially I just had an array of patterns that I fed through qw/$_/ixms and it was slow. I had not considered using alternation because I thought it was going to be too slow. Perl 5.10 fixed this kind of problem but I was so used to how it performed before that I had not considered it since. With this new information in hand I created a new benchmark set to compare performance. Here are the old numbers:</p>
<pre>
./method_bench.pl
         Rate method4 method5 method3 method1 method2
method4 585/s      --     -0%    -35%    -40%    -40%
method5 586/s      0%      --    -35%    -40%    -40%
method3 898/s     53%     53%      --     -8%     -8%
method2 972/s     66%     66%      8%      0%      --
method1 972/s     66%     66%      8%      --     -0%
</pre>

<p>Method1 was copied into the new test as a baseline. I purposely added a solution that I know would be slower ‘regex_overhead’ just to see what would happen. ‘regex_assem’ show no performance difference if I use $regex3-&gt;re() or not.</p>
<script src="https://gist.github.com/4104227.js?file=comments_bench.pl"></script>

<p>Running the above script I get the following numbers:</p>
<pre>
./comments_bench.pl 
                 Rate   regex_assem        method1 regex_overhead      one_regex
regex_assem    2.38/s            --           -21%           -86%           -88%
method1        3.00/s           26%             --           -83%           -85%
regex_overhead 17.5/s          635%           482%             --           -13%
one_regex      20.0/s          742%           567%            15%             --
</pre>

<p>These results match up with Aaron Crane’s comment about a major speed increase. Even the test with extra overhead is multiple times faster than the method1 I had. So now I go back to this script I wrote and change it to use regular expressions with alternation. I tested these two versions I now had against a test data set of a few hundred files totaling 36 megabytes. The old script took 37 seconds to process the test data while the new version only takes 9 seconds, 25% of the baseline. Bumping up the test data to 135mb the old way takes 105 seconds, the new way 17 seconds or 16% of the baseline.</p>
<p>What can I say? The Perl community came through for me. I wrote an article about how I solved a problem and two people came along and gave me advice that lead to a faster solution and since speed was the goal, a better solution. Thank you.</p>]]>
    </content>
</entry>

<entry>
    <title>Using Padre for the first time</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2012/11/using-padre-for-the-first-time.html" />
    <id>tag:blogs.perl.org,2012:/users/kirk_kimmel//922.4053</id>

    <published>2012-11-17T17:02:00Z</published>
    <updated>2012-11-17T17:04:23Z</updated>

    <summary>Recently I have been doing some in depth research with regards to development tools of all kinds. Currently I am working my through the various IDEs available in both the open and close source worlds. This is what spurred me...</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>Recently I have been doing some in depth research with regards to development tools of all kinds. Currently I am working my through the various IDEs available in both the open and close source worlds. This is what spurred me into giving Padre another shot. The last time I tried to install it there was a dependency problem and it was not worth solving. So that is my first step, install Padre.</p>]]>
        <![CDATA[<p>Padre install perfectly in my openSUSE 12.2, Perl v5.16 development environment. I immediately started the application and loaded a script I wrote a few weeks ago. I have never used Padre, seen screenshots of it or really been interested in it, that is my perspective going into this. The first thing I did was go to the Window menu and see what helper windows are available. I started the CPAN Explorer and ran a search. It was fast and I assume it is reading the local module list file that cpan uses. Then I click on Recent and hit Refresh and the most recent CPAN modules show up, just like the metacpan recent page.</p>

<p>That feature alone makes it better than most of the IDEs out there for Perl development but would we expect less? It is written in Perl by Perl developers and they have a good grasp on what information developers want. Next I wrote a little code and then opened the Regex Editor to give it a shot. I like the list of quick references on the right for things like Character classes. Setting ixmsg flags are done via check boxes at the top of the window. Inputting some test data and a simple regexp I run into a problem.<br />
  <br />
The regexp I used was \d{2,3} something simple. Now when no match is found the 'Matched text' area shows a 'No match' message in red. When the regexp matches a substring of the sample data we see all of the sample text again in the matched text area but the actual match is not highlighted. I expected it to highlight the matches in this area as a visual aid to find the matches quicker. I tried substitution and the replacements were not highlighted either.</p>

<p>Moving on I started looking for the must have features that all good code editors have.</p>

<ul>
<li>cross platform</li>
<li>full unicode and utf-8 support</li>
<li>visible line numbers</li>
<li>current line &amp; column number</li>
<li>syntax highlighting</li>
<li>brace matching</li>
<li>auto indentation</li>
<li>view spaces and end of line characters</li>
<li>Different files as tabs</li>
<li>multiple level undo/redo</li>
<li>code folding</li>
<li>block commenting</li>
<li>new line conversion</li>
<li>Perl integration (syntax check, debugging)</li>
<li>text zoom</li>
<li>double click word selection</li>
<li>triple click line selection</li>
<li>find in files</li>
<li>regular expression search and replace (Only in the Replace option. The normal find does not support regexp)</li>
<li>multiple instances</li>
<li>generalized autocomplete</li>
<li>session state preservation (only if you choose to save it)</li>
</ul>

<p><br />
I found almost everything I expected in Padre and the rest via plugins. Padre::Plugin::PerlTidy added perltidy support after I restarted Padre and enabled the plugin via the Plugin Manager. For Padre::Plugin::PerlCritic I followed the same routine and it by default does nothing if you do not have a minimal .perlcriticrc file. With a perlcriticrc file in hand I tried running it again and now I get output. The problem though is how the output is displayed. First off this plugin does not honor the verbose setting from the config file. No matter what I set it to the Padre output is the same. The second problem is the lack of colorized output or other indicator of the severity level of a warning message.</p>

<p>Next I went to Tools, Preferences to change a few options around. When I saved the changes the File menu bar disappeared, which is no good. I started up a test Windows XP environment, installed DWIM Perl 5.14.2.1 (v7) and tried the same process again and it did not mess up. So after having to restart Padre a bunch of times after making configuration changes I started using it to update a few scripts I wrote last month. Four segmentation faults later and I am done with it. I have no interest in fiddling with it to figure out the problem, I tried it out for research and now I know what it can and cannot do. </p>

<p>Would I recommend Padre to other developers? No. </p>

<p>Padre uses the Scintilla library which is where it gets the bulk of its features from. SciTE is an editor developed by the Scintilla developers that does most of what Padre does without crashing constantly, is updated more frequently and is cross platform as well. The extra features like CPAN Explorer, refactoring, and other Perl specific goodies I already have via command line applications. I started out excited to try Padre in the hopes it could make things a little faster in the development cycle. Now I just regard it as another buggy IDE. And please do not feed me a junk line about it works on my Linux distribution, I do not care. Pretend I am a normal end user for a moment. If I install a piece of software and it crashes repeatedly I simply switch to another application. I am not compelled to try and fix it or figure out what is going on, I am not invested in it as a first time user.</p>]]>
    </content>
</entry>

<entry>
    <title>Testing for HTTP compression</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2012/10/testing-for-http-compression.html" />
    <id>tag:blogs.perl.org,2012:/users/kirk_kimmel//922.3972</id>

    <published>2012-10-20T10:25:13Z</published>
    <updated>2012-10-20T19:10:44Z</updated>

    <summary>How do I determine what the content-encoding of a web page is? A simple question which after doing a little searching did not turn up a simple answer. A stackoverflow question lead me to the solution but did not answer...</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>How do I determine what the content-encoding of a web page is? A simple question which after doing a little searching did not turn up a simple answer. A stackoverflow question lead me to the solution but did not answer the question directly so here I am writing it up. We will need to install these modules first:</p>

<p><code><br />
cpan Compress::Zlib LWP::UserAgent<br />
</code><br />
</p>]]>
        <![CDATA[<p><script src="https://gist.github.com/3922897.js?file=gistfile1.pl"></script></p>

<p>Is a quick test to see what formats you can accept. 'gzip, x-gzip, deflate, x-bzip2' is the output on my system. Now lets fetch a web page and see what we get.</p>

<p><script src="https://gist.github.com/3924279.js?file=gistfile1.pl"></script></p>

<p>Looking through all the headers we can see that content-encoding is gzip for reddit. Now instead of dumping all the headers we could simply look at $response->header('content-encoding') and have what we need. </p>

<p>There is an interesting tidbit in the full headers of reddit:</p>

<pre>
======(  $response->{_headers}  [ 'delivery_formats.pl', line 19 ]======

<p>    bless({<br />
      "client-date"         => "Sat, 20 Oct 2012 18:27:48 GMT",<br />
      "client-peer"         => "165.254.27.97:80",<br />
      "client-response-num" => 1,<br />
      "connection"          => "close",<br />
      "content-encoding"    => "gzip",<br />
      "content-length"      => 8346,<br />
      "content-type"        => "application/json; charset=UTF-8",<br />
      "date"                => "Sat, 20 Oct 2012 18:27:48 GMT",<br />
      "server"              => "'; DROP TABLE servertypes; --",<br />
      "set-cookie"          => "reddit_first=%7B%22firsttime%22%3A%20%22first%22%7D; Domain=reddit.com; expires=Thu, 31 Dec 2037 23:59:59 GMT; Path=/",<br />
      "vary"                => "Accept-Encoding",<br />
    }, "HTTP::Headers")<br />
</pre></p>

<p>Lets see this is coming from server <b>"'; DROP TABLE servertypes; --"</b>. A SQL injection as the server name, it makes me smile. Obviously the reddit developers have read that xkcd comic before. To protect your application against such an "attack" I would recommend reading <a href="http://www.bobby-tables.com">bobby-tables.com</a> which is a guide to preventing SQL injection.</p>]]>
    </content>
</entry>

<entry>
    <title>Text Processing: Divide and Conquer</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2012/09/text-processing-divide-and-conquer.html" />
    <id>tag:blogs.perl.org,2012:/users/kirk_kimmel//922.3867</id>

    <published>2012-09-22T23:45:41Z</published>
    <updated>2012-09-22T23:43:15Z</updated>

    <summary>Another day another generic text processing problem that many developers have had to solve before. I have a list of patterns and need to find if they exist in a group of files. If I did not need to do...</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>Another day another generic text processing problem that many developers have had to solve before. I have a list of patterns and need to find if they exist in a group of files. If I did not need to do complex post processing then I could just use the command line like so</p>

<p><code><br />
grep -ri -f patterns files/<br />
</code></p>]]>
        <![CDATA[<p>and be done with it, alas the world is a cruel place :) So I cooked up a simple program that slurps a file in and does the comparison and tracks the necessary stats. The main problem with this approach is that it is too slow, I had over 4,000 patterns to match against every file. I used <b>perl -d:NYTProf</b> to confirm my theory. Here is the basic report I got from nytprofhtml:</p>

<pre>
1125978	1	1	386s	386s	main::CORE:match (opcode)
1130211	2	1	1.09s	1.09s	main::CORE:regcomp (opcode)
41804	6	2	378ms	638ms	File::Spec::Unix::canonpath (recurses: max depth 1, inclusive time 159ms)
23226	20	7	131ms	518ms	Path::Class::Dir::stringify
250764	6	1	87.5ms	308ms	File::Spec::Unix::CORE:subst (opcode)
32515	6	2	85.6ms	435ms	File::Spec::Unix::catdir
1236	2	2	56.0ms	447ms	File::Spec::Unix::abs2rel
24462	3	2	49.0ms	49.0ms	File::Spec::Unix::catpath
5968	10	5	43.1ms	325ms	Path::Class::File::stringify
31924	10	3	42.1ms	42.1ms	Path::Class::Entity::_spec
4903	1	1	27.5ms	146ms	File::Spec::Unix::catfile
120	2	2	26.1ms	29.4ms	utf8::SWASHNEW
980	974	5	19.3ms	20.3ms	Encode::utf8::decode_xs (xsub)
6988	6	3	17.0ms	19.2ms	File::Spec::Unix::splitpath
1840	4	1	15.4ms	655ms	Path::Class::Rule::test (recurses: max depth 2, inclusive time 674ms)
</pre>

<p>1,125,978 matches in <b>3 megabytes</b> of sample text files took 6 minutes and 26 seconds. This is way too slow so, let us optimize. </p>

<h2>Divide</h2>

<p>I started by looking at the patterns file and the majority of them are string literals which is good news. With a little preprocessing I now have all the string literals in a hash whose key is the first letter of the pattern and the value is a hash containing the patterns as keys and the values are coderefs to do whatever processing is needed. Now instead of having to run all of the literal matches every time I can test the first letter and use that to limit the matching scope. The best case scenario is 'x' which only has 1 pattern to test and the worst is 's' with over 700 patterns.</p>

<h2>index instead of m//</h2>

<p>Now you might be asking yourself why didn't I just use <b>index</b> since I have string literals? Index is designed for fast string searching with literal matches. Lets take a look at a simple example which loads 4,000 patterns and matches it against the entire text of Bram Stoker's Dracula which is a 16,248 line, 836kb file.</p>

<p><script src="https://gist.github.com/3689246.js?file=gistfile1.pl"></script></p>

<p>Is the sample program and it takes 2 minutes, 10 seconds to run. The bulk of the time is spent doing 4,000 match operations. Now lets try the same thing with index instead.</p>

<p><script src="https://gist.github.com/3689276.js?file=gistfile1.pl"></script></p>

<p>This program processes the entire Dracula text in 4.1 seconds. </p>

<p>The reason I didn't use index is that some of the literal patterns match as substrings of other words so I get a bunch of false positives. This does not happen with m// because I used \b to match up word boundaries. This was a precautionary measure because with that many input patterns I had to assume there would be some overlap and it turns out there is. A gentle look at <a href="http://codeidol.com/community/perl/avoid-using-regular-expressions-for-simple-string-/14220/">simple string operations</a> is a good read if this is new material and chapter 9 of "Mastering Algorithms with Perl" goes into depth about Boyer-Moore which Perl and grep both use.</p>

<h2>I need tokens</h2>

<p>This is what I had:</p>

<p><script src="https://gist.github.com/3688579.js?file=gistfile1.pl"></script></p>

<p>Which was fine but now with the need to test the first letter of each word, I need tokens. One of the advantages to making this change is I can now do <a href="https://en.wikipedia.org/wiki/Text_normalization">text normalization</a> on it, specifically removing duplicate tokens so less match attempts need to be made. <a href="https://metacpan.org/module/List::MoreUtils">List::MoreUtils</a> makes this a breeze.</p>

<p><script src="https://gist.github.com/3688004.js?file=gistfile1.pl"></script></p>

<p>I read in the whole source file and strip out all the characters I do not care about. Then I split it into tokens using whitespace as the separator. Next I filter out any empty or one letter tokens. uniq from List::MoreUtils returns a list with no duplicate values. I was surprised that between List::MoreUtils and <a href="https://metacpan.org/module/List::Util">List::Util</a> there was no function for getting the shortest or longest string in a list. Then again it is a trivial task to implement them using reduce as shown above.</p>

<h2>Conquer</h2>

<p>Going back to the initial 3mb of text files this updated script takes <b>under 3 seconds</b> to execute, that is less than 1% of the baseline time. I can now run this script 127 times in the time it took for the first version to complete one run. </p>

<h2>More optimization</h2>

<p>The more I used this little application on larger and larger datasets the more I wondered if I could get a little more speed out of it. I started by increasing my sample data size to 26 megabytes to simulate a more normal workload. Then I took a look at that foreach loop to see if I could speed things up. I created a simple benchmark script to compare five variations on a solution and then time them.</p>

<p><script src="https://gist.github.com/3749364.js?file=gistfile1.pl"></script></p>

<pre>
          Rate method5 method4 method3 method2 method1
method5  667/s      --     -0%    -34%    -40%    -41%
method4  670/s      0%      --    -34%    -40%    -41%
method3 1010/s     51%     51%      --     -9%    -11%
method2 1115/s     67%     66%     10%      --     -1%
method1 1130/s     69%     69%     12%      1%      --
</pre>

<p>Method1 turned out to be the fastest but what does this translate to in real work? Method5 takes 33 seconds to process 26mb of text while method1 only takes 22 seconds, a 33.33% reduction in execution time. For much larger datasets I am looking into doing things in parallel however the overhead of forking is greater than the amount of work that needs to be done on a per file basis.</p>]]>
    </content>
</entry>

<entry>
    <title>3 features I would like to see in Perl</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2012/09/3-features-i-would-like-to-see-in-perl.html" />
    <id>tag:blogs.perl.org,2012:/users/kirk_kimmel//922.3830</id>

    <published>2012-09-14T19:22:29Z</published>
    <updated>2012-09-15T14:02:21Z</updated>

    <summary>A few days ago I read Features Perl 5 Needs in 2012 by chromatic and while I thought the ideas were nice the only one I really cared about was a replacement for XS. I have tried with XS and...</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>A few days ago I read <a href="http://www.modernperlbooks.com/mt/2012/09/features-perl-5-needs-in-2012.html">Features Perl 5 Needs in 2012</a> by chromatic and while I thought the ideas were nice the only one I really cared about was a replacement for XS. I have tried with XS and FFI to bring in new libraries and it is just so painful. Marcus believes that <a href="http://marcus.nordaaker.com/perl-needs-modern-garbage-collection/">Perl garbage collection</a> needs a serious overhaul and I agree. Improved gc in a language is one of those things that helps ease development pain, specifically scalability issues. Java is always improving its gc to better meet new performance requirements. <a href="http://blog.twoshortplanks.com/2012/09/13/another-feature-perl-5-needs-in-2012/">Structured core exceptions</a> is another improvement that would be very nice. This got me thinking what would I like to see in the next version of Perl. I have never contributed to Perl core, never written a Perl book, and I do not follow p5p or Perl6 development. I have looked a little at the Perl source but nothing serious. This makes me an outsider to Perl development so I have a different viewpoint on what I should get out of Perl.</p>]]>
        <![CDATA[<p>The number one thing Perl needs is speed. It needs to be faster to stay competitive. This is one of the reasons that people say Perl is dead, no news about large scale performance improvements. The python community has PyPy and Psycho as performance options. Facebook developed HipHop for PHP which gives a 3-5x speed boost with software like Drupal. The Perl community has <a href="https://www.metacpan.org/module/B::C">B::C</a> which can improve program startup time and <a href="https://www.metacpan.org/module/B::CC">B::CC</a> which may improve runtime speeds. Reini Urban has covered <a href="http://blogs.perl.org/users/rurban/2012/09/native-pbc-in-parrot-revived.html">this material</a> on both the Perl 5 and Perl 6 front.</p>

<p>Now I don't want to get off on a rant here ;)</p>

<p>I read articles like <a href="https://lwn.net/Articles/487216/">Perl 5.16 and beyond</a> and the whole focus is new syntactic sugar and how adding a MOP to Perl which will make Moose faster. The only mention of something becoming faster is by adding all new features to Perl and that only makes certain things faster. What about the rest of us? I try to keep up the good tenets of functional programming in my applications and I have never had the need to use Moose in an application. I have literally dozens of Perl programs on my computer that I use in a day to day fashion some I wrote, some I didn't and none of them use Moose, Mouse, Moo, etc.. I want improvements to Perl that make my existing applications faster.</p>

<p>C was initially developed between 1969 and 1973, making it 40 years old. Java 1 was released at the beginning of 1996 making it 16 years old. Perl was first released in 1987 making it 25 years old. GCC and clang for C and the Oracle JVM get faster. If you take newer versions and test against older versions with the same code there is a trend for the newer compilers and JVM to have much better performance. The same cannot be said for Perl. <a href="http://speed.perlformance.net/">Perl::Formance</a> is an ongoing project that tracks the performance of Perl versions against different benchmark programs. The general trend is Perl performance is a plateau and gets slightly worse in some cases. </p>

<p>This performance and optimization stagnation Perl is in does nothing to help expand its market share and improve its image. The reality we all already know is that there is a finite amount of programmers in the world and Perl is vying for position against the constant influx of new languages as well as competing against existing languages. Some may consider the community staying the same size a good thing but my point of view on the matter is the community is either growing or dying. CPAN may seem magical to some but it is just a group of developers who choose to build libraries in and for Perl. For CPAN to stay vibrant it needs to add modules that meet the ever expanding needs of the programming world. An example of an under represented segment is graph database support. The PHP and Java community have us beat bad in those areas. Why? One of the reasons is both those languages have tools that allow them to scale far beyond the base implementation which is an important scalability concern.</p>

<p>I can take a PHP application that is slower than its Perl equivalent, and without changing the code compile it with HipHop and now it is faster than the Perl version. Does this work the same with B::C or B::CC? No. This is the end of my rant. </p>

<p>The top features I want to see in Perl:</p>

<p>1. A faster perl across the board<br />
2. Better gc which would probably make applications faster<br />
3. XS, FFI improvements to allow faster support of new libraries in the wild</p>

<p>Everything else for me is unimportant. People still use C because it is fast not because they are constantly adding new features to the language. On a side note I just saw a tweet from chromatic stating he has an idea of how to do better exceptions in core and stay backwards compatible.</p>]]>
    </content>
</entry>

<entry>
    <title>Q: When not to use Regexp? A: HTML parsing</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2012/08/q-when-not-to-use-regexp-a-html-parsing.html" />
    <id>tag:blogs.perl.org,2012:/users/kirk_kimmel//922.3764</id>

    <published>2012-08-29T17:51:48Z</published>
    <updated>2012-08-29T17:52:47Z</updated>

    <summary>It always starts out as something simple and innocent and then the Internet ruins it. So I am giving a data mining talk at Ohio LinuxFest 2012 and surprise, surprise there is going to be a nice helping of Perl....</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>It always starts out as something simple and innocent and then the Internet ruins it.</p>

<p>So I am giving a data mining talk at <a href="http://ohiolinux.org/">Ohio LinuxFest 2012</a> and surprise, surprise there is going to be a nice helping of Perl. So I am on the internet doing research looking for some simple scrapers and collectors to mention in my talk. I always prefer to give multiple examples for any problem since programming does not have a one size fits all model. To make a long story short I found a bunch of different social media scapers. The problem I found with most of them was the same. Things like this Ruby example:</p>

<p><script src="https://gist.github.com/3482179.js?file=gistfile1.rb"></script></p>

<p> another Ruby example (the comment is from the original source):</p>

<p><script src="https://gist.github.com/3482190.js?file=gistfile1.rb"></script></p>

<p>The Python ones I found were a little more deceptive. Here is what I found on the surface:</p>

<p><script src="https://gist.github.com/3482211.js?file=gistfile1.py"></script></p>

<p>So I see BeautifulSoup included and I am thinking that must be like <a href="https://www.metacpan.org/module/HTML::Parser">HTML::Parser</a> right? Wrong. Instead I find these:</p>

<p><script src="https://gist.github.com/3482220.js?file=gistfile1.py"></script>        </p>

<p>What about <a href="http://docs.python.org/library/htmlparser.html">HTMLParser</a>? More of the same:</p>

<p><script src="https://gist.github.com/3482230.js?file=gistfile1.py"></script></p>

<p>The one that probably wasted the most brain cells was this Perl one</p>

<p><script src="https://gist.github.com/3482317.js?file=gistfile1.pl"></script></p>

<p>Wow, just wow. The untold hours of work to build the above expressions and in some cases knowing that it will break no matter what. </p>

<p>HTML parsing with regexp is the <a href="http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html">cthulhu way</a> and yet people still do it even though good parsers exist and have already solved this problem. HTML 4.01 was published in 2000 and HTML5 in 2008. HTML::Parser released version 2.14 in 1998 and HTML::TreeBuilder released version 0.50 in 1998 as well. </p>

<p>I shouldn't be surprised considering the empowerment I feel when using regular expressions to solve problems. Then we see things like this `perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' [number]` which can determine if a number is prime or not. Neil already wrote up a <a href="http://montreal.pm.org/tech/neil_kandalgaonkar.shtml">walk through</a> of the pattern, if your interested.</p>

<p>Concurrently, I am also thinking "I am not the only person who has had this problem. I bet someone already solved it on the Internet." A little searching and I had found a half dozen examples of using actual parsing libraries to work with HTML. Let me repeat the important part there 'I am not the only person who has had this problem.' This is one of those ideas that should be impressed upon programmers regularly. Tunnel vision while focused on a project happens and it only makes the final solution worse, not better. Another version of this is the 'Not Invented Here' syndrome that seems to invade programmers minds and make them think they can not only do the task better but it will be quicker for them to just rebuild it from scratch. If this happens to you take a step back and really assess the amount of time needed to write something new from scratch.</p>

<p>With all that said I would remind people that <a href="http://htmlparsing.com/">htmlparsing.com</a> is an expanding community resource that can help people understand how easy it is to use a parser and not waste brain time on a regexp solution.</p>

<pre>
    my $mech = WWW::Mechanize->new();
    $mech->get('http://news.ycombinator.com/');

<p>    foreach my $link ( $mech->links() ) {<br />
        if ( $link->url() =~ m/perl[.]org/xms ) {<br />
            say $1;<br />
        }<br />
    }<br />
</pre></p>

<p>Is all it takes to get started. I would like to note that PHP and Python both have a good parsing library in <a href="https://code.google.com/p/html5lib/">html5lib</a>.</p>]]>
        
    </content>
</entry>

<entry>
    <title>A NYTprof encoding hiccup</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2012/08/a-nytprof-encoding-hiccup.html" />
    <id>tag:blogs.perl.org,2012:/users/kirk_kimmel//922.3718</id>

    <published>2012-08-21T13:00:29Z</published>
    <updated>2012-08-21T14:43:42Z</updated>

    <summary>While using Devel::NYTProf on a new application I started getting this message fid 33 has no src saved for /usr/lib/perl5/5.14.2/autodie.pm (NYTP_FIDf_HAS_SRC not set but src available!) Now my first thought was this has something to do with either the newest...</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>While using Devel::NYTProf on a new application I started getting this message <br />
<code><br />
fid 33 has no src saved for /usr/lib/perl5/5.14.2/autodie.pm (NYTP_FIDf_HAS_SRC not set but src available!)<br />
</code><br />
Now my first thought was this has something to do with either the newest version of autodie or utf8::all. So I checked to make sure all the modules I was using were up to date and tested again, still there. Then I wrote a really short program to recreate this error and for some reason I couldn't. Going back and forth between the two files I finally noticed what was different and I was shocked by what it was.</p>

<p>This script generates the above message with `perl -d:NYTProf`</p>

<p><script src="https://gist.github.com/3415273.js?file=gistfile1.pl"></script></p>

<p>and this one does not.</p>

<p><script src="https://gist.github.com/3415297.js?file=gistfile1.pl"></script></p>

<p>Just switching the order of autodie and utf8::all fixed this. My mind was blown. I could not think of another time I had seen the use order do something like this. So I did some searching to see if this was known or not. <a href="https://rt.cpan.org/Public/Bug/Display.html?id=70211">Bug 70211</a> says it is 'use open qw( :encoding(UTF-8) :std );' followed by a use module statement that causes the issue. So my "fix" is not really a fix but there appears to be no actual problems with nytprof but I am keeping an eye on it.</p>]]>
        
    </content>
</entry>

<entry>
    <title>DIY personal analytics</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2012/08/diy-personal-analytics.html" />
    <id>tag:blogs.perl.org,2012:/users/kirk_kimmel//922.3712</id>

    <published>2012-08-20T04:03:05Z</published>
    <updated>2012-08-26T00:55:23Z</updated>

    <summary><![CDATA[How many times a day do you reach for &lt;ctrl&gt; + r when using the shell? What about the history command? !! anyone? Do we as programmers evolve and stop making the same mistakes? Do we really optimize our workflows?...]]></summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
        <category term="io-all" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>How many times a day do you reach for &lt;ctrl&gt; + r when using the shell? What about the history command? !! anyone?</p>

<p>Do we as programmers evolve and stop making the same mistakes? Do we really optimize our workflows? This is where the idea of personal analytics comes in. I am going to see what I can learn from looking at my bash history for the last few years. Here are the relevant settings in my .bashrc file:</p>

<p><script src="https://gist.github.com/3401826.js?file=gistfile1.sh"></script></p>

<p>shopt is a bash command that shows and changes shell option names. The histappend option tells bash to append the history collected to the filename specified in HISTFILE instead of overwriting the file. cmdhist tells bash to save all lines of a multiple-line command in the same history entry.</p>

<p>HISTFILE allows me to tell bash where and what to name history files. For example 2012-08-20.hist is today's bash history file. </p>

<p>Setting the prompt with history -a will append the current session history to an existing history file. This means your history file is updated as each command is executed on the command line, no waiting until the session is closed to write the history. I have found this helps with many terminals open at once and the need to share commands between them. The `history -n` part appends the history lines not already read from the history file to the current session history. These are lines appended to the history file since the beginning of the current bash session.</p>

<p>Getting basic statistics like how often I use a command or the last time I used it is easy. `ack -c ^perlcritic *` for example tells me how many times I called perlcritic and the filenames contain the date. The inherent problem is all the commands in the history files that are not helpful for whatever the task is at hand. So I wrote a simple filter script <a href="https://gist.github.com/3402769">clean_history.pl</a> which creates a clean file and a file containing the junk removed. The initial files are not modified which allows this script to be used in an iterative manner.</p>

<p>Lets take a look at the code with most of the patterns removed just to show how short this program really is.</p>

<p><script src="https://gist.github.com/3402778.js?file=gistfile1.pl"></script></p>

<p>Now I have these cleaned up history files I can look at. I wrote this script last December and have been using it since to help identify commands that I use a great deal that are either long or hard to remember. For those commands I started making aliases and functions in my .bashrc so I wouldn't have to remember all the intricate details and to optimize my work flow with less typing and more getting things done. So far I have added over a dozen aliases/functions based on what my history has been telling me. Here are two Perl related ones:</p>

<pre><code
># perl cleanup before git commit
pcl () {
  make clean
  rm Makefile.old
  cover --delete
  rm -r nytprof*
  rm smallprof.out
  rm -r fatlib*
  rm fatpacker.trace packlists
}

<p># build that perl coverage report<br />
pcr () {<br />
  perl Makefile.PL<br />
  make<br />
  cover -delete<br />
  cover -summary -report html -test<br />
}</code></pre></p>

<p>I still dig through the clean files, add new patterns to the cleaner script and look at what I could automate away. This not only makes it easier for me to get things done but it is also a benchmark on the evolution of my tool usage.</p>

<p>I hope others try this out on their history files and see what interesting goodies you can dig up and possibly share with us.</p>]]>
        
    </content>
</entry>

<entry>
    <title>An overview of spell checking modules</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2012/08/an-overview-of-spell-checking-modules.html" />
    <id>tag:blogs.perl.org,2012:/users/kirk_kimmel//922.3710</id>

    <published>2012-08-18T14:55:25Z</published>
    <updated>2012-08-19T21:10:35Z</updated>

    <summary>Spell checking is one of those problems that is already solved... sorta. Like all problems it really depends on context. Take Jon Bentley&apos;s Programming pearls: a spelling checker where he examines the problem space and the differences between a spell...</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>Spell checking is one of those problems that is already solved... sorta. </p>

<p>Like all problems it really depends on context. Take Jon Bentley's <a href="https://dl.acm.org/citation.cfm?id=315102&dl=ACM&coll=DL&CFID=143274770&CFTOKEN=10731946">Programming pearls: a spelling checker</a> where he examines the problem space and the differences between a spell checker and a spelling corrector. I start by searching the keyword 'spell' across all of CPAN.</p>

<p><code>wget http://www.cpan.org/modules/01modules.index.html<br />
ack -i spell 01modules.index.html<br />
</code><br />
The above covered all 22,442 distribution names but not the sub modules names. A few metacpan searches later and I was able to compile the following list.</p>

<p>Direct checkers - modules that actually do the spell checking<br />
<ul><br />
<li><a href="https://www.metacpan.org/module/Lingua::Ispell">Lingua::Ispell</a> A module encapsulating access to the Ispell program via IPC::Open2</li><br />
<li><a href="https://metacpan.org/module/Meta::Tool::Aspell">Meta::Tool::Aspell</a> run aspell for you. Meta is a class library of about 250 classes and is abandonware.</li><br />
<li><a href="https://metacpan.org/module/Text::Aspell">Text::Aspell</a> Perl interface to the GNU Aspell library</li><br />
<li><a href="https://metacpan.org/module/Text::Hunspell">Text::Hunspell</a> Perl interface to the GNU Hunspell library</li><br />
<li><a href="https://metacpan.org/module/Text::Ispell">Text::Ispell</a> A wrapper module for Ispell. The ispell cli is called via IPC::Open2.</li><br />
</ul></p>

<p>Indirect - relies on another module to do the actual checking<br />
<ul><br />
<li><a href="https://metacpan.org/module/Search::Tools::SpellCheck">Search::Tools::SpellCheck</a> Uses Text::Aspell to offer spelling suggestions</li><br />
<li><a href="https://metacpan.org/module/Text::SpellChecker">Text::SpellChecker</a> OO interface for spell-checking a block of text. Uses either Text::Aspell or Text::Hunspell</li><br />
</ul></p>

<p>POD only checkers<br />
<ul><br />
<li><a href="https://metacpan.org/module/Pod::Spell::CommonMistakes">Pod::Spell::CommonMistakes</a> Catches common typos in POD by using Pod::Spell to format the text and then comparing it against a custom wordlist from Pod::Spell::CommonMistakes::WordList. No system spell checker is required.</li><br />
<li><a href="https://metacpan.org/module/Pod::Spelling">Pod::Spelling</a> Send POD to a spelling checker using either Lingua::Ispell or Text::Aspell. A test library is provided via Test::Pod::Spelling</li><br />
<li><a href="https://metacpan.org/module/Test::Spelling">Test::Spelling</a> check for spelling errors in POD files. Pod::Spell is used for parsing and an open3 call is made to either 'spell', 'aspell', 'ispell', or 'hunspell' for spell checking.</li><br />
</ul></p>

<p>XML<br />
<ul><br />
<li><a href="https://www.metacpan.org/module/Apache::AxKit::Language::SpellCheck">Apache::AxKit::Language::SpellCheck</a> is an XML Text Spell Checker for the Apache AxKit. Checking is done via Text::Aspell</li><br />
<li><a href="https://www.metacpan.org/module/xml_spellcheck">xml_spellcheck</a> is a cli application for spell checking XML files. It makes a system call to 'aspell -c' directly.</li><br />
</ul></p>

<p>Spell checking as a test<br />
<ul><br />
<li>Dist::Zilla::Plugin::PodSpellingTests is DEPRECATED! The old name of the PodSpelling plugin</li><br />
<li><a href="https://www.metacpan.org/module/Dist::Zilla::Plugin::SpellingCommonMistakesTests">Dist::Zilla::Plugin::SpellingCommonMistakesTests</a> Generates a Test::Pod::Spell::CommonMistakes release test</li><br />
<li><a href="https://metacpan.org/module/Test::Pod::Spelling::CommonMistakes">Test::Pod::Spelling::CommonMistakes</a> Checks POD for common spelling mistakes using Pod::Spell::CommonMistakes.</li><br />
<li><a href="https://www.metacpan.org/module/Dist::Zilla::Plugin::Test::PodSpelling">Dist::Zilla::Plugin::Test::PodSpelling</a> Generates a Test::Spelling author test</li><br />
<li><a href="https://metacpan.org/module/Perl::Critic::Policy::Documentation::PodSpelling">Perl::Critic::Policy::Documentation::PodSpelling</a> Spell check the POD. Aspell is used via an open command.</li><br />
</ul></p>

<p>Checks spelling via remote service/application<br />
<ul><br />
<li><a href="https://metacpan.org/module/Bing::Search::Source::Spell">Bing::Search::Source::Spell</a> uses Bing to spell check text.</li><br />
<li><a href="https://metacpan.org/module/Lingua::AtD">Lingua::AtD</a> Provides an OO wrapper for After the Deadline grammar and spelling service.</li><br />
<li><a href="https://www.metacpan.org/module/Lingua::MSWordSpell">Lingua::MSWordSpell</a> Uses Microsoft Word's Spellchecker over OLE automation instead of something like ispell</li><br />
<li><a href="https://metacpan.org/module/Net::Google::Spelling">Net::Google::Spelling</a> simple OOP-ish interface to the Google SOAP API for spelling suggestions. This appears abandoned based on last update date, number of open bugs and the fact it has more failed test reports than passes.</li><br />
<li><a href="https://metacpan.org/module/WebService::KoreanSpeller">WebService::KoreanSpeller</a> A Korean spell checker</li><br />
</ul></p>

<p>Everything else<br />
<ul><br />
<li><a href="https://www.metacpan.org/module/Gtk2::Spell">Gtk2::Spell</a> Perl bindings to GtkSpell, used in concert with Gtk2::TextView.</li><br />
<li><a href="https://www.metacpan.org/module/Lingua::Jspell">Lingua::Jspell</a> Perl interface to the Jspell morphological analyzer.</li><br />
<li><a href="https://www.metacpan.org/module/Lingua::Spelling::Alternative">Lingua::Spelling::Alternative</a> Use affix files generated by the ispell tools to return alternative spellings of a given word</li><br />
<li><a href="https://metacpan.org/module/Pod::Spell">Pod::Spell</a> a formatter for spell checking Pod, no actual checking capabilities built in.</li><br />
<li><a href="https://metacpan.org/module/Text::SpellChecker::GUI">Text::SpellChecker::GUI</a> Implements a user interface to Text::SpellChecker</li><br />
<li><a href="https://metacpan.org/module/Tie::Ispell">Tie::Ispell</a> Ties a hash with an Ispell dictionary</li><br />
<li><a href="https://metacpan.org/module/tkispell">tkispell</a> Perl/Tk user interface for Ispell</li><br />
</ul></p>

<p>While many of these modules are actively developed and useful many do not fit my requirements for this project. I want to spell check any kind of utf8 encoded text and not need an Internet connection or closed source program to accomplish this task. The first two groups direct and indirect spell checkers appear to meet these requirements.</p>

<p>So which one to use? Lets take a look under the hood. GNU Ispell gives spelling suggestions if a word is not found in its dictionary. When searching for possible corrections to present it uses a <a href="https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance">Damerau–Levenshtein distance</a> of 1.</p>

<p>GNU Aspell is an Ispell replacement that can handle utf-8 by default, has 70 supported language dictionaries, and supports using multiple dictionaries at once.</p>

<p>Hunspell is an advanced spell checker based on MySpell that supports both dictionaries and rules. It is currently used by LibreOffice/OpenOffice and has dictionaries for 99 languages.</p>

<p>It looks like Aspell and Hunspell are the front runners. I rejected Meta::Tool::Aspell because it has not been updated since 2002 and the distribution it is a part of has tons of other modules that are not needed for this problem. That leaves Text::Aspell, Text::Hunspell, Search::Tools::SpellCheck, and Text::SpellChecker left. Now before I started trying these modules out, I decided to try the command line versions of aspell and hunspell against some test data and to compare the output.</p>

<p>Long story short they both generate too much noise due to how they function. If a word is in the dictionary or it can be matched then everything works out. Since I am dealing with text of any kind from source code files to memos and email there is too much noise. Things like company names, places, peoples' names, animals, plants, email addresses, and technical terms can easily be flagged as incorrect.</p>

<p>I wrote a few sample programs to try and work around this problem. I will cover the results of my research in a future post.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Backing up Berlios.de</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/kirk_kimmel/2012/08/backing-up-berliosde.html" />
    <id>tag:blogs.perl.org,2012:/users/kirk_kimmel//922.3705</id>

    <published>2012-08-17T04:16:57Z</published>
    <updated>2012-08-17T13:35:09Z</updated>

    <summary>Last year it was announced that www.berlios.de was going to be shut down. People were asking if someone was going to back it up to save all those open source projects. I decided to gave it a shot and I...</summary>
    <author>
        <name>Kimmel</name>
        
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/kirk_kimmel/">
        <![CDATA[<p>Last year it was announced that <a href="http://www.berlios.de/">www.berlios.de</a> was going to be shut down. People were asking if someone was going to back it up to save all those open source projects. I decided to gave it a shot and I was able to backup all of the berlios projects. While working on the process of uploading it to a new host (I was looking at github) it was announced that the site was saved, so I set the project aside.<br />
  <br />
Digging around I found this code and decided to post it so that people who are trying to build data mining style tools can have another real world example. <a href="https://github.com/kimmel/backup-berlios.de">github.com/kimmel/backup-berlios.de</a> contains two scripts, a shared library and a data file. </p>

<p>01_fetch_project_list.pl builds a list of all the projects on Berlios and writes it to a file. <br />
02_download_repos.pl takes that data file and downloads everything it can. </p>

<p>I broke the whole process into two scripts for a variety of reasons, mainly so I could resume the project downloads by having it premapped and then skipping anything that already existed.</p>

<p>Now when downloading any large number of web pages two key optimizations can be made: caching and compression. <a href="https://www.metacpan.org/module/WWW::Mechanize::Cached::GZip">WWW::Mechanize::Cached::GZip</a> works out great because it requests a gzip-compressed response which is fantastic for web pages and automatically caches the results for later. When downloading the projects themselves I was simply fetching archive files so WWW::Mechanize::* is just plain overkill in terms of features. <a href="https://www.metacpan.org/module/LWP::UserAgent">LWP::UserAgent</a> was perfect for this simple task. </p>

<p>I didn't need to worry about unicode since berlios uses only ASCII characters for project names. For logging the log filename needed to contain YYYY-MM-DD and automatically rotate itself. A combination of <a href="https://metacpan.org/module/Log::Dispatch::File::Stamped">Log::Dispatch::File::Stamped</a> and <a href="https://metacpan.org/module/Log::Dispatch::Screen">Log::Dispatch::Screen</a> fulfilled these requirements. I also considered doing the work in parallel but decided against it. I prefer to collect data and be a nice web citizen about it instead of just slamming a server with as many requests as possible constantly.</p>]]>
        
    </content>
</entry>

</feed>
