<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>stas</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/stas/" />
    <link rel="self" type="application/atom+xml" href="http://blogs.perl.org/users/stas/atom.xml" />
    <id>tag:blogs.perl.org,2009-11-03:/users/stas//1361</id>
    <updated>2013-05-12T23:41:00Z</updated>
    <subtitle>say v74.65.80.72;</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Pro 4.38</generator>

<entry>
    <title>Ludic Perl</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/stas/2013/05/ludic-perl.html" />
    <id>tag:blogs.perl.org,2013:/users/stas//1361.4671</id>

    <published>2013-05-12T23:38:29Z</published>
    <updated>2013-05-12T23:41:00Z</updated>

    <summary>Yep, indeed, contributing to the Perl community can be a very ludic activity (not to be confused with luddite!). I tried to list every Perl-related web resource where participants are encouraged to build up some kind of score. Most have...</summary>
    <author>
        <name>stas</name>
        <uri>http://github.com/creaktive</uri>
    </author>
    
    <category term="chart" label="chart" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="community" label="community" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="gamification" label="gamification" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="perl" label="Perl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="rank" label="rank" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/stas/">
        <![CDATA[<p>Yep, indeed, contributing to the Perl community can be a very <a href="https://duckduckgo.com/?q=define%3Aludic">ludic</a> activity (not to be confused with <a href="https://duckduckgo.com/?q=define+luddite">luddite</a>!). I tried to list every Perl-related web resource where participants are encouraged to build up some kind of score. Most have charts where participants compete for the highest rank while some has an absolute goal (like 100% test coverage). The list has no specific order. Feel free to post the resources I forgot/am unaware of in comments!</p>
]]>
        <![CDATA[<ul>
<li><a href="https://github.com/languages/Perl">GitHub » Explore » Languages » Perl</a> - weekly/monthly/overall chart with the most starred/forked Perl projects</li>
<li><a href="http://www.github-meets-cpan.com/">GitHub Meets CPAN</a> - CPAN authors get ranked by their GitHub relevance</li>
<li><a href="http://cpantesters.org/">CPAN Testers Report</a> - beat the machine making every test pass on every platform <code>:)</code></li>
<li><a href="http://cpants.cpanauthors.org/ranking">CPANTS game</a> - rates CPAN distributions by compliance with some sane standards</li>
<li><a href="http://questhub.io/perl/players">Play Perl</a> - social TODO list for the Perl community</li>
<li><a href="http://stats.cpantesters.org/testers.html">CPAN Testers Statistics</a> - smoke other's modules on your system(s), try to beat BINGOS</li>
<li><a href="https://metacpan.org/favorite/leaderboard">MetaCPAN Leaderboard</a> - buried somewhat deeply into the MetaCPAN, the chart of the Top 100 CPAN modules</li>
<li><a href="http://www.enlightenedperl.org/ironman.html">Iron Man Perl Blogging Challenge</a> - blog about Perl, earn badges</li>
<li><a href="http://pjcj.sytes.net/cover/latest/">CPAN Coverage Report</a> - an interesting initiative by Paul Johnson, the author of Devel::Cover (last updated: 23-Aug-2011)</li>
<li><a href="http://changes.cpanhq.org/hof">CPAN::Changes Kwalitee Service Hall of Fame</a> - not sure about it, but this appears to be the predecessor of the CPANTS game (seems to be offline, <a href="http://web.archive.org/web/20120719110302/http://changes.cpanhq.org/hof">see via The Wayback Machine</a>)</li>
<li><a href="https://metacpan.org/module/Perl::Achievements">Perl::Achievements</a> - a cool idea by Yanick Champoux</li>
</ul>
]]>
    </content>
</entry>

<entry>
    <title>Web Scraping with Modern Perl (Part 2 - Speed Edition)</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/stas/2013/02/web-scraping-with-modern-perl-part-2---speed-edition.html" />
    <id>tag:blogs.perl.org,2013:/users/stas//1361.4346</id>

    <published>2013-02-18T17:38:24Z</published>
    <updated>2013-02-18T17:41:21Z</updated>

    <summary>tl;dr Grab the gist with the complete, working source code. Benchmark it against the one featured on the previous article: $ \time perl mojo-crawler.pl 23.08user 0.07system 2:11.99elapsed 17%CPU (0avgtext+0avgdata 84064maxresident)k 16inputs+0outputs (0major+6356minor)pagefaults 0swaps $ \time perl yada-crawler.pl 8.83user 0.30system 0:12.60elapsed...</summary>
    <author>
        <name>stas</name>
        <uri>http://github.com/creaktive</uri>
    </author>
    
    <category term="anyevent" label="AnyEvent" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="asynchronous" label="asynchronous" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="crawler" label="crawler" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="curl" label="curl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="libcurl" label="libcurl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="modernperl" label="Modern Perl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="scaper" label="scaper" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="spider" label="spider" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="xpath" label="XPath" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/stas/">
        <![CDATA[<h2>tl;dr</h2>

<p>Grab the <a href="https://gist.github.com/creaktive/4607326">gist</a> with the complete, working source code.
Benchmark it against <a href="https://gist.github.com/creaktive/4347600">the one</a> featured on the <a href="http://blogs.perl.org/users/stas/2013/01/web-scraping-with-modern-perl-part-1.html">previous article</a>:</p>

<pre><code>$ \time perl mojo-crawler.pl
23.08user 0.07system 2:11.99elapsed 17%CPU (0avgtext+0avgdata 84064maxresident)k
16inputs+0outputs (0major+6356minor)pagefaults 0swaps

$ \time perl yada-crawler.pl
8.83user 0.30system 0:12.60elapsed 72%CPU (0avgtext+0avgdata 131008maxresident)k
0inputs+0outputs (0major+8607minor)pagefaults 0swaps
</code></pre>

<p>How can it be <strong>10x faster</strong> while consuming less than <strong>a half of CPU</strong> resources?!</p>
]]>
        <![CDATA[<h2>Perl as a glue</h2>

<p>Sorry, I had cheated a bit on the <code>mojo-crawler.pl</code> benchmark results.
It implicitly uses the <a href="http://cpan.me/EV">EV</a>, <em>a high performance full-featured event loop</em> library when it is present.
And it is not required for <a href="http://mojolicio.us">Mojolicious</a> to work properly.
Let's disable it:</p>

<pre><code>$ MOJO_REACTOR=Mojo::Reactor::Poll time perl mojo-crawler.pl
113.99user 13.37system 2:08.46elapsed 99%CPU (0avgtext+0avgdata 83808maxresident)k
2912inputs+0outputs (18major+5789minor)pagefaults 0swaps
</code></pre>

<p>The elapsed time is the same with ou without EV, but now the pure-Perl crawler hogs the CPU!</p>

<p>Why? EV provides an interface to <a href="http://software.schmorp.de/pkg/libev.html">libev</a>, which clearly does a better connection polling job than the 100% interpreted code. The bridge between Perl and the compiled library is called <a href="http://perldoc.perl.org/perlxs.html">XS</a>:</p>

<blockquote>
  <p>XS is an interface description file format used to create an extension interface between Perl and C code (or a C library) which one wishes to use with Perl.</p>
</blockquote>

<p>Actually, <a href="https://metacpan.org">CPAN</a> is full of <em>high performance</em> XS-based modules for many tasks:</p>

<ul>
<li><a href="http://cpan.me/Net::Curl">Net::Curl</a> - Perl interface for <a href="http://curl.haxx.se/libcurl/c/">libcurl</a>;</li>
<li><a href="http://cpan.me/HTTP::Parser::XS">HTTP::Parser::XS</a> - a fast, primitive HTTP request parser;</li>
<li><a href="http://cpan.me/JSON::XS">JSON::XS</a> - JSON serialising/deserialising, done correctly and fast;</li>
<li><a href="http://cpan.me/Unicode::CheckUTF8">Unicode::CheckUTF8</a> - an XS wrapper around some Unicode Consortium code to check if a string is valid UTF-8;</li>
<li><a href="http://cpan.me/XML::LibXML">XML::LibXML</a> - Perl Binding for <a href="http://xmlsoft.org/">libxml2</a>.</li>
</ul>

<p>Thus, an efficient and fast web crawler/scraper could be constructed with those <em>"bare-metal"</em> building blocks <code>;)</code></p>

<h2>Show me the code!</h2>

<pre><code>#!/usr/bin/env perl
use 5.016;
use common::sense;
use utf8::all;

# Use fast binary libraries
use EV;
use Web::Scraper::LibXML;
use YADA 0.039;

YADA-&gt;new(
    common_opts =&gt; {
        # Available opts @ http://curl.haxx.se/libcurl/c/curl_easy_setopt.html
        encoding        =&gt; '',
        followlocation  =&gt; 1,
        maxredirs       =&gt; 5,
    }, http_response =&gt; 1, max =&gt; 4,
)-&gt;append([qw[
    http://sysd.org/page/1/
    http://sysd.org/page/2/
    http://sysd.org/page/3/
]] =&gt; sub {
    my ($self) = @_;
    return  if $self-&gt;has_error
        or not $self-&gt;response-&gt;is_success
        or not $self-&gt;response-&gt;content_is_html;

    # Declare the scraper once and then reuse it
    state $scraper = scraper {
        process q(html title), title =&gt; q(text);
        process q(a), q(links[]) =&gt; q(@href);
    };

    # Employ amazing Perl (en|de)coding powers to handle HTML charsets
    my $doc = $scraper-&gt;scrape(
        $self-&gt;response-&gt;decoded_content,
        $self-&gt;final_url,
    );

    printf qq(%-64s %s\n), $self-&gt;final_url, $doc-&gt;{title};

    # Enqueue links from the parsed page
    $self-&gt;queue-&gt;prepend([
        grep {
            $_-&gt;can(q(host)) and $_-&gt;scheme =~ m{^https?$}x
            and $_-&gt;host eq $self-&gt;initial_url-&gt;host
            and (grep { length } $_-&gt;path_segments) &lt;= 3
        } @{$doc-&gt;{links} // []}
    ] =&gt; __SUB__);
})-&gt;wait;
</code></pre>

<h2>Now what?!</h2>

<p>The example above has half the lines of code of the <a href="https://gist.github.com/creaktive/4347600">previous one</a>.
This comes at a cost of installing a bunch of external dependencies from the CPAN:</p>

<pre><code>$ cpanm AnyEvent::Net::Curl::Queued EV HTML::TreeBuilder::LibXML Web::Scraper utf8::all
</code></pre>

<p>Despite the <code>use 5.016</code> pragma, this code works fine on Perl 5.10 if you get rid of the <code>__SUB__</code> reference.</p>

<p>So, what approach is the better one? Obviously, it depends. There is no silver bullet: <strong>web crawling is ultimately I/O-bound</strong>! However, specialized and well-tested libraries guarantee the I/O-boundness.
For instance, trimming the <code>::LibXML</code> part from the <code>use Web::Scraper::LibXML</code> statement considerably slows down our tiny crawler, because the HTML parsing will allocate more CPU cycles than the connection polling.</p>

<p>As the edge case, let's see how the venerable <a href="https://www.gnu.org/software/wget/">GNU Wget</a> tool (see also <a href="http://cpan.me/yada">yada</a>, which comes bundled together with the <a href="http://cpan.me/AnyEvent::Net::Curl::Queued">AE::N::C::Queued</a> distribution) behaves:</p>

<pre><code>$ "time" wget -r --follow-tags a http://sysd.org/
0.23user 0.41system 1:10.20elapsed 0%CPU (0avgtext+0avgdata 23920maxresident)k
0inputs+40704outputs (0major+4323minor)pagefaults 0swaps
</code></pre>

<p>Despite it's clear disadvantage of using single connection, it is almost completely I/O-bound since it's URL extraction code doesn't require complete parsing of the HTML.</p>
]]>
    </content>
</entry>

<entry>
    <title>Put a fancy CPU/RAM usage chart in tmux status bar</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/stas/2013/01/put-a-fancy-cpuram-usage-chart-in-tmux-status-bar.html" />
    <id>tag:blogs.perl.org,2013:/users/stas//1361.4246</id>

    <published>2013-02-01T02:17:55Z</published>
    <updated>2013-02-01T02:19:59Z</updated>

    <summary>Prologue So, once upon a time I had a crazy idea: to put an almost complete resource meter into the tmux status bar. You know, the clock is so boring. Let&apos;s add a battery indicator there. And the load numbers....</summary>
    <author>
        <name>stas</name>
        <uri>http://github.com/creaktive</uri>
    </author>
    
    <category term="ansi" label="ANSI" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="ascii" label="ASCII" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="chart" label="chart" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="cpu" label="cpu" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="memory" label="memory" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="monitor" label="monitor" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="spark" label="spark" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="tmux" label="tmux" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/stas/">
        <![CDATA[<h2>Prologue</h2>

<p>So, once upon a time I had a crazy idea: to put an almost complete resource meter into the <a href="http://tmux.sourceforge.net/">tmux</a> status bar. You know, the clock is so boring. Let's add a battery indicator there. And the load numbers. And the memory usage...</p>

<p>Needless to say, this resulted in an unbearable user experience:</p>

<pre><code>a:2.96G c:4.37G f:5.41G i:2.98G l:0.65/1.73/1.41 23:47
</code></pre>

<p>Actually, the data is OK, the "gauges" work fine on every Unix I tested them. If only it was a bit fancier...</p>

<h2>Puke rainbows!</h2>

<p>Then I discovered <a href="https://github.com/Goles/Battery">Battery</a>. And then, <a href="http://zachholman.com/spark/">Spark</a>. I just couldn't resist myself, so I revamped my messy Perl usage data parser to output this gorgeous ANSI art scrolling chart:</p>

<p><img src="http://i.imgur.com/JGvgK6B.png?1" alt="Screenshot" title="" /></p>

<p>It was tested on Mac OS X 10.8.2, Ubuntu 12.04, Ubuntu 11.10, Debian 6.0.6 and works fine with the default system Perl; there are no external dependencies at all.</p>

<p>Liked it? Go ahead, grab your copy and follow the installation instructions: <a href="https://github.com/creaktive/rainbarf">https://github.com/creaktive/rainbarf</a></p>
]]>
        

    </content>
</entry>

<entry>
    <title>Web Scraping with Modern Perl (Part 1)</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/stas/2013/01/web-scraping-with-modern-perl-part-1.html" />
    <id>tag:blogs.perl.org,2013:/users/stas//1361.4222</id>

    <published>2013-01-21T20:52:25Z</published>
    <updated>2013-02-18T17:51:30Z</updated>

    <summary>tl;dr Grab the gist with the complete, working source code. I often hear the question: &quot;so, you&apos;re Perl guy, could you show me how to make a web crawler/spider/scraper, then?&quot; I hope this post series will become my ultimate answer...</summary>
    <author>
        <name>stas</name>
        <uri>http://github.com/creaktive</uri>
    </author>
    
    <category term="asynchronous" label="asynchronous" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="crawler" label="crawler" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="cssselector" label="CSS selector" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="modernperl" label="Modern Perl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="mojolicious" label="Mojolicious" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="scaper" label="scaper" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="spider" label="spider" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/stas/">
        <![CDATA[<h2>tl;dr</h2>

<p>Grab the <a href="https://gist.github.com/4347600">gist</a> with the complete, working source code.</p>

<p>I often hear the question: <em>"so, you're Perl guy, could you show me how to make a web crawler/spider/scraper, then?"</em>
I hope this post series will become my ultimate answer :)</p>

<p>First of all, I compiled a small list of features that people expect of crawlers nowadays:</p>

<ol>
<li>capability of concurrent, persistent connections;</li>
<li>usage of CSS selectors to process HTML;</li>
<li>easily modifiable source instead of a flexible OOP inheritance structure;</li>
<li>LESS DEPENDENCIES!</li>
</ol>
]]>
        <![CDATA[<p>Well, sorry <a href="http://cpan.me/WWW::Mechanize">WWW::Mechanize</a>, it's not your turn. Instead, the first example will be based on <a href="http://mojolicio.us/perldoc/Mojo/UserAgent">Mojo::UserAgent</a> from the <a href="http://mojolicio.us/">Mojolicious</a> framework. Being event-driven and low on dependencies (none except the <a href="http://mojolicio.us/">Mojolicious</a> itself) is specially attractive for Perl newcomers with some <a href="http://jquery.com/">jQuery</a> literacy.</p>

<h2>Boilerplate</h2>

<p>Let's call our project <code>mojo-crawler.pl</code>. Here's how it begins:</p>

<pre><code>#!/usr/bin/env perl
use 5.010;
use open qw(:locale);
use strict;
use utf8;
use warnings qw(all);

use Mojo::UserAgent;

# FIFO queue
my @urls = map { Mojo::URL-&gt;new($_) } qw(
    http://sysd.org/page/1/
    http://sysd.org/page/2/
    http://sysd.org/page/3/
    http://sysd.org/page/4/
    http://sysd.org/page/5/
    http://sysd.org/page/6/
);

# Limit parallel connections to 4
my $max_conn = 4;

# User agent following up to 5 redirects
my $ua = Mojo::UserAgent
    -&gt;new(max_redirects =&gt; 5)
    -&gt;detect_proxy;

# Keep track of active connections
my $active = 0;
</code></pre>

<p>Note that I'm using my very own server as a guinea pig.
Consider that my oath on the safety of the parallelization method I chose.</p>

<h2>Event loop</h2>

<p>Here we keep a constant-sized pool of active connections, populating it with URLs from our FIFO queue. The anonymous <code>sub {}</code> fires every time the event loop is idle (0-second timer):</p>

<pre><code>Mojo::IOLoop-&gt;recurring(
    0 =&gt; sub {
        for ($active + 1 .. $max_conn) {

            # Dequeue or halt if there are no active crawlers anymore
            return ($active or Mojo::IOLoop-&gt;stop)
                unless my $url = shift @urls;

            # Fetch non-blocking just by adding
            # a callback and marking as active
            ++$active;
            $ua-&gt;get($url =&gt; \&amp;get_callback);
        }
    }
);
</code></pre>

<p>Now, start the event loop unless it is already started somewhere else.
In <em>this</em> code, it won't be started anywhere else.
But who knows how deep the Copy&amp;Paste will bury it in future?!</p>

<pre><code># Start event loop if necessary
Mojo::IOLoop-&gt;start unless Mojo::IOLoop-&gt;is_running;
</code></pre>

<h2>Download callback</h2>

<p>Every completed download ends here. Even the failed ones.
Thus, when the download is complete, we decrease the <code>$active</code> counter to free a connection slot:</p>

<pre><code>sub get_callback {
    my (undef, $tx) = @_;

    # Deactivate
    --$active;

    # Parse only OK HTML responses
    return
        if not $tx-&gt;res-&gt;is_status_class(200)
        or $tx-&gt;res-&gt;headers-&gt;content_type !~ m{^text/html\b}ix;

    # Request URL
    my $url = $tx-&gt;req-&gt;url;

    say $url;
    parse_html($url, $tx);

    return;
}

# Not implemented yet!
sub parse_html { return }
</code></pre>

<p>Fear not, the <code>parse_html()</code> is implemented right below!</p>

<h2>Take one</h2>

<p>Let's make sure this code actually does what it does, while it is still low on the line count:</p>

<pre><code>$ perl mojo-crawler.pl 
http://sysd.org/page/2/
http://sysd.org/page/4/
http://sysd.org/page/3/
http://sysd.org/page/5/
http://sysd.org/
http://sysd.org/page/6/
$
</code></pre>

<p>Fine, the downloads were completed in the order of the time it took to download each resource. Oh, and <a href="http://sysd.org/page/1">http://sysd.org/page/1</a> simply redirects to <a href="http://sysd.org/">http://sysd.org/</a>.</p>

<h2>Adding some depth</h2>

<p>The most difficult part of making web crawlers isn't making them start; it's making them <strong>stop</strong>. Our complete <code>parse_html()</code> also takes care of feeding the URL queue with the URLs extracted from <code>&lt;a href="..."&gt;</code> links. Plus, it makes a trivial verification on:</p>

<ol>
<li>Protocols (only HTTP and HTTPS);</li>
<li>Path depth (don't go deeper than /a/b/c);</li>
<li>URL revisiting (don't download the same resource over and over);</li>
<li>Cross-domain links (not allowed).</li>
</ol>

<p>And, to show we've been there, let's print the title of the page:</p>

<pre><code>sub parse_html {
    my ($url, $tx) = @_;

    say $tx-&gt;res-&gt;dom-&gt;at('html title')-&gt;text;

    # Extract and enqueue URLs
    for my $e ($tx-&gt;res-&gt;dom('a[href]')-&gt;each) {

        # Validate href attribute
        my $link = Mojo::URL-&gt;new($e-&gt;{href});
        next if 'Mojo::URL' ne ref $link;

        # "normalize" link
        $link = $link-&gt;to_abs($tx-&gt;req-&gt;url)-&gt;fragment(undef);
        next unless grep { $link-&gt;protocol eq $_ } qw(http https);

        # Don't go deeper than /a/b/c
        next if @{$link-&gt;path-&gt;parts} &gt; 3;

        # Access every link only once
        state $uniq = {};
        ++$uniq-&gt;{$url-&gt;to_string}
        next if ++$uniq-&gt;{$link-&gt;to_string} &gt; 1;

        # Don't visit other hosts
        next if $link-&gt;host ne $url-&gt;host;

        push @urls, $link;
        say " -&gt; $link";
    }
    say '';

    return;
}
</code></pre>

<h2>Take two</h2>

<p>This time, it will be a lot slower, as every internal link is followed and downloaded.
The crawler will print the accessed URL, the title of the page, and the extracted non-visited links:</p>

<pre><code>$ "time" perl mojo-crawler.pl 
http://sysd.org/
sysd.org
 -&gt; http://sysd.org/tag/benchmark/
 -&gt; http://sysd.org/tag/command-line-interface/
 -&gt; http://sysd.org/tag/console/
 -&gt; http://sysd.org/tag/overhead/
 -&gt; http://sysd.org/tag/terminal/
 -&gt; http://sysd.org/tag/teste/
 -&gt; http://sysd.org/tag/tty/
 -&gt; http://sysd.org/tag/velocidade/
 -&gt; http://sysd.org/tag/browser/
 -&gt; http://sysd.org/tag/deprecation/
 -&gt; http://sysd.org/tag/ie/
 -&gt; http://sysd.org/tag/microsoft/
 -&gt; http://sysd.org/tag/navegador/
 -&gt; http://sysd.org/tag/webdesign/
 -&gt; http://sysd.org/tag/webdev/
 -&gt; http://sysd.org/tag/api/
 -&gt; http://sysd.org/tag/hack-2/
 -&gt; http://sysd.org/tag/integration/
 -&gt; http://sysd.org/tag/rest/
...
27.73user 0.88system 3:48.46elapsed 12%CPU (0avgtext+0avgdata 98272maxresident)k
0inputs+8outputs (0major+6749minor)pagefaults 0swaps
$
</code></pre>

<p>A very important final note: albeit this tiny crawler operates through the recursive traversal of links, it is implemented in an iterative way. Thus, it is very light on memory consumption. In fact, the only structure that hogs the RAM is the <code>$uniq</code> hashref. <a href="http://perldoc.perl.org/functions/tie.html">tie</a> it to any kind of persistent storage if that concerns you. The FIFO queue <code>@urls</code> <strong>could</strong> grow a lot if the crawled site has dynamically-generated link lists (or even broken pagination).
So, not storing it in some kind of key/value database is a bit reckless.</p>

<h2>Conclusion</h2>

<p>Despite this being a <em>toy spider</em>, I believe it is good enough to solve 80% of web crawling/scraping problems. The remaining 20% would require much more code, tests and infrastructure (A.K.A. the <a href="https://en.wikipedia.org/wiki/Pareto_principle">Pareto principle</a>).
Please, don't reinvent the wheel, check out the <a href="http://commoncrawl.org/">CommonCrawl</a> project first!
And keep checking <a href="http://blogs.perl.org/users/stas/">my Perl blog</a> for more on that 80% focused web-crawling ;)</p>

<h2>Acknowledgements</h2>

<ul>
<li>This post was inspired by the article <a href="http://blog.hartleybrody.com/web-scraping/">I Don't Need No Stinking API: Web Scraping For Fun and Profit</a>;</li>
<li><a href="http://blogs.perl.org/users/joel_berger/">Joel Berger</a> and <a href="https://twitter.com/kraih">Sebastian Riedel</a> gave me invaluable tips on writing concise Mojolicious code recipes.</li>
</ul>

<h2>Continued</h2>

<p><a href="http://blogs.perl.org/users/stas/2013/02/web-scraping-with-modern-perl-part-2---speed-edition.html">Web Scraping with Modern Perl (Part 2 - Speed Edition)</a></p>
]]>
    </content>
</entry>

<entry>
    <title>Merry XS-mas!</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/stas/2012/12/merry-xs-mas.html" />
    <id>tag:blogs.perl.org,2012:/users/stas//1361.4151</id>

    <published>2012-12-24T12:34:10Z</published>
    <updated>2012-12-24T13:29:38Z</updated>

    <summary>Some time ago, I&apos;ve acknowledged the LWP::Protocol::Net::Curl existence here. Lots of things changed since then due to the feedback received, so thank you all, guys! Today, I am proud to announce the reach of the stable milestone with the version...</summary>
    <author>
        <name>stas</name>
        <uri>http://github.com/creaktive</uri>
    </author>
    
    <category term="curl" label="curl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="libcurl" label="libcurl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="lwp" label="LWP" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="milestone" label="milestone" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="xs" label="XS" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/stas/">
        <![CDATA[<p>Some time ago, I've <a href="http://blogs.perl.org/users/stas/2012/11/libcurl-as-lwp-backend-or-all-your-protocol-are-belong-to-us.html">acknowledged the LWP::Protocol::Net::Curl existence</a> here. Lots of things <a href="https://metacpan.org/diff/file/?target=SYP/LWP-Protocol-Net-Curl-0.011/lib/LWP/Protocol/Net/Curl.pm&amp;source=SYP/LWP-Protocol-Net-Curl-0.007/lib/LWP/Protocol/Net/Curl.pm">changed since then</a> due to the feedback received, so thank you all, guys! Today, I am proud to announce the reach of the <strong>stable</strong> milestone with the version <em>0.011</em>.</p>
]]>
        <![CDATA[<p>Which doesn't actually mean that plugging <a href="http://curl.haxx.se/libcurl/">libcurl</a> in will make your LWP-based code harder, faster, better, stronger (at least I guarantee HTTPS over SOCKS5 is <em>less</em> hard than the traditional way). Your mileage may vary, albeit the general <a href="http://cpantesters.org/distro/L/LWP-Protocol-Net-Curl.html">CPAN Testers Reports</a> are fine.</p>

<p>After all, the purpose of this module boils down to "having one single XS dependency for handling multiple protocols and formats". So, give it a try and tell me what you think!</p>

<h2>One more thing</h2>

<p>Does anyone know how to reach <a href="https://metacpan.org/author/SPARKY">Przemysław Iskra</a>, the author of Net::Curl? I've <a href="https://github.com/sparky/perl-Net-Curl/pull/1">discovered, reported and provided a patch</a> for one extremely slow memory leak, the source of infamous <em>"Attempt to free unreferenced scalar: SV 0xdeadbeef during global destruction."</em> warning. Would be very nice to see it upstreamed to the CPAN <code>:)</code></p>
]]>
    </content>
</entry>

<entry>
    <title>CPAN module recommendation system</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/stas/2012/12/cpan-module-recommendation-system.html" />
    <id>tag:blogs.perl.org,2012:/users/stas//1361.4121</id>

    <published>2012-12-11T16:45:39Z</published>
    <updated>2012-12-11T16:51:05Z</updated>

    <summary>A little confession/reasoning/backstory: I love CPAN surfing. You know, watching the latest releases, browsing module dependencies and other modules by the same authors... And favoriting the interesting stuff. Stats show that I&apos;m not supposed to be the only one. If...</summary>
    <author>
        <name>stas</name>
        <uri>http://github.com/creaktive</uri>
    </author>
    
    <category term="collaborativefiltering" label="collaborative filtering" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="cpan" label="CPAN" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="experiment" label="experiment" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="hivemind" label="hive mind" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="metacpan" label="MetaCPAN" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="recommendation" label="recommendation" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/stas/">
        <![CDATA[<p>A little confession/reasoning/backstory: I love <em>CPAN surfing</em>. You know, watching the latest releases, browsing module dependencies and other modules by the same authors... And favoriting the interesting stuff. <a href="https://metacpan.org/favorite/leaderboard">Stats show</a> that I'm not supposed to be the only one. If so, wouldn't it be nice to provide a crowdsourced recommendation for CPAN modules? Think: <em>"People who favorite Mojolicious often favorite: AnyEvent, Data::Printer, Devel::NYTProf, Dist::Zilla..."</em>. Plus, given the user's favorites, own releases &amp; own release dependencies, a custom-tailored module suggestion list could be build for any PAUSE ID subscribed to the <a href="https://metacpan.org/">MetaCPAN</a>. Enter the <a href="http://cpan-u.sysd.org/">CPAN::U experiment</a>.</p>
]]>
        <![CDATA[<p>It uses <a href="http://www.computer.org/csdl/mags/ic/2003/01/w1076-abs.html">Amazon's item-to-item algorithm</a> to group CPAN modules with the greatest <a href="https://coderwall.com/p/284hja">cosine similarity</a> between the feature vectors of their "likers".</p>

<p><code>tl;dr</code>: a poor man's <a href="https://en.wikipedia.org/wiki/Netflix_Prize">Cinematch</a>.</p>

<p>Not a big deal, but it addresses, at least partially, an issue raised in the recent <a href="http://blogs.perl.org/users/steven_haryanto/2012/11/categorizing-cpan-modules.html">Categorizing CPAN modules</a> post (which was an actual inspiration for wrapping up a public release for some quick &amp; dirty code written <strong>one day prior</strong> to that post publication). And it is fun to explore, after all!</p>

<p>The next logical step is to tweak <a href="https://github.com/creaktive/metacpan-web">my fork of metacpan-web</a> to incorporate the recommendation API. But first, the API needs to be tested. That's the main purpose for the <a href="http://cpan-u.sysd.org/">CPAN::U experiment</a>: to be a crash test dummy for the further "return to the source". And this is why I'm kindly asking for your help. There are too many questions unanswered (which can be replied directly on the <a href="http://cpan-u.sysd.org/#disqus_thread">project's landing page</a>):</p>

<ul>
<li>Query your PAUSE ID and/or a few not-so-ubiquitous modules you know. Rate the results, from 0 for "complete nonsense" to 5 for "the module I was long searching for".</li>
<li>Is it slow? Does it crashes? Does it work at all, in your browser? (I suck at frontend, sorry 'bout Bootstrap thingie)</li>
<li>Are you aware of any collaborative filtering algorithms more appropriate for this task?</li>
<li>Could it be implemented as a part of the MetaCPAN API, at all? (currently, there is a <a href="https://github.com/creaktive/CPAN-U/blob/master/_attachments/bin/metacpan-recommendation.pl">Perl fetch/process script</a> which populates a CouchDB database which is queried via Ajax)</li>
</ul>

<p>As always, pull requests are welcome!</p>
]]>
    </content>
</entry>

<entry>
    <title>TMTOWTDI, plus benchmarking</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/stas/2012/12/tmtowtdi-plus-benchmarking.html" />
    <id>tag:blogs.perl.org,2012:/users/stas//1361.4100</id>

    <published>2012-12-04T16:46:36Z</published>
    <updated>2012-12-16T14:02:10Z</updated>

    <summary>There&apos;s a very common Perl idiom for getting &quot;top N elements&quot; from an array: @top = (sort @a) [0 .. $n - 1]. Mostly, it&apos;s good enough for anything one would dare to store in RAM. Then, there is Sort::Key::Top,...</summary>
    <author>
        <name>stas</name>
        <uri>http://github.com/creaktive</uri>
    </author>
    
    <category term="algorithm" label="algorithm" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="array" label="array" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="benchmark" label="benchmark" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="perl" label="Perl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="selectionalgorithm" label="selection algorithm" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="sortkeytop" label="Sort::Key::Top" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="tmtowtdi" label="TMTOWTDI" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="topelements" label="top elements" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/stas/">
        <![CDATA[<p>There's a very common Perl idiom for getting "top N elements" from an array: <code>@top = (sort @a) [0 .. $n - 1]</code>. Mostly, it's good enough for anything one would dare to store in RAM.</p>

<p>Then, there is <a href="https://metacpan.org/module/Sort::Key::Top">Sort::Key::Top</a>, which allows you to write <code>@top = top $n =&gt; @a</code>. Yet another syntax sugar?</p>

<p>Not even close! While the docs don't state it boldly, it is:</p>

<ol>
<li>a XS based module</li>
<li><a href="http://en.wikipedia.org/wiki/Selection_algorithm#Partition-based_general_selection_algorithm">partition-based general selection algorithm</a> (also known as <em>quickselect</em>) implementation</li>
</ol>

<p>So, expect it to be fast. How fast?</p>
]]>
        <![CDATA[<p>Here's the results from my attempt of benchmarking the 10 longest-word selection from system dictionary (total of 235886 words):</p>

<pre><code>               Rate   quicksort    pureperl quickselect
quicksort   0.329/s          --        -75%        -89%
pureperl     1.30/s        295%          --        -57%
quickselect  3.04/s        825%        134%          --
</code></pre>

<p>Note that the <em>quicksort</em> is there only to verify this claim from the <a href="http://en.wikipedia.org/wiki/Selection_algorithm">Wikipedia article</a>:</p>

<blockquote>
  <p>However, if done properly, a Java implementation is typically a magnitude (10x) faster than the quicksort algorithm.</p>
</blockquote>

<p>Yup, seems like so.
BTW, the 10 longest words are:</p>

<pre><code>"scientificophilosophical"
"tetraiodophenolphthalein"
"formaldehydesulphoxylate"
"thyroparathyroidectomize"
"pathologicopsychological"
"formaldehydesulphoxylic"
"hematospectrophotometer"
"thymolsulphonephthalein"
"phenolsulphonephthalein"
"epididymodeferentectomy"
</code></pre>

<h2>Benchmark code:</h2>

<pre><code>#!/usr/bin/env perl
use 5.010000;
use autodie;
use strict;
use warnings qw(all);

use Carp qw(croak);
use Benchmark qw(cmpthese);

use Sort::Key::Top qw(rnkeytopsort);

my %words;
open my $fh, q(&lt;), q(/usr/share/dict/words);
while (&lt;$fh&gt;) {
    chomp;
    $words{$_} = length;
}
close $fh;

say q(words in hash: ) . scalar keys %words;

my $top_n = 10;
my $code = {
    quickselect =&gt; sub { rnkeytopsort { $words{$_} } $top_n =&gt; keys %words },
    quicksort =&gt; sub {
        use sort qw(_quicksort stable);
        (
            sort { $words{$b} &lt;=&gt; $words{$a} }
            keys %words
        ) [0 .. $top_n - 1];
    },
    pureperl =&gt; sub {
        (
            sort { $words{$b} &lt;=&gt; $words{$a} }
            keys %words
        ) [0 .. $top_n - 1];
    },
};
croak qq(something went VERY wrong)
    unless [$code-&gt;{quickselect}-&gt;()] ~~ [$code-&gt;{pureperl}-&gt;()];

cmpthese(100 =&gt; $code);
</code></pre>
]]>
    </content>
</entry>

<entry>
    <title>Google Refine + Perl</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/stas/2012/11/google-refine-perl.html" />
    <id>tag:blogs.perl.org,2012:/users/stas//1361.4069</id>

    <published>2012-11-26T17:18:26Z</published>
    <updated>2012-12-16T14:06:44Z</updated>

    <summary> (repost from http://sysd.org/google-refine-perl-english/; it&apos;s more contextual here) Google Refine is awesome. If you&apos;re unaware of what it is, access their official page and watch at least the first screencast. You&apos;ll see it can be helpful for several ETL-related tasks. Currently,...</summary>
    <author>
        <name>stas</name>
        <uri>http://github.com/creaktive</uri>
    </author>
    
    <category term="api" label="API" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="geodna" label="GeoDNA" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="googlerefine" label="Google Refine" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="hack" label="hack" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="integration" label="integration" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="interface" label="interface" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="mojolicious" label="Mojolicious" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="perl" label="Perl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="rest" label="REST" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/stas/">
        <![CDATA[<p>
(repost from <a href="http://sysd.org/google-refine-perl-english/">http://sysd.org/google-refine-perl-english/</a>; it's more contextual here)
</p>

<p>
<a href="https://code.google.com/p/google-refine/">Google Refine</a> is awesome. If you're unaware of what it is, access their official page and watch at least the first screencast. You'll see it can be helpful for several <a href="https://en.wikipedia.org/wiki/Extract,_transform,_load">ETL</a>-related tasks.
</p>

<p>
Currently, I use it a lot, specially for simple (but boring) tasks, like loading a CSV, trimming out some outliers and saving as JSON to be imported into <a href="http://www.mongodb.org/">MongoDB</a>. Nothing a Perl one-liner couldn't do.
</p>

<p>
However, the opposite is not true: Perl one-liners are a lot more flexible than Google Refine. Now, what if we could merge both?
</p>
]]>
        <![CDATA[<ol>
    <li>Google Refine <a href="https://code.google.com/p/google-refine/wiki/FetchingURLsFromWebServices">could be easily integrated with any RESTful API</a>.</li>
    <li>Perl transforms one-liners into <a href="http://mojolicio.us/perldoc/ojo">RESTful webservices</a>.</li>
    <li>PROFIT!!!</li>
</ol>

<p>
As a practical example, I'll use some georeferenced data I was working at. Let's suppose I have to deduplicate registers, and one of "duplicate" rules is their proximity on the map. Google Refine is far from a full-featured <a href="https://en.wikipedia.org/wiki/GIS">GIS</a>, and is unable to handle bidimensional coordinate system. Enter the <a href="http://www.geodna.org/docs/google-maps.html">GeoDNA</a>: an algorithm to lower geospatial dimensions. As it's FAQ says,
</p>

<blockquote>GeoDNA is a way to represent a latitude/longitude coordinate pair as a string. That sounds simple enough, but it's a special string format: the longer it is, the more accurate it is. More importantly, each string uniquely defines a region of the earth's surface, so in general, GeoDNA codes with similar prefixes are located near each other. This can be used to perform proximity searching <strong>using only string comparisons</strong> (like the SQL "LIKE" operator).</blockquote>

<p>
Another interesting property of GeoDNA is that when ordening a set of records by their GeoDNA code, close locations are likely to appear in adjacent rows (sometimes, close locations will share very different prefixes, but similar prefixes <strong>always</strong> represent close locations).
</p>

<p>
To incorporate GeoDNA into Google Refine, we'll use the <em>Add column by fetching URLs</em> option, clicking on the header of any column (which column it will be doesn't matter as we'll use two of them, anyway):
</p>

<p><a href="http://blogs.perl.org/users/stas/assets_c/2012/11/Screen-Shot-2012-04-30-at-6.58.58-PM-1043.html" onclick="window.open('http://blogs.perl.org/users/stas/assets_c/2012/11/Screen-Shot-2012-04-30-at-6.58.58-PM-1043.html','popup','width=500,height=250,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blogs.perl.org/users/stas/assets_c/2012/11/Screen-Shot-2012-04-30-at-6.58.58-PM-thumb-500x250-1043.png" width="500" height="250" alt="Edit column > Add column by fetching URLs..." title="Edit column > Add column by fetching URLs..." class="mt-image-center" style="text-align: center; display: block; margin: 0 auto 20px;" /></a></p>

<p>
As the expression, we'll paste the following code (here, pay attention to the correct latitude/longitude column names):
</p>

<pre><code>'http://127.0.0.1:3000/?lat='+
row.cells['latitude'].value
+'&amp;lon='+
row.cells['longitude'].value
</code></pre>

<p>
Throttle delay can be zeroed, as our webservice is local. The final configuration should look like this (don't push the OK button, yet):
</p>

<p><a href="http://blogs.perl.org/users/stas/assets_c/2012/11/cities-Google-Refine-1042.html" onclick="window.open('http://blogs.perl.org/users/stas/assets_c/2012/11/cities-Google-Refine-1042.html','popup','width=702,height=541,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blogs.perl.org/users/stas/assets_c/2012/11/cities-Google-Refine-thumb-500x385-1042.png" width="500" height="385" alt="Add column by fetching URLs... dialog" title="Add column by fetching URLs... dialog" class="mt-image-center" style="text-align: center; display: block; margin: 0 auto 20px;" /></a></p>

<p>
Now, check if you have <a href="https://metacpan.org/module/Mojolicious">Mojolicious</a> and <a href="https://metacpan.org/module/Geo::DNA">Geo::DNA</a> Perl modules (install them via CPAN, if not) and paste into your terminal:
</p>

<pre><code>perl -MGeo::DNA -Mojo -E 'a("/"=&gt;sub{my$s=shift;$s-&gt;render(json=&gt;{geocode=&gt;Geo::DNA::encode_geo_dna($s-&gt;param("lat"),$s-&gt;param("lon"))})})-&gt;start' daemon
</code></pre>

<p>
If you prefer a "human-readable" version, paste the following code into <code>geocode-webservice.pl</code>:
</p>

<pre><code>#!/usr/bin/env perl
use Geo::DNA qw(encode_geo_dna);
use Mojolicious::Lite;
any '/' =&gt; sub {
    my $self = shift;
    $self-&gt;render(json =&gt; {
        geocode =&gt; encode_geo_dna(
            $self-&gt;param('lat'),
            $self-&gt;param('lon'),
        ),
    });
};
app-&gt;start;
</code></pre>

<p>
Once you started a webservice, it will report <em>Server available at http://127.0.0.1:3000</em>. Now, click OK on Google Refine dialog and wait. Even without delay, it could be a bit slow; however, even then this hack saved me a lot of time ;)
</p>
]]>
    </content>
</entry>

<entry>
    <title>libcurl as LWP backend (or &quot;all your protocol are belong to us&quot;)</title>
    <link rel="alternate" type="text/html" href="http://blogs.perl.org/users/stas/2012/11/libcurl-as-lwp-backend-or-all-your-protocol-are-belong-to-us.html" />
    <id>tag:blogs.perl.org,2012:/users/stas//1361.4051</id>

    <published>2012-11-16T16:12:55Z</published>
    <updated>2012-12-16T14:09:09Z</updated>

    <summary>Suppose you are planning to scrap a few thousands of pages using WWW::Mechanize. Over HTTPS. Via SOCKS5 tunnel. On an aged CentOS box (think Perl v5.8). With no root privileges. Bonus points if it uses HTTP compression. Better prepare for...</summary>
    <author>
        <name>stas</name>
        <uri>http://github.com/creaktive</uri>
    </author>
    
    <category term="cpan" label="CPAN" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="curl" label="curl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dropin" label="dropin" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="glue" label="glue" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="libcurl" label="libcurl" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="lwp" label="LWP" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="mechanize" label="Mechanize" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="scraper" label="Scraper" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blogs.perl.org/users/stas/">
        <![CDATA[<p>Suppose you are planning to scrap a few thousands of pages using WWW::Mechanize.</p>

<p>Over HTTPS.
Via SOCKS5 tunnel.
On an aged CentOS box (think Perl v5.8).
With no root privileges.
Bonus points if it uses HTTP compression.
Better prepare for some serious yak shaving.</p>

<p>If only WWW::Mechanize was written on top of <a href="http://curl.haxx.se/">libcurl</a>, instead of LWP::UserAgent!
(spoiler: I doubt it could ever happen; libcurl is all about <em>manipulexity</em>; <em>whipuptitude</em> is beyond it's scope)
How cool supporting all <strong>that</strong> features <em>out-of-box</em> would be?</p>

<pre><code>$ curl -V
curl 7.28.0 (x86_64-apple-darwin12.2.0) libcurl/7.28.0 OpenSSL/1.0.1c zlib/1.2.7 c-ares/1.7.5 libidn/1.25 libssh2/1.2.7
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp 
Features: AsynchDNS IDN IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP
</code></pre>

<p>Now, what about this?</p>

<pre><code>$ PERL5OPT=-MLWP::Protocol::Net::Curl=verbose,1 mech-dump https://google.com
</code></pre>
]]>
        <![CDATA[<p>Or, in your script:</p>

<pre><code>#!/usr/bin/env perl
use common::sense;
use LWP::Protocol::Net::Curl;
use WWW::Mechanize;
...
</code></pre>

<p>You could even use Perl as a glue between libcurl and libxml:</p>

<pre><code>#!/usr/bin/env perl
use common::sense;
use Data::Printer;
use LWP::Protocol::Net::Curl encoding =&gt; ''; # enables Content-Encoding: deflate, gzip
use Web::Scraper::LibXML;
my $scraper = scraper {
    process "a[href]", "urls[]" =&gt; '@href';
    result 'urls';
};
my $links = $scraper-&gt;scrape(URI-&gt;new('http://www.cpan.org/'));
p $links;
</code></pre>

<p><a href="https://metacpan.org/module/LWP::Protocol::Net::Curl">LWP::Protocol::Net::Curl</a> is a work in progress, but how complete is it?</p>

<ul>
<li>Passes <code>libwww-perl-6.04/t/</code>;</li>
<li>Passes <code>WWW-Mechanize-1.72/t/</code> (minor caveats);</li>
<li><code>PERL5OPT=-MLWP::Protocol::Net::Curl lwp-(download|dump|mirror|request)</code> work</li>
<li>Compatible with <a href="https://metacpan.org/module/Net::Google::DataAPI">Net::Google::DataAPI</a> (achievement unlocked ;)</li>
<li>Smoke tested on a bunch of our own crawlers.</li>
</ul>

<p>Unfortunately, no <a href="http://www.cpantesters.org/distro/L/LWP-Protocol-Net-Curl.html">CPAN Testers Reports</a> are available for the latest release, which fixed a major bug with proper <code>:content_file</code> handling. Many other bugs may lurk around, so keep an eye at project's <a href="https://github.com/creaktive/LWP-Protocol-Net-Curl">GitHub repo</a>!</p>
]]>
    </content>
</entry>

</feed>
