libcurl as LWP backend (or "all your protocol are belong to us")

Suppose you are planning to scrap a few thousands of pages using WWW::Mechanize.

Over HTTPS. Via SOCKS5 tunnel. On an aged CentOS box (think Perl v5.8). With no root privileges. Bonus points if it uses HTTP compression. Better prepare for some serious yak shaving.

If only WWW::Mechanize was written on top of libcurl, instead of LWP::UserAgent! (spoiler: I doubt it could ever happen; libcurl is all about manipulexity; whipuptitude is beyond it's scope) How cool supporting all that features out-of-box would be?

$ curl -V
curl 7.28.0 (x86_64-apple-darwin12.2.0) libcurl/7.28.0 OpenSSL/1.0.1c zlib/1.2.7 c-ares/1.7.5 libidn/1.25 libssh2/1.2.7
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp 
Features: AsynchDNS IDN IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP

Now, what about this?

$ PERL5OPT=-MLWP::Protocol::Net::Curl=verbose,1 mech-dump https://google.com

Or, in your script:

#!/usr/bin/env perl
use common::sense;
use LWP::Protocol::Net::Curl;
use WWW::Mechanize;
...

You could even use Perl as a glue between libcurl and libxml:

#!/usr/bin/env perl
use common::sense;
use Data::Printer;
use LWP::Protocol::Net::Curl encoding => ''; # enables Content-Encoding: deflate, gzip
use Web::Scraper::LibXML;
my $scraper = scraper {
    process "a[href]", "urls[]" => '@href';
    result 'urls';
};
my $links = $scraper->scrape(URI->new('http://www.cpan.org/'));
p $links;

LWP::Protocol::Net::Curl is a work in progress, but how complete is it?

  • Passes libwww-perl-6.04/t/;
  • Passes WWW-Mechanize-1.72/t/ (minor caveats);
  • PERL5OPT=-MLWP::Protocol::Net::Curl lwp-(download|dump|mirror|request) work
  • Compatible with Net::Google::DataAPI (achievement unlocked ;)
  • Smoke tested on a bunch of our own crawlers.

Unfortunately, no CPAN Testers Reports are available for the latest release, which fixed a major bug with proper :content_file handling. Many other bugs may lurk around, so keep an eye at project's GitHub repo!

3 Comments

Can I ask a simple question here? How do I do the equivalent of curl -3 (--sslv3) or -1 (--tlsv1) with the LWP::Protocol::Net::Curl? One of my internet banking sites times out if contacted using just 'curl', but responds if using 'curl -3' or 'curl -1'.

Thanks, that works :)


Leave a comment

About stas

user-pic Just another lazy, impatient and arrogant IT guy.