HTTP_PROXY
environment variable in order to access different hosts (for instance, "yay proxy" for the external hosts and "nay proxy" for the internal ones). Sounds familiar? In this case, a pure-Perl tool I've written might help you :)
]]>
The problem
Corporate proxies are meant to steer GUI browser users via Proxy Auto-Config. In a nutshell, a browser like Internet Explorer downloads a special routing file from a virtual host served by the proxy itself. This file consists of a JavaScript code that usually contains a humongous if/else if/else
clause that maps the requested hostname to the address of the proxy host capable of
contacting the requested hostname.
Now, CLI clients don't usually implement JavaScript, and therefore can not decide which proxy to use by themselves.
Enter dePAC. It uses a portable lightweight (albeit limited) JavaScript engine implementation in order to parse the PAC file. Then, it creates a relay proxy that forwards the requests to the routes assigned by the PAC logic.
depac
is usually started in the beginning of login session, and through use of environment variables it's relay proxy can be located and automatically used for all the user agents that do support HTTP_PROXY
variables.
(this technique is somewhat similar to what ssh-agent does. In fact, half of the previous paragraph was stolen from ssh-agent
manual page :)
The biggest advantage of depac
in comparison to the similar solutions like pac4cli is that the former does not require a system-wide installation. Both the JavaScript engine and the relay proxy are implemented in pure Perl language and require no dependencies except for Perl v5.10 itself (which is omnipresent anyway).
$ curl -o ~/bin/depac https://raw.githubusercontent.com/creaktive/dePAC/master/depac
$ chmod +x ~/bin/depac
$ echo 'eval $(~/bin/depac)' >> ~/.profile
Or you can use wget
and call it with perl
(feel free to mix):
$ wget -O ~/depac https://raw.githubusercontent.com/creaktive/dePAC/master/depac
$ echo 'eval $(perl ~/depac)' >> ~/.profile
You can also use depac
in an ad-hoc fashion, without a shared instance running
in the background:
$ depac -- wget -r -np https://something.com
]]>
Kiev reminded me the place of my birth, Novosibirsk.
The huge Dnipro river, the soviet architechure, the dishes...
OK, it's not the same, but, you know, after all, Ukraine is more similar to Russia than Brazil or Netherlands are ;)
Also, I've got my thesis disproved: Ukrainian & Russian language distance is bigger than Spanish & Portuguese. And, as a bonus, it seems that my wife's 2 years of learning Russian got her able to handle simple conversations with the local people :D
I don't usually interact with people.
Not because I don't enjoy talking to people, but rather because I'm an introvert.
Well, there is a CPAN module for that!
Actually, not yet, but Hallway++ concept by Matt S. Trout comes quite close.
I've had many enlightening talks with many interesting people, despite the fact that I suck at starting conversations :D
Also, we data-munging crowd usually suck at design. At least, graphic design. He that never made a crappy logo/favicon among you, let him first cast a stone. A skilled designer can make a huge difference for our awesome projects. Now, you too can make a difference by saving a designer's life!
tie()
ing scalars/hashrefs/arrayrefs. Tie::Trace module wraps this debugging feature in a user-friendly way;grep
/sed
/awk
/join
Unix commands for structured XML data;:)
\o/
BTW, all the slides (even mine) can be accessed from here: http://act.yapc.eu/ye2013/slides
]]>:)
Grab the gist with the complete, working source code. Benchmark it against the one featured on the previous article:
$ \time perl mojo-crawler.pl
23.08user 0.07system 2:11.99elapsed 17%CPU (0avgtext+0avgdata 84064maxresident)k
16inputs+0outputs (0major+6356minor)pagefaults 0swaps
$ \time perl yada-crawler.pl
8.83user 0.30system 0:12.60elapsed 72%CPU (0avgtext+0avgdata 131008maxresident)k
0inputs+0outputs (0major+8607minor)pagefaults 0swaps
How can it be 10x faster while consuming less than a half of CPU resources?!
]]> Perl as a glueSorry, I had cheated a bit on the mojo-crawler.pl
benchmark results.
It implicitly uses the EV, a high performance full-featured event loop library when it is present.
And it is not required for Mojolicious to work properly.
Let's disable it:
$ MOJO_REACTOR=Mojo::Reactor::Poll time perl mojo-crawler.pl
113.99user 13.37system 2:08.46elapsed 99%CPU (0avgtext+0avgdata 83808maxresident)k
2912inputs+0outputs (18major+5789minor)pagefaults 0swaps
The elapsed time is the same with ou without EV, but now the pure-Perl crawler hogs the CPU!
Why? EV provides an interface to libev, which clearly does a better connection polling job than the 100% interpreted code. The bridge between Perl and the compiled library is called XS:
XS is an interface description file format used to create an extension interface between Perl and C code (or a C library) which one wishes to use with Perl.
Actually, CPAN is full of high performance XS-based modules for many tasks:
Thus, an efficient and fast web crawler/scraper could be constructed with those "bare-metal" building blocks ;)
#!/usr/bin/env perl
use 5.016;
use common::sense;
use utf8::all;
# Use fast binary libraries
use EV;
use Web::Scraper::LibXML;
use YADA 0.039;
YADA->new(
common_opts => {
# Available opts @ http://curl.haxx.se/libcurl/c/curl_easy_setopt.html
encoding => '',
followlocation => 1,
maxredirs => 5,
}, http_response => 1, max => 4,
)->append([qw[
http://sysd.org/page/1/
http://sysd.org/page/2/
http://sysd.org/page/3/
]] => sub {
my ($self) = @_;
return if $self->has_error
or not $self->response->is_success
or not $self->response->content_is_html;
# Declare the scraper once and then reuse it
state $scraper = scraper {
process q(html title), title => q(text);
process q(a), q(links[]) => q(@href);
};
# Employ amazing Perl (en|de)coding powers to handle HTML charsets
my $doc = $scraper->scrape(
$self->response->decoded_content,
$self->final_url,
);
printf qq(%-64s %s\n), $self->final_url, $doc->{title};
# Enqueue links from the parsed page
$self->queue->prepend([
grep {
$_->can(q(host)) and $_->scheme =~ m{^https?$}x
and $_->host eq $self->initial_url->host
and (grep { length } $_->path_segments) <= 3
} @{$doc->{links} // []}
] => __SUB__);
})->wait;
The example above has half the lines of code of the previous one. This comes at a cost of installing a bunch of external dependencies from the CPAN:
$ cpanm AnyEvent::Net::Curl::Queued EV HTML::TreeBuilder::LibXML Web::Scraper utf8::all
Despite the use 5.016
pragma, this code works fine on Perl 5.10 if you get rid of the __SUB__
reference.
So, what approach is the better one? Obviously, it depends. There is no silver bullet: web crawling is ultimately I/O-bound! However, specialized and well-tested libraries guarantee the I/O-boundness.
For instance, trimming the ::LibXML
part from the use Web::Scraper::LibXML
statement considerably slows down our tiny crawler, because the HTML parsing will allocate more CPU cycles than the connection polling.
As the edge case, let's see how the venerable GNU Wget tool (see also yada, which comes bundled together with the AE::N::C::Queued distribution) behaves:
$ "time" wget -r --follow-tags a http://sysd.org/
0.23user 0.41system 1:10.20elapsed 0%CPU (0avgtext+0avgdata 23920maxresident)k
0inputs+40704outputs (0major+4323minor)pagefaults 0swaps
Despite it's clear disadvantage of using single connection, it is almost completely I/O-bound since it's URL extraction code doesn't require complete parsing of the HTML.
]]>So, once upon a time I had a crazy idea: to put an almost complete resource meter into the tmux status bar. You know, the clock is so boring. Let's add a battery indicator there. And the load numbers. And the memory usage...
Needless to say, this resulted in an unbearable user experience:
a:2.96G c:4.37G f:5.41G i:2.98G l:0.65/1.73/1.41 23:47
Actually, the data is OK, the "gauges" work fine on every Unix I tested them. If only it was a bit fancier...
Then I discovered Battery. And then, Spark. I just couldn't resist myself, so I revamped my messy Perl usage data parser to output this gorgeous ANSI art scrolling chart:
It was tested on Mac OS X 10.8.2, Ubuntu 12.04, Ubuntu 11.10, Debian 6.0.6 and works fine with the default system Perl; there are no external dependencies at all.
Liked it? Go ahead, grab your copy and follow the installation instructions: https://github.com/creaktive/rainbarf
]]>Grab the gist with the complete, working source code.
I often hear the question: "so, you're Perl guy, could you show me how to make a web crawler/spider/scraper, then?" I hope this post series will become my ultimate answer :)
First of all, I compiled a small list of features that people expect of crawlers nowadays:
Let's call our project mojo-crawler.pl
. Here's how it begins:
#!/usr/bin/env perl
use 5.010;
use open qw(:locale);
use strict;
use utf8;
use warnings qw(all);
use Mojo::UserAgent;
# FIFO queue
my @urls = map { Mojo::URL->new($_) } qw(
http://sysd.org/page/1/
http://sysd.org/page/2/
http://sysd.org/page/3/
http://sysd.org/page/4/
http://sysd.org/page/5/
http://sysd.org/page/6/
);
# Limit parallel connections to 4
my $max_conn = 4;
# User agent following up to 5 redirects
my $ua = Mojo::UserAgent
->new(max_redirects => 5)
->detect_proxy;
# Keep track of active connections
my $active = 0;
Note that I'm using my very own server as a guinea pig. Consider that my oath on the safety of the parallelization method I chose.
Here we keep a constant-sized pool of active connections, populating it with URLs from our FIFO queue. The anonymous sub {}
fires every time the event loop is idle (0-second timer):
Mojo::IOLoop->recurring(
0 => sub {
for ($active + 1 .. $max_conn) {
# Dequeue or halt if there are no active crawlers anymore
return ($active or Mojo::IOLoop->stop)
unless my $url = shift @urls;
# Fetch non-blocking just by adding
# a callback and marking as active
++$active;
$ua->get($url => \&get_callback);
}
}
);
Now, start the event loop unless it is already started somewhere else. In this code, it won't be started anywhere else. But who knows how deep the Copy&Paste will bury it in future?!
# Start event loop if necessary
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;
Every completed download ends here. Even the failed ones.
Thus, when the download is complete, we decrease the $active
counter to free a connection slot:
sub get_callback {
my (undef, $tx) = @_;
# Deactivate
--$active;
# Parse only OK HTML responses
return
if not $tx->res->is_status_class(200)
or $tx->res->headers->content_type !~ m{^text/html\b}ix;
# Request URL
my $url = $tx->req->url;
say $url;
parse_html($url, $tx);
return;
}
# Not implemented yet!
sub parse_html { return }
Fear not, the parse_html()
is implemented right below!
Let's make sure this code actually does what it does, while it is still low on the line count:
$ perl mojo-crawler.pl
http://sysd.org/page/2/
http://sysd.org/page/4/
http://sysd.org/page/3/
http://sysd.org/page/5/
http://sysd.org/
http://sysd.org/page/6/
$
Fine, the downloads were completed in the order of the time it took to download each resource. Oh, and http://sysd.org/page/1 simply redirects to http://sysd.org/.
The most difficult part of making web crawlers isn't making them start; it's making them stop. Our complete parse_html()
also takes care of feeding the URL queue with the URLs extracted from <a href="...">
links. Plus, it makes a trivial verification on:
And, to show we've been there, let's print the title of the page:
sub parse_html {
my ($url, $tx) = @_;
say $tx->res->dom->at('html title')->text;
# Extract and enqueue URLs
for my $e ($tx->res->dom('a[href]')->each) {
# Validate href attribute
my $link = Mojo::URL->new($e->{href});
next if 'Mojo::URL' ne ref $link;
# "normalize" link
$link = $link->to_abs($tx->req->url)->fragment(undef);
next unless grep { $link->protocol eq $_ } qw(http https);
# Don't go deeper than /a/b/c
next if @{$link->path->parts} > 3;
# Access every link only once
state $uniq = {};
++$uniq->{$url->to_string}
next if ++$uniq->{$link->to_string} > 1;
# Don't visit other hosts
next if $link->host ne $url->host;
push @urls, $link;
say " -> $link";
}
say '';
return;
}
This time, it will be a lot slower, as every internal link is followed and downloaded. The crawler will print the accessed URL, the title of the page, and the extracted non-visited links:
$ "time" perl mojo-crawler.pl
http://sysd.org/
sysd.org
-> http://sysd.org/tag/benchmark/
-> http://sysd.org/tag/command-line-interface/
-> http://sysd.org/tag/console/
-> http://sysd.org/tag/overhead/
-> http://sysd.org/tag/terminal/
-> http://sysd.org/tag/teste/
-> http://sysd.org/tag/tty/
-> http://sysd.org/tag/velocidade/
-> http://sysd.org/tag/browser/
-> http://sysd.org/tag/deprecation/
-> http://sysd.org/tag/ie/
-> http://sysd.org/tag/microsoft/
-> http://sysd.org/tag/navegador/
-> http://sysd.org/tag/webdesign/
-> http://sysd.org/tag/webdev/
-> http://sysd.org/tag/api/
-> http://sysd.org/tag/hack-2/
-> http://sysd.org/tag/integration/
-> http://sysd.org/tag/rest/
...
27.73user 0.88system 3:48.46elapsed 12%CPU (0avgtext+0avgdata 98272maxresident)k
0inputs+8outputs (0major+6749minor)pagefaults 0swaps
$
A very important final note: albeit this tiny crawler operates through the recursive traversal of links, it is implemented in an iterative way. Thus, it is very light on memory consumption. In fact, the only structure that hogs the RAM is the $uniq
hashref. tie it to any kind of persistent storage if that concerns you. The FIFO queue @urls
could grow a lot if the crawled site has dynamically-generated link lists (or even broken pagination).
So, not storing it in some kind of key/value database is a bit reckless.
Despite this being a toy spider, I believe it is good enough to solve 80% of web crawling/scraping problems. The remaining 20% would require much more code, tests and infrastructure (A.K.A. the Pareto principle). Please, don't reinvent the wheel, check out the CommonCrawl project first! And keep checking my Perl blog for more on that 80% focused web-crawling ;)
After all, the purpose of this module boils down to "having one single XS dependency for handling multiple protocols and formats". So, give it a try and tell me what you think!
Does anyone know how to reach Przemysław Iskra, the author of Net::Curl? I've discovered, reported and provided a patch for one extremely slow memory leak, the source of infamous "Attempt to free unreferenced scalar: SV 0xdeadbeef during global destruction." warning. Would be very nice to see it upstreamed to the CPAN :)
tl;dr
: a poor man's Cinematch.
Not a big deal, but it addresses, at least partially, an issue raised in the recent Categorizing CPAN modules post (which was an actual inspiration for wrapping up a public release for some quick & dirty code written one day prior to that post publication). And it is fun to explore, after all!
The next logical step is to tweak my fork of metacpan-web to incorporate the recommendation API. But first, the API needs to be tested. That's the main purpose for the CPAN::U experiment: to be a crash test dummy for the further "return to the source". And this is why I'm kindly asking for your help. There are too many questions unanswered (which can be replied directly on the project's landing page):
As always, pull requests are welcome!
]]>@top = (sort @a) [0 .. $n - 1]
. Mostly, it's good enough for anything one would dare to store in RAM.
Then, there is Sort::Key::Top, which allows you to write @top = top $n => @a
. Yet another syntax sugar?
Not even close! While the docs don't state it boldly, it is:
So, expect it to be fast. How fast?
]]> Here's the results from my attempt of benchmarking the 10 longest-word selection from system dictionary (total of 235886 words): Rate quicksort pureperl quickselect
quicksort 0.329/s -- -75% -89%
pureperl 1.30/s 295% -- -57%
quickselect 3.04/s 825% 134% --
Note that the quicksort is there only to verify this claim from the Wikipedia article:
However, if done properly, a Java implementation is typically a magnitude (10x) faster than the quicksort algorithm.
Yup, seems like so. BTW, the 10 longest words are:
"scientificophilosophical"
"tetraiodophenolphthalein"
"formaldehydesulphoxylate"
"thyroparathyroidectomize"
"pathologicopsychological"
"formaldehydesulphoxylic"
"hematospectrophotometer"
"thymolsulphonephthalein"
"phenolsulphonephthalein"
"epididymodeferentectomy"
#!/usr/bin/env perl
use 5.010000;
use autodie;
use strict;
use warnings qw(all);
use Carp qw(croak);
use Benchmark qw(cmpthese);
use Sort::Key::Top qw(rnkeytopsort);
my %words;
open my $fh, q(<), q(/usr/share/dict/words);
while (<$fh>) {
chomp;
$words{$_} = length;
}
close $fh;
say q(words in hash: ) . scalar keys %words;
my $top_n = 10;
my $code = {
quickselect => sub { rnkeytopsort { $words{$_} } $top_n => keys %words },
quicksort => sub {
use sort qw(_quicksort stable);
(
sort { $words{$b} <=> $words{$a} }
keys %words
) [0 .. $top_n - 1];
},
pureperl => sub {
(
sort { $words{$b} <=> $words{$a} }
keys %words
) [0 .. $top_n - 1];
},
};
croak qq(something went VERY wrong)
unless [$code->{quickselect}->()] ~~ [$code->{pureperl}->()];
cmpthese(100 => $code);
]]>
Google Refine is awesome. If you're unaware of what it is, access their official page and watch at least the first screencast. You'll see it can be helpful for several ETL-related tasks.
Currently, I use it a lot, specially for simple (but boring) tasks, like loading a CSV, trimming out some outliers and saving as JSON to be imported into MongoDB. Nothing a Perl one-liner couldn't do.
However, the opposite is not true: Perl one-liners are a lot more flexible than Google Refine. Now, what if we could merge both?
]]>As a practical example, I'll use some georeferenced data I was working at. Let's suppose I have to deduplicate registers, and one of "duplicate" rules is their proximity on the map. Google Refine is far from a full-featured GIS, and is unable to handle bidimensional coordinate system. Enter the GeoDNA: an algorithm to lower geospatial dimensions. As it's FAQ says,
GeoDNA is a way to represent a latitude/longitude coordinate pair as a string. That sounds simple enough, but it's a special string format: the longer it is, the more accurate it is. More importantly, each string uniquely defines a region of the earth's surface, so in general, GeoDNA codes with similar prefixes are located near each other. This can be used to perform proximity searching using only string comparisons (like the SQL "LIKE" operator).
Another interesting property of GeoDNA is that when ordening a set of records by their GeoDNA code, close locations are likely to appear in adjacent rows (sometimes, close locations will share very different prefixes, but similar prefixes always represent close locations).
To incorporate GeoDNA into Google Refine, we'll use the Add column by fetching URLs option, clicking on the header of any column (which column it will be doesn't matter as we'll use two of them, anyway):
As the expression, we'll paste the following code (here, pay attention to the correct latitude/longitude column names):
'http://127.0.0.1:3000/?lat='+
row.cells['latitude'].value
+'&lon='+
row.cells['longitude'].value
Throttle delay can be zeroed, as our webservice is local. The final configuration should look like this (don't push the OK button, yet):
Now, check if you have Mojolicious and Geo::DNA Perl modules (install them via CPAN, if not) and paste into your terminal:
perl -MGeo::DNA -Mojo -E 'a("/"=>sub{my$s=shift;$s->render(json=>{geocode=>Geo::DNA::encode_geo_dna($s->param("lat"),$s->param("lon"))})})->start' daemon
If you prefer a "human-readable" version, paste the following code into geocode-webservice.pl
:
#!/usr/bin/env perl
use Geo::DNA qw(encode_geo_dna);
use Mojolicious::Lite;
any '/' => sub {
my $self = shift;
$self->render(json => {
geocode => encode_geo_dna(
$self->param('lat'),
$self->param('lon'),
),
});
};
app->start;
Once you started a webservice, it will report Server available at http://127.0.0.1:3000. Now, click OK on Google Refine dialog and wait. Even without delay, it could be a bit slow; however, even then this hack saved me a lot of time ;)
]]>Over HTTPS. Via SOCKS5 tunnel. On an aged CentOS box (think Perl v5.8). With no root privileges. Bonus points if it uses HTTP compression. Better prepare for some serious yak shaving.
If only WWW::Mechanize was written on top of libcurl, instead of LWP::UserAgent! (spoiler: I doubt it could ever happen; libcurl is all about manipulexity; whipuptitude is beyond it's scope) How cool supporting all that features out-of-box would be?
$ curl -V
curl 7.28.0 (x86_64-apple-darwin12.2.0) libcurl/7.28.0 OpenSSL/1.0.1c zlib/1.2.7 c-ares/1.7.5 libidn/1.25 libssh2/1.2.7
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp
Features: AsynchDNS IDN IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP
Now, what about this?
$ PERL5OPT=-MLWP::Protocol::Net::Curl=verbose,1 mech-dump https://google.com
]]>
Or, in your script:
#!/usr/bin/env perl
use common::sense;
use LWP::Protocol::Net::Curl;
use WWW::Mechanize;
...
You could even use Perl as a glue between libcurl and libxml:
#!/usr/bin/env perl
use common::sense;
use Data::Printer;
use LWP::Protocol::Net::Curl encoding => ''; # enables Content-Encoding: deflate, gzip
use Web::Scraper::LibXML;
my $scraper = scraper {
process "a[href]", "urls[]" => '@href';
result 'urls';
};
my $links = $scraper->scrape(URI->new('http://www.cpan.org/'));
p $links;
LWP::Protocol::Net::Curl is a work in progress, but how complete is it?
libwww-perl-6.04/t/
;WWW-Mechanize-1.72/t/
(minor caveats);PERL5OPT=-MLWP::Protocol::Net::Curl lwp-(download|dump|mirror|request)
workUnfortunately, no CPAN Testers Reports are available for the latest release, which fixed a major bug with proper :content_file
handling. Many other bugs may lurk around, so keep an eye at project's GitHub repo!