November 2012 Archives

Google Refine + Perl

(repost from http://sysd.org/google-refine-perl-english/; it's more contextual here)

Google Refine is awesome. If you're unaware of what it is, access their official page and watch at least the first screencast. You'll see it can be helpful for several ETL-related tasks.

Currently, I use it a lot, specially for simple (but boring) tasks, like loading a CSV, trimming out some outliers and saving as JSON to be imported into MongoDB. Nothing a Perl one-liner couldn't do.

However, the opposite is not true: Perl one-liners are a lot more flexible than Google Refine. Now, what if we could merge both?

libcurl as LWP backend (or "all your protocol are belong to us")

Suppose you are planning to scrap a few thousands of pages using WWW::Mechanize.

Over HTTPS. Via SOCKS5 tunnel. On an aged CentOS box (think Perl v5.8). With no root privileges. Bonus points if it uses HTTP compression. Better prepare for some serious yak shaving.

If only WWW::Mechanize was written on top of libcurl, instead of LWP::UserAgent! (spoiler: I doubt it could ever happen; libcurl is all about manipulexity; whipuptitude is beyond it's scope) How cool supporting all that features out-of-box would be?

$ curl -V
curl 7.28.0 (x86_64-apple-darwin12.2.0) libcurl/7.28.0 OpenSSL/1.0.1c zlib/1.2.7 c-ares/1.7.5 libidn/1.25 libssh2/1.2.7
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp 
Features: AsynchDNS IDN IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP

Now, what about this?

$ PERL5OPT=-MLWP::Protocol::Net::Curl=verbose,1 mech-dump https://google.com

About stas

user-pic Just another lazy, impatient and arrogant IT guy.