Web Scraping with Perl & PhantomJS

PhantomJS is a 'headless' WebKit browser, mainly intended for use as a web testing framework, and is controlled by a JavaScript API. The 'headless' aspect of that also makes the framework extremely useful for scraping JavaScript heavy websites.

The problem with PhantomJS (up until the v1.8 release on 23 December 2012), was that if you were unfamiliar with JavaScript, CoffeeScript or Node.js (if you were using the Casper.js fork), was that it wasn't very easy understand or control. Since the v1.8 release in December, PhantomJS now supports WebDriver, which basically means you can control it from pretty any language you like (although Perl isn't explicitly mentioned).

Since I like Perl, I decided to give it a go after trying WWW::Mechanize::Firefox + MozRepl, which is great, but doesn't work if you're going double-headless and are running it on a GUI-less server.

I was previously using Mojo::UserAgent as the scraping agent for this project, however it was ridiculously simple to plug in Selenium::Remote::Driver to perform the get request and return the fully rendered HTML, back into the awesome Mojo::DOM parser for easy manipulation of the data. (I found out about Wight which offers more native support for PhantomJS after working on the project, but the below still applies if you just want to use the PhantomJS API.)

All you need to do to get PhantomJS up & running for your scraper is:

1. Install it

2. Run the command `phantomjs --webdriver=9134 &` to send PhantomJS into the background as a proxy for your requests

3. Combine with Mojolicious:

#!/usr/bin/env perl
use Modern::Perl;
use Mojo::DOM;
use Mojo::URL;
use Selenium::Remote::Driver;
my $url = 'http://www.google.co.uk';
# fetch the web page
my $res = _fetch_page($url);
# store the URL as a Mojo::URL object (useful for making links absolute etc)
my $mojo_uri = Mojo::URL->new($url);

# check for success of request
if ($res) {
# Grab an array of the items (allows granular control)
my $dom = Mojo::DOM->new($res);
say $dom->at('title')->text;
}

sub _fetch_page {
my $url = shift;
my $driver = new Selenium::Remote::Driver('remote_server_addr' => 'localhost',
'port' => '9134',
'browser_name' => 'chrome',
'platform' => 'VISTA');
$driver->get($url);
my $dom = Mojo::DOM->new( $driver->get_page_source() );
$driver->quit();
return $dom;
}

It's also stupidly easy to walk through a document's DOM, or even serve up a screengrab of the web page:

sub screengrab {
	my $self = shift;
	my $url = $self->param('url');
	my $driver = new Selenium::Remote::Driver('remote_server_addr' => 'localhost',
                                             'port' => '9134',
                                             'browser_name'       => 'chrome',
                                             'platform'           => 'VISTA');
	$driver->get($url);
	my $png_base64 = $driver->screenshot();
	$driver->quit();
	$self->render( data => MIME::Base64::decode_base64($png_base64), format => 'png' );
}

5 Comments

The above seems like a lot of extra worked compared to WWW::WebKit a module for driving a headless WebKit browser. It uses Gtk3::WebKit a Perl native binding set to the WebKit browser. WebKit support also exists in the Gtk2 Perl bindings which also pre-date PhantomJS by more than a year.

I am going to suggest the same as Kimmel - it's overkill when you have WebKit bindings in perl without the limitations of Phantom.js - you could open as much parallel pages as you wish.

Here is a small utility that I wrote against Gtk3::WebKit (there was no WWW::WebKit at the time). Just for example:
https://github.com/luben/webshot

Best regards

Sounds totally cool, I'll check it out.

You might also look into WWW::HtmlUnit, which is perl bindings for the java library. I've been using that at work for a long time now and like it quite a bit.

Thanks for writing this article - I do a lot of QA testing with Selenium, but wasn't aware of Phantom.js.

Also the comments were helpfful too - can't believe there is an HTMLUnit Perl binding ... I didn't think to look for it and all this time I have been writing the Java code directly :(

Leave a comment

About Rob Hammond

user-pic I blog mostly about SEO, but sometimes about Perl.