Scrappy - The All Powerful Web Spidering, Scraping, Creeping Crawling Framework

Hello all, it has been a long while since I’ve blogged here partly due to the fact that my account was mysteriously disabled and I have only recently had it rectified.

Anyways, it has truly been an arduous journey learning Moose while re-writing Scrappy, The All Powerful Web Spidering, Scraping, Creeping Crawling Framework, but it has proven both exciting and rewarding.

In the beginning, the task was only supposed to involve me upgrading Scrappy 0.6* to use Moose but something in the Moose documentation keep resonating in my head, … “With Moose, you can concentrate on the logical structure of your classes, focusing on what rather than how”.

… So the question became (as I sat down to re-write my little darling), How do I want Scrappy to function, as a whole? I knew that I’d always wanted Scrappy to be compartmentalized, I knew that I wanted to do away with the DSL syntax and focus on a better OO system (hence Moose), I knew that it needed a plugin system so that (in the future) a community of adopters could take it to the next level, and I ultimately concluded that these wants and needs would require me to completely re-write this module.

Dagnabit, … but wait, Moose has made the impossible (or rather implausible), doable, and the unbearable, pleasant. Scrappy (the module) has be completely re-written and is fast becoming a robust library. Scrappy now includes classes for access control, event logging, session handling, URL matching, web request and response handling, proxy management, web scraping, downloading, plugin loading, and much more. If you haven’t looked at it lately, please take a moment to do so.

The following is a brief introduction to Scrappy:

# The elevator pitch, crawl a website with ease ...

Scrappy->new->crawl('http://search.cpan.org/recent',
    '/recent' => {
        '#cpansearch li a' => sub {
            my ($self, $item) = @_;
            print $item->{href}, "\n";
            $self->queue->add($item->{href});
        }
    },
    '/~:author/:package/' => {
        'body' => sub {
            my ($self, $item, $params) = @_;
           …..
        }
    }
);

Scrappy ships with a command-line utility used to execute system and user-defined “action” classes under the Scrappy::Action namespace (which was modeled after the mojolicious CLI app). The Scrappy::Action namespace currently ships with two functional core classes, Download (used to download webpages from the internet) and Generate (used to generate boiler-plate code for scrapers).

The Scrappy::Action::Download class, still very experimental, can be used to download a webpage including all images, styles-sheets (as well as the images within) and script files, as follows:

$ scrappy download page http://blogs.perl.org/

The Scrappy::Action::Generate class, not as experimental, can be used to generate boiler-plate code for scripts, projects, and project classes as follows:

$ scrappy generate script scr8p.pl

or …

$ scrappy generate project Scr8p::Something

or … to add a new class to an existing project …

$ scrappy generate class Scr8p::Something::Special

The generate project class Scr8p::Somthing::Special would look a bit like the following:

package Scr8p::Something::Special;
use Moose;
with 'Scrappy::Project::Document';

sub title {
    # page title
    return shift->scraper->select('title')
    ->data->[0]->{text};
}

1;

… which make you wonder wtf is this class doing, so let me explain. Upon initialization of your project class “Scr8p::Something”, all classes of the same namespace are automatically imported which includes Scr8p::Something::Special. In the script file that initializes your project class, there should also be routes defined. Routes (as with most modern web frameworks) map URLs to a classes. Imagine that the “/” route is mapped to “special” which is shorthand for “Scr8p::Something::Special”. When a URL matching “/” is passed to the parse_document method of your project class, the parser class “Scr8p::Something::Special” will be used to parse that webpage and return a data structure based on the classes methods and return values. As project management is still highly experimental I will not go into further detail.

Finally I would like to quickly showcase the Scrappy::Scraper::Parser module (based on Web::Scraper). This module allows you to extract data from webpages with deadly precision.

Consider the following example:

my  $parser = Scrappy::Scraper::Parser->new;
$parser->html($html);

# get all links in all table rows with this CSS selector
my  $links = $parser->scrape('table tr a');

# select all links in the 2nd table row of all tables with XPATH selector
my  $links = $parser->scrape('//table/tr[2]/a');

# percision scraping
# select all links in the 2nd table row ONLY with CSS selectors and focus()
my  $links = $parser->select('table tr')
->focus(2)
->scrape('a');

The example above uses three different techniques to extract links from a webpage based on the level of precision needed. The last method which is specific to Scrappy::Scraper::Parser allows you to zero-in on specific blocks of HTML as to not accidentally pickup nested HTML tags.

I apologize for this lengthy blog post, its my first in many months. I hope all interested fork, watch and contribute to this library on github.

— Al Newkirk

2 Comments

Hi Al,

I noticed that Parellel::ForkManager is a dependency of Scrappy. Is it capable of doing requests in parallel out of the box?

-Steve

Hi Al,

Thanks for this good blog.
I am looking for more documentation on Scrappy module or is there any book or tutorial or example on this?

Please help me out if you have any.

Thanks,
Anand

Leave a comment

About Al Newkirk

user-pic ... proud Perl hacker, ask me anything!