Scrappy - Automated Full Service Web Spider

For the past two/three months (on and off) I have been developing a web spider framework that would bring together many web scraping concepts and methodologies and combine them in one library (package). That framework is Scrappy which is a play on the term (scraper happy or happy scraper), not to be misunderstood as a port of Scrapy the Python Scraper Framework.

For the sake of time I will just highlight some of the features that I think are innovative and add value.

Scrappy can be used as DSL or OOP (even interchangeably). e.g.
[oop]
use Scrappy;
my $spidy = Scrappy->new;
$spidy->crawl($url, { 'a' => sub { print 'I found this link', shift->href; } });

or

[dsl]
use Scrappy qw/:syntax/;
crawl $url, { 'a' => sub { print 'I found this link', shift->href; } };

Pick a User-Agent, Any User-Agent, e.g.
use Scrappy qw/:syntax/;
user_agent random_ua;

or

user_agent random_ua 'firefox'; # firefox only
user_agent random_ua 'firefox', 'linux'; # firefox on linux only

XPath or Grab and Zoom
Most spider writers know that if you need to grab an element with precision accuracy, especially when the HTML is not marked up with alot of CSS selectors, then XPath is your best friend. Sometime you may not want a mix of XPaths and CSS selectors through you code so instead use the grab and zoom technique to get closer to that particular element.

my $block = grab 'div', ':all'; #grab
my $zoom = grab 'span', ':all', $block->html; #zoom

Scrappy has URL pattern matching (like web apps), e.g.
get 'http://localhost/tags/websites';

if (match '/tags/:tag') {
print param('tag');
}

More and More Variables Automatically
Scrappy has a convenient stash object for sharing data throughout various actions, and forked processes, has a param method for accessing the current page's querystring or matched URL patterns. e.g.

Scrappy is automated Out-of-the-Box
Scrappy can be written manually with you specifying which pages to crawl and scrape from, or you can use the crawl method to automatically crawl pages in the queue and apply specific actions on specific pages. The crawlers method provides the same level of automation but increases the processing time by forking the action processing operations. In short, you get a site crawler and the ability to execute multiple processes.

Session Handling
Session handling can be easily turned on and by default stores all cookies encountered in the session file by domain.

Pausing
Pausing is where you can set a range in second for your spider to wait between requests in an attempt to fake human interaction.

Proxy Support

Get It On CPAN:
http://search.cpan.org/dist/Scrappy/lib/Scrappy.pm

Follow Scrappy on GitHub:
http://github.com/alnewkirk/Scrappy

Also, a previous article written on Scrappy:
http://ana.im/press/2010/09/scrappy/

Leave a comment

About awnstudio

user-pic Al Newkirk is a web application developer from Philadelphia, PA in the United States who specializes in Perl development with MySQL on a Windows platform using Strawberry Perl. I do a bunch of other stuff too, stick around and find out what!