App::scrape - Simple HTML scraping from the command line

By Max Maischein on February 14, 2011 9:11 PM under programming

Inspired by a demonstration of Mojolicious on the command line , I replicated the relevant functionality as a stand-alone program, tentatively named App::scrape. It's currently based on HTML::TreeBuilder::XPath , the ever-useful HTML::Selector::XPath, and LWP::Simple.

My long-term plan is to turn the program back into a module that can switch out HTML::TreeBuilder for a different engine with the same query capabilities. Especially WWW::Scripter and WWW::Mechanize::Firefox but also XML::LibXML in HTML-mode are backends that I would like to support.

Ideally, the code lives on as a module that gets used by other modules again. WWW::Mechanize has no convenient scraping support. Web::Scraper has no convenient navigation support and doesn't lend itself to data-driven approaches due to the DSL-config-style it uses. Having a common way to do queries against a DOM tree would be a nice thing if you want to extract data from the interwebz.

I plan on releasing the whole thing onto CPAN once I've got the API worked out and moved the meat of the program back into a module. Also still needing to be done is abstracting out what a DOM engine needs to provide to fetch a resource and then query it.

I guess some examples would also help:


> perl -w bin\scrape.pl http://perl.org title
The Perl Programming Language - www.perl.org

> perl -w bin\scrape.pl http://perl.org //a/@href

/

/

/learn.html

...

> perl -w bin\scrape.pl http://perl.org --uri=1 //a/@href

http://perl.org/

http://perl.org/

http://perl.org/learn.html

...

> perl -w bin\scrape.pl http://perl.org --uri=1 //a/@href a

http://perl.org/

http://perl.org/        Home

http://perl.org/learn.html      Learn

...

> perl -w bin\scrape.pl http://perl.org --sep=";" //a/@href a /; /;Home /learn.html;Learn ...

2 comments

Tagged as:

web scraping

2 Comments

mpeters.myopenid.com | February 15, 2011 2:38 AM | Reply

If you can add CSS style queries that would be wonderful. They are usually shorter than XPath and are familiar to more people as CSS invades everything. Obviously XPath can do more, so having both as an option seems like a good idea.

Max Maischein | February 15, 2011 4:37 PM | Reply

It already does that (see the second parameter in the last example), but I'll try to make that more obvious in the documentation.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Max Maischein

I'm the Treasurer for the Frankfurt Perlmongers e.V. . I have organized Perl events including 9 German Perl Workshops and one YAPC::Europe.

More info »

Max Maischein