App::scrape - Simple HTML scraping from the command line

Inspired by a demonstration of Mojolicious on the command line , I replicated the relevant functionality as a stand-alone program, tentatively named App::scrape. It's currently based on HTML::TreeBuilder::XPath , the ever-useful HTML::Selector::XPath, and LWP::Simple.

My long-term plan is to turn the program back into a module that can switch out HTML::TreeBuilder for a different engine with the same query capabilities. Especially WWW::Scripter and WWW::Mechanize::Firefox but also XML::LibXML in HTML-mode are backends that I would like to support.

Ideally, the code lives on as a module that gets used by other modules again. WWW::Mechanize has no convenient scraping support. Web::Scraper has no convenient navigation support and doesn't lend itself to data-driven approaches due to the DSL-config-style it uses. Having a common way to do queries against a DOM tree would be a nice thing if you want to extract data from the interwebz.

I plan on releasing the whole thing onto CPAN once I've got the API worked out and moved the meat of the program back into a module. Also still needing to be done is abstracting out what a DOM engine needs to provide to fetch a resource and then query it.

I guess some examples would also help:

> perl -w bin\scrape.pl http://perl.org title The Perl Programming Language - www.perl.org

> perl -w bin\scrape.pl http://perl.org //a/@href
/
/
/learn.html
...

> perl -w bin\scrape.pl http://perl.org --uri=1 //a/@href
http://perl.org/
http://perl.org/
http://perl.org/learn.html
...

> perl -w bin\scrape.pl http://perl.org --uri=1 //a/@href a
http://perl.org/
http://perl.org/ Home
http://perl.org/learn.html Learn
...

> perl -w bin\scrape.pl http://perl.org --sep=";" //a/@href a
/;
/;Home
/learn.html;Learn
...

2 Comments

If you can add CSS style queries that would be wonderful. They are usually shorter than XPath and are familiar to more people as CSS invades everything. Obviously XPath can do more, so having both as an option seems like a good idea.

Leave a comment

About Max Maischein

user-pic I'm the author of various CPAN modules. I'm also one of the admins of perlmonks.org.