App::scrape - Simple HTML scraping from the command line
Inspired by a demonstration of Mojolicious on the command line , I replicated the relevant functionality as a stand-alone program, tentatively named App::scrape. It's currently based on HTML::TreeBuilder::XPath , the ever-useful HTML::Selector::XPath, and LWP::Simple.
My long-term plan is to turn the program back into a module that can switch out HTML::TreeBuilder for a different engine with the same query capabilities. Especially WWW::Scripter and WWW::Mechanize::Firefox but also XML::LibXML in HTML-mode are backends that I would like to support.
Ideally, the code lives on as a module that gets used by other modules again. WWW::Mechanize has no convenient scraping support. Web::Scraper has no convenient navigation support and doesn't lend itself to data-driven approaches due to the DSL-config-style it uses. Having a common way to do queries against a DOM tree would be a nice thing if you want to extract data from the interwebz.
I plan on releasing the whole thing onto CPAN once I've got the API worked out and moved the meat of the program back into a module. Also still needing to be done is abstracting out what a DOM engine needs to provide to fetch a resource and then query it.
I guess some examples would also help:
> perl -w bin\scrape.pl http://perl.org title The Perl Programming Language - www.perl.org
> perl -w bin\scrape.pl http://perl.org //a/@href
> perl -w bin\scrape.pl http://perl.org --uri=1 //a/@href
> perl -w bin\scrape.pl http://perl.org --uri=1 //a/@href a
> perl -w bin\scrape.pl http://perl.org --sep=";" //a/@href a