App::scrape - Simple HTML scraping from the command line
Inspired by a demonstration of Mojolicious on the command line , I replicated the relevant functionality as a stand-alone program, tentatively named App::scrape. It's currently based on HTML::TreeBuilder::XPath , the ever-useful HTML::Selector::XPath, and LWP::Simple.
My long-term plan is to turn the program back into a module that can switch out HTML::TreeBuilder for a different engine with the same query capabilities. Especially WWW::Scripter and WWW::Mechanize::Firefox but also XML::LibXML in HTML-mode are backends that I would like to support.
Ideally, the code lives on as a module that gets used by other modules again. WWW::Mechanize has no convenient scraping support. Web::Scraper has no convenient navigation support and doesn't lend itself to data-driven approaches due to the DSL-config-style it uses. Having a common way to do queries against a DOM tree would be a nice thing if you want to extract data from the interwebz.
I plan on releasing the whole thing onto CPAN once I've got the API worked out and moved the meat of the program back into a module. Also still needing to be done is abstracting out what a DOM engine needs to provide to fetch a resource and then query it.
I guess some examples would also help:
> perl -w bin\scrape.pl http://perl.org title
The Perl Programming Language - www.perl.org
> perl -w bin\scrape.pl http://perl.org //a/@href
/
/
/learn.html
...
> perl -w bin\scrape.pl http://perl.org --uri=1 //a/@href
http://perl.org/
http://perl.org/
http://perl.org/learn.html
...
> perl -w bin\scrape.pl http://perl.org --uri=1 //a/@href a
http://perl.org/
http://perl.org/ Home
http://perl.org/learn.html Learn
...
> perl -w bin\scrape.pl http://perl.org --sep=";" //a/@href a
/;
/;Home
/learn.html;Learn
...
If you can add CSS style queries that would be wonderful. They are usually shorter than XPath and are familiar to more people as CSS invades everything. Obviously XPath can do more, so having both as an option seems like a good idea.
It already does that (see the second parameter in the last example), but I'll try to make that more obvious in the documentation.