[FAILED] 10x faster than LibXML
Unfortunately, the idea contained fatal flaw. See the following post for explaantions.
Once upon a time I faced a huge pile of HTML files which I had to analyze. Say, there were about 1 000 000 of them. Say, 100 Gb of data.
Most of you would say “It’s not that much!”. And you are right. It’s not.
But then I’ve decided to estimate time required to process that pile of files. I quickly put XPaths of what I was needed together and got a prototype in Web::Scraper. And here I go: ~0.94s per file, i/o overhead not included. That occurred more than 11 days on my laptop. Phew!
Well, thanks to CPAN and you guys, we have a drop-in replacement that switches to LibXML - Web::Scraper::LibXML. I’ve got ~0.17, which were only about two days. That was too much for me anyway and I decided to try something more exotic.
I’ve googled a list of available HTML C/C++ parsers, went through their benchmarks, documentation and requirements and I’ve decided that I like pugixml library.
Yep, I decided to go with an XML parser which itself is able only to parse a valid XML, but not HTML. And to accomplish my goals I agreed to process HTML content with html tidy first (there is a HTML::Tidy, an interface to htmltidy library).
So I’ve wrote a simple XS wrapper around pugixml and passed input through HTML::Tidy before parsing.
I’ve got 0.03765s per file, 11 hours in total. Which were 4.5 times faster than what I had before. And 25 times faster than initial HTML::TreeBuilder-based implementation.
And please, take into account that I was too lazy to strip off some additional queries I’ve put into that XS before benchmarking. That things made parsing more complex (and time consuming). Synthetic benchmarks show that PugiXML+HTML::Tidy tandem is ten(!) times faster than LibXML.
Later, in a two days or so, I’ve decided that, even disregarding simplicity of writing an XS wrapper, I would like to have such thing in a Perl land. And, in the long run, to get a drop-in replacement for Web::Scraper to use PugiXML.
Please, don't get me wrong. I'm not trying to say that HTML::TreeBuilder or XML::LibXML are bad. Thery do their job and do it perfect. But at the moment when somebody in the Perl land needs something that is as fast as possible... I just want it to be there, on CPAN.
And there are some cons behind these pros for writing/using this thing. Here are some of them:
pugixml is not fully W3C conformant, as well as it’s XPath 1.0 implementation.
And don’t be fooled by synthetic benchmarks I made. Final implementation might not be that fast. Say, “only” 5-6 times faster than LibXML.
However, from my point of view, implementation worth efforts.
For example, if perlish PugiXML was used for processing RSS feeds, I believe that TheOldReader would not suffer that much from a hit by a 100 000 new subscribers in a single day when Google announced shutdown of the Reader. And I think most of you have a use case for such speedy HTML/XML parser off the top of your head.
So I’ve drafted a rough plan on implementation of Perl interface to pugixml and discussed it with the author of the original library, Arseny Kapoulkine. Feedback I received was encouraging.
And now...
I’m looking for your feedback. Would you use such thing if it is on CPAN?
I’m looking for your advices. Did I miss something? Do you have any additional concerns?
Finally, I’m looking for participants. If you would like to join development, I’d be happy to talk to you. Mostly because I like to meet new people, least because I’m lazy and would like to share that burden. And also, and it’s important, because I would like to have a co-maintainer(s) in order to make sure what we’ve done is supported for a while.
So don’t hesitate to write a comment or drop an email to yko@cpan.org
Thanks
Yes - please put this on CPAN
As long as the docs mention the limitations you have discussed here then it's a really worthwhile addition.
Thanks
Hi
Sure, put it on CPAN.
When you write the docs, please include a 'See Also' section, mentioning the other modules (and programs) you've looked at.
Be sure to mention Marpa::R2, which has its own HTML parser:
Marpa::R2
I wonder how many of these speed benefits could be retained if PugiXML were used as an alternative parser for XML::LibXML. That is, turn PugiXML parsing events into XML::LibXML::Node objects.
Toby, thank you for your interest.
I was thinking about the same in context of writing drop-in replacement for Web::Scraper, similar to Web::Scraper::LibXML for about 3 minutes and then rejected this idea.
Again, this doesn’t mean that idea is bad. It’s brilliant! Finally, it come to my head. The problem is that at the moment I see a lot of efforts and less of outcome in the end.
However, I totally encourage you, as well as anyone who would like to, to participate PugiXML development just to get an idea of it’s internals and then together or individually decide if it’s possible and worth efforts and what is the best approach to achieve the goal.
Think about XML::LibXML::Node->parentNode and XML::LibXML::Node->nextSibling. Or, if we talk about HTML::TreeBuilder, think about HTML::Element->parent and HTML::Element->content.
If you want to represent part of the DOM as an object of a different type, you have to populate the whole dom.
And even if we manage to build something like that and retain 50% of speed benefits pugixml library provides. Imagine the amount of work to be done and a support burden that immediately descends on implementor's shoulders.
I don't think that speed benefits like 3x faster worth such efforts. And I think nobody would like such unstable solution.
The main idea is to provide an interface and build a drop in replacement that covers base filters of Web::Scraper, like
process '//foo' => foo => 'TEXT';
process '//a' => url => '@href';
process '//span' => html => 'raw';
But when it comes to filters and callbacks, we provide a PugiXML::Node instance instead of HTML::Element/XML::LibXML::Node.
Yes, this is a significant limitation. Existing filters/callbacks that expect an instance of HTML::Element have to be modified in order to be used with Web::Scraper::PugiXML.
From my point of view it worth it, because the main idea of PugiXML is to solve CPU bottleneck problem. And the idea of putting efforts to build an interface to a fast library and then slow things down by tying it to slower object model is contrary to the main pugixml purpose.
yko
P.S.: At the moment of writing I see the most simple way to provide Whatever::Node by stringifying the resulting node user wants to operate, parse it by appropriate parser and return an instance of desired class. I think that’s how it would be done in Web::Scraper::PugiXML.
I have recently started contributing to Web::Query, a jQuery-like (as close as possible) for html scrapping and manipulation.
Because its based on HTML::TreeBuilder/HTML::Element, its very slow, and some XPath queries are incredibly slow.
So I would definitely use PugiXML!
We need PugiXML and HTML::TreeBuilder::PugiXML as a drop-in replacement for HTML::TreeBuilder (like HTML::TreeBuilder::LibXML which if fully implemented, would also help a lot! :) ).
Thank you!!
- cafe