[FAILED] 10x faster than LibXML

Unfortunately, the idea contained fatal flaw. See the following post for explaantions.

Once upon a time I faced a huge pile of HTML files which I had to analyze. Say, there were about 1 000 000 of them. Say, 100 Gb of data.
Most of you would say “It’s not that much!”. And you are right. It’s not.

But then I’ve decided to estimate time required to process that pile of files. I quickly put XPaths of what I was needed together and got a prototype in Web::Scraper. And here I go: ~0.94s per file, i/o overhead not included. That occurred more than 11 days on my laptop. Phew!

Well, thanks to CPAN and you guys, we have a drop-in replacement that switches to LibXML - Web::Scraper::LibXML. I’ve got ~0.17, which were only about two days. That was too much for me anyway and I decided to try something more exotic.

I’ve googled a list of available HTML C/C++ parsers, went through their benchmarks, documentation and requirements and I’ve decided that I like pugixml library.

Yep, I decided to go with an XML parser which itself is able only to parse a valid XML, but not HTML. And to accomplish my goals I agreed to process HTML content with html tidy first (there is a HTML::Tidy, an interface to htmltidy library).

So I’ve wrote a simple XS wrapper around pugixml and passed input through HTML::Tidy before parsing.

I’ve got 0.03765s per file, 11 hours in total. Which were 4.5 times faster than what I had before. And 25 times faster than initial HTML::TreeBuilder-based implementation.

And please, take into account that I was too lazy to strip off some additional queries I’ve put into that XS before benchmarking. That things made parsing more complex (and time consuming). Synthetic benchmarks show that PugiXML+HTML::Tidy tandem is ten(!) times faster than LibXML.

Later, in a two days or so, I’ve decided that, even disregarding simplicity of writing an XS wrapper, I would like to have such thing in a Perl land. And, in the long run, to get a drop-in replacement for Web::Scraper to use PugiXML.

Please, don't get me wrong. I'm not trying to say that HTML::TreeBuilder or XML::LibXML are bad. Thery do their job and do it perfect. But at the moment when somebody in the Perl land needs something that is as fast as possible... I just want it to be there, on CPAN.

And there are some cons behind these pros for writing/using this thing. Here are some of them:

pugixml is not fully W3C conformant, as well as it’s XPath 1.0 implementation.

And don’t be fooled by synthetic benchmarks I made. Final implementation might not be that fast. Say, “only” 5-6 times faster than LibXML.

However, from my point of view, implementation worth efforts.
For example, if perlish PugiXML was used for processing RSS feeds, I believe that TheOldReader would not suffer that much from a hit by a 100 000 new subscribers in a single day when Google announced shutdown of the Reader. And I think most of you have a use case for such speedy HTML/XML parser off the top of your head.

So I’ve drafted a rough plan on implementation of Perl interface to pugixml and discussed it with the author of the original library, Arseny Kapoulkine. Feedback I received was encouraging.

And now...

I’m looking for your feedback. Would you use such thing if it is on CPAN?

I’m looking for your advices. Did I miss something? Do you have any additional concerns?

Finally, I’m looking for participants. If you would like to join development, I’d be happy to talk to you. Mostly because I like to meet new people, least because I’m lazy and would like to share that burden. And also, and it’s important, because I would like to have a co-maintainer(s) in order to make sure what we’ve done is supported for a while.

So don’t hesitate to write a comment or drop an email to yko@cpan.org

Thanks

5 Comments

Yes - please put this on CPAN

As long as the docs mention the limitations you have discussed here then it's a really worthwhile addition.

Thanks

Hi

Sure, put it on CPAN.

When you write the docs, please include a 'See Also' section, mentioning the other modules (and programs) you've looked at.

Be sure to mention Marpa::R2, which has its own HTML parser:

Marpa::R2

I wonder how many of these speed benefits could be retained if PugiXML were used as an alternative parser for XML::LibXML. That is, turn PugiXML parsing events into XML::LibXML::Node objects.

I have recently started contributing to Web::Query, a jQuery-like (as close as possible) for html scrapping and manipulation.

Because its based on HTML::TreeBuilder/HTML::Element, its very slow, and some XPath queries are incredibly slow.

So I would definitely use PugiXML!

We need PugiXML and HTML::TreeBuilder::PugiXML as a drop-in replacement for HTML::TreeBuilder (like HTML::TreeBuilder::LibXML which if fully implemented, would also help a lot! :) ).

Thank you!!

- cafe


Leave a comment

About yko

user-pic I blog about Perl.