yko

It's time to admit I've failed

2013-05-03T18:36:09Z

Two days ago I was so excited! I had an idea how to make the Perl world a bit better, faster and simpler. Of course, I didn’t spread such exciting news until I checked and double-checked and benchmarked, until I’m absolutely sure I’ve found The Holy Grail.

Well, see the title. It hurts. All my benchmarks contained a terrible mistake. And those +20%, or, maybe even +100% speed boost PugiXML interface could provide doesn’t worth all the buzz I created.

I apologize.

]]> Yet, Perl interface to PugiXML I’ve described in my previous post could be (optimistically) twice as fast as LibXML. In some cases. But I’m so disappointed by my failure that I just don’t think it worth it.

Another lesson learned.

When you feel lack of speed with HTML parsing please use something LibXML based, like HTML::TreeBuilder::LibXML or just XML::LibXML. Just make sure you are using load_html() family instead of load_xml() and enable recover() mode as it’s done in HTML::TreeBuilder::LibXML

For those who still are interested, the code of the prototype is published at https://github.com/yko/pugixml-perl
At some point I may decide to continue development. Unfortunately it would not be that lightning fast as initially expected.

[FAILED] 10x faster than LibXML

2013-05-01T15:45:03Z

Unfortunately, the idea contained fatal flaw. See the following post for explaantions.

Once upon a time I faced a huge pile of HTML files which I had to analyze. Say, there were about 1 000 000 of them. Say, 100 Gb of data.
Most of you would say “It’s not that much!”. And you are right. It’s not.

But then I’ve decided to estimate time required to process that pile of files. I quickly put XPaths of what I was needed together and got a prototype in Web::Scraper. And here I go: ~0.94s per file, i/o overhead not included. That occurred more than 11 days on my laptop. Phew!

]]> Well, thanks to CPAN and you guys, we have a drop-in replacement that switches to LibXML - Web::Scraper::LibXML. I’ve got ~0.17, which were only about two days. That was too much for me anyway and I decided to try something more exotic.

I’ve googled a list of available HTML C/C++ parsers, went through their benchmarks, documentation and requirements and I’ve decided that I like pugixml library.

Yep, I decided to go with an XML parser which itself is able only to parse a valid XML, but not HTML. And to accomplish my goals I agreed to process HTML content with html tidy first (there is a HTML::Tidy, an interface to htmltidy library).

So I’ve wrote a simple XS wrapper around pugixml and passed input through HTML::Tidy before parsing.

I’ve got 0.03765s per file, 11 hours in total. Which were 4.5 times faster than what I had before. And 25 times faster than initial HTML::TreeBuilder-based implementation.

And please, take into account that I was too lazy to strip off some additional queries I’ve put into that XS before benchmarking. That things made parsing more complex (and time consuming). Synthetic benchmarks show that PugiXML+HTML::Tidy tandem is ten(!) times faster than LibXML.

Later, in a two days or so, I’ve decided that, even disregarding simplicity of writing an XS wrapper, I would like to have such thing in a Perl land. And, in the long run, to get a drop-in replacement for Web::Scraper to use PugiXML.

Please, don't get me wrong. I'm not trying to say that HTML::TreeBuilder or XML::LibXML are bad. Thery do their job and do it perfect. But at the moment when somebody in the Perl land needs something that is as fast as possible... I just want it to be there, on CPAN.

And there are some cons behind these pros for writing/using this thing. Here are some of them:

pugixml is not fully W3C conformant, as well as it’s XPath 1.0 implementation.

And don’t be fooled by synthetic benchmarks I made. Final implementation might not be that fast. Say, “only” 5-6 times faster than LibXML.

However, from my point of view, implementation worth efforts.
For example, if perlish PugiXML was used for processing RSS feeds, I believe that TheOldReader would not suffer that much from a hit by a 100 000 new subscribers in a single day when Google announced shutdown of the Reader. And I think most of you have a use case for such speedy HTML/XML parser off the top of your head.

So I’ve drafted a rough plan on implementation of Perl interface to pugixml and discussed it with the author of the original library, Arseny Kapoulkine. Feedback I received was encouraging.

And now...

I’m looking for your feedback. Would you use such thing if it is on CPAN?

I’m looking for your advices. Did I miss something? Do you have any additional concerns?

Finally, I’m looking for participants. If you would like to join development, I’d be happy to talk to you. Mostly because I like to meet new people, least because I’m lazy and would like to share that burden. And also, and it’s important, because I would like to have a co-maintainer(s) in order to make sure what we’ve done is supported for a while.

So don’t hesitate to write a comment or drop an email to yko@cpan.org

Thanks

Kiev.pm organizational changes

2011-08-15T15:14:06Z

Dear Perl community!

I'm proud to say that today Sergey Gulko officially announced me as a leader of Kiev.pm group. Thank you Sergey and my appreciation to all members of Kiev.pm for approving me for this assignment and great support during debates and planning stage. Guys, you are great! Sergey I wish you a very good luck in all spheres of your personal and professional life.

Kiev.pm community exists since June 1, 2007 and until today was lead by Sergey Gulko. Our community is about 200 registered members. During past few years we had organized two Perl workshops called ‘Perl Mova’ together with Moscow.pm. This gave us an opportunity to meet Jonathan Worthington, Andrew Shitov, Alex Kapranoff and many other great people, collaborate and better know each other inside of the community.

I am happy that being a Perl monger brought me to know such seasoned professionals as Oleg Alistratov, Denis Zhdanov and Viacheslav Tykhanovskyi. I am learning a lot from them and I hope to learn more in future. I am absolutely sure that Kiev.pm can be proud of its young and talented members as Sergey Zasenko, Maxim Vuets, their energy, enthusiasm and expectations is the best engine for our community.
And I’m sure this is only the beginning of Kiev.pm growth.

I want to help Perl people collaborate, meet each other and spread their knowledge within local and whole world Perl community as well as to keep Kiev.pm healthy and prosperous. I consider such activity to be my main task in capacity of pm group leader. I believe that combination of my communication and organizational skills and of course support of Kiev Perl Mongers community will help us to reach all seated goals.

If you want to get in touch with Kiev.pm representative about local events, meetings or any other issues feel free to contact me via email yko@cpan.org.

Hello World in Plack 2: InteractiveDebugger

2011-07-15T18:21:30Z

Another tool to debug your Plack application is InteractiveDebugger

You can dive into your code, explore your stack and see all variables in each frame.
And execute arbitrary code at some level!

Just try it:
plackup idebug-demo.psgi
firefox http://127.0.0.1:5000/

]]> D command shows all variables in frame and you can just dump any variable using Data::Dumper:
perl> Data::Dumper::Dumper $self;
It would be great to have perl debugger-like alias 'x' for this command, maybe it will appear in future.

Hello World in Plack

2011-07-12T08:59:03Z

I'm was playing with PSGI spec and Plack last few days and found Debug middleware just amazing:

You should just try it:

plackup debug-demo.psgi firefox http://127.0.0.1:5000

]]> You can see all environment variables, generation time and memory usage before and after response. Find your memory leaks before they find you.