An example using Mojo::DOM for rewriting HTML
Recently on stackoverflow, I answered a question that I thought worthy of a highlight here on the blog. In this forum we all know that one should never parse HTML with a regex, but if we agree on that, there are still many options available afterward. The question as posed was given some HTML, remove all <style>
tags and contents. The question was later amended to include that he needed to also remove <style>
tags with attributes (the nail in the regex coffin) and <link>
tags to stylesheets.
While you could use an XML parser or an HTML tokenizer, personally I like using the Mojo::DOM
parser. This is a Document-Object Model interface to your HTML and it supports CSS3 selectors, making it really flexible when you need it. The original problem is solved as easily as:
#!/usr/bin/env perl
use strict;
use warnings;
use Mojo::DOM;
my $content = <<'END';
<html>
<head> <title> Example </title> </head>
<style>
p{color: red;
background-color: #FFFF;
}
div {......
...
}
</style>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
END
my $dom = Mojo::DOM->new( $content );
$dom->find('style')->pluck('remove');
print $dom;
The pluck
method is a little confusing, but its really just a shorthand for the doing a method on each resultant object. The analogous line could be
$dom->find('style')->each(sub{ $_->remove });
which is a little more understandable but less cute. Further, to really understand that the call to find
returns an instance of Mojo::Collection
, a container object that has array-filtering methods as you would expect, but also can backflow back to the original DOM object. Thus when we remove
the resultant tags from the collection, they are gone from the DOM object!
Now lets say that the $content
variable also contained these lines
<link rel="stylesheet" type="text/css" href="$url_path/gridsorting.css">
<link rel="icon" href="somefile.jpg">
where you want to remove the first one, and not the second. You can do this in one of two ways.
$dom->find('link')->each( sub{ $_->remove if $_->{rel} eq 'stylesheet' } );
This mechanism uses the object methods (and Mojo::DOM exposes attributes as hash keys) to remove only the link
tags which have rel=stylesheet
. You can however use CSS3 selectors to only find
those elements, however, and since Mojo::DOM has full CSS3 selector support you can do
$dom->find('link[rel=stylesheet]')->pluck('remove');
CSS3 selector statements can be joined with a comma to find all tags matching either selector, so we can simply include the line
$dom->find('style, link[rel=stylesheet]')->pluck('remove');
and get rid of all your offensive stylesheets in one fell swoop!
I’m continually surprised and delighted that the best parts of Mojolicious aren’t the web app stuff. Mojo::DOM has made HTML and XML munging much easier, and it does it without a lot of external dependencies.
@brian, I have learned a lot about the web app side of Mojolicious and IMO its as good for that task as Mojo::DOM is for HTML munging!
Mojo::DOM is a very useful module to have around, particularly with the CSS3 support as you mention. Perhaps only fair to mention that Perl has had excellent support for HTML mangling for some time now - using TreeBuilder, for example:
or the second example:
Not quite as compact as the Mojolicious approach - as has been mentioned before, sometimes reinventing the wheel provides a smoother ride. Seeing HTML parsing regex code in the wild and suggesting Mojo::DOM / TreeBuilder equivalents seems like a worthwhile effort to me, whichever module you choose, so keep up those stackoverflow replies.
Mojo::DOM performance also seems pretty good - even with some filler content in the page it seems to be consistently about 25-30% faster than the HTML::TreeBuilder equivalent, not bad considering it’s up against XS for at least part of the HTML::Tree stack:
https://gist.github.com/3837205
One slight downside to the regex reimplementation in Mojo::DOM::HTML is that it doesn’t seem to cover all the edge cases that HTML::TreeBuilder can handle - https://gist.github.com/3837258 for example, which I believe is a valid HTML4 comment although it seems this comment syntax has been disallowed in html5, which presumably is what Mojo::DOM is aimed at.
@Tom, thanks for the well reasoned comments. Of course Perl has many solutions to this problem (I do allude to that fact towards the beginning of the post). My point wasn’t to ignore this, but to highlight the new Perl module on the block (Mojo::DOM) as well as the new mechanism/standard (CSS3 selectors) and how they can make your life easier.
The old kids are still there because they are tried-and-true and that is never to be mocked, but then again, most people (at least on this board) already know about them.
It is really interesting to see that Mojo::DOM stacks up against the XS based modules. You are right about the target of HTML5, but I wonder if we should file a bug on mojolicious; it seems like the old allowed comment syntax could be added without too much difficulty.
Edit:
Actually it seems that HTML::TreeBuilder has a problem with that example too; notice that the
</p>
tag is missing! Also, I have filed a bug on the HTML4 style comments and submitted a pull request with tests.FYI, HTML::HTML5::Parser handles that weird comment with ease. :-)
heh, thanks Joel - that was a quick fix indeed!
Toby - thanks for the link, I’d not tried HTML::HTML5::Parser before. Always useful to have an alternative parser that can handle “real-world” data.
P.S. my patch has been merged on Mojo::DOM, its scheduled to be included on the 3.45 release.
Cheers!
Hi, I’m curious about Mojo:Dom. Let’s say you want a generic method of traversing a page, and returning a hash of headers and related sub headings and text from a page using Mojo::Dom. The sub text may not necessarily be a descendant of the heading and from what I can tell, mojo::dom can only descend not ascend. How does one construct this? I’ve read the docs and tried a number of things but I’m not getting far :(
Hi, I’m not at all sure what you are attempting to do and the comment system of this site does not make it very conducive to technical discussion. I would recommend asking a question on StackOverflow and include the Perl and Mojolicious tags (at least).