An example using Mojo::DOM for rewriting HTML

Recently on stackoverflow, I answered a question that I thought worthy of a highlight here on the blog. In this forum we all know that one should never parse HTML with a regex, but if we agree on that, there are still many options available afterward. The question as posed was given some HTML, remove all <style> tags and contents. The question was later amended to include that he needed to also remove <style> tags with attributes (the nail in the regex coffin) and <link> tags to stylesheets.

While you could use an XML parser or an HTML tokenizer, personally I like using the Mojo::DOM parser. This is a Document-Object Model interface to your HTML and it supports CSS3 selectors, making it really flexible when you need it. The original problem is solved as easily as:

#!/usr/bin/env perl

use strict;
use warnings;

use Mojo::DOM;

my $content = <<'END';
<html>
<head> <title> Example </title> </head>
<style>
p{color: red;
background-color: #FFFF;
}
div {......
...
}
</style>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
END

my $dom = Mojo::DOM->new( $content );
$dom->find('style')->pluck('remove');

print $dom;

The pluck method is a little confusing, but its really just a shorthand for the doing a method on each resultant object. The analogous line could be

$dom->find('style')->each(sub{ $_->remove });

which is a little more understandable but less cute. Further, to really understand that the call to find returns an instance of Mojo::Collection, a container object that has array-filtering methods as you would expect, but also can backflow back to the original DOM object. Thus when we remove the resultant tags from the collection, they are gone from the DOM object!

Now lets say that the $content variable also contained these lines

<link rel="stylesheet" type="text/css" href="$url_path/gridsorting.css">
<link rel="icon" href="somefile.jpg">

where you want to remove the first one, and not the second. You can do this in one of two ways.

$dom->find('link')->each( sub{ $_->remove if $_->{rel} eq 'stylesheet' } );

This mechanism uses the object methods (and Mojo::DOM exposes attributes as hash keys) to remove only the link tags which have rel=stylesheet. You can however use CSS3 selectors to only find those elements, however, and since Mojo::DOM has full CSS3 selector support you can do

$dom->find('link[rel=stylesheet]')->pluck('remove');

CSS3 selector statements can be joined with a comma to find all tags matching either selector, so we can simply include the line

$dom->find('style, link[rel=stylesheet]')->pluck('remove');

and get rid of all your offensive stylesheets in one fell swoop!

9 Comments

I’m continually surprised and delighted that the best parts of Mojolicious aren’t the web app stuff. Mojo::DOM has made HTML and XML munging much easier, and it does it without a lot of external dependencies.

Mojo::DOM is a very useful module to have around, particularly with the CSS3 support as you mention. Perhaps only fair to mention that Perl has had excellent support for HTML mangling for some time now - using TreeBuilder, for example:

 my $dom = HTML::TreeBuilder->new_from_content( $content );
 $_->delete for $dom->look_down(_tag => 'style');
 print $dom->as_HTML;

or the second example:

 $_->delete for $dom->look_down(_tag => 'style'), 
                $dom->look_down(_tag => 'link', rel => 'stylesheet');

Not quite as compact as the Mojolicious approach - as has been mentioned before, sometimes reinventing the wheel provides a smoother ride. Seeing HTML parsing regex code in the wild and suggesting Mojo::DOM / TreeBuilder equivalents seems like a worthwhile effort to me, whichever module you choose, so keep up those stackoverflow replies.

Mojo::DOM performance also seems pretty good - even with some filler content in the page it seems to be consistently about 25-30% faster than the HTML::TreeBuilder equivalent, not bad considering it’s up against XS for at least part of the HTML::Tree stack:

https://gist.github.com/3837205

One slight downside to the regex reimplementation in Mojo::DOM::HTML is that it doesn’t seem to cover all the edge cases that HTML::TreeBuilder can handle - https://gist.github.com/3837258 for example, which I believe is a valid HTML4 comment although it seems this comment syntax has been disallowed in html5, which presumably is what Mojo::DOM is aimed at.

FYI, HTML::HTML5::Parser handles that weird comment with ease. :-)

heh, thanks Joel - that was a quick fix indeed!

Toby - thanks for the link, I’d not tried HTML::HTML5::Parser before. Always useful to have an alternative parser that can handle “real-world” data.

Hi, I’m curious about Mojo:Dom. Let’s say you want a generic method of traversing a page, and returning a hash of headers and related sub headings and text from a page using Mojo::Dom. The sub text may not necessarily be a descendant of the heading and from what I can tell, mojo::dom can only descend not ascend. How does one construct this? I’ve read the docs and tried a number of things but I’m not getting far :(

Leave a comment

About Joel Berger

user-pic As I delve into the deeper Perl magic I like to share what I can.