An example using Mojo::DOM for rewriting HTML

By Joel Berger on October 4, 2012 3:25 PM under Mojolicious

Recently on stackoverflow, I answered a question that I thought worthy of a highlight here on the blog. In this forum we all know that one should never parse HTML with a regex, but if we agree on that, there are still many options available afterward. The question as posed was given some HTML, remove all <style> tags and contents. The question was later amended to include that he needed to also remove <style> tags with attributes (the nail in the regex coffin) and <link> tags to stylesheets.

While you could use an XML parser or an HTML tokenizer, personally I like using the Mojo::DOM parser. This is a Document-Object Model interface to your HTML and it supports CSS3 selectors, making it really flexible when you need it. The original problem is solved as easily as:

#!/usr/bin/env perl

use strict;
use warnings;

use Mojo::DOM;

my $content = <<'END';
<html>
<head> <title> Example </title> </head>
<style>
p{color: red;
background-color: #FFFF;
}
div {......
...
}
</style>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
END

my $dom = Mojo::DOM->new( $content );
$dom->find('style')->pluck('remove');

print $dom;

The pluck method is a little confusing, but its really just a shorthand for the doing a method on each resultant object. The analogous line could be

$dom->find('style')->each(sub{ $_->remove });

which is a little more understandable but less cute. Further, to really understand that the call to find returns an instance of Mojo::Collection, a container object that has array-filtering methods as you would expect, but also can backflow back to the original DOM object. Thus when we remove the resultant tags from the collection, they are gone from the DOM object!

Now lets say that the $content variable also contained these lines

<link rel="stylesheet" type="text/css" href="$url_path/gridsorting.css">
<link rel="icon" href="somefile.jpg">

where you want to remove the first one, and not the second. You can do this in one of two ways.

$dom->find('link')->each( sub{ $_->remove if $_->{rel} eq 'stylesheet' } );

This mechanism uses the object methods (and Mojo::DOM exposes attributes as hash keys) to remove only the link tags which have rel=stylesheet. You can however use CSS3 selectors to only find those elements, however, and since Mojo::DOM has full CSS3 selector support you can do

$dom->find('link[rel=stylesheet]')->pluck('remove');

CSS3 selector statements can be joined with a comma to find all tags matching either selector, so we can simply include the line

$dom->find('style, link[rel=stylesheet]')->pluck('remove');

and get rid of all your offensive stylesheets in one fell swoop!

9 comments

9 Comments

brian d foy | October 4, 2012 2:23 PM | Reply

I’m continually surprised and delighted that the best parts of Mojolicious aren’t the web app stuff. Mojo::DOM has made HTML and XML munging much easier, and it does it without a lot of external dependencies.

Joel Berger | October 4, 2012 3:36 PM | Reply

@brian, I have learned a lot about the web app side of Mojolicious and IMO its as good for that task as Mojo::DOM is for HTML munging!

tommolesworth.myopenid.com | October 4, 2012 7:12 PM | Reply

Mojo::DOM is a very useful module to have around, particularly with the CSS3 support as you mention. Perhaps only fair to mention that Perl has had excellent support for HTML mangling for some time now - using TreeBuilder, for example:

 my $dom = HTML::TreeBuilder->new_from_content( $content );
 $_->delete for $dom->look_down(_tag => 'style');
 print $dom->as_HTML;

or the second example:

 $_->delete for $dom->look_down(_tag => 'style'), 
                $dom->look_down(_tag => 'link', rel => 'stylesheet');

Not quite as compact as the Mojolicious approach - as has been mentioned before, sometimes reinventing the wheel provides a smoother ride. Seeing HTML parsing regex code in the wild and suggesting Mojo::DOM / TreeBuilder equivalents seems like a worthwhile effort to me, whichever module you choose, so keep up those stackoverflow replies.

Mojo::DOM performance also seems pretty good - even with some filler content in the page it seems to be consistently about 25-30% faster than the HTML::TreeBuilder equivalent, not bad considering it’s up against XS for at least part of the HTML::Tree stack:

https://gist.github.com/3837205

One slight downside to the regex reimplementation in Mojo::DOM::HTML is that it doesn’t seem to cover all the edge cases that HTML::TreeBuilder can handle - https://gist.github.com/3837258 for example, which I believe is a valid HTML4 comment although it seems this comment syntax has been disallowed in html5, which presumably is what Mojo::DOM is aimed at.

Joel Berger | October 4, 2012 9:44 PM | Reply

@Tom, thanks for the well reasoned comments. Of course Perl has many solutions to this problem (I do allude to that fact towards the beginning of the post). My point wasn’t to ignore this, but to highlight the new Perl module on the block (Mojo::DOM) as well as the new mechanism/standard (CSS3 selectors) and how they can make your life easier.

The old kids are still there because they are tried-and-true and that is never to be mocked, but then again, most people (at least on this board) already know about them.

It is really interesting to see that Mojo::DOM stacks up against the XS based modules. You are right about the target of HTML5, but I wonder if we should file a bug on mojolicious; it seems like the old allowed comment syntax could be added without too much difficulty.

Edit:

Actually it seems that HTML::TreeBuilder has a problem with that example too; notice that the </p> tag is missing! Also, I have filed a bug on the HTML4 style comments and submitted a pull request with tests.

Toby Inkster replied to comment from tommolesworth.myopenid.com | October 5, 2012 4:55 AM | Reply

FYI, HTML::HTML5::Parser handles that weird comment with ease. :-)

tommolesworth.myopenid.com | October 5, 2012 6:58 AM | Reply

heh, thanks Joel - that was a quick fix indeed!

Toby - thanks for the link, I’d not tried HTML::HTML5::Parser before. Always useful to have an alternative parser that can handle “real-world” data.

Joel Berger | October 5, 2012 5:21 PM | Reply

P.S. my patch has been merged on Mojo::DOM, its scheduled to be included on the 3.45 release.

Cheers!

https://www.google.com/accounts/o8/id?id=AItOawlsH6xT98cO8R6i25T3JKnyf3im9QgVvvA | May 6, 2015 4:17 AM | Reply

Hi, I’m curious about Mojo:Dom. Let’s say you want a generic method of traversing a page, and returning a hash of headers and related sub headings and text from a page using Mojo::Dom. The sub text may not necessarily be a descendant of the heading and from what I can tell, mojo::dom can only descend not ascend. How does one construct this? I’ve read the docs and tried a number of things but I’m not getting far :(

Joel Berger replied to comment from https://www.google.com/accounts/o8/id?id=AItOawlsH6xT98cO8R6i25T3JKnyf3im9QgVvvA | May 13, 2015 8:31 AM | Reply

Hi, I’m not at all sure what you are attempting to do and the comment system of this site does not make it very conducive to technical discussion. I would recommend asking a question on StackOverflow and include the Perl and Mojolicious tags (at least).

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Joel Berger

As I delve into the deeper Perl magic I like to share what I can.

More info »

Joel Berger