CSS selector goodness in Mojo::DOM

By tempire on February 17, 2011 3:29 AM

Now that we've seen how easy Mojo::DOM makes parsing html, let's take a closer look at the css selector goodness it provides.

Here's a fairly verbose html sample for us to work with:

First, we initialize and parse the file:use File::Slurp 'slurp'; use Mojo::DOM; my $dom = Mojo::DOM->new->parse(scalar slurp 'some.html');

Getting all the articles' contents, of course, is easy:
$dom->find('li a');

But we can do better than that. Let's say we want only the article titles that have page anchors:
$dom->find('a[href^=#]');

Nah, let's get the article titles that link to external urls:
$dom->find('a[href^=http]');

How about only article titles that link to .net domains?
$dom->find('a[href*=.net/]');

We can also get the page anchors themselves:
$dom->find('div.article a[name]');

It could be that some articles have no text content; let's single those out:
$dom->find('div.article p:empty');

Or, if we want only the articles with text content:
$dom->find('div.article p:not(:empty)');

Let's get the articles that are only snippets (class name ends with 'snippet'):
$dom->find('div.article p[class$=snippet]');

There's an advertisement in the markup, let's look at the article immediately following it:
$dom->find('a.advertisement + div.article');

If you're looking to be particularly awesome, with Mojolicious 1.1, you can use all of these selectors from the command line.

Mojo::DOM currently implements all the selectors from jQuery that make contextual sense; if you run into a use case for something that's not implemented, pop into #mojo or the mailing list and make your case. Join the revolution!

As always, it's one-step easy to install:
sudo -s 'curl -L cpanmin.us | perl - Mojolicious'

Mojo::DOM docs

2 comments

Tagged as:

css parse dom

2 Comments

cyan | January 17, 2013 3:44 PM | Reply

Great article!! Thanks!

If I wanted to extract the url itself, how would I do it? For instance, I would like the output to be like below:

Laser Beam Eyes: http://external.url.com/

tempire replied to comment from cyan | January 17, 2013 6:48 PM | Reply

You can access the attribute via ->attrs:
http://mojolicio.us/perldoc/Mojo/DOM#attrs

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About tempire

I do not like the status quo. There is always a better way; the question is whether you care enough to find it.

More info »

tempire