CSS selector goodness in Mojo::DOM

Now that we've seen how easy Mojo::DOM makes parsing html, let's take a closer look at the css selector goodness it provides.

Here's a fairly verbose html sample for us to work with:

First, we initialize and parse the file:
use File::Slurp 'slurp';
use Mojo::DOM;
my $dom = Mojo::DOM->new->parse(scalar slurp 'some.html');

Getting all the articles' contents, of course, is easy:
$dom->find('li a');

But we can do better than that. Let's say we want only the article titles that have page anchors:

Nah, let's get the article titles that link to external urls:

How about only article titles that link to .net domains?

We can also get the page anchors themselves:
$dom->find('div.article a[name]');

It could be that some articles have no text content; let's single those out:
$dom->find('div.article p:empty');

Or, if we want only the articles with text content:
$dom->find('div.article p:not(:empty)');

Let's get the articles that are only snippets (class name ends with 'snippet'):
$dom->find('div.article p[class$=snippet]');

There's an advertisement in the markup, let's look at the article immediately following it:
$dom->find('a.advertisement + div.article');

If you're looking to be particularly awesome, with Mojolicious 1.1, you can use all of these selectors from the command line.

Mojo::DOM currently implements all the selectors from jQuery that make contextual sense; if you run into a use case for something that's not implemented, pop into #mojo or the mailing list and make your case. Join the revolution!

As always, it's one-step easy to install:
sudo -s 'curl -L cpanmin.us | perl - Mojolicious'

Mojo::DOM docs


Great article!! Thanks!

If I wanted to extract the url itself, how would I do it? For instance, I would like the output to be like below:

Laser Beam Eyes: http://external.url.com/

Leave a comment

About tempire

user-pic I do not like the status quo. There is always a better way; the question is whether you care enough to find it.