Using Mojo::DOM

Mojolicious is already well known for its web framework, but I am finding more and more (after being told by our own brian d foy) that its DOM parser (Mojo::DOM) is worth the price of admission as well. Anyway today I was poking around StackOverflow and I ended up answering a question using nothing more than some well crafted DOM calls. Here is my (slightly reworded) response. It makes for a nice example of using simple CSS3 selectors to simplify HTML parsing.

The question goes something like this: Lets say we have some HTML which contains the times that a shop is open. How can we get this information in a HTML5/CSS3 (i.e. modern) way? Mojo::DOM.

#!/usr/bin/env perl

use strict;
use warnings;

use 5.10.0;
use Mojo::DOM;

my $dom = Mojo::DOM->new(<<'HTML');
<div class="box notranslate" id="venueHours">
<h5 class="translate">Hours</h5>
<div class="status closed">Currently closed</div>
<div class="hours">
  <div class="timespan">
    <div class="openTime">
      <div class="days">Mon,Tue,Wed,Thu,Sat</div>
      <span class="hours"> 10:00 AM–6:00 PM</span>
    </div>
  </div>
  <div class="timespan">
    <div class="openTime">
      <div class="days">Fri</div>
      <span class="hours"> 10:00 AM–9:00 PM</span></div>
    </div>
    <div class="timespan">
      <div class="openTime">
        <div class="days">Sun</div>
        <span class="hours"> 10:00 AM–5:00 PM</span>
      </div>
    </div>
  </div>
</div>
HTML

We can use find to get a collection of results, each to make an array, then manually appling the text method.

say "div days:";
say $_->text for $dom->find('div.days')->each;

say "\nspan hours:";
say $_->text for $dom->find('span.hours')->each;

Or equivalently we can let Mojo do a map for us!

say "div days:";
say for $dom->find('div.days')->map(sub{$_->text})->each;

say "\nspan hours:";
say for $dom->find('span.hours')->map(sub{$_->text})->each;

Both forms give the output:

div days:
Mon,Tue,Wed,Thu,Sat
Fri
Sun

span hours:
 10:00 AM–6:00 PM
 10:00 AM–9:00 PM
 10:00 AM–5:00 PM

But say we want to get the times corresponding to the days? We can use the children of the openTimes div:

say "Open Times:";
say for $dom->find('div.openTime')
            ->map(sub{$_->children->each})
            ->map(sub{$_->text})
            ->each;

Output:

Open Times:
Mon,Tue,Wed,Thu,Sat
 10:00 AM–6:00 PM
Fri
 10:00 AM–9:00 PM
Sun
 10:00 AM–5:00 PM

I know it may not seem very impressive to people who do this all the time, but to a relative web outsider, the intuitiveness of this interface makes parsing out the HTML a breeze!

4 Comments

I can appreciate brevity, but is this really necessary?

say for $dom->find('div.openTime')
  ->map(sub{$_->children->map(sub{$_->text})})->each;

It looks to me like a weird kind of golfing and i’ll be thankful to not see things like that in my production code.

That’s not unique to Mojo::DOM. Similar things are possible with most DOM implementations. For example, using XML::LibXML with XML::LibXML::QuerySelector…

say for $dom
  -> querySelectorAll('.openTime > *')
  -> map(sub { $_->textContent })

Even without XML::LibXML::QuerySelector, it’s not too difficult with pure XML::LibXML…

say for $dom
  -> findnodes('//*[@class="openTime"]/*[@class]')
  -> map(sub { $_->textContent })

Leave a comment

About Joel Berger

user-pic As I delve into the deeper Perl magic I like to share what I can.