Web::Scraper - Weekly Travelling in CPAN | Moments on Perl or other Programming Issues [blogs.perl.org]

Web::Scraper - Weekly Travelling in CPAN

By C.-Y. Fung on April 7, 2023 10:07 PM

Date of Latest Release: Oct 20, 2014
Distribution: Web-Scraper
Module version: 0.38
Main Contributors: Tatsuhiko Miyagawa (MIYAGAWA)
License: [perl_5]

The official document provided by Web::Scraper is quite clear. I copied the style and comments and made up a script :

use v5.24.0;
use URI;
use Web::Scraper;
 
# First, create your scraper block
my $modules = scraper {
    # Parse all TDs inside 'table' with classname equal "name", store them into
    # an array 'authors'.  We embed other scrapers for each TD.
    process 'table td[class="name"]', "modules[]" => scraper {
      # And, in each TD,
      # get the URI of "a" element
      process "a", uri => '@href';
      # get text inside "small" element
      process "a", name => 'TEXT';
    };
};

my $res = $modules->scrape( URI->new("https://metacpan.org/author/CYFUNG") );

# iterate the array 'modules'
for my $module (@{$res->{modules}}) {
    # output:
    # Map-Tube-Hongkong-0.03	https://metacpan.org/dist/Map-Tube-Hongkong
    # Math-Abacus-0.05	https://metacpan.org/dist/Math-Abacus
    # Math-Cryptarithm-v0.20.2	https://metacpan.org/dist/Math-Cryptarithm
    # Math-Permutation	https://metacpan.org/dist/Math-Permutation
    say "$module->{name}\t$module->{uri}";
}

It may not be obvious for some newcomers. In this case, a data dumping suite like Data::Dumper/Data::Dump/Data::Printer will be your friend.

Web::Scraper can do web scraping task for gracefully. But it is common that a web page is not organized by the structure we want. Here is a script I used at work.

# Perl script for scraping list of shopping centres
#      in New Territories, Hong Kong from wikipedia

use URI;
use Web::Scraper;
use Encode;
use v5.24.0;
use warnings;
binmode STDOUT, ':utf8';

my $list_of_shopping_centres_in_NT
 = 'https://zh.m.wikipedia.org/wiki/新界商場列表';

# The web page is like:
# big region A (h3)
# region A.1 (h4)
# table of shopping centres in A.1
# region A.2 (h4)
# table of shopping centres in A.2
# ...
# big region B (h3)
# region B.1 (h4)
# table of shopping centres in B.1
# region B.2 (h4)
# table of shopping centres in B.2
# ...

my $wikipedia = scraper {
   process 'body', "body[]" => scraper {
        process 'h4', "region[]" => 'TEXT',
        process '//h4/following::table', "tbl[]" => scraper {
            process 'tr', "tr[]" => scraper {
                process 'td', "content[]" => 'TEXT',
            };
        }
    }
};

my $res = $wikipedia->scrape( URI->new($list_of_shopping_centres_in_NT) );
my $counter = 0; 

my @regions;
foreach my $v ($res->{'body'}->@*) {
    @regions = $v->{'region'}->@*;
    # remove the word "edit" in Chinese
    s/\x{007f16}\x{008f91}//u foreach @regions;
}

# The Tricky Part
use utf8;
use List::Util qw(first);
# '馬鞍山' is the region preceding h3 region '大埔', which has no h4 followers
my $idx_a = first { $regions[$_] eq '馬鞍山' } 0..$#regions;
splice @regions, $idx_a+1, 0, '大埔'; 
# '米埔' is the region preceding h3 region '屯門', which has no h4 followers
my $idx_b = first { $regions[$_] eq '米埔' } 0..$#regions;
splice @regions, $idx_b+1, 0, '屯門';

foreach my $v ($res->{'body'}->@*) {
    foreach my $u ($v->{'tbl'}->@*) {
        foreach my $w ($u->{'tr'}->@*) {
            my $ti = $w->{'content'}->[0];
            my $region = $regions[$counter];
            # output is like:
            # 康盛花園商場	    將軍澳
            # 翠林新城    將軍澳
            # 慧安商場    將軍澳
            # 慧星匯    將軍澳
            # ...
            say $ti, "\t", $region if defined($ti) && defined($region);
        }
        $counter++;
    }
}

Since the above example is not friendly to non-Chinese reader, I made up an example scraping a similar Wikipedia page, list of schools in Perth of Western Australia:

# Perl script for scraping list of schools in Perth, Australia

use URI;
use Web::Scraper;
use Encode;
use v5.24.0;
use warnings;
binmode STDOUT, ':utf8';

my $list_of_schools_in_Perth
 = 'https://en.wikipedia.org/wiki/List_of_schools_in_the_Perth_metropolitan_area';

my $wikipedia = scraper {
   process 'body', "body[]" => scraper {
        process 'h3', "type[]" => 'TEXT',
        process '//h3/following::table', "tbl[]" => scraper {
            process 'tr', "tr[]" => scraper {
                process 'td', "content[]" => 'TEXT',
            };
        }
    }
};

my $res = $wikipedia->scrape( URI->new($list_of_schools_in_Perth) );
my $counter = 0; 

my @types;
foreach my $v ($res->{'body'}->@*) {
    @types = $v->{'type'}->@*;
    s/\[edit\]//u foreach @types;
}

foreach my $v ($res->{'body'}->@*) {
    foreach my $u ($v->{'tbl'}->@*) {
        foreach my $w ($u->{'tr'}->@*) {
            my $ti = $w->{'content'}->[0];
            my $type = $types[$counter];
            say $ti, "\t", $type if defined($ti) && defined($type);
        }
        $counter++;
    }
}

It is worth noting that more dedicated or dynamic web scraping task can be done by the famous package Selenium, which has a Perl port WWW::Selenium.

THE HIGHLIGHTED PERL MODULES OF WEEK 14 OF 2023:
Web::Scraper

0 comments

Tagged as:

cpan

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About C.-Y. Fung

This blog is inactive and replaced by https://e7-87-83.github.io/coding/blog.html ; but I post highly Perl-related posts here.

More info »

Moments on Perl or other Programming Issues