Web::Scraper - Weekly Travelling in CPAN
Destination: Web::Scraper
Date of Latest Release: Oct 20, 2014Distribution: Web-Scraper
Module version: 0.38
Main Contributors: Tatsuhiko Miyagawa (MIYAGAWA)
License: [perl_5]
The official document provided by Web::Scraper is quite clear. I copied the style and comments and made up a script :
use v5.24.0; use URI; use Web::Scraper; # First, create your scraper block my $modules = scraper { # Parse all TDs inside 'table' with classname equal "name", store them into # an array 'authors'. We embed other scrapers for each TD. process 'table td[class="name"]', "modules[]" => scraper { # And, in each TD, # get the URI of "a" element process "a", uri => '@href'; # get text inside "small" element process "a", name => 'TEXT'; }; }; my $res = $modules->scrape( URI->new("https://metacpan.org/author/CYFUNG") ); # iterate the array 'modules' for my $module (@{$res->{modules}}) { # output: # Map-Tube-Hongkong-0.03 https://metacpan.org/dist/Map-Tube-Hongkong # Math-Abacus-0.05 https://metacpan.org/dist/Math-Abacus # Math-Cryptarithm-v0.20.2 https://metacpan.org/dist/Math-Cryptarithm # Math-Permutation https://metacpan.org/dist/Math-Permutation say "$module->{name}\t$module->{uri}"; }
It may not be obvious for some newcomers. In this case, a data dumping suite like Data::Dumper/Data::Dump/Data::Printer will be your friend.
# Perl script for scraping list of shopping centres # in New Territories, Hong Kong from wikipedia use URI; use Web::Scraper; use Encode; use v5.24.0; use warnings; binmode STDOUT, ':utf8'; my $list_of_shopping_centres_in_NT = 'https://zh.m.wikipedia.org/wiki/新界商場列表'; # The web page is like: # big region A (h3) # region A.1 (h4) # table of shopping centres in A.1 # region A.2 (h4) # table of shopping centres in A.2 # ... # big region B (h3) # region B.1 (h4) # table of shopping centres in B.1 # region B.2 (h4) # table of shopping centres in B.2 # ... my $wikipedia = scraper { process 'body', "body[]" => scraper { process 'h4', "region[]" => 'TEXT', process '//h4/following::table', "tbl[]" => scraper { process 'tr', "tr[]" => scraper { process 'td', "content[]" => 'TEXT', }; } } }; my $res = $wikipedia->scrape( URI->new($list_of_shopping_centres_in_NT) ); my $counter = 0; my @regions; foreach my $v ($res->{'body'}->@*) { @regions = $v->{'region'}->@*; # remove the word "edit" in Chinese s/\x{007f16}\x{008f91}//u foreach @regions; } # The Tricky Part use utf8; use List::Util qw(first); # '馬鞍山' is the region preceding h3 region '大埔', which has no h4 followers my $idx_a = first { $regions[$_] eq '馬鞍山' } 0..$#regions; splice @regions, $idx_a+1, 0, '大埔'; # '米埔' is the region preceding h3 region '屯門', which has no h4 followers my $idx_b = first { $regions[$_] eq '米埔' } 0..$#regions; splice @regions, $idx_b+1, 0, '屯門'; foreach my $v ($res->{'body'}->@*) { foreach my $u ($v->{'tbl'}->@*) { foreach my $w ($u->{'tr'}->@*) { my $ti = $w->{'content'}->[0]; my $region = $regions[$counter]; # output is like: # 康盛花園商場 將軍澳 # 翠林新城 將軍澳 # 慧安商場 將軍澳 # 慧星匯 將軍澳 # ... say $ti, "\t", $region if defined($ti) && defined($region); } $counter++; } }
Since the above example is not friendly to non-Chinese reader, I made up an example scraping a similar Wikipedia page, list of schools in Perth of Western Australia:
# Perl script for scraping list of schools in Perth, Australia use URI; use Web::Scraper; use Encode; use v5.24.0; use warnings; binmode STDOUT, ':utf8'; my $list_of_schools_in_Perth = 'https://en.wikipedia.org/wiki/List_of_schools_in_the_Perth_metropolitan_area'; my $wikipedia = scraper { process 'body', "body[]" => scraper { process 'h3', "type[]" => 'TEXT', process '//h3/following::table', "tbl[]" => scraper { process 'tr', "tr[]" => scraper { process 'td', "content[]" => 'TEXT', }; } } }; my $res = $wikipedia->scrape( URI->new($list_of_schools_in_Perth) ); my $counter = 0; my @types; foreach my $v ($res->{'body'}->@*) { @types = $v->{'type'}->@*; s/\[edit\]//u foreach @types; } foreach my $v ($res->{'body'}->@*) { foreach my $u ($v->{'tbl'}->@*) { foreach my $w ($u->{'tr'}->@*) { my $ti = $w->{'content'}->[0]; my $type = $types[$counter]; say $ti, "\t", $type if defined($ti) && defined($type); } $counter++; } }
It is worth noting that more dedicated or dynamic web scraping task can be done by the famous package Selenium, which has a Perl port WWW::Selenium.
THE HIGHLIGHTED PERL MODULES OF WEEK 14 OF 2023:
Web::Scraper
Web::Scraper
Leave a comment