Removing Locale::Country::SubCountry from CPAN

By Ron Savage on May 6, 2012 4:34 AM

Hi Folks

Reluctantly, I've removed Locale::Country::SubCountry from CPAN, since I have no reasonable means of keeping the data up-to-date.

This data included subcountry names in the native scripts of the corresponding countries.

As a replacement, I spent a lot of time developing WWW::Scraper::Wikipedia::ISO3166. This (gently) scrapes country and subcountry names off Wikipedia, and stores them in an SQLite db.

The names are all use (more-or-less) latin letters (i.e. A-Z, a-z, with diacritics).

This is not my preference, but I've decided to adopt a policy of:

o Hoping ISO3166 is kept up-to-date (which is reasonable).

o Hoping someone, somewhere keeps Wikipedia up-to-date. I can't say whether or not this is reasonable, but I'm just going to assume it is.

I'll document various issues people might have with the Wikipedia/IS03166 version of such data.

A typical instance is the name of Bolivia:

o Wikipedia/ISO3166 calls it 'Bolivia, Plurinational State of' whereas you probably (and reaonably) expect it to be called 'Bolivia'.

Of course, Kim Ryan's module Locale::SubSountry, is always available, if mine provides data which fails to meet your expectations :-).

Cheers
Ron

10 comments

10 Comments

Toby Inkster | May 6, 2012 9:40 AM | Reply

Ubuntu (and Debian) includes an iso_3166_2.xml file as part of its iso-codes package. It appears to be manually maintained by tracking news on the ISO website, but it's probably a reasonably reliable source. Whatsmore, the package includes lots of translations.

I'd be happy to take over Locale::Country::SubCountry and port it to run on top of Ubuntu/Debian's data.

Mark Lawrence | May 6, 2012 10:58 PM | Reply

If you aren't already aware of it http://geonames.org has an API that *may* provide the information you need, or (admittedly large) exports of all data which could be culled for this module.

Ron Savage | May 7, 2012 12:22 AM | Reply

1 of 2: Hi Tony

I run Debian, so I'll look into the iso-code package you mention. Thanx for the offer to take it over. I've found /usr/share/xml/iso-codes/iso_3166_2.xml on my machine, so I'll examine that. Pause... Nope - it does not use native (Chinese, Arabic, etc) scripts, and contains only slightly more info than I get from Wikipedia. Nice try, though. My code will run on any OS/Perl installation with DBD::SQLite. I see no point in using XML for this, or of limiting myself to Debian. Of course, I /could/ make a case for limiting everybody to Debian, but I digress :-)). Also, I will supply scripts to export the data as CSV and HTML. Hmmm - I might even ship those output files too.

2 of 2: Hi Mark

No, I have not heard of geonames.org. Sigh - the internet is damned huge, I can tell you that. Pause... Ack! Chrome says I bookmarked in some time in the distant past. $repeat_comment_on_size_of_internet; Yes, I remember it now. A little bit more complex than Wikipedia to scrape, but a very nice resource admittedly. Still no native scripts.

3 of 2: $many x $thanx;
Or, as Adam Kennedy thinks I should write:
$thanx x $many;
Obviously, the poor guy has gotten too much Padre on the brain!

Cheers
Ron

Ben Bullock | May 7, 2012 5:52 AM | Reply

The codes for Japan and also the native writings of the regions can be found in the CPAN module

https://metacpan.org/module/Geography::JapanesePrefectures

There is also a model based on the Japanese one for China:

https://metacpan.org/module/Geography::China::Provinces

But I don't know if it has the regional names and codes.

It's easy to get the native writings of country and region names from Wikipedia by simply examining the "interwiki links".

Robin Smidsrød | May 7, 2012 8:09 AM | Reply

The Unicode CLDR contains all the information you need, even in the correct script. Go to http://unicode.org/Public/cldr/latest/ and download core.zip, look in e.g. common/main/ar.xml (for generic Arabic) and you'll find information about languages, scripts, territories, date formats and more.

I have, for a long time, been planning on writing a parser for this huge data source that is the CLDR, using XML::Rabbit, but I never seem to find the tuits. We even need parts of it for $dayjob. I got kinda stuck on the API design, as it contains such a big amount of information. If someone would like to pitch in, get in touch.

Toby Inkster replied to comment from Ron Savage | May 7, 2012 8:09 AM | Reply

I didn't mean to suggest that the users of Locale::Country::SubCountry need to have a copy of iso_3166_2.xml on their machine - but rather that the maintainer of the distribution should have a copy.

It would be converted to CSV and inlined into the modules as part of "make dist".

Ron Savage | May 7, 2012 8:10 AM | Reply

Hi Ben

But that doesn't give me the names in the native script /of the country I'm examining/, except in 1 case I found from the few I checked.

Pick Egypt, say:
http://en.wikipedia.org/wiki/ISO_3166-2:EG

That list of languages down the left doesn't include Egypt or Arabic, and that's what I was hoping to get.

Cheers
Ron

Ben Bullock replied to comment from Ron Savage | May 7, 2012 8:42 AM | Reply

Pick Egypt, say: http://en.wikipedia.org/wiki/ISO_3166-2:EG
That list of languages down the left doesn't include Egypt or Arabic, and that's what I was hoping to get.

If you have the name of the subcountry in English in the ISO file, you can scrape the wikipedia article with that name and then get the interwikilink from that.

Mark Lawrence | May 7, 2012 10:43 PM | Reply

This highlights a very sad fact about the state of such basic information: there is no single, complete, update-to-date, authoritative source for country names, divisions, timezones, currency, etc in a relationally consistent format, translated, easily update-able, exportable, synchronize-able, partition-able, API-able...

Each of the sources we've identified in this thread (not to mention openstreetmap etc) have some kind of issue, but more importantly they all appear to be independent: presumably every project is either listening to the news-wire for changes, or waiting till someone else makes a change and then manually integrating.

It is almost enough to make me start a Github for geodata. Imagine a tool with the following usage:

usage: geodb [--db DATABASE] COMMAND --db your local database (defaults to .geodb) COMMAND (required) init initialize a geo database search run a query against a geo database list display some pre-defined search results show formatted information of objects new insert a record into a geo database update update a geo database export dump a search query as xml, json, blah pull fetch and merge from a remote geo database push send updates to a remote geo database

$ geodb init
Initializing geodb in /home/mark/.geodb

$ geodb pull geo://geodb.com/geodb
Fetched 13 updates.
...

$ geodb list currencies
adp Andorran Peseta
aed United Arab Emirates Dirham
...

$ geodb show AUS
Country: Australia
2-code: AU
3-code: AUS
Timezones:
Australia/Brisbane
Australia/Darwin

Even better if the .geodb would be a SQLite file which can obviously be used with whatever prgramming language / interface you want to build on top of it.

Anyone have interest in (and access to funding for) such a thing?

Ron Savage | May 8, 2012 2:23 AM | Reply

1 of 4: Hi Toby

Good point. I could use or ship iso-codes/iso_3166_2.xml. Sorry for the mis-understanding.

2 of 4: Hi Ben

But for how many subcountry names is the native script version available? And how much work can be put into such a scheme, given the varying formats of the pages in question? The few I checked would be PITA.

Luckily I have plenty of time available, but also a number of projects I'd like to work on...

3 of 4: Hi Mark

A geodb, eh? Hmmm. I heading in that direction too. My module stuffs the Wikipedia data into an SQLite db, and I have scripts which export the data as HTML and CSV.

One pain is that the SQLite web site and Oracle both ship an exe called sqlite3, which are incompatible, unless - I assume - the db was created with their own tools. Perhaps there's a command line switch which deals with this issue. I didn't check.

Here is my distro's scripts/ dir so far. All access methods in Import/Export/Etc modules:
copy.config.pl
create.tables.pl
drop.tables.pl
export.as.csv.pl
export.as.html.pl
get.country.page.pl
get.subcountry.page.pl
get.subcountry.pages.pl
populate.countries.pl
populate.subcountries.pl
populate.subcountry.pl
report.statistics.pl

The current cost of the 3166 db from ISO is about 200 Swiss Francs = $222 Aust dollars. I can afford it but don't feel like paying for it. And updating is an issue too.

So, yes, I have an interest in it.

As for funding, I'm living off my savings, and will be for months, while I care for my mother (who has Alzheimer's) until I have to put her in a home, so in a sense extra funding is desired but not necessary.

But, as I said in a previous reply, various projects contend for my time. This is good, since the intellectual stimulation is important, but is also a type of complexity, and complexity is always a red flag for me.

4 of 4: Hi Robin

Thanx for the URL. I was not aware of that. Of course, this whole process is a big learning curve for me, but I do realise unicode is not going away so I'm absorbing it in stages.

I may well shift my data source over to that file.

As for the API, I think it'd better be the classic one-small-step-at-a-time API.

Ideas/etc very welcome. Perhaps also a more convenient discussion forum would be an idea.

Cheers
Ron

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Ron Savage

I try to write all code in Perl, but find I end up writing in bash, CSS, HTML, JS, and SQL, and doing database design, just to get anything done...

More info »

Ron Savage