Removing Locale::Country::SubCountry from CPAN

Hi Folks

Reluctantly, I've removed Locale::Country::SubCountry from CPAN, since I have no reasonable means of keeping the data up-to-date.

This data included subcountry names in the native scripts of the corresponding countries.

As a replacement, I spent a lot of time developing WWW::Scraper::Wikipedia::ISO3166. This (gently) scrapes country and subcountry names off Wikipedia, and stores them in an SQLite db.

The names are all use (more-or-less) latin letters (i.e. A-Z, a-z, with diacritics).

This is not my preference, but I've decided to adopt a policy of:

o Hoping ISO3166 is kept up-to-date (which is reasonable).

o Hoping someone, somewhere keeps Wikipedia up-to-date. I can't say whether or not this is reasonable, but I'm just going to assume it is.

I'll document various issues people might have with the Wikipedia/IS03166 version of such data.

A typical instance is the name of Bolivia:

o Wikipedia/ISO3166 calls it 'Bolivia, Plurinational State of' whereas you probably (and reaonably) expect it to be called 'Bolivia'.

Of course, Kim Ryan's module Locale::SubSountry, is always available, if mine provides data which fails to meet your expectations :-).

Cheers
Ron

10 Comments

Ubuntu (and Debian) includes an iso_3166_2.xml file as part of its iso-codes package. It appears to be manually maintained by tracking news on the ISO website, but it's probably a reasonably reliable source. Whatsmore, the package includes lots of translations.

I'd be happy to take over Locale::Country::SubCountry and port it to run on top of Ubuntu/Debian's data.

If you aren't already aware of it http://geonames.org has an API that *may* provide the information you need, or (admittedly large) exports of all data which could be culled for this module.

The codes for Japan and also the native writings of the regions can be found in the CPAN module

https://metacpan.org/module/Geography::JapanesePrefectures

There is also a model based on the Japanese one for China:

https://metacpan.org/module/Geography::China::Provinces

But I don't know if it has the regional names and codes.

It's easy to get the native writings of country and region names from Wikipedia by simply examining the "interwiki links".

The Unicode CLDR contains all the information you need, even in the correct script. Go to http://unicode.org/Public/cldr/latest/ and download core.zip, look in e.g. common/main/ar.xml (for generic Arabic) and you'll find information about languages, scripts, territories, date formats and more.

I have, for a long time, been planning on writing a parser for this huge data source that is the CLDR, using XML::Rabbit, but I never seem to find the tuits. We even need parts of it for $dayjob. I got kinda stuck on the API design, as it contains such a big amount of information. If someone would like to pitch in, get in touch.

I didn't mean to suggest that the users of Locale::Country::SubCountry need to have a copy of iso_3166_2.xml on their machine - but rather that the maintainer of the distribution should have a copy.

It would be converted to CSV and inlined into the modules as part of "make dist".

Pick Egypt, say: http://en.wikipedia.org/wiki/ISO_3166-2:EG

That list of languages down the left doesn't include Egypt or Arabic, and that's what I was hoping to get.


If you have the name of the subcountry in English in the ISO file, you can scrape the wikipedia article with that name and then get the interwikilink from that.

This highlights a very sad fact about the state of such basic information: there is no single, complete, update-to-date, authoritative source for country names, divisions, timezones, currency, etc in a relationally consistent format, translated, easily update-able, exportable, synchronize-able, partition-able, API-able...

Each of the sources we've identified in this thread (not to mention openstreetmap etc) have some kind of issue, but more importantly they all appear to be independent: presumably every project is either listening to the news-wire for changes, or waiting till someone else makes a change and then manually integrating.

It is almost enough to make me start a Github for geodata. Imagine a tool with the following usage:

    usage: geodb [--db DATABASE] COMMAND
        --db        your local database (defaults to .geodb)
        COMMAND     (required)
            init    initialize a geo database
            search  run a query against a geo database
            list    display some pre-defined search results
            show    formatted information of objects 
            new     insert a record into a geo database
            update  update a geo database
            export  dump a search query as xml, json, blah
            pull    fetch and merge from a remote geo database
            push    send updates to a remote geo database

$ geodb init
Initializing geodb in /home/mark/.geodb

$ geodb pull geo://geodb.com/geodb
Fetched 13 updates.
...

$ geodb list currencies
adp Andorran Peseta
aed United Arab Emirates Dirham
...

$ geodb show AUS
Country: Australia
2-code: AU
3-code: AUS
Timezones:
Australia/Brisbane
Australia/Darwin

Even better if the .geodb would be a SQLite file which can obviously be used with whatever prgramming language / interface you want to build on top of it.

Anyone have interest in (and access to funding for) such a thing?

Leave a comment

About Ron Savage

user-pic I try to write all code in Perl, but find I end up writing in bash, CSS, HTML, JS, and SQL, and doing database design, just to get anything done...