Google Refine + Perl

(repost from http://sysd.org/google-refine-perl-english/; it's more contextual here)

Google Refine is awesome. If you're unaware of what it is, access their official page and watch at least the first screencast. You'll see it can be helpful for several ETL-related tasks.

Currently, I use it a lot, specially for simple (but boring) tasks, like loading a CSV, trimming out some outliers and saving as JSON to be imported into MongoDB. Nothing a Perl one-liner couldn't do.

However, the opposite is not true: Perl one-liners are a lot more flexible than Google Refine. Now, what if we could merge both?

  1. Google Refine could be easily integrated with any RESTful API.
  2. Perl transforms one-liners into RESTful webservices.
  3. PROFIT!!!

As a practical example, I'll use some georeferenced data I was working at. Let's suppose I have to deduplicate registers, and one of "duplicate" rules is their proximity on the map. Google Refine is far from a full-featured GIS, and is unable to handle bidimensional coordinate system. Enter the GeoDNA: an algorithm to lower geospatial dimensions. As it's FAQ says,

GeoDNA is a way to represent a latitude/longitude coordinate pair as a string. That sounds simple enough, but it's a special string format: the longer it is, the more accurate it is. More importantly, each string uniquely defines a region of the earth's surface, so in general, GeoDNA codes with similar prefixes are located near each other. This can be used to perform proximity searching using only string comparisons (like the SQL "LIKE" operator).

Another interesting property of GeoDNA is that when ordening a set of records by their GeoDNA code, close locations are likely to appear in adjacent rows (sometimes, close locations will share very different prefixes, but similar prefixes always represent close locations).

To incorporate GeoDNA into Google Refine, we'll use the Add column by fetching URLs option, clicking on the header of any column (which column it will be doesn't matter as we'll use two of them, anyway):

Edit column > Add column by fetching URLs...

As the expression, we'll paste the following code (here, pay attention to the correct latitude/longitude column names):

'http://127.0.0.1:3000/?lat='+
row.cells['latitude'].value
+'&lon='+
row.cells['longitude'].value

Throttle delay can be zeroed, as our webservice is local. The final configuration should look like this (don't push the OK button, yet):

Add column by fetching URLs... dialog

Now, check if you have Mojolicious and Geo::DNA Perl modules (install them via CPAN, if not) and paste into your terminal:

perl -MGeo::DNA -Mojo -E 'a("/"=>sub{my$s=shift;$s->render(json=>{geocode=>Geo::DNA::encode_geo_dna($s->param("lat"),$s->param("lon"))})})->start' daemon

If you prefer a "human-readable" version, paste the following code into geocode-webservice.pl:

#!/usr/bin/env perl
use Geo::DNA qw(encode_geo_dna);
use Mojolicious::Lite;
any '/' => sub {
    my $self = shift;
    $self->render(json => {
        geocode => encode_geo_dna(
            $self->param('lat'),
            $self->param('lon'),
        ),
    });
};
app->start;

Once you started a webservice, it will report Server available at http://127.0.0.1:3000. Now, click OK on Google Refine dialog and wait. Even without delay, it could be a bit slow; however, even then this hack saved me a lot of time ;)

Leave a comment

About stas

user-pic Just another lazy, impatient and arrogant IT guy.