DBIx::Class::FilterColumn: making transformation easier

Over a year ago I was tasked with creating a data warehouse for sports data. Having known absolutely nothing about data warehousing/ETL, my first sport ended up quite the mess; scrapers would extract and transform at the same time then stuff it into a database where it most likely needed additional transformations. At the time, additional transformations meant writing a script to iterate over every row and change whatever column to whatever regex I had constructed. Sometime later after i'd have generated a report i'd find something wrong, often missing data due to a bad transform regex, which meant re-scraping websites (and often times purchasing another membership).

Naturally I started saving the pages to be scraped, and then just scraped the files themselves. But I was still left loading the data into the db with DBIx::Class (by choice) and then running some sort of transformation on every column. My scripts folder got messy quickly, filled with various 20-30 line transformation scripts to modify a column or two's data. That is all fine and dandy, except as the data grew the need for the transformation to just 'happen' automatically became more apparent.

Welcome to DBIx::Class::FilterColumn. For this specific example, I was getting spreads where certain letter strings and non alphanumeric characters needed to be represented numerically. To clarify, I wanted a decimal number (negative or positive), but would often times get spreads such as:

pk # means 0

-5-05 # means -5

ev-05 # means 0

½ # means .5

Not only does DBIx::Class::FilterColumn handle this for us, but if we wanted to spit the data back out its easy to replace and substituted characters back (we still lose anything we flat out removed, but for the scope of this post we'll stop here).


sub odds_to_storage { 

    return undef unless $_[1];
    $_[1]=~s~\275~.5~; # convert ½ to .5
    $_[1]=~s~-\d\d$|ev$|u\d+$|\+\d+$~~; # strip off vig
    ($_[1] eq 'pk') ? 0 : $_[1]; # pj -> 0

}

sub odds_from_storage { 

    $_[1]=~s~\.0~~; # strip off needless .0
    $_[1]=~s~.5~\275~; # convert .5 to ½
    ($_[1] eq '0') ? 'pk' : $_[1] } # 0 -> pk
}

foreach my $col (__PACKAGE__->columns) {

    __PACKAGE__->filter_column(
        $col => {
            filter_to_storage => 'odds_to_storage',
            filter_from_storage => 'odds_from_storage',
        }) if __PACKAGE__->column_info($col)->{data_type} eq 'decimal';
}

1 Comment

Why not use proved existing generic open source ETL solutions such as Talend Data Integration or Pentaho Data Integration?
The early Talend versions even had Perl support for your transformation rules.

Leave a comment

About ugexe

user-pic Code requires [some order of magnitude] higher cleverness to read than to write, so if you are writing code at maximum cleverness, then you have already excluded yourself from being clever enough to understand it.