distributed data analysis and reporting and perl, ...

so I was asked this question regarding our users and our products
and such. wanting a quantitative numerical result,
- I quickly (less than 5 minutes) wrote up a map-reduce (+ reduce) job,
- submitted it to our array of servers,
- and in less than 10 seconds, I had the answer.

actual job code (names changed):


<%map>
    my $data = shift;

return unless $data->{category} eq 'CATEGORY';
return unless $data->{design};
return unless $data->{type} eq 'TYPE';

return ("$data->{design} $data->{status}", 1 );
</%map>
<%reduce>
my $key = shift;
my $iter = shift;

my $total = 0;
$iter->foreach(sub { $total ++ });

my($design, $status) = split /\s/, $key;

return($design, { status => $status, total => $total });
</%reduce>
<%reduce>
my $key = shift;
my $iter = shift;

my %totals;
$iter->foreach(sub {
my $r = shift;
$totals{ $r->{status} } = $r->{total};
$totals{all} += $r->{total};
} );

return( $key, \%totals);
</%reduce>
<%final>
my $key = shift;
my $tot = shift;

my $all = $tot->{all};

for my $status (keys %$tot){
my $pct = $all ? sprintf('%.2f', 100 * $tot->{$status} / $all) : 0;
print "$key\t$status\t$tot->{$status}\t%$pct\n";
}
</%final>

we are a perl + mason shop, so one of the design goals was to keep
the learning curve easy and comfortable for our junior staff.

since it is common with some other m/r frameworks, to run a job, and
then use the results as input for another job (et al...), I wanted to be able
to specify multiple stages, from start to end, all together.

did I mention that we plan to open-source this?
oh yeah, that....

Leave a comment

About jaw

user-pic I am this tall.