distributed data analysis and reporting and perl, ...
so I was asked this question regarding our users and our products
and such. wanting a quantitative numerical result,
- I quickly (less than 5 minutes) wrote up a map-reduce (+ reduce) job,
- submitted it to our array of servers,
- and in less than 10 seconds, I had the answer.
actual job code (names changed):
<%map>
my $data = shift;
return unless $data->{category} eq 'CATEGORY';
return unless $data->{design};
return unless $data->{type} eq 'TYPE';
return ("$data->{design} $data->{status}", 1 );
</%map>
<%reduce>
my $key = shift;
my $iter = shift;
my $total = 0;
$iter->foreach(sub { $total ++ });
my($design, $status) = split /\s/, $key;
return($design, { status => $status, total => $total });
</%reduce>
<%reduce>
my $key = shift;
my $iter = shift;
my %totals;
$iter->foreach(sub {
my $r = shift;
$totals{ $r->{status} } = $r->{total};
$totals{all} += $r->{total};
} );
return( $key, \%totals);
</%reduce>
<%final>
my $key = shift;
my $tot = shift;
my $all = $tot->{all};
for my $status (keys %$tot){
my $pct = $all ? sprintf('%.2f', 100 * $tot->{$status} / $all) : 0;
print "$key\t$status\t$tot->{$status}\t%$pct\n";
}
</%final>
we are a perl + mason shop, so one of the design goals was to keep
the learning curve easy and comfortable for our junior staff.
since it is common with some other m/r frameworks, to run a job, and
then use the results as input for another job (et al...), I wanted to be able
to specify multiple stages, from start to end, all together.
did I mention that we plan to open-source this?
oh yeah, that....
Leave a comment