TMTOWTDI, plus benchmarking
There's a very common Perl idiom for getting "top N elements" from an array: @top = (sort @a) [0 .. $n - 1]. Mostly, it's good enough for anything one would dare to store in RAM.
Then, there is Sort::Key::Top, which allows you to write @top = top $n => @a. Yet another syntax sugar?
Not even close! While the docs don't state it boldly, it is:
- a XS based module
- partition-based general selection algorithm (also known as quickselect) implementation
So, expect it to be fast. How fast?
Here's the results from my attempt of benchmarking the 10 longest-word selection from system dictionary (total of 235886 words):
Rate quicksort pureperl quickselect
quicksort 0.329/s -- -75% -89%
pureperl 1.30/s 295% -- -57%
quickselect 3.04/s 825% 134% --
Note that the quicksort is there only to verify this claim from the Wikipedia article:
However, if done properly, a Java implementation is typically a magnitude (10x) faster than the quicksort algorithm.
Yup, seems like so. BTW, the 10 longest words are:
"scientificophilosophical"
"tetraiodophenolphthalein"
"formaldehydesulphoxylate"
"thyroparathyroidectomize"
"pathologicopsychological"
"formaldehydesulphoxylic"
"hematospectrophotometer"
"thymolsulphonephthalein"
"phenolsulphonephthalein"
"epididymodeferentectomy"
Benchmark code:
#!/usr/bin/env perl
use 5.010000;
use autodie;
use strict;
use warnings qw(all);
use Carp qw(croak);
use Benchmark qw(cmpthese);
use Sort::Key::Top qw(rnkeytopsort);
my %words;
open my $fh, q(<), q(/usr/share/dict/words);
while (<$fh>) {
chomp;
$words{$_} = length;
}
close $fh;
say q(words in hash: ) . scalar keys %words;
my $top_n = 10;
my $code = {
quickselect => sub { rnkeytopsort { $words{$_} } $top_n => keys %words },
quicksort => sub {
use sort qw(_quicksort stable);
(
sort { $words{$b} <=> $words{$a} }
keys %words
) [0 .. $top_n - 1];
},
pureperl => sub {
(
sort { $words{$b} <=> $words{$a} }
keys %words
) [0 .. $top_n - 1];
},
};
croak qq(something went VERY wrong)
unless [$code->{quickselect}->()] ~~ [$code->{pureperl}->()];
cmpthese(100 => $code);
This is a slow laptop, so I only benchmarked the first 5000 words in /usr/share/dict/words, but the following pure Perl solution seems to be faster than all of the above! (Though only 11% faster than quickselect.)
pureperl2 => sub { my @top; for (keys %words) { if (@top < 10 or $words{$_} > $words{$top[0]}) { @top = sort { $words{$a} <=> $words{$b} } $_, @top; shift @top if @top > 10; } } return reverse @top; },(Granted, it's not quite as pithy.)
Excellent point!
I've forgot completely to point that both quickselect and selection-by-sorting algorithms execute in a linear time.
So, the quickselect/quicksort performance ratio of 10x will be the same while selecting 10 or 1000 elements.
I am not good at algorithm analysis, but benchmarking
pureperl2at$top_n = 1000clearly shows it's non-linearity:words in hash: 235886 s/iter pureperl2 quicksort pureperl quickselect pureperl2 4.29 -- -37% -82% -93% quicksort 2.71 58% -- -71% -88% pureperl 0.781 450% 247% -- -59% quickselect 0.317 1255% 755% 146% --True; it doesn't work great for large
$top_n. But pureperl2 does have a potential optimisation. When inserting$_into@top, it should be possible to take advantage of the fact that@topis already sorted, and insert the new word into the right place without needing to sort it again.(I couldn't get it to work without sacrificing sort stability, but it should be doable.)
Did it using binary search (code stolen from List::BinarySearch):
The outer loop has O(n) complexity, while the binary search is O(log n).
So, pureperl3 code is supposed to be O(n log n), the same as quickselect.
In a benchmark for
$top_n = 1000, it is only 13% slower than quickselect.And it is 5% faster for
$top_n = 10.Very interesting results! Kudos for pointing this out!
I've optimised your
pureperl3slightly more, using the trick of eliminating brace-delimited blocks. Blocks are lexical scopes and thus there is some overhead each time you enter and exit one.pureperl3 => sub { my @top; my $cur; for my $word (keys %words) { next if $words{$word} < $words{$top[0]||'a'}; (@top = $word) unless @top; my ($low, $high) = (0, scalar @top); ( ($cur = ($low + $high) >> 1), ($words{$word} - $words{$top[$cur]} > 0) ? ($low = $cur + 1) : ($high = $cur), ) while $low < $high; splice(@top, $low, 0, $word); shift @top if @top > $top_n; } return reverse @top; },Benchmarking the new pureperl3 on
$top_n = 1000with 483523 words, it's 15% faster than quickselect!Awesome :D
I hope you plan to release this code as Sort::Key::Top::PP and/or integrate it into Sort::Key::Top as a fallback when compiler is unavailable.
OK... Sort-Key-Top-PP 0.002.