TMTOWTDI, plus benchmarking

By stas on December 4, 2012 1:46 PM

There's a very common Perl idiom for getting "top N elements" from an array: @top = (sort @a) [0 .. $n - 1]. Mostly, it's good enough for anything one would dare to store in RAM.

Then, there is Sort::Key::Top, which allows you to write @top = top $n => @a. Yet another syntax sugar?

Not even close! While the docs don't state it boldly, it is:

a XS based module
partition-based general selection algorithm (also known as quickselect) implementation

So, expect it to be fast. How fast?

Here's the results from my attempt of benchmarking the 10 longest-word selection from system dictionary (total of 235886 words):

               Rate   quicksort    pureperl quickselect
quicksort   0.329/s          --        -75%        -89%
pureperl     1.30/s        295%          --        -57%
quickselect  3.04/s        825%        134%          --

Note that the quicksort is there only to verify this claim from the Wikipedia article:

However, if done properly, a Java implementation is typically a magnitude (10x) faster than the quicksort algorithm.

Yup, seems like so. BTW, the 10 longest words are:

"scientificophilosophical"
"tetraiodophenolphthalein"
"formaldehydesulphoxylate"
"thyroparathyroidectomize"
"pathologicopsychological"
"formaldehydesulphoxylic"
"hematospectrophotometer"
"thymolsulphonephthalein"
"phenolsulphonephthalein"
"epididymodeferentectomy"

Benchmark code:

#!/usr/bin/env perl
use 5.010000;
use autodie;
use strict;
use warnings qw(all);

use Carp qw(croak);
use Benchmark qw(cmpthese);

use Sort::Key::Top qw(rnkeytopsort);

my %words;
open my $fh, q(<), q(/usr/share/dict/words);
while (<$fh>) {
    chomp;
    $words{$_} = length;
}
close $fh;

say q(words in hash: ) . scalar keys %words;

my $top_n = 10;
my $code = {
    quickselect => sub { rnkeytopsort { $words{$_} } $top_n => keys %words },
    quicksort => sub {
        use sort qw(_quicksort stable);
        (
            sort { $words{$b} <=> $words{$a} }
            keys %words
        ) [0 .. $top_n - 1];
    },
    pureperl => sub {
        (
            sort { $words{$b} <=> $words{$a} }
            keys %words
        ) [0 .. $top_n - 1];
    },
};
croak qq(something went VERY wrong)
    unless [$code->{quickselect}->()] ~~ [$code->{pureperl}->()];

cmpthese(100 => $code);

9 comments

Tagged as:

algorithm, array, benchmark, Perl, selection algorithm, Sort::Key::Top, TMTOWTDI, top elements

9 Comments

Toby Inkster | December 4, 2012 5:22 PM | Reply

This is a slow laptop, so I only benchmarked the first 5000 words in /usr/share/dict/words, but the following pure Perl solution seems to be faster than all of the above! (Though only 11% faster than quickselect.)

pureperl2 => sub {
	my @top;
	for (keys %words) {
		if (@top < 10 or $words{$_} > $words{$top[0]}) {
			@top = sort { $words{$a} <=> $words{$b} } $_, @top;
			shift @top if @top > 10;
		}
	}
	return reverse @top;
},

Toby Inkster | December 4, 2012 5:25 PM | Reply

(Granted, it's not quite as pithy.)

stas replied to comment from Toby Inkster | December 4, 2012 8:33 PM | Reply

Excellent point!
I've forgot completely to point that both quickselect and selection-by-sorting algorithms execute in a linear time.
So, the quickselect/quicksort performance ratio of 10x will be the same while selecting 10 or 1000 elements.
I am not good at algorithm analysis, but benchmarking pureperl2 at $top_n = 1000 clearly shows it's non-linearity:

words in hash: 235886
            s/iter   pureperl2   quicksort    pureperl quickselect
pureperl2     4.29          --        -37%        -82%        -93%
quicksort     2.71         58%          --        -71%        -88%
pureperl     0.781        450%        247%          --        -59%
quickselect  0.317       1255%        755%        146%          --

Toby Inkster replied to comment from stas | December 5, 2012 4:46 AM | Reply

True; it doesn't work great for large $top_n. But pureperl2 does have a potential optimisation. When inserting $_ into @top, it should be possible to take advantage of the fact that @top is already sorted, and insert the new word into the right place without needing to sort it again.

(I couldn't get it to work without sacrificing sort stability, but it should be doable.)

stas replied to comment from Toby Inkster | December 5, 2012 4:24 PM | Reply

Did it using binary search (code stolen from List::BinarySearch):



    pureperl3 => sub {

        my @top;

        for my $word (keys %words) {

            if (@top  $words{$top[0]}) {

                unless (@top) {

                    @top = ($word);

                } else {

                    my ($low, $high) = (0, scalar @top);

                    while ($low 
                        my $cur = ($low + $high) >> 1;

                        if ($words{$word} - $words{$top[$cur]} > 0) {

                            $low = $cur + 1;

                        } else {

                            $high = $cur;

                        }

                    }

                    splice(@top, $low, 0, $word);

                }

                shift @top if @top > $top_n;

            }

        }

        return reverse @top;

    },

The outer loop has O(n) complexity, while the binary search is O(log n).
So, pureperl3 code is supposed to be O(n log n), the same as quickselect.
In a benchmark for $top_n = 1000, it is only 13% slower than quickselect.
And it is 5% faster for $top_n = 10.
Very interesting results! Kudos for pointing this out!

Toby Inkster | December 6, 2012 11:59 AM | Reply

I've optimised your pureperl3 slightly more, using the trick of eliminating brace-delimited blocks. Blocks are lexical scopes and thus there is some overhead each time you enter and exit one.

pureperl3 => sub {
	my @top;
	my $cur;
	for my $word (keys %words) {
		next if $words{$word} < $words{$top[0]||'a'};
		(@top = $word) unless @top;
		my ($low, $high) = (0, scalar @top);
		(
			($cur = ($low + $high) >> 1),
			($words{$word} - $words{$top[$cur]} > 0)
				? ($low = $cur + 1)
				: ($high = $cur),
		) while $low < $high;
		splice(@top, $low, 0, $word);
		shift @top if @top > $top_n;
	}
	return reverse @top;
},

Toby Inkster replied to comment from Toby Inkster | December 6, 2012 1:02 PM | Reply

Benchmarking the new pureperl3 on $top_n = 1000 with 483523 words, it's 15% faster than quickselect!

stas replied to comment from Toby Inkster | December 7, 2012 2:06 PM | Reply

Awesome :D
I hope you plan to release this code as Sort::Key::Top::PP and/or integrate it into Sort::Key::Top as a fallback when compiler is unavailable.

Toby Inkster | December 11, 2012 11:39 AM | Reply

OK... Sort-Key-Top-PP 0.002.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About stas

Perl Programmer. Cloud Explorer.

More info »

stas