Binary search versus hash lookup
A long time ago I had a very silly conversation with a developer where we both agreed that we weren't too concerned about algorithms and data structure memorization because, hey, we could look them up, right? We don't need to use that fancy-pants computer sciency stuff in our jobs because we've never needed to. Of course, that's like saying because you've never had a driver's license, a driver's license is useless. When you don't have a car, you naturally tend to look at problems and solutions without considering the options a car may provide you.
That's why I've been brushing up on my comp-sci lately. I've been working through a bunch of sorting algorithms (insertion sort, merge sort, and quick sort) to better understand their characteristics when I started to think about how I might find if an item is in a list. If it's a large list, I typically use a hash. Most of the time this is fine, but I wanted to understand better what I could do.
My general strategy would be something like this:
my %index_for = map { $list[$_] => $_ } 0 .. $#list;
my $index = $index_for{$some_value};
That seems reasonable, but if that list is sorted, how does it stand up to a binary search?
#!/usr/bin/env perl
use Test::Most;
use Time::HiRes qw/ tv_interval gettimeofday /;
sub timefor(&$) {
my $start = [gettimeofday];
$_[0]->();
explain sprintf "$_[1]: took %s time" => tv_interval($start);
}
sub binary {
my ( $elem, $list ) = @_;
my $max = $#$list;
my $min = 0;
while ( $max >= $min ) {
my $index = int( ( $max + $min ) / 2 );
if ( $list->[$index] < $elem ) { $min = $index + 1 }
elsif ( $list->[$index] > $elem ) { $max = $index - 1 }
else { return $index }
}
return;
}
my @list = map { $_ * 2 } 1 .. 1000;
my ( $target, $index ) = ( 998, 498 );
timefor {
is binary( $target, \@list ), $index, "$target is in our list";
}
"Binary search on list";
# assumes no elements are duplicated
my %index_for;
timefor {
%index_for = map { $list[$_] => $_ } 0 .. $#list;
is $index_for{$target}, $index, "$target is in our hash";
}
"Hash based search with hash construction";
timefor {
is $index_for{$target}, $index, "$target is in our hash";
}
"Hash based search without hash construction";
done_testing;
That outputs something list this:
ok 1 - 998 is in our list
# Binary search on list: took 0.000302 time
ok 2 - 998 is in our hash
# Hash based search with hash construction: took 0.001509 time
ok 3 - 998 is in our hash
# Hash based search without hash construction: took 0.000172 time
1..3
If you construct the hash once and throw it away, you're wasting time. You need to use that hash many times before you match the performance of a binary search. Yes, for many of you this is a trivial result and you're not surprised, but I wanted to make sure that I really understood what I was looking at. It's interesting that for a list of only 1000 elements, it takes at least 20-25 searches through the list before the hash lookup pays off. If we have a million element list, we need around 10,000 lookups before the hash-based solution pays off. The hash creation is expensive (almost 2 seconds on my box for one million entries).
Naturally, this doesn't mean you should abandon your hash-based lookups in favor of binary searches. You need to profile your code and figure out where your real bottlenecks are, but it's still something worth thinking about ... something you might be hard-pressed to do if you don't even know basic algorithms.
Hi Ovid
But your binary search data is sorted. Perhaps you should be using random data and counting the time to sort it?
Cheers Ron
I have also started reading skiena's Algorithm design (I highly recommend the online lectures just google for url). I just read the chapter on Dictionary , I'm still trying to grasp so I might be wrong but here goes .
From a point of Big O , your going through the elements in a array is O(n) and hash retrieval in worst case (though expected is (n/m) for big O we always take the worst case) is also O(n) so its O(n^2) where as binary search is log n.
From a perl point of view things might be different, Do you know if perl hash lookup is O(1) or O(n/m) ?.
Its surprising that the insertion takes long as I thought insertion was O(1) .
Compiled code can spend all its time in actual computation. Perl code needs a runloop with an op dispatcher to run, which is overhead. Running Perl is expensive.
Implementing an algorithm with lower complexity but that cannot offload its inner loop to perl inside some op is often slower in practice than using an algorithm whose complexity is worse but whose inner loop (or hot path) is run by a single op, such as hash lookup or the regex engine.
Eventually an O(n) algorithm that loops in Perl will catch up to an in-op-looping O(n²) algorithm, but it takes large – sometimes very large – runtimes to amortize the constant overhead added by looping in Perl.
Ron: actually, shuffling and sorting the data to make the comparisons "fair" almost defeats the purpose. The idea is that with a given set of starting conditions, knowing your algorithms can easily lead you to make different choices. That being said, I shuffled the 1000 elements and then timed the binary search with a prior sort:
As you can see, that's still better than the hash lookup. Kicking this up to one million elements:
So the sort is still far faster than the hash construction.
Hmm, I ran the script on my computer (i5 Win7) and got these results:
then I moved the binary search to be the last one and ran it again:
I don't have any conclusion just an observation:
The binary search took 2 times longer. The hash based with construction got 2 times faster and now they are on par.
The hash based without construction is now slower (?!) than the hash based with construction.
Is this a Windows thing?
Have I missed something obvious?
BTW I ran it several times and the numbers were fluctuating a lot so maybe the benchmark should be ran many times and then averaged.
I would also run this multiple times for those reasons. And I would run it for different values of
$target
, including missing values. The conclusions still stand, but this might be more convincing :I added an alternative
withalloc
%indexfor, since that seemed more realistic, given the described use case. I was curious if that would be slower. I was surprised to see that creating a new hash was faster than assigning to an existing one. Maybe this has something to do with garbage collection?