Binary search versus hash lookup

A long time ago I had a very silly conversation with a developer where we both agreed that we weren't too concerned about algorithms and data structure memorization because, hey, we could look them up, right? We don't need to use that fancy-pants computer sciency stuff in our jobs because we've never needed to. Of course, that's like saying because you've never had a driver's license, a driver's license is useless. When you don't have a car, you naturally tend to look at problems and solutions without considering the options a car may provide you.

That's why I've been brushing up on my comp-sci lately. I've been working through a bunch of sorting algorithms (insertion sort, merge sort, and quick sort) to better understand their characteristics when I started to think about how I might find if an item is in a list. If it's a large list, I typically use a hash. Most of the time this is fine, but I wanted to understand better what I could do.

My general strategy would be something like this:

my %index_for = map { $list[$_] => $_ } 0 .. $#list;
my $index = $index_for{$some_value};

That seems reasonable, but if that list is sorted, how does it stand up to a binary search?

#!/usr/bin/env perl

use Test::Most;
use Time::HiRes qw/ tv_interval gettimeofday /;

sub timefor(&$) {
    my $start = [gettimeofday];
    $_[0]->();
    explain sprintf "$_[1]: took %s time" => tv_interval($start);
}

sub binary {
    my ( $elem, $list ) = @_;
    my $max = $#$list;
    my $min = 0;

    while ( $max >= $min ) {
        my $index = int( ( $max + $min ) / 2 );
        if    ( $list->[$index] < $elem ) { $min = $index + 1 }
        elsif ( $list->[$index] > $elem ) { $max = $index - 1 }
        else                              { return $index }
    }

    return;
}
my @list = map { $_ * 2 } 1 .. 1000;
my ( $target, $index ) = ( 998, 498 );
timefor {
    is binary( $target, \@list ), $index, "$target is in our list";
}
"Binary search on list";

# assumes no elements are duplicated
my %index_for;
timefor {
    %index_for = map { $list[$_] => $_ } 0 .. $#list;
    is $index_for{$target}, $index, "$target is in our hash";
}
"Hash based search with hash construction";
timefor {
    is $index_for{$target}, $index, "$target is in our hash";
}
"Hash based search without hash construction";

done_testing;

That outputs something list this:

ok 1 - 998 is in our list
# Binary search on list: took 0.000302 time
ok 2 - 998 is in our hash
# Hash based search with hash construction: took 0.001509 time
ok 3 - 998 is in our hash
# Hash based search without hash construction: took 0.000172 time
1..3

If you construct the hash once and throw it away, you're wasting time. You need to use that hash many times before you match the performance of a binary search. Yes, for many of you this is a trivial result and you're not surprised, but I wanted to make sure that I really understood what I was looking at. It's interesting that for a list of only 1000 elements, it takes at least 20-25 searches through the list before the hash lookup pays off. If we have a million element list, we need around 10,000 lookups before the hash-based solution pays off. The hash creation is expensive (almost 2 seconds on my box for one million entries).

Naturally, this doesn't mean you should abandon your hash-based lookups in favor of binary searches. You need to profile your code and figure out where your real bottlenecks are, but it's still something worth thinking about ... something you might be hard-pressed to do if you don't even know basic algorithms.

6 Comments

Hi Ovid

But your binary search data is sorted. Perhaps you should be using random data and counting the time to sort it?

Cheers Ron

I have also started reading skiena's Algorithm design (I highly recommend the online lectures just google for url). I just read the chapter on Dictionary , I'm still trying to grasp so I might be wrong but here goes .
From a point of Big O , your going through the elements in a array is O(n) and hash retrieval in worst case (though expected is (n/m) for big O we always take the worst case) is also O(n) so its O(n^2) where as binary search is log n.

From a perl point of view things might be different, Do you know if perl hash lookup is O(1) or O(n/m) ?.
Its surprising that the insertion takes long as I thought insertion was O(1) .

Compiled code can spend all its time in actual computation. Perl code needs a runloop with an op dispatcher to run, which is overhead. Running Perl is expensive.

Implementing an algorithm with lower complexity but that cannot offload its inner loop to perl inside some op is often slower in practice than using an algorithm whose complexity is worse but whose inner loop (or hot path) is run by a single op, such as hash lookup or the regex engine.

Eventually an O(n) algorithm that loops in Perl will catch up to an in-op-looping O(n²) algorithm, but it takes large – sometimes very large – runtimes to amortize the constant overhead added by looping in Perl.

Hmm, I ran the script on my computer (i5 Win7) and got these results:

# Binary search on list: took 0.000779 time
# Hash based search with hash construction: took 0.002645 time
# Hash based search without hash construction: took 0.001848 time

then I moved the binary search to be the last one and ran it again:

# Hash based search with hash construction: took 0.001477 time
# Hash based search without hash construction: took 0.002317 time
# Binary search on list: took 0.001897 time

I don't have any conclusion just an observation:

The binary search took 2 times longer. The hash based with construction got 2 times faster and now they are on par.

The hash based without construction is now slower (?!) than the hash based with construction.

Is this a Windows thing?

Have I missed something obvious?

BTW I ran it several times and the numbers were fluctuating a lot so maybe the benchmark should be ran many times and then averaged.

I would also run this multiple times for those reasons. And I would run it for different values of $target, including missing values. The conclusions still stand, but this might be more convincing :

use Benchmark qw/cmpthese :hireswallclock/;
cmpthese -5, {
    binary => sub {
        $target=int rand @list;
        binary($target, \@list);
    },
    with_assign => sub {
        $target=int rand @list;
        %index_for = map { $list[$_] => $_ } 0 .. $#list;
        $index_for{$target};
    },
    with_alloc => sub {
        $target=int rand @list;
        my %index_for = map { $list[$_] => $_ } 0 .. $#list;
        $index_for{$target};
    },
    without => sub {
        $target=int rand @list;
        $index_for{$target};
    },
};

I added an alternative withalloc%indexfor, since that seemed more realistic, given the described use case. I was curious if that would be slower. I was surprised to see that creating a new hash was faster than assigning to an existing one. Maybe this has something to do with garbage collection?

                 Rate with_assign  with_alloc      binary     without
with_assign    1585/s          --        -25%        -99%       -100%
with_alloc     2099/s         32%          --        -99%       -100%
binary       166127/s      10384%       7814%          --        -95%
without     3092012/s     195030%     147192%       1761%          --

Leave a comment

About Ovid

user-pic Have Perl; Will Travel. Freelance Perl/Testing/Agile consultant. Photo by http://www.circle23.com/. Warning: that site is not safe for work. The photographer is a good friend of mine, though, and it's appropriate to credit his work.