Perfect Hashes and faster than memcmp
In my previous post about perlcc next steps I talked shortly about my current project, Perfect::Hash.
# generate c file for readonly lookup
phash keyfile --prefix=phash -nul
# pure-perl usage
use Perfect::Hash;
my @dict = split/\n/,`cat /usr/share/dict/words`;
my $ph = Perfect::Hash->new(\@dict, -minimal);
for (@ARGV) {
my $v = $ph->perfecthash($_);
print ($dict[$v] eq $_ ? "$_ at line ".$v+1."\n" : "$_ not found\n");
}
Perfect::Hash->new("keyfile", '-urban', ...)->save_c;
# or just:
phash keyfile --urban
cc -O3 -msse4.2 phash.c ... -lz
phash /usr/share/dict/words --cmph-bdz_ph --nul
cc -O3 phash.c ... -lcmph
phash
knows about several fast lookup algorithms for any readonly
string-based maps (no integer maps yet), and will select the best
method to store it as a lookup method in a C file to compile it with
your project. Such as with gperf
, only better. Planned are also
exporters for other languages, to be able to store values in their
native language-specific format. E.g. XS for perl-specific SVs, AVs or
HVs.
In order to optimize loading of static data, readonly hashes to
replace cdb
or sqlite
databases or serialized caches we will use
phash. gperf
was enhanced for libunistring
, to lookup unicode data
efficiently, compared to icu
which uses only straight C++ hashmaps.
Searching in constant tables is quite a remarkably common task, so I
am constantly suprised why nobody uses optimized data structures and
algorithms to use it. I only know of google which also use
consistent hashes
to spread the load and keyspace across several servers, apple for
their speach apps, and libunistring for unicode mappings. cdb
e.g. could easily use perfect hashes instead of overly simple DJB
hashes. Or mysql
which is mostly only usable as glorified readonly
concurrent hashtable could be replaced by such tables. The update with
regeneration and recompilation for such a table costs less then a
typical update into mysql.
Perfect::Hash
Let's create some sample hash tables:
$ phash /usr/share/dict/words -hanovpp -size 20 -false-positives
$ cat phash.c
#include "phash.h"
#include <string.h>
#ifdef _MSC_VER
#define INLINE __inline
#else
#define INLINE inline
#endif
/* FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/ */
static INLINE
unsigned phash_hash(unsigned d, const unsigned char *s, const int l) {
int i = 0;
if (!d) d = 0x01000193;
for (; i < l; i++) {
d = (d * 0x01000193) ^ *s++;
}
return d & 0xffffffff;
}
long phash_lookup(const unsigned char* s) {
unsigned char h;
signed char d;
char v;
/* hash indices, direct < 0, indirect > 0 */
static const signed char G[] = {
-16, 1, 1,-15, 0, 0, 0,-14, 1,-13,-12, 1, 3,-10, 0, -8,
0, -7, 0, -2,
};
/* values */
static const unsigned char V[] = {
0, 9, 17, 6, 12, 13, 10, 14, 18, 5, 8, 15, 4, 2, 11, 16,
1, 7, 3, 19,
};
long l = strlen(s);
h = (unsigned char)(phash_hash(0, s, l) % 20);
d = G[h];
v = d < 0
? V[(unsigned char)-d-1]
: d == 0
? V[h]
: V[(unsigned char)(phash_hash(d, s, l) % 20)];
return v;
}
This creates an interim lookup array G
with negative indices for
immediate in the values array V
, and with positive indices for a second
hash-based lookup into the values array V
, where the index is the hash
seed > 0. So we have typically 1.5 hash function calls per lookup.
Generation of such a table is linear in time, intermediate and resulting
space is 2n
, one array[n]
for G
and another for the values. But note that
this lookup will only work if we know in advance that the key is a member of
the perfect hash. That's why we can use the option -false-positive
.
In order to use arbitrary keys for the lookup, we omit -false-positives
, we need to add
a final check if the found key is really contained in the stored perfect hash.
So we need 3n
space, one more array for the keys also, and one more full memcmp
.
We will also add the option -nul
to be able to store binary keys, with
embedded NUL
characters. Note that using \0 in keyfiles is quite tricky.
$ phash /usr/share/dict/words -hanovpp -size 20 -nul
$ tail -n30 phash.c
long phash_lookup(const unsigned char* s, int l) {
unsigned char h;
signed char d;
char v;
/* hash indices, direct < 0, indirect > 0 */
static const signed char G[] = {
-16, 1, 1,-15, 0, 0, 0,-14, 1,-13,-12, 1, 3,-10, 0, -8,
0, -7, 0, -2,
};
/* values */
static const unsigned char V[] = {
0, 9, 17, 6, 12, 13, 10, 14, 18, 5, 8, 15, 4, 2, 11, 16,
1, 7, 3, 19,
};
/* keys */
static const char* K[] = {
"A","A's","AA's","AB's","ABM's","AC's","ACTH's","AI's","AIDS's","AM's",
"AOL","AOL's","ASCII's","ASL's","ATM's","ATP's",
"AWOL's","AZ's","AZT's","Aachen",
};
h = (unsigned char)(phash_hash(0, s, l) % 20);
d = G[h];
v = d < 0
? V[(unsigned char)-d-1]
: d == 0
? V[h]
: V[(unsigned char)(phash_hash(d, s, l) % 20)];
if (memcmp(K[v],s,l)) v = -1;
return v;
}
The differences are the additional storage of the keys
and the final
if (memcmp(K[v],s,l)) v = -1;
Note that this data structure here is optimized to work with
zero-based integer values. For arbitrary values the check would be a
bit different and we would need one more array for the values. gperf
creates an array of key + value structs and the values can be of any
type.
This -hanovpp
- pp
stands for pure-perl, hanov
for Steve Hanov,
who wrote a simple python script at
http://stevehanov.ca/blog/index.php?id=119 based on the cmph CHD
algorithm - method is the default method to create arbitrary sized
perfect hashes, and needs no external library. It creates pure and
independent, pretty fast C code.
Fast CRC32
The optimized versions -hanov
and -urban
use the
typically faster hardware accelerated iSCSI CRC32
function from zlib
,
which uses only 1 CPU cycle per word, but requires zlib on the system,
and is only fast if zlib is fast. You cannot beat 1 cycle, but some
zlib versions only use the slow software fallback version for CRC32,
which is most slower than the simple FNV hash function used with -hanovpp
.
Note that faster wordsize optimized FNV variants are "presented" at http://www.sanmayce.com/Fastest_Hash/index.html#TRIADvsCRC (Beware of eye-bleeding though)
gperf
gperf
might be able to create even faster and better hashes, because
it doesn't fully hash each key, it rather creates a custom simpler
hash function for each key set. It just adds some characters at
certain indices until you get different values for each key. Most
other methods only create custom hash seeds for each keyset. But
gperf
can not reliably create stable hash functions and will either
time out on certain keysets or will create bad hash tables.
The major variants of phash -gperf
include the options -pic
and
-switch
. -pic
creates a lookup function optimized for shared
libraries, and -switch
creates a nested switch statement.
-switch and fast memcmp
gperf --switch=2
is by far worse than my optimized version called phash -switch
With gperf you have to manually specify the level of nested switches, with
phash -switch the nesting level is created automatically, but the biggest trick
is a special memcmp optimization for shorter strings.
The old gperf --switch=n
variant:
$ gperf examples/words20 --switch=2
... special hash ...
const char *
in_word_set (str, len)
register const char *str;
register unsigned int len;
{
if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH) {
register int key = hash (str, len);
if (key <= MAX_HASH_VALUE && key >= MIN_HASH_VALUE) {
register const char *resword;
if (key < 16) {
switch (key - 1) {
case 0: resword = "A"; goto compare;
case 2: resword = "A's"; goto compare;
case 3: resword = "AM's"; goto compare;
case 4: resword = "ATM's"; goto compare;
case 5: resword = "Aachen"; goto compare;
case 8: resword = "AZ's"; goto compare;
case 9: resword = "AZT's"; goto compare;
case 10: resword = "AWOL's"; goto compare;
case 13: resword = "AC's"; goto compare;
case 14: resword = "ASL's"; goto compare;
}
} else {
switch (key - 16) {
case 0: resword = "ACTH's"; goto compare;
case 1: resword = "ASCII's"; goto compare;
case 2: resword = "AOL"; goto compare;
case 3: resword = "AI's"; goto compare;
case 4: resword = "AOL's"; goto compare;
case 5: resword = "AIDS's"; goto compare;
case 8: resword = "AB's"; goto compare;
case 9: resword = "ABM's"; goto compare;
case 13: resword = "AA's"; goto compare;
case 14: resword = "ATP's"; goto compare;
}
}
return 0;
compare:
if (*str == *resword && !strcmp (str + 1, resword + 1))
return resword;
}
}
return 0;
}
vs. the new seperate phash -switch
method:
$ phash examples/words20 -switch
$ cat phash.c
#include "phash.h"
#include <string.h>
long phash_lookup(const unsigned char* s, int l) {
switch (l) {
case 1:
return *(s) == 'A' ? 0 : -1;
case 3:
switch (s[1]) { /* A's, AOL */
case 39:
if (*(short*)s == (short)0x2741 /* A's */
&& *(&s[2]) == 's') return 1;
break;
case 'O':
if (*(short*)s == (short)0x4f41 /* AOL */
&& *(&s[2]) == 'L') return 10;
default:
return -1;
}
return -1;
case 4:
switch (s[1]) { /* AM's, AB's, AZ's, AA's, AI's, AC's */
case 'A':
if (*(int*)s == (int)0x73274141 /* AA's */) return 2;
break;
case 'B':
if (*(int*)s == (int)0x73274241 /* AB's */) return 3;
break;
case 'C':
if (*(int*)s == (int)0x73274341 /* AC's */) return 5;
break;
case 'I':
if (*(int*)s == (int)0x73274941 /* AI's */) return 7;
break;
case 'M':
if (*(int*)s == (int)0x73274d41 /* AM's */) return 9;
break;
case 'Z':
if (*(int*)s == (int)0x73275a41 /* AZ's */) return 17;
default:
return -1;
}
return -1;
case 5:
switch (s[1]) { /* ATP's, AOL's, ASL's, ATM's, ABM's, AZT's */
case 'B':
if (*(int*)s == (int)0x274d4241 /* ABM's */
&& *(&s[4]) == 's') return 4;
break;
case 'O':
if (*(int*)s == (int)0x274c4f41 /* AOL's */
&& *(&s[4]) == 's') return 11;
break;
case 'S':
if (*(int*)s == (int)0x274c5341 /* ASL's */
&& *(&s[4]) == 's') return 13;
break;
case 'T':
if (*(int*)s == (int)0x27505441 /* ATP's */
&& *(&s[4]) == 's') return 15;
if (*(int*)s == (int)0x274d5441 /* ATM's */
&& *(&s[4]) == 's') return 14;
break;
case 'Z':
if (*(int*)s == (int)0x27545a41 /* AZT's */
&& *(&s[4]) == 's') return 18;
default:
return -1;
}
return -1;
case 6:
switch (s[1]) { /* AIDS's, AWOL's, ACTH's, Aachen */
case 'C':
if (*(int*)s == (int)0x48544341 /* ACTH's */
&& *(short*)&s[4] == (short)0x7327 /* 's */) return 6;
break;
case 'I':
if (*(int*)s == (int)0x53444941 /* AIDS's */
&& *(short*)&s[4] == (short)0x7327 /* 's */) return 8;
break;
case 'W':
if (*(int*)s == (int)0x4c4f5741 /* AWOL's */
&& *(short*)&s[4] == (short)0x7327 /* 's */) return 16;
break;
case 'a':
if (*(int*)s == (int)0x68636141 /* Aachen */
&& *(short*)&s[4] == (short)0x6e65 /* en */) return 19;
default:
return -1;
}
return -1;
case 7:
return *(int*)s == (int)0x49435341 /* ASCII's */
&& *(short*)&s[4] == (short)0x2749 /* I's */
&& *(&s[6]) == 's' ? 12 : -1;
}
}
So we switch here first on the key length, and then on the best key.
The new switch method creates the nesting level automatically for you,
all cases with more than 3 comparisons are automatically changed to a
separate switch on the best character. And because we know the length
an the keys in advance we can optimize memcmp
away by doing
word-size comparison directly. This is 50% - 2000% faster then doing
memcmp
on each string. See the implementation of the helper
function
memcmp_const_str
I heard of people doing this on every string shorter than 128
chars. So far I'm using only 36 as cutoff.
This is not a very interesting example, you need much bigger keysizes to see smarter switch nesting in effect. With this method you can beat even the best hash functions for certain key sets.
For example we got with /usr/share/dict/words
on debian with size=99171
from 0.307571
seconds down to 0.006666
seconds, while the fastest perfect
hash needed 0.007391
seconds. See the benchmarks.
In this special case it was 200x faster with the new memcmp optimization method.
Unfortunately we cannot use this new memcmp optimization method for all hash tables. We need to know one string and the length to compare in advance, and the key and length is a parameter of the lookup method. This only works for this static switch table.
Pearson
There exists another special hashing method, called Pearson hashing. This creates a short 256 bytes "pearson" table of randomized shuffled bytes and uses this table to mix with all characters of the string to hash. Such byte-wise operation with a small lookup table are perfect for smaller machines or smaller keysets.
In my benchmarks -pearson
wins with ~127 keys, but you need to be a
bit lucky to find a good pearson randomization. phash -pearson tries for
max 60 seconds to find the best random table.
Small dictionary with 127 keys
Method *lookup* generate compile c size exesize options
hanovpp 0.001247 0.001904 0.069912 3454 11575
hanov 0.001231 0.000946 0.070709 3200 11629
urban 0.001298 0.001020 0.072471 3200 11629
pearson 0.001178 0.068758 0.072311 12354 19240
pearsonnp 0.001253 0.018204 0.072419 12386 19259
cmph-bdz_ph 0.001271 0.000412 0.066264 335 10009
cmph-bdz 0.001317 0.000403 0.066471 424 10033
cmph-bmz 0.001277 0.000539 0.066814 2431 10577
cmph-chm 0.001325 0.000646 0.067026 4356 11057
cmph-fch 0.001372 0.000511 0.066843 942 10177
cmph-chd_ph 0.001310 0.000361 0.066292 418 10033
cmph-chd 0.001323 0.000343 0.066428 735 10129
switch 0.001190 0.000173 0.131323 12517 17450
gperf 0.001199 0.002180 0.064883 8987 12574
The generated code looks like this:
$ phash examples/words20 -pearson
$ cat phash.c
#include "phash.h"
#include <string.h>
long phash_lookup(const unsigned char* s) {
int l = strlen(s);
long h = 0;
const char *key = s;
static unsigned char phash[] = {
136,176, 67, 19,170, 84, 87,228, 86, 92,100,214,208,201,160,126,
30,192, 78,189,232,244,142,252,253,147, 34,251,156, 38, 62,141,
20, 69, 88, 36,241,193, 77,162, 48,247,181,234,194,125, 98,135,
127, 61,204,180,205,133,132,145, 10,146,229, 18,198, 37, 73,139,
70,131,249, 51,223,235, 97,202,120,165, 16, 7,186, 89,243, 24,
216, 43, 45,182, 0, 94, 2,248, 39,188,255,240,219, 96,220,217,
168,130, 1,190, 85, 75, 44,169, 55,144, 58,224,221,239, 12,159,
211, 90, 47, 72, 33, 8, 26,231, 56, 57,155, 59, 3,254, 66,185,
27, 4, 6,102,101,150, 63,153, 52, 81,197,105,161, 9,140, 11,
207,148,103,199,183, 49,213,143,149,138, 99,117,167, 83,171,175,
218, 53, 80,113,106,250,124,137,109,104, 22,242,107,134,166, 54,
14,110,164,236,187,203,237,154,128, 31, 15, 74,179,230,158,177,
115,210,114, 5,226,196,212, 60,116, 21,157, 32,112,238, 17, 41,
68,184, 93, 23,245,173,129,172, 25, 79,233,178, 28,121, 40,246,
119, 50, 13, 65,163,222,206, 95,225,122, 71,111, 46, 35,195, 82,
42,152, 76, 91,151,118, 29,209,174,227,108, 64,200,215,123,191,
};
/* collisions: keys and values */
static const int Cv_1[] = {0};
static const char *Ck_2[] = {"ABM's","AC's", };
static const int Cv_2[] = {4,5, };
static const int Cv_3[] = {7};
static const int Cv_4[] = {3};
static const int Cv_5[] = {2};
static const int Cv_6[] = {8};
static const int Cv_7[] = {1};
static const int Cv_8[] = {6};
static const int Cv_9[] = {9};
/* collision keys */
static const char *Ck[] = { 0, 0,
(void*)&Ck_2, 0, 0, 0, 0, 0, 0, 0, };
/* collision values and direct values */
static const int *Cv[] = { 0,
(void*)&Cv_1,
(void*)&Cv_2,
(void*)&Cv_3,
(void*)&Cv_4,
(void*)&Cv_5,
(void*)&Cv_6,
(void*)&Cv_7,
(void*)&Cv_8,
(void*)&Cv_9, };
/* size of collisions */
static const unsigned char Cs[] = { 0, 1, 2, 1, 1, 1, 1, 1, 1, 1, };
/* keys */
static const char* K[] = {
"A","A's","AA's","AB's","ABM's","AC's","ACTH's","AI's","AIDS's","AM's",
};
unsigned char c;
for (c=*s++; c; c=*s++) {
h = phash[(h ^ c) % 256];
}
h = h % 10;
if (Cs[h] > 1) {
const char **ck = (const char **)Ck[h];
int i = 0;
for (; i < Cs[h]; i++) {
if (!memcmp(ck[i], key, l)) return Cv[h][i];
}
}
else if (Cs[h] == 1) {
h = Cv[h][0];
}
if (memcmp(K[h], key, l)) h = -1;
return h;
}
We see that -pearson
did not generate a perfect hash, it rather
created a static collision table. But -pearson
tries to
minimize the collision costs for us by trying random pearson tables
for a maximum time of 60 seconds and then stores the best variant.
In this example here we only got one colliding bucket with 2 values.
There are 2 variants for -pearson, called -pearsonnp
, "np" for non-perfect,
and -pearson8
which is the strict perfect hash variant and a 8 bit table.
-pearsonnp
will use a larger pearson lookup table to create less collisions,
and -pearson8
will most likely fail to create a proper hash table. I'm also
still experimenting with other pearson variants for bigger tables.
CMPH
The currently supported external variants (which require an external library even at run-time) use the cmph library. They are a bit slower than the other methods, but only a bit. They allow much bigger key sizes, with dictionaries which might even exceed your main memory, and there exists even an variant to use tables which exceed the available main memory.
$ phash examples/words20 -cmph-bdz_ph
$ cat phash.c
#include "phash.h"
#include <string.h>
#include "cmph.h"
long phash_lookup(const unsigned char* s) {
static unsigned char *packed_mphf = "\006\000\000\000\000\000\000\000\r\000\000\000\t\000\000
\000\b\177\035\024\033\003\000";
return cmph_search_packed(packed_mphf, (const char*)s, strlen(s));
}
You need to link this file with -lcmph
.
Summary
Creating perfect hashes is easy,
they are grossly underused,
they are fast,
and in certain cases fast data tricks - word-sized cmp instead of memcmp - beat better algorithms.
Very nice. Thanks for making useful open source software and writing about it.