Perfect Hashes and faster than memcmp

In my previous post about perlcc next steps I talked shortly about my current project, Perfect::Hash.

# generate c file for readonly lookup
phash keyfile --prefix=phash -nul

# pure-perl usage
use Perfect::Hash;
my @dict = split/\n/,`cat /usr/share/dict/words`;
my $ph = Perfect::Hash->new(\@dict, -minimal);
for (@ARGV) {
  my $v = $ph->perfecthash($_);
  print ($dict[$v] eq $_ ? "$_ at line ".$v+1."\n" : "$_ not found\n");
}

Perfect::Hash->new("keyfile", '-urban', ...)->save_c;
# or just:
phash keyfile --urban
cc -O3 -msse4.2 phash.c ... -lz

phash /usr/share/dict/words --cmph-bdz_ph --nul
cc -O3 phash.c ... -lcmph

phash knows about several fast lookup algorithms for any readonly string-based maps (no integer maps yet), and will select the best method to store it as a lookup method in a C file to compile it with your project. Such as with gperf, only better. Planned are also exporters for other languages, to be able to store values in their native language-specific format. E.g. XS for perl-specific SVs, AVs or HVs.

In order to optimize loading of static data, readonly hashes to replace cdb or sqlite databases or serialized caches we will use phash. gperf was enhanced for libunistring, to lookup unicode data efficiently, compared to icu which uses only straight C++ hashmaps. Searching in constant tables is quite a remarkably common task, so I am constantly suprised why nobody uses optimized data structures and algorithms to use it. I only know of google which also use consistent hashes to spread the load and keyspace across several servers, apple for their speach apps, and libunistring for unicode mappings. cdb e.g. could easily use perfect hashes instead of overly simple DJB hashes. Or mysql which is mostly only usable as glorified readonly concurrent hashtable could be replaced by such tables. The update with regeneration and recompilation for such a table costs less then a typical update into mysql.

Perfect::Hash

Let's create some sample hash tables:

$ phash /usr/share/dict/words -hanovpp -size 20 -false-positives
$ cat phash.c

#include "phash.h"
#include <string.h>

#ifdef _MSC_VER
#define INLINE __inline
#else
#define INLINE inline
#endif

/* FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/ */
static INLINE
unsigned phash_hash(unsigned d, const unsigned char *s, const int l) {
    int i = 0;
    if (!d) d = 0x01000193;
    for (; i < l; i++) {
        d = (d * 0x01000193) ^ *s++;
    }
    return d & 0xffffffff;
}

long phash_lookup(const unsigned char* s) {
    unsigned char h;
    signed char d;
    char v;
    /* hash indices, direct < 0, indirect > 0 */
    static const signed char G[] = {
        -16,  1,  1,-15,  0,  0,  0,-14,  1,-13,-12,  1,  3,-10,  0, -8,
          0, -7,  0, -2,
    };
    /* values */
    static const unsigned char V[] = {
          0,  9, 17,  6, 12, 13, 10, 14, 18,  5,  8, 15,  4,  2, 11, 16,
          1,  7,  3, 19,
    };
    long l = strlen(s);
    h = (unsigned char)(phash_hash(0, s, l) % 20);
    d = G[h];
    v = d < 0
        ? V[(unsigned char)-d-1]
        : d == 0
          ? V[h]
          : V[(unsigned char)(phash_hash(d, s, l) % 20)];
    return v;
}

This creates an interim lookup array G with negative indices for immediate in the values array V, and with positive indices for a second hash-based lookup into the values array V, where the index is the hash seed > 0. So we have typically 1.5 hash function calls per lookup. Generation of such a table is linear in time, intermediate and resulting space is 2n, one array[n] for G and another for the values. But note that this lookup will only work if we know in advance that the key is a member of the perfect hash. That's why we can use the option -false-positive.

In order to use arbitrary keys for the lookup, we omit -false-positives, we need to add a final check if the found key is really contained in the stored perfect hash. So we need 3n space, one more array for the keys also, and one more full memcmp.

We will also add the option -nul to be able to store binary keys, with embedded NUL characters. Note that using \0 in keyfiles is quite tricky.

$ phash /usr/share/dict/words -hanovpp -size 20 -nul
$ tail -n30 phash.c

long phash_lookup(const unsigned char* s, int l) {
    unsigned char h;
    signed char d;
    char v;
    /* hash indices, direct < 0, indirect > 0 */
    static const signed char G[] = {
        -16,  1,  1,-15,  0,  0,  0,-14,  1,-13,-12,  1,  3,-10,  0, -8,
          0, -7,  0, -2,
    };
    /* values */
    static const unsigned char V[] = {
          0,  9, 17,  6, 12, 13, 10, 14, 18,  5,  8, 15,  4,  2, 11, 16,
          1,  7,  3, 19,
    };
    /* keys */
    static const char* K[] = {
        "A","A's","AA's","AB's","ABM's","AC's","ACTH's","AI's","AIDS's","AM's",
        "AOL","AOL's","ASCII's","ASL's","ATM's","ATP's",
        "AWOL's","AZ's","AZT's","Aachen",
    };
    h = (unsigned char)(phash_hash(0, s, l) % 20);
    d = G[h];
    v = d < 0
        ? V[(unsigned char)-d-1]
        : d == 0
          ? V[h]
          : V[(unsigned char)(phash_hash(d, s, l) % 20)];
    if (memcmp(K[v],s,l)) v = -1;
    return v;
}

The differences are the additional storage of the keys and the final if (memcmp(K[v],s,l)) v = -1;

Note that this data structure here is optimized to work with zero-based integer values. For arbitrary values the check would be a bit different and we would need one more array for the values. gperf creates an array of key + value structs and the values can be of any type.

This -hanovpp - pp stands for pure-perl, hanov for Steve Hanov, who wrote a simple python script at http://stevehanov.ca/blog/index.php?id=119 based on the cmph CHD algorithm - method is the default method to create arbitrary sized perfect hashes, and needs no external library. It creates pure and independent, pretty fast C code.

Fast CRC32

The optimized versions -hanov and -urban use the typically faster hardware accelerated iSCSI CRC32 function from zlib, which uses only 1 CPU cycle per word, but requires zlib on the system, and is only fast if zlib is fast. You cannot beat 1 cycle, but some zlib versions only use the slow software fallback version for CRC32, which is most slower than the simple FNV hash function used with -hanovpp.

Note that faster wordsize optimized FNV variants are "presented" at http://www.sanmayce.com/Fastest_Hash/index.html#TRIADvsCRC (Beware of eye-bleeding though)

gperf

gperf might be able to create even faster and better hashes, because it doesn't fully hash each key, it rather creates a custom simpler hash function for each key set. It just adds some characters at certain indices until you get different values for each key. Most other methods only create custom hash seeds for each keyset. But gperf can not reliably create stable hash functions and will either time out on certain keysets or will create bad hash tables.

The major variants of phash -gperf include the options -pic and -switch. -pic creates a lookup function optimized for shared libraries, and -switch creates a nested switch statement.

-switch and fast memcmp

gperf --switch=2 is by far worse than my optimized version called phash -switch With gperf you have to manually specify the level of nested switches, with phash -switch the nesting level is created automatically, but the biggest trick is a special memcmp optimization for shorter strings.

The old gperf --switch=n variant:

$ gperf examples/words20 --switch=2

... special hash ...

const char *
in_word_set (str, len)
     register const char *str;
     register unsigned int len;
{
  if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH) {
      register int key = hash (str, len);
      if (key <= MAX_HASH_VALUE && key >= MIN_HASH_VALUE) {
          register const char *resword;
          if (key < 16) {
              switch (key - 1) {
                  case 0:  resword = "A"; goto compare;
                  case 2:  resword = "A's"; goto compare;
                  case 3:  resword = "AM's"; goto compare;
                  case 4:  resword = "ATM's"; goto compare;
                  case 5:  resword = "Aachen"; goto compare;
                  case 8:  resword = "AZ's"; goto compare;
                  case 9:  resword = "AZT's"; goto compare;
                  case 10: resword = "AWOL's"; goto compare;
                  case 13: resword = "AC's"; goto compare;
                  case 14: resword = "ASL's"; goto compare;
                }
          } else {
              switch (key - 16) {
                  case 0: resword = "ACTH's"; goto compare;
                  case 1: resword = "ASCII's"; goto compare;
                  case 2: resword = "AOL"; goto compare;
                  case 3: resword = "AI's"; goto compare;
                  case 4: resword = "AOL's"; goto compare;
                  case 5: resword = "AIDS's"; goto compare;
                  case 8: resword = "AB's"; goto compare;
                  case 9: resword = "ABM's"; goto compare;
                  case 13: resword = "AA's"; goto compare;
                  case 14: resword = "ATP's"; goto compare;
              }
          }
          return 0;
        compare:
          if (*str == *resword && !strcmp (str + 1, resword + 1))
            return resword;
        }
    }
  return 0;
}

vs. the new seperate phash -switch method:

$ phash examples/words20 -switch
$ cat phash.c

#include "phash.h"
#include <string.h>

long phash_lookup(const unsigned char* s, int l) {
    switch (l) {
      case 1: 
        return *(s) == 'A' ? 0 : -1;
      case 3: 
        switch (s[1]) { /* A's, AOL */
          case 39:
            if (*(short*)s == (short)0x2741 /* A's */
        && *(&s[2]) == 's') return 1;
            break;
          case 'O':
            if (*(short*)s == (short)0x4f41 /* AOL */
        && *(&s[2]) == 'L') return 10;
          default:
            return -1;
        }
        return -1;
      case 4: 
        switch (s[1]) { /* AM's, AB's, AZ's, AA's, AI's, AC's */
          case 'A':
            if (*(int*)s == (int)0x73274141 /* AA's */) return 2;
            break;
          case 'B':
            if (*(int*)s == (int)0x73274241 /* AB's */) return 3;
            break;
          case 'C':
            if (*(int*)s == (int)0x73274341 /* AC's */) return 5;
            break;
          case 'I':
            if (*(int*)s == (int)0x73274941 /* AI's */) return 7;
            break;
          case 'M':
            if (*(int*)s == (int)0x73274d41 /* AM's */) return 9;
            break;
          case 'Z':
            if (*(int*)s == (int)0x73275a41 /* AZ's */) return 17;
          default:
            return -1;
        }
        return -1;
      case 5: 
        switch (s[1]) { /* ATP's, AOL's, ASL's, ATM's, ABM's, AZT's */
          case 'B':
            if (*(int*)s == (int)0x274d4241 /* ABM's */
        && *(&s[4]) == 's') return 4;
            break;
          case 'O':
            if (*(int*)s == (int)0x274c4f41 /* AOL's */
        && *(&s[4]) == 's') return 11;
            break;
          case 'S':
            if (*(int*)s == (int)0x274c5341 /* ASL's */
        && *(&s[4]) == 's') return 13;
            break;
          case 'T':
            if (*(int*)s == (int)0x27505441 /* ATP's */
        && *(&s[4]) == 's') return 15;
            if (*(int*)s == (int)0x274d5441 /* ATM's */
        && *(&s[4]) == 's') return 14;
            break;
          case 'Z':
            if (*(int*)s == (int)0x27545a41 /* AZT's */
        && *(&s[4]) == 's') return 18;
          default:
            return -1;
        }
        return -1;
      case 6: 
        switch (s[1]) { /* AIDS's, AWOL's, ACTH's, Aachen */
          case 'C':
            if (*(int*)s == (int)0x48544341 /* ACTH's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 6;
            break;
          case 'I':
            if (*(int*)s == (int)0x53444941 /* AIDS's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 8;
            break;
          case 'W':
            if (*(int*)s == (int)0x4c4f5741 /* AWOL's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 16;
            break;
          case 'a':
            if (*(int*)s == (int)0x68636141 /* Aachen */
        && *(short*)&s[4] == (short)0x6e65 /* en */) return 19;
          default:
            return -1;
        }
        return -1;
      case 7: 
        return *(int*)s == (int)0x49435341 /* ASCII's */
        && *(short*)&s[4] == (short)0x2749 /* I's */
        && *(&s[6]) == 's' ? 12 : -1;
    }
}

So we switch here first on the key length, and then on the best key. The new switch method creates the nesting level automatically for you, all cases with more than 3 comparisons are automatically changed to a separate switch on the best character. And because we know the length an the keys in advance we can optimize memcmp away by doing word-size comparison directly. This is 50% - 2000% faster then doing memcmp on each string. See the implementation of the helper function memcmp_const_str I heard of people doing this on every string shorter than 128 chars. So far I'm using only 36 as cutoff.

This is not a very interesting example, you need much bigger keysizes to see smarter switch nesting in effect. With this method you can beat even the best hash functions for certain key sets.

For example we got with /usr/share/dict/words on debian with size=99171 from 0.307571 seconds down to 0.006666 seconds, while the fastest perfect hash needed 0.007391 seconds. See the benchmarks. In this special case it was 200x faster with the new memcmp optimization method.

Unfortunately we cannot use this new memcmp optimization method for all hash tables. We need to know one string and the length to compare in advance, and the key and length is a parameter of the lookup method. This only works for this static switch table.

Pearson

There exists another special hashing method, called Pearson hashing. This creates a short 256 bytes "pearson" table of randomized shuffled bytes and uses this table to mix with all characters of the string to hash. Such byte-wise operation with a small lookup table are perfect for smaller machines or smaller keysets.

In my benchmarks -pearson wins with ~127 keys, but you need to be a bit lucky to find a good pearson randomization. phash -pearson tries for max 60 seconds to find the best random table.

Small dictionary with 127 keys

Method       *lookup*  generate compile   c size   exesize  options

hanovpp      0.001247  0.001904 0.069912     3454    11575
hanov        0.001231  0.000946 0.070709     3200    11629
urban        0.001298  0.001020 0.072471     3200    11629
pearson      0.001178  0.068758 0.072311    12354    19240
pearsonnp    0.001253  0.018204 0.072419    12386    19259
cmph-bdz_ph  0.001271  0.000412 0.066264      335    10009
cmph-bdz     0.001317  0.000403 0.066471      424    10033
cmph-bmz     0.001277  0.000539 0.066814     2431    10577
cmph-chm     0.001325  0.000646 0.067026     4356    11057
cmph-fch     0.001372  0.000511 0.066843      942    10177
cmph-chd_ph  0.001310  0.000361 0.066292      418    10033
cmph-chd     0.001323  0.000343 0.066428      735    10129
switch       0.001190  0.000173 0.131323    12517    17450
gperf        0.001199  0.002180 0.064883     8987    12574

The generated code looks like this:

$ phash examples/words20 -pearson
$ cat phash.c

#include "phash.h"
#include <string.h>

long phash_lookup(const unsigned char* s) {
    int l = strlen(s);
    long h = 0;
    const char *key = s;
    static unsigned char phash[] = {
        136,176, 67, 19,170, 84, 87,228, 86, 92,100,214,208,201,160,126,
         30,192, 78,189,232,244,142,252,253,147, 34,251,156, 38, 62,141,
         20, 69, 88, 36,241,193, 77,162, 48,247,181,234,194,125, 98,135,
        127, 61,204,180,205,133,132,145, 10,146,229, 18,198, 37, 73,139,
         70,131,249, 51,223,235, 97,202,120,165, 16,  7,186, 89,243, 24,
        216, 43, 45,182,  0, 94,  2,248, 39,188,255,240,219, 96,220,217,
        168,130,  1,190, 85, 75, 44,169, 55,144, 58,224,221,239, 12,159,
        211, 90, 47, 72, 33,  8, 26,231, 56, 57,155, 59,  3,254, 66,185,
         27,  4,  6,102,101,150, 63,153, 52, 81,197,105,161,  9,140, 11,
        207,148,103,199,183, 49,213,143,149,138, 99,117,167, 83,171,175,
        218, 53, 80,113,106,250,124,137,109,104, 22,242,107,134,166, 54,
         14,110,164,236,187,203,237,154,128, 31, 15, 74,179,230,158,177,
        115,210,114,  5,226,196,212, 60,116, 21,157, 32,112,238, 17, 41,
         68,184, 93, 23,245,173,129,172, 25, 79,233,178, 28,121, 40,246,
        119, 50, 13, 65,163,222,206, 95,225,122, 71,111, 46, 35,195, 82,
         42,152, 76, 91,151,118, 29,209,174,227,108, 64,200,215,123,191,

    };
    /* collisions: keys and values */
    static const int   Cv_1[] = {0};
    static const char *Ck_2[] = {"ABM's","AC's", };
    static const int   Cv_2[] = {4,5, };
    static const int   Cv_3[] = {7};
    static const int   Cv_4[] = {3};
    static const int   Cv_5[] = {2};
    static const int   Cv_6[] = {8};
    static const int   Cv_7[] = {1};
    static const int   Cv_8[] = {6};
    static const int   Cv_9[] = {9};
    /* collision keys */
    static const char *Ck[] = { 0, 0, 
        (void*)&Ck_2, 0, 0, 0, 0, 0, 0, 0, };
    /* collision values and direct values */
    static const int *Cv[] = { 0, 
      (void*)&Cv_1, 
      (void*)&Cv_2, 
      (void*)&Cv_3, 
      (void*)&Cv_4, 
      (void*)&Cv_5, 
      (void*)&Cv_6, 
      (void*)&Cv_7, 
      (void*)&Cv_8, 
      (void*)&Cv_9, };
    /* size of collisions */
    static const unsigned char Cs[] = { 0, 1, 2, 1, 1, 1, 1, 1, 1, 1, };
    /* keys */
    static const char* K[] = {
        "A","A's","AA's","AB's","ABM's","AC's","ACTH's","AI's","AIDS's","AM's",
    };
    unsigned char c;
    for (c=*s++; c; c=*s++) {
        h = phash[(h ^ c) % 256];
    }
    h = h % 10;
    if (Cs[h] > 1) {
      const char **ck = (const char **)Ck[h];
      int i = 0;
      for (; i < Cs[h]; i++) {
        if (!memcmp(ck[i], key, l)) return Cv[h][i];
      }
    }
    else if (Cs[h] == 1) {
      h = Cv[h][0];
    }
    if (memcmp(K[h], key, l)) h = -1;
    return h;
}

We see that -pearson did not generate a perfect hash, it rather created a static collision table. But -pearson tries to minimize the collision costs for us by trying random pearson tables for a maximum time of 60 seconds and then stores the best variant. In this example here we only got one colliding bucket with 2 values.

There are 2 variants for -pearson, called -pearsonnp, "np" for non-perfect, and -pearson8 which is the strict perfect hash variant and a 8 bit table. -pearsonnp will use a larger pearson lookup table to create less collisions, and -pearson8 will most likely fail to create a proper hash table. I'm also still experimenting with other pearson variants for bigger tables.

CMPH

The currently supported external variants (which require an external library even at run-time) use the cmph library. They are a bit slower than the other methods, but only a bit. They allow much bigger key sizes, with dictionaries which might even exceed your main memory, and there exists even an variant to use tables which exceed the available main memory.

$ phash examples/words20 -cmph-bdz_ph
$ cat phash.c
#include "phash.h"
#include <string.h>
#include "cmph.h"

long phash_lookup(const unsigned char* s) {
    static unsigned char *packed_mphf = "\006\000\000\000\000\000\000\000\r\000\000\000\t\000\000
    \000\b\177\035\024\033\003\000";
    return cmph_search_packed(packed_mphf, (const char*)s, strlen(s));
}

You need to link this file with -lcmph.

Summary

Creating perfect hashes is easy,

they are grossly underused,

they are fast,

and in certain cases fast data tricks - word-sized cmp instead of memcmp - beat better algorithms.

1 Comment

Very nice. Thanks for making useful open source software and writing about it.

Leave a comment

About Reini Urban

user-pic Working at cPanel on B::C (the perl-compiler), p2, types, parrot, B::Generate, cygwin perl and more guts (LLVM, jit, optimizations).