Perfect Hashes and faster than memcmp

In my previous post about perlcc next steps I talked shortly about my current project, Perfect::Hash.

# generate c file for readonly lookup
phash keyfile --prefix=phash -nul

# pure-perl usage
use Perfect::Hash;
my @dict = split/\n/,`cat /usr/share/dict/words`;
my $ph = Perfect::Hash->new(\@dict, -minimal);
for (@ARGV) {
  my $v = $ph->perfecthash($_);
  print ($dict[$v] eq $_ ? "$_ at line ".$v+1."\n" : "$_ not found\n");
}

Perfect::Hash->new("keyfile", '-urban', ...)->save_c;
# or just:
phash keyfile --urban
cc -O3 -msse4.2 phash.c ... -lz

phash /usr/share/dict/words --cmph-bdz_ph --nul
cc -O3 phash.c ... -lcmph

phash knows about several fast lookup algorithms for any readonly string-based maps (no integer maps yet), and will select the best method to store it as a lookup method in a C file to compile it with your project. Such as with gperf, only better. Planned are also exporters for other languages, to be able to store values in their native language-specific format. E.g. XS for perl-specific SVs, AVs or HVs.

In order to optimize loading of static data, readonly hashes to replace cdb or sqlite databases or serialized caches we will use phash. gperf was enhanced for libunistring, to lookup unicode data efficiently, compared to icu which uses only straight C++ hashmaps. Searching in constant tables is quite a remarkably common task, so I am constantly suprised why nobody uses optimized data structures and algorithms to use it. I only know of google which also use consistent hashes to spread the load and keyspace across several servers, apple for their speach apps, and libunistring for unicode mappings. cdb e.g. could easily use perfect hashes instead of overly simple DJB hashes. Or mysql which is mostly only usable as glorified readonly concurrent hashtable could be replaced by such tables. The update with regeneration and recompilation for such a table costs less then a typical update into mysql.

Perfect::Hash

Let's create some sample hash tables:

$ phash /usr/share/dict/words -hanovpp -size 20 -false-positives
$ cat phash.c

#include "phash.h"
#include <string.h>

#ifdef _MSC_VER
#define INLINE __inline
#else
#define INLINE inline
#endif

/* FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/ */
static INLINE
unsigned phash_hash(unsigned d, const unsigned char *s, const int l) {
    int i = 0;
    if (!d) d = 0x01000193;
    for (; i < l; i++) {
        d = (d * 0x01000193) ^ *s++;
    }
    return d & 0xffffffff;
}

long phash_lookup(const unsigned char* s) {
    unsigned char h;
    signed char d;
    char v;
    /* hash indices, direct < 0, indirect > 0 */
    static const signed char G[] = {
        -16,  1,  1,-15,  0,  0,  0,-14,  1,-13,-12,  1,  3,-10,  0, -8,
          0, -7,  0, -2,
    };
    /* values */
    static const unsigned char V[] = {
          0,  9, 17,  6, 12, 13, 10, 14, 18,  5,  8, 15,  4,  2, 11, 16,
          1,  7,  3, 19,
    };
    long l = strlen(s);
    h = (unsigned char)(phash_hash(0, s, l) % 20);
    d = G[h];
    v = d < 0
        ? V[(unsigned char)-d-1]
        : d == 0
          ? V[h]
          : V[(unsigned char)(phash_hash(d, s, l) % 20)];
    return v;
}

This creates an interim lookup array G with negative indices for immediate in the values array V, and with positive indices for a second hash-based lookup into the values array V, where the index is the hash seed > 0. So we have typically 1.5 hash function calls per lookup. Generation of such a table is linear in time, intermediate and resulting space is 2n, one array[n] for G and another for the values. But note that this lookup will only work if we know in advance that the key is a member of the perfect hash. That's why we can use the option -false-positive.

In order to use arbitrary keys for the lookup, we omit -false-positives, we need to add a final check if the found key is really contained in the stored perfect hash. So we need 3n space, one more array for the keys also, and one more full memcmp.

We will also add the option -nul to be able to store binary keys, with embedded NUL characters. Note that using \0 in keyfiles is quite tricky.

$ phash /usr/share/dict/words -hanovpp -size 20 -nul
$ tail -n30 phash.c

long phash_lookup(const unsigned char* s, int l) {
    unsigned char h;
    signed char d;
    char v;
    /* hash indices, direct < 0, indirect > 0 */
    static const signed char G[] = {
        -16,  1,  1,-15,  0,  0,  0,-14,  1,-13,-12,  1,  3,-10,  0, -8,
          0, -7,  0, -2,
    };
    /* values */
    static const unsigned char V[] = {
          0,  9, 17,  6, 12, 13, 10, 14, 18,  5,  8, 15,  4,  2, 11, 16,
          1,  7,  3, 19,
    };
    /* keys */
    static const char* K[] = {
        "A","A's","AA's","AB's","ABM's","AC's","ACTH's","AI's","AIDS's","AM's",
        "AOL","AOL's","ASCII's","ASL's","ATM's","ATP's",
        "AWOL's","AZ's","AZT's","Aachen",
    };
    h = (unsigned char)(phash_hash(0, s, l) % 20);
    d = G[h];
    v = d < 0
        ? V[(unsigned char)-d-1]
        : d == 0
          ? V[h]
          : V[(unsigned char)(phash_hash(d, s, l) % 20)];
    if (memcmp(K[v],s,l)) v = -1;
    return v;
}

The differences are the additional storage of the keys and the final if (memcmp(K[v],s,l)) v = -1;

Note that this data structure here is optimized to work with zero-based integer values. For arbitrary values the check would be a bit different and we would need one more array for the values. gperf creates an array of key + value structs and the values can be of any type.

This -hanovpp - pp stands for pure-perl, hanov for Steve Hanov, who wrote a simple python script at http://stevehanov.ca/blog/index.php?id=119 based on the cmph CHD algorithm - method is the default method to create arbitrary sized perfect hashes, and needs no external library. It creates pure and independent, pretty fast C code.

Fast CRC32

The optimized versions -hanov and -urban use the typically faster hardware accelerated iSCSI CRC32 function from zlib, which uses only 1 CPU cycle per word, but requires zlib on the system, and is only fast if zlib is fast. You cannot beat 1 cycle, but some zlib versions only use the slow software fallback version for CRC32, which is most slower than the simple FNV hash function used with -hanovpp.

Note that faster wordsize optimized FNV variants are "presented" at http://www.sanmayce.com/Fastest_Hash/index.html#TRIADvsCRC (Beware of eye-bleeding though)

gperf

gperf might be able to create even faster and better hashes, because it doesn't fully hash each key, it rather creates a custom simpler hash function for each key set. It just adds some characters at certain indices until you get different values for each key. Most other methods only create custom hash seeds for each keyset. But gperf can not reliably create stable hash functions and will either time out on certain keysets or will create bad hash tables.

The major variants of phash -gperf include the options -pic and -switch. -pic creates a lookup function optimized for shared libraries, and -switch creates a nested switch statement.

-switch and fast memcmp

gperf --switch=2 is by far worse than my optimized version called phash -switch With gperf you have to manually specify the level of nested switches, with phash -switch the nesting level is created automatically, but the biggest trick is a special memcmp optimization for shorter strings.

The old gperf --switch=n variant:

$ gperf examples/words20 --switch=2

... special hash ...

const char *
in_word_set (str, len)
     register const char *str;
     register unsigned int len;
{
  if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH) {
      register int key = hash (str, len);
      if (key <= MAX_HASH_VALUE && key >= MIN_HASH_VALUE) {
          register const char *resword;
          if (key < 16) {
              switch (key - 1) {
                  case 0:  resword = "A"; goto compare;
                  case 2:  resword = "A's"; goto compare;
                  case 3:  resword = "AM's"; goto compare;
                  case 4:  resword = "ATM's"; goto compare;
                  case 5:  resword = "Aachen"; goto compare;
                  case 8:  resword = "AZ's"; goto compare;
                  case 9:  resword = "AZT's"; goto compare;
                  case 10: resword = "AWOL's"; goto compare;
                  case 13: resword = "AC's"; goto compare;
                  case 14: resword = "ASL's"; goto compare;
                }
          } else {
              switch (key - 16) {
                  case 0: resword = "ACTH's"; goto compare;
                  case 1: resword = "ASCII's"; goto compare;
                  case 2: resword = "AOL"; goto compare;
                  case 3: resword = "AI's"; goto compare;
                  case 4: resword = "AOL's"; goto compare;
                  case 5: resword = "AIDS's"; goto compare;
                  case 8: resword = "AB's"; goto compare;
                  case 9: resword = "ABM's"; goto compare;
                  case 13: resword = "AA's"; goto compare;
                  case 14: resword = "ATP's"; goto compare;
              }
          }
          return 0;
        compare:
          if (*str == *resword && !strcmp (str + 1, resword + 1))
            return resword;
        }
    }
  return 0;
}

vs. the new seperate phash -switch method:

$ phash examples/words20 -switch
$ cat phash.c

#include "phash.h"
#include <string.h>

long phash_lookup(const unsigned char* s, int l) {
    switch (l) {
      case 1: 
        return *(s) == 'A' ? 0 : -1;
      case 3: 
        switch (s[1]) { /* A's, AOL */
          case 39:
            if (*(short*)s == (short)0x2741 /* A's */
        && *(&s[2]) == 's') return 1;
            break;
          case 'O':
            if (*(short*)s == (short)0x4f41 /* AOL */
        && *(&s[2]) == 'L') return 10;
          default:
            return -1;
        }
        return -1;
      case 4: 
        switch (s[1]) { /* AM's, AB's, AZ's, AA's, AI's, AC's */
          case 'A':
            if (*(int*)s == (int)0x73274141 /* AA's */) return 2;
            break;
          case 'B':
            if (*(int*)s == (int)0x73274241 /* AB's */) return 3;
            break;
          case 'C':
            if (*(int*)s == (int)0x73274341 /* AC's */) return 5;
            break;
          case 'I':
            if (*(int*)s == (int)0x73274941 /* AI's */) return 7;
            break;
          case 'M':
            if (*(int*)s == (int)0x73274d41 /* AM's */) return 9;
            break;
          case 'Z':
            if (*(int*)s == (int)0x73275a41 /* AZ's */) return 17;
          default:
            return -1;
        }
        return -1;
      case 5: 
        switch (s[1]) { /* ATP's, AOL's, ASL's, ATM's, ABM's, AZT's */
          case 'B':
            if (*(int*)s == (int)0x274d4241 /* ABM's */
        && *(&s[4]) == 's') return 4;
            break;
          case 'O':
            if (*(int*)s == (int)0x274c4f41 /* AOL's */
        && *(&s[4]) == 's') return 11;
            break;
          case 'S':
            if (*(int*)s == (int)0x274c5341 /* ASL's */
        && *(&s[4]) == 's') return 13;
            break;
          case 'T':
            if (*(int*)s == (int)0x27505441 /* ATP's */
        && *(&s[4]) == 's') return 15;
            if (*(int*)s == (int)0x274d5441 /* ATM's */
        && *(&s[4]) == 's') return 14;
            break;
          case 'Z':
            if (*(int*)s == (int)0x27545a41 /* AZT's */
        && *(&s[4]) == 's') return 18;
          default:
            return -1;
        }
        return -1;
      case 6: 
        switch (s[1]) { /* AIDS's, AWOL's, ACTH's, Aachen */
          case 'C':
            if (*(int*)s == (int)0x48544341 /* ACTH's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 6;
            break;
          case 'I':
            if (*(int*)s == (int)0x53444941 /* AIDS's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 8;
            break;
          case 'W':
            if (*(int*)s == (int)0x4c4f5741 /* AWOL's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 16;
            break;
          case 'a':
            if (*(int*)s == (int)0x68636141 /* Aachen */
        && *(short*)&s[4] == (short)0x6e65 /* en */) return 19;
          default:
            return -1;
        }
        return -1;
      case 7: 
        return *(int*)s == (int)0x49435341 /* ASCII's */
        && *(short*)&s[4] == (short)0x2749 /* I's */
        && *(&s[6]) == 's' ? 12 : -1;
    }
}

So we switch here first on the key length, and then on the best key. The new switch method creates the nesting level automatically for you, all cases with more than 3 comparisons are automatically changed to a separate switch on the best character. And because we know the length an the keys in advance we can optimize memcmp away by doing word-size comparison directly. This is 50% - 2000% faster then doing memcmp on each string. See the implementation of the helper function memcmp_const_str I heard of people doing this on every string shorter than 128 chars. So far I'm using only 36 as cutoff.

This is not a very interesting example, you need much bigger keysizes to see smarter switch nesting in effect. With this method you can beat even the best hash functions for certain key sets.

For example we got with /usr/share/dict/words on debian with size=99171 from 0.307571 seconds down to 0.006666 seconds, while the fastest perfect hash needed 0.007391 seconds. See the benchmarks. In this special case it was 200x faster with the new memcmp optimization method.

Unfortunately we cannot use this new memcmp optimization method for all hash tables. We need to know one string and the length to compare in advance, and the key and length is a parameter of the lookup method. This only works for this static switch table.

Pearson

There exists another special hashing method, called Pearson hashing. This creates a short 256 bytes "pearson" table of randomized shuffled bytes and uses this table to mix with all characters of the string to hash. Such byte-wise operation with a small lookup table are perfect for smaller machines or smaller keysets.

In my benchmarks -pearson wins with ~127 keys, but you need to be a bit lucky to find a good pearson randomization. phash -pearson tries for max 60 seconds to find the best random table.

Small dictionary with 127 keys

Method       *lookup*  generate compile   c size   exesize  options

hanovpp      0.001247  0.001904 0.069912     3454    11575
hanov        0.001231  0.000946 0.070709     3200    11629
urban        0.001298  0.001020 0.072471     3200    11629
pearson      0.001178  0.068758 0.072311    12354    19240
pearsonnp    0.001253  0.018204 0.072419    12386    19259
cmph-bdz_ph  0.001271  0.000412 0.066264      335    10009
cmph-bdz     0.001317  0.000403 0.066471      424    10033
cmph-bmz     0.001277  0.000539 0.066814     2431    10577
cmph-chm     0.001325  0.000646 0.067026     4356    11057
cmph-fch     0.001372  0.000511 0.066843      942    10177
cmph-chd_ph  0.001310  0.000361 0.066292      418    10033
cmph-chd     0.001323  0.000343 0.066428      735    10129
switch       0.001190  0.000173 0.131323    12517    17450
gperf        0.001199  0.002180 0.064883     8987    12574

The generated code looks like this:

$ phash examples/words20 -pearson
$ cat phash.c

#include "phash.h"
#include <string.h>

long phash_lookup(const unsigned char* s) {
    int l = strlen(s);
    long h = 0;
    const char *key = s;
    static unsigned char phash[] = {
        136,176, 67, 19,170, 84, 87,228, 86, 92,100,214,208,201,160,126,
         30,192, 78,189,232,244,142,252,253,147, 34,251,156, 38, 62,141,
         20, 69, 88, 36,241,193, 77,162, 48,247,181,234,194,125, 98,135,
        127, 61,204,180,205,133,132,145, 10,146,229, 18,198, 37, 73,139,
         70,131,249, 51,223,235, 97,202,120,165, 16,  7,186, 89,243, 24,
        216, 43, 45,182,  0, 94,  2,248, 39,188,255,240,219, 96,220,217,
        168,130,  1,190, 85, 75, 44,169, 55,144, 58,224,221,239, 12,159,
        211, 90, 47, 72, 33,  8, 26,231, 56, 57,155, 59,  3,254, 66,185,
         27,  4,  6,102,101,150, 63,153, 52, 81,197,105,161,  9,140, 11,
        207,148,103,199,183, 49,213,143,149,138, 99,117,167, 83,171,175,
        218, 53, 80,113,106,250,124,137,109,104, 22,242,107,134,166, 54,
         14,110,164,236,187,203,237,154,128, 31, 15, 74,179,230,158,177,
        115,210,114,  5,226,196,212, 60,116, 21,157, 32,112,238, 17, 41,
         68,184, 93, 23,245,173,129,172, 25, 79,233,178, 28,121, 40,246,
        119, 50, 13, 65,163,222,206, 95,225,122, 71,111, 46, 35,195, 82,
         42,152, 76, 91,151,118, 29,209,174,227,108, 64,200,215,123,191,

    };
    /* collisions: keys and values */
    static const int   Cv_1[] = {0};
    static const char *Ck_2[] = {"ABM's","AC's", };
    static const int   Cv_2[] = {4,5, };
    static const int   Cv_3[] = {7};
    static const int   Cv_4[] = {3};
    static const int   Cv_5[] = {2};
    static const int   Cv_6[] = {8};
    static const int   Cv_7[] = {1};
    static const int   Cv_8[] = {6};
    static const int   Cv_9[] = {9};
    /* collision keys */
    static const char *Ck[] = { 0, 0, 
        (void*)&Ck_2, 0, 0, 0, 0, 0, 0, 0, };
    /* collision values and direct values */
    static const int *Cv[] = { 0, 
      (void*)&Cv_1, 
      (void*)&Cv_2, 
      (void*)&Cv_3, 
      (void*)&Cv_4, 
      (void*)&Cv_5, 
      (void*)&Cv_6, 
      (void*)&Cv_7, 
      (void*)&Cv_8, 
      (void*)&Cv_9, };
    /* size of collisions */
    static const unsigned char Cs[] = { 0, 1, 2, 1, 1, 1, 1, 1, 1, 1, };
    /* keys */
    static const char* K[] = {
        "A","A's","AA's","AB's","ABM's","AC's","ACTH's","AI's","AIDS's","AM's",
    };
    unsigned char c;
    for (c=*s++; c; c=*s++) {
        h = phash[(h ^ c) % 256];
    }
    h = h % 10;
    if (Cs[h] > 1) {
      const char **ck = (const char **)Ck[h];
      int i = 0;
      for (; i < Cs[h]; i++) {
        if (!memcmp(ck[i], key, l)) return Cv[h][i];
      }
    }
    else if (Cs[h] == 1) {
      h = Cv[h][0];
    }
    if (memcmp(K[h], key, l)) h = -1;
    return h;
}

We see that -pearson did not generate a perfect hash, it rather created a static collision table. But -pearson tries to minimize the collision costs for us by trying random pearson tables for a maximum time of 60 seconds and then stores the best variant. In this example here we only got one colliding bucket with 2 values.

There are 2 variants for -pearson, called -pearsonnp, "np" for non-perfect, and -pearson8 which is the strict perfect hash variant and a 8 bit table. -pearsonnp will use a larger pearson lookup table to create less collisions, and -pearson8 will most likely fail to create a proper hash table. I'm also still experimenting with other pearson variants for bigger tables.

CMPH

The currently supported external variants (which require an external library even at run-time) use the cmph library. They are a bit slower than the other methods, but only a bit. They allow much bigger key sizes, with dictionaries which might even exceed your main memory, and there exists even an variant to use tables which exceed the available main memory.

$ phash examples/words20 -cmph-bdz_ph
$ cat phash.c
#include "phash.h"
#include <string.h>
#include "cmph.h"

long phash_lookup(const unsigned char* s) {
    static unsigned char *packed_mphf = "\006\000\000\000\000\000\000\000\r\000\000\000\t\000\000
    \000\b\177\035\024\033\003\000";
    return cmph_search_packed(packed_mphf, (const char*)s, strlen(s));
}

You need to link this file with -lcmph.

Summary

Creating perfect hashes is easy,

they are grossly underused,

they are fast,

and in certain cases fast data tricks - word-sized cmp instead of memcmp - beat better algorithms.

perlcc next steps

cPanel uses now the new perl compiler B::C with -O3 and --staticxs with 5.14.4 in production, and I'm outlining the next steps.

Our old compiler (with 5.6.2) needed 2:30 hours for each build, produced ~100 binaries of about 30-50MB size. This sounds a lot but it is not. This is about the size a single perl scripts needs in memory, but a single perl script has to find all the dependent files first, load and parse it, and construct the stashes, the nested hashes of namespaces. All this is not needed at all during run-time, hence perlcc compiled binaries do not need it. The new compiler uses 30 minutes for the same binaries on our build servers, but is not yet using a lot of internal optimizations. It does use a lot of statically initialized strings and data, so the run-time is better than with 5.6, even if 5.6 is still much faster than the 5.14 run-time. But we still don't collect arrays into _independent_comalloc ranges as we did with 5.6, we still don't use link-time nor profile guided optimizations and we still do not optimize single modules seperately.

There are not many shared libraries used in such a compiled binary, just libperl and the used XS shared libraries.

Starting such a single binary is very fast, it's just mapped into memory at once, then the dynamic symbols need to be generated, as B::C cannot yet generate them statically, and perl5 is not really helpful in supporting static symbols, but all other data, strings and ops (perl functions) are generated statically. Then the XS modules are initialized (booted), then some pointers pointing into some shared libraries need to updated, as they pointed to some other pointer at compile-time and then the perl program is started.

perlcc -m

The plan for the next year is to support generating shared modules per perl package. perlcc -m compiles a package to a shared library, buildcc generates the Makefile.depend dependencies for each binary, and then you can use selectively more advanced optimizations per package. I.e. B::CC or rperl optimized compilations, which should run 2-20x faster than the current B::C compiled packages.

B::CC is not stable enough yet, so you can only use it with some modules, and it currently benefits greatly from type information, for which upstream maintainers have no interest. With rperl it is even more strict, as those modules are not even more strict, as in real programming languages compared to our simple dynamic scripting language, it explicitly disallows certain kinds of dynamic magic or array autovivification, which slows down run-time so much as benchmarked in my YAPC::EU 2013 talk.

So the plan is to compile to several optimized libraries seperately and link them together later. For binaries which fork or which need to be updated pretty often, shared libraries are preferred, for other simple binaries static libraries are preferred as they are loaded much faster. Shared libraries can be updated much easier, and when you fork a binary its shared libraries are shared, there's no seperate copy in memory. This is also one of the main reasons to change our current perl5 unicode implementation from simple and slow perl data structures, to compiled shared libraries, which are just mapped in when needed and can be shared.

Data::Compile

But first some other steps are needed. Before we will create compiled libraries per perl packages we want to compile only some read-only datastructures, hashes mainly. The name of this module is Data::Compile and will replace our current cdb databases for localization data. cdb maps readonly strings to strings, but cannot store other values than strings and is implemented as normal hashmap. Our localization data would be much easier to use with stored coderefs, similar to pre-compiled html templates, which which is a mix of strings and coderefs. By using certain parts of B::C it is trivial to store data as shared library, which can be dynaloaded on demand. Loading such a compiled datastructure of perl data is of course much faster then loading a btree or colliding hashmap and accessing it at run-time through some magic tie interface. B::C is very efficient in storing perl data statically, and only some GV symbols need to be initialized at boot-time.

Data::Compile will also replace the overly slow serializers, like Sereal, Cpanel::JSON::XS, Data::MessagePack or Storable, which have to convert binary data from and to perl data. Data::Compile freezes perl data natively and just thaws it at instance. So loading the CPAN Metadata file will go from 16 secs to 0.001ms, similar to CPAN::SQLite and cpanm which does those queries at run-time to some server will not be needed anymore. Updating such a compiled CPAN Metadata file is estimated to need 2 secs, which needs about the same time as updating CPAN with CPAN::SQLite. And CPAN::SQLite still has a locking bug, which prevents multiple cpan sessions, so you are easier of with cpanm now.

In order to optimize loading of static data, readonly hashes, to replace cdb or sqlite databases or serialized caches we need another module first:

Perfect::Hash

Currently there's is only gperf to create perfect hashes in C or C++, and then there is also the lesser known bob jenkins perfect executable to create perfect hashes (with some weird header files) and of course the cmph library to create perfect hashes of bigger dictionaries, google size. No single database or datra-structure can be faster to lookup then perfect hashes for readonly data, the fastest single-host readonly databases are cdb and mysql. And perfect hashes beat them by far. Ever wondered why Google lookups are so fast? Well, they distribute hashes across several servers with so-called consistent hashes, which map strings into buckets of n servers, and when they insert and delete entries the remapping (copying to other buckets) is minimized. But the real trick are minimal perfect hashes, which minimize lookup times and storage sizes far beyond normal or sparse hashes or b-trees.

So I created phash to replace gperf in the long term, by using better algorithms to handle any data (gperf fails to work with anagrams or weird keys), to create optimally fast C libraries or optimally small C libraries for fast hash lookups, and even provide backends to output perfect hashes for several popular programming languages, like perl (XS), java, ruby, php or python. As C library you can only store strings as keys, and integers and strings as values. With those other backends you can store all supported values.

Perfect hashes look differently on small 8-bit machines than on fast x86_64 machines, for static libraries or for shared libraries with -fPIC, for <1000 keys or for >1.000.000 keys, for keys with NUL characters or not, for 7-bit keys only, for unicode keys, for case-insensitive key lookup, and much more. Using a high-level language to analyze the keys to generate a good perfect hash (in this case perl) is much easier than fixing and enhancing gperf.

The other main problem is icu which is a collection of glorified but moderately efficient hashmaps of unicode tables and the even worse perl5 implementation of those unicode tables. Encode does it much better by pre-compiling encoding tables into good shared libraries, but unicode has much more tables than just encodings. parrot currently uses icu with a new gperf generated workaround for missing tables in icu (the not-existing whitespace class and missing namealiases which were broken in icu 5.0), and moarvm came up with it's own implementation of those tables to workaround icu deficiencies.

So far I have re-implemented the best algorithms to create and use perfect hashes in pure perl and optimized it with XS functions and I am able to create good C code. I am still testing optimal hash functions and strategies. I need only ~2 seconds to create a perfect hash for 100.000 entries in pure perl, with C libraries this goes down to 0.001 seconds, scaling linearily. The main problem is solved. Compilation with a C compiler and -O3 of such hashes need another second.

E.g. these hashes can be used to replace the constant lookup in ExtUtils::Constants. So I'll look soon into seperate my current ~10 different implementations of perfect hashes into one simple but good enough pure-perl version, which will generate pure-c code, without the need for any external library, and then seperate the others into several packages with external dependencies, like zlib for a fast hardware-assisted crc32 hash function (1 op per word), or libcmph and the various output formatters.

Then Data::Compiled will use Perfect::Hash::XS to store the read-only hashes. And then source code will change to use those hashes instead, and then perlcc -m will be able to compile and link arbitrary perl packages with various type annotations and various compilers, if B::C -O0, -O3, B::CC -O0, -O2 or even rperl.

The next plan is then to create a type inferencer in C, but I'll wait for this until p5p comes to a syntax decision how to annotate return types of functions. If ever.

vec is slow, little-endian and limited

I remembered from the perl docs to use vec to compress arrays of integers to bitvectors. In my case with Perfect::Hash::Urban the size advantage is 3320 byte vs 88 bytes for two integer arrays of size 20 with values [0..19], which compress to 8 bit per entry with vec. This is 300x less bytes. The theoretical maximal simple compression would be 13 byte, i.e. 5 bits per entry * 20 = 100 bits = 13 byte. Stored as Perl PV this would require 68 byte PV overhead + 13 byte for the bitvector.

But vec is only allowing sizes which are a power of 2, i.e. (1,2,4,8.16,32 and with 64 bit machines 64). So it seems to use natural words to store the bits, allowing fast access to each entry. With 64 bits it even throws a portable warning. I have no idea what this warning is for nowadays, other than Storable would not be able to thaw it on 32bit machines which do not support 64bit ints. Currently it hurts performance at every single get and set access at run-time for vec with 64 bits entries.

So I implemented my bitvector to store intermediate tables for my perfect hashes in pure-perl and XS with vec, and it was about 2x faster and 300x smaller. That's approximately what I expected.

# convert 2 int arrays @G and @V to one bitvector $G (a string)
my $bits = 1+length(sprintf "%b",$size); # +1 for negative values
for (2,4,8,16,32,($Config{ptrsize}==8?(64):())) {
  next if $bits > $_;
  $bits = $_; last;
}
my $G = "\0" x int($bits * $size / 4); # calloc it
# $G = @G
for my $i (0..$#G) {
  vec($G, $i, $bits) = $G[$i] if $G[$i];
}
# $G+offset = @V
for my $i (0..$#V) { # put V after G
  vec($G, $i+$size, $bits) = $V[$i] if $V[$i];
}
printf("\$G\[$bits]=\"%s\":%d\n", unpack("h*", $G), length($G))
  if $options{-debug};

and read-access is like this:

my $v = vec($G, $index, $bits); # vec version of $v = $G[$index];

But then I tried perfect hashes on bigger dictionaries, like /usr/share/dict/words which need tables of 32 bits per entry, which are needed for 65535 to 4294967295 entries. Note that with that many entries a real bitvector, i.e. via Bit::Vector might be better than the simple approach with vec, which does not support bit sizes between 16 and 32 or even in 32 - 64. But so far I prefer to work without any external dependencies.

Everything works fine in pure-perl, but I wanted faster arithmetic and hash calculations in XS for bigger dictionaries. Like the CPAN Metadata hashes or the various perl5 unicode tables.

But my XS version did not work with bigger dictionaries, it only worked fine with 8 bit tables, which is for a maximum table size of 64.

My XS implementation of vec looked like this:

static inline
IV vec(char *G, IV index, IV bits) {
  if (bits == 8)
    return *(char*)(G + index) & 255;
  else if (bits == 4)
    return *(char*)(G + (index / 2)) & 15;
  else if (bits == 16) {
    short l = *(short*)((short*)G + index); /* __UINT16_MAX__ */
    return (IV)l;
  }
  else if (bits == 32) {
#if INTSIZE == 4
    int l = *(int*)((int*)G + index); /* __UINT32_MAX__ */
#else
    long l = *(long*)((long*)G + index);
#endif
    return (IV)l;
  }
#ifdef HAVE_QUAD
  else if (bits == 64) {
    long l = *(long*)((long*)G + index);
    return (IV)l;
  }
#endif
}

Every access translates to natural words and index offsets, which translates to a single assembler instruction per size. But I got overflows on most indices, so I looked up the perl5 implementation of vec in doop.c and started getting angry.

vec is really little-endian only, no natural endianness, so perl5 does byte fiddling for all the various sizes, but then the bit size limit makes no sense at all anymore, as the naive implementation supporting little-endian only would be a simple loop independent of the bits, instead of all the if branches to check the supported bit sizes. So perl5 vec is not only unnecessarily slow, and only supports little-endian, not native endian-ness, it also adds arbitrary limits on the supported bit sizes for no good technical reason.

So either use Bit::Vector or pack/unpack instead. vec is good for nothing. But I'll probably use just my fast nvec implementation for natural endian-ness and natural word sizes as shown above. This is a drop-in replacement for vec for bits>=4, but works with cooperative access for XS and Pure-Perl, and is fast.

Statistics for perl hash tables

The following data is from a 64bit perl-5.19.10 (i.e. 32 bit jenkins - corrected from 64 bit siphash - which randomly shuffles the 2 first entries of collisions), running the core test suite with -DH.

number of hash keys:

0   3634661x
1   3263596x
2   4361350x
3   2745134x
4   4199196x
5   2921502x
... and so on linearly falling ...
29  1821569x
30  736893x
31  312185x
32  643500x
33  379865x
34  371403x
35  1050967x
36  1016470x
37  1226918x
38  8558286x
39  1671515x
40  4594200x
41  2859521x
42  627789x
43  686036x
...
67  220743x
68  3382026x
69  1225141x
70  147903x
71  95882x
...
307 8330x
308 8962x
309 7902x
310 9041x
...
513 2413x
514 2713x
515 2708x
516 1945x
...
1098    410x
1099    433x
1100    402x
1101    400x
...
1848    69x
2225    14756x
2226    66x
2283    14877x
2284    63x
2309    14818x
2310    60x
2658    25321x
2659    56x
4239    9221x
4240    53x
5654    23526x
5655    46x
6355    1799x
6356    41x
6768    1641x
6769    38x
7476    2001x
7477    35x
7983    37x
7984    76x
7985    27x
7989    25x
7990    23x
8292    54x
8293    19x
9716    18x
10132   19x
10133   18x
13060   1968x
13061   15x
19745   39503x
19746   13x
20893   3795x
20894   9x
22124   3862x
22125   5x
23647   3865x
23648   1x
25137   328x

hash table sizes:

0   1933x
1   294x
3   432x
7   24128585x
15  14099448x
31  15040566x
63  28189821x
127 19795904x
255 10955641x
511 3851922x
1023    2009591x
2047    961707x
4095    180486x
8191    236082x
16383   141396x
32767   136848x

depth of linear search of collisions:

0   19076114x
1   76051832x
2   19870023x
3   4091878x
4   577090x
5   46417x
6   16933x
7   346x
8   23x

It was produced with this patch and a short script to sum up the occurences of the hash keys, sizes and tried collisions. The tests are still work in progress, see some more in this perl-hash-stats repo

The perl5 testsuite has a key size of median = 20, and avg of 133.2. The most commonly used key sizes are 4, 101 and 2.

Average case (perl core testsuite)

| Hash Function     | collisions| cycles/hash |
|:------------------|----------:|------------:|
| CRC32             | 1.078     | 29.78       |
| SUPERFAST         | 1.081     | 34.72       |
| ONE_AT_A_TIME_HARD| 1.092     | 83.75       |
| SIPHASH           | 1.091     | 154.68      |
| ONE_AT_A_TIME     | 1.098     | 43.62       |
| ONE_AT_A_TIME_OLD | 1.100     | 43.62       |
| MURMUR3           | 1.105     | 34.03       |
| DJB2              | 1.131     | 44.73       |
| CITY              |           | 30.13       |
| SDBM              | -         | 30.57       |

Less collisions are better, less cycles/hash is faster. A hash table lookup consists of one constant hash function (depending only on the length of the key) and then resolving 0-x collisions (in our avg case 0-8).

  • collisions are the number of linked list iterations per hash table usage.
  • cycles/hash is measured with smhasher for 10 byte keys. (see "Small key speed test")
  • SDBM and DJBJ did not produce a workable miniperl with -O2. Needed to patch them. Seeing that a HASH=0, effectively creating a long list of linear collisions in HvARRAY[0], does not work in current perl5, makes me feel bad. Note that seed + len is to prevent from the \0 attack.

So it's a pretty good distribution to test better hash tables. Sorry, no beautiful plots yet, later.

The problem I see with the new 5.18 hash tables is manifold:

  • instead of using a fast and optimized hash function we are using a slow, unoptimized hash function. Instead of using a good and fast function and avoid collision attacks, we try to avoid attacks by using slower hashing. The choosen hash functions are also not the best (faster and less collisions) independently verified for average data.
  • we hash unnecessary bits. needed are typically the last 5-9 bits (size 32-512) The biggest hash table in the core test suite needs 15 bits!
  • we cargo-cult salting by adding the random seed to the end of the key. every hash function combines the key with the salt anyway, and we only need the last bits. The last bits should prefer the key, not the seed. This is wrong. We add the len to the seed to prevent from Ruslan's \0 attack.
  • the problem is only to avoid linear collision exhaustion, O(n/2) search time, but we didn't change the collision strategy, e.g. from linear list to binary sorted array, i.e. O(log n). A sorted array of collision entries is much faster to search and much more cache friendly. That's why open addressing is so popular. Sorting up to 6 collisions is thanksfully trivial (hard-coded insertion-sort), and we only need to fallback to complicated called sort functions on real attacks.
  • we randomize seeds only when rehashing, when the attack can already occur. We did it better with 5.8.1, but this was "optimized" with 5.8.2, which lead the problem with 5.18. old info: cPanel perl randomizes the seed for all hashes, so is even immune to the attack Ruslan is talking about in the public blog post at booking.com, but the full motivation for the current 5.18 design is not publicly accessible. It was accessible, I just did not find it.

I'm working on fixing the 5.18 hash table problem and also evaluating robin-hood (fastest open addressing hash table) and cmph BMZ (fastest perfect hash) and using HW optimized hash functions (crc32 intrinsincs and SSE4.2 or armv8).

My analysis of the ususal hash functions is here: github.com/rurban/smhasher, where you can see how ridicuously slow siphash is, how slow and bad jenkins is, compared to fast hash and more secure functions. There are also the results of the full testsuite for each hash function and the easier to follow stats) evaluating quality vs performance.

Good open addressing hash tables are evaluated at github.com/goossaert/hashmap.

Perfect hash tables are explained at cmph.sourceforge.net/.

The public motivation to slow down the perl hash function is here. The discussion within the security team is not available, but as I was told, only the actual attack is not published, which is good. What is not good is that the developers have no idea how to evaluate hash tables and there is no sane discussion possible in the mailinglist, hence this blog and the various repos. In the end I will submit the best solution as patch.

I am also working a symbolic proof besides the statistical one with the stp SAT solver, to see the "quality" of a hash function over the possible seeds, counting the number of collisions over the typical table sizes. With such a solver you can automatically create attack vectors against most web-sites with bad hash collision strategies, so I'm a bit reluctant to publish that.

Note that "hardening" a hash function does not help in hardening a hash table. It only will get slower, but only marginally better.

Notes: * Edited SIPHASH out. We are using jenkins_hard in blead now. This is 2x faster than siphash, and only 6x slower than unoptimized City, but 4-20x slower than SSE4 optimized crc32 intrinsics, with not that high additional collision costs. * Edited seed + len confusion. This is a good thing to do. In fact CRC32 is currently the leader in least collisions and performance.

Do not use each

The each hash iterator has two problems which almost nobody is aware of, and which might lead to unexpected and hard to detect problems. We were using the perl-compiler since 12 years in production and just recently found out that the B function walksymtable will miss random symbols from time to time. It entirely depends on hash randomization. I cannot predict what each did to your code.

First problem: Do not change the data

Adding or deleting a key might lead to rehashing, which will re-order the keys. The iterator will just continue and then either miss keys or get keys again a second time.

perldoc -f each:

If you add or delete a hash's elements while iterating over it, entries may be skipped or duplicated--so don't do that. Exception: It is always safe to delete the item most recently returned by "each()", so the following code works properly:

while (($key, $value) = each %hash) {
    print $key, "\n";
    delete $hash{$key};   # This is safe
}

Below is the bug fixed with http://perl5.git.perl.org/perl.git/commit/5cc8528c900964306cba9b53c6eaa27af540eaea?f=ext/B/B.pm but even the p5p committer was not able to understand the underlying problem, why it failed. This bug was in core perl since 5.005 (1998) until 5.18.

sub walksymtable {
    my ($symref, $method, $recurse, $prefix) = @_;
    my $sym;
    my $ref;
    my $fullname;
    no strict 'refs';
    $prefix = '' unless defined $prefix;
    while (($sym, $ref) = each %$symref) {
        $fullname = "*main::".$prefix.$sym;
        if ($sym =~ /::$/) {
            $sym = $prefix . $sym;
            if (svref_2object(\*$sym)->NAME ne "main::"
                && $sym ne "<none>::"
                && &$recurse($sym))
            {
                walksymtable(\%$fullname, $method, $recurse, $sym);
            }
        } else {
            svref_2object(\*$fullname)->$method();
        }
    }
}

The fix is to replace

while (($sym, $ref) = each %$symref)

with

foreach my $sym (keys %$symref) {
    my $ref = $symref->{$sym};

walksymtable walks all entries in stashes (symbol table hashes) and all its children, and calls a method on each symbol. There is no apparent destructive code in walksymtable which could change the stash, but calling methods might lead to changes. Here it's &$recurse and ->$method(). It could be totally unexpected, e.g. loading a module which changes the symboltable or @ISA hierarchy, which might lead to skipped symbols. Or just parsing normal perl code, such as attributes or named regex references will magically load code and change the symbol table.

each is a bit more performant then foreach, because foreach stores the list to iterate over and just advances the index into it. But you are still able to change entries in the hash with for/foreach, you are safe to use recursion and you are safe from unexpected side-effects as in the walksymtable function.

Second problem: Not re-entrant

while (($key, $value) = each %hash) {
    use Data::Dumper;
    print $key, "\n";
    delete $hash{$key};
    print Dumper(\%hash);  # This is a bug
}

The problems is that Data::Dumper internally uses each to iterate over the hash, which resets the iterator of the hash to the end. The first while loop will not be able continue, the iterator cannot even be stored and reset manually. This is a Data::Dumper bug not detected and fixed in core perl for 16 years, since perl 5.005 (1998).

each not being re-entrant is not even documented.

The hash iterators should have really been stored in the op, not the hash. This is one of the oldest perl5 bugs which will never be fixed.

So each should be treated as in php: Avoid it like a plague. Only use it in optimized cases where you know what you are doing, but not in libraries as you saw with Data::Dumper.