August 2014 Archives

Perfect Hashes and faster than memcmp

By Reini Urban on August 21, 2014 12:49 AM

In my previous post about perlcc next steps I talked shortly about my current project, Perfect::Hash.

# generate c file for readonly lookup
phash keyfile --prefix=phash -nul

# pure-perl usage
use Perfect::Hash;
my @dict = split/\n/,`cat /usr/share/dict/words`;
my $ph = Perfect::Hash->new(\@dict, -minimal);
for (@ARGV) {
  my $v = $ph->perfecthash($_);
  print ($dict[$v] eq $_ ? "$_ at line ".$v+1."\n" : "$_ not found\n");
}

Perfect::Hash->new("keyfile", '-urban', ...)->save_c;
# or just:
phash keyfile --urban
cc -O3 -msse4.2 phash.c ... -lz

phash /usr/share/dict/words --cmph-bdz_ph --nul
cc -O3 phash.c ... -lcmph

phash knows about several fast lookup algorithms for any readonly string-based maps (no integer maps yet), and will select the best method to store it as a lookup method in a C file to compile it with your project. Such as with gperf, only better. Planned are also exporters for other languages, to be able to store values in their native language-specific format. E.g. XS for perl-specific SVs, AVs or HVs.

In order to optimize loading of static data, readonly hashes to replace cdb or sqlite databases or serialized caches we will use phash. gperf was enhanced for libunistring, to lookup unicode data efficiently, compared to icu which uses only straight C++ hashmaps. Searching in constant tables is quite a remarkably common task, so I am constantly suprised why nobody uses optimized data structures and algorithms to use it. I only know of google which also use consistent hashes to spread the load and keyspace across several servers, apple for their speach apps, and libunistring for unicode mappings. cdb e.g. could easily use perfect hashes instead of overly simple DJB hashes. Or mysql which is mostly only usable as glorified readonly concurrent hashtable could be replaced by such tables. The update with regeneration and recompilation for such a table costs less then a typical update into mysql.

Perfect::Hash

Let's create some sample hash tables:

$ phash /usr/share/dict/words -hanovpp -size 20 -false-positives
$ cat phash.c

#include "phash.h"
#include <string.h>

#ifdef _MSC_VER
#define INLINE __inline
#else
#define INLINE inline
#endif

/* FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/ */
static INLINE
unsigned phash_hash(unsigned d, const unsigned char *s, const int l) {
    int i = 0;
    if (!d) d = 0x01000193;
    for (; i < l; i++) {
        d = (d * 0x01000193) ^ *s++;
    }
    return d & 0xffffffff;
}

long phash_lookup(const unsigned char* s) {
    unsigned char h;
    signed char d;
    char v;
    /* hash indices, direct < 0, indirect > 0 */
    static const signed char G[] = {
        -16,  1,  1,-15,  0,  0,  0,-14,  1,-13,-12,  1,  3,-10,  0, -8,
          0, -7,  0, -2,
    };
    /* values */
    static const unsigned char V[] = {
          0,  9, 17,  6, 12, 13, 10, 14, 18,  5,  8, 15,  4,  2, 11, 16,
          1,  7,  3, 19,
    };
    long l = strlen(s);
    h = (unsigned char)(phash_hash(0, s, l) % 20);
    d = G[h];
    v = d < 0
        ? V[(unsigned char)-d-1]
        : d == 0
          ? V[h]
          : V[(unsigned char)(phash_hash(d, s, l) % 20)];
    return v;
}

This creates an interim lookup array G with negative indices for immediate in the values array V, and with positive indices for a second hash-based lookup into the values array V, where the index is the hash seed > 0. So we have typically 1.5 hash function calls per lookup. Generation of such a table is linear in time, intermediate and resulting space is 2n, one array[n] for G and another for the values. But note that this lookup will only work if we know in advance that the key is a member of the perfect hash. That's why we can use the option -false-positive.

In order to use arbitrary keys for the lookup, we omit -false-positives, we need to add a final check if the found key is really contained in the stored perfect hash. So we need 3n space, one more array for the keys also, and one more full memcmp.

We will also add the option -nul to be able to store binary keys, with embedded NUL characters. Note that using \0 in keyfiles is quite tricky.

$ phash /usr/share/dict/words -hanovpp -size 20 -nul
$ tail -n30 phash.c

long phash_lookup(const unsigned char* s, int l) {
    unsigned char h;
    signed char d;
    char v;
    /* hash indices, direct < 0, indirect > 0 */
    static const signed char G[] = {
        -16,  1,  1,-15,  0,  0,  0,-14,  1,-13,-12,  1,  3,-10,  0, -8,
          0, -7,  0, -2,
    };
    /* values */
    static const unsigned char V[] = {
          0,  9, 17,  6, 12, 13, 10, 14, 18,  5,  8, 15,  4,  2, 11, 16,
          1,  7,  3, 19,
    };
    /* keys */
    static const char* K[] = {
        "A","A's","AA's","AB's","ABM's","AC's","ACTH's","AI's","AIDS's","AM's",
        "AOL","AOL's","ASCII's","ASL's","ATM's","ATP's",
        "AWOL's","AZ's","AZT's","Aachen",
    };
    h = (unsigned char)(phash_hash(0, s, l) % 20);
    d = G[h];
    v = d < 0
        ? V[(unsigned char)-d-1]
        : d == 0
          ? V[h]
          : V[(unsigned char)(phash_hash(d, s, l) % 20)];
    if (memcmp(K[v],s,l)) v = -1;
    return v;
}

The differences are the additional storage of the keys and the final if (memcmp(K[v],s,l)) v = -1;

Note that this data structure here is optimized to work with zero-based integer values. For arbitrary values the check would be a bit different and we would need one more array for the values. gperf creates an array of key + value structs and the values can be of any type.

This -hanovpp - pp stands for pure-perl, hanov for Steve Hanov, who wrote a simple python script at http://stevehanov.ca/blog/index.php?id=119 based on the cmph CHD algorithm - method is the default method to create arbitrary sized perfect hashes, and needs no external library. It creates pure and independent, pretty fast C code.

Fast CRC32

The optimized versions -hanov and -urban use the typically faster hardware accelerated iSCSI CRC32 function from zlib, which uses only 1 CPU cycle per word, but requires zlib on the system, and is only fast if zlib is fast. You cannot beat 1 cycle, but some zlib versions only use the slow software fallback version for CRC32, which is most slower than the simple FNV hash function used with -hanovpp.

Note that faster wordsize optimized FNV variants are "presented" at http://www.sanmayce.com/Fastest_Hash/index.html#TRIADvsCRC (Beware of eye-bleeding though)

gperf

gperf might be able to create even faster and better hashes, because it doesn't fully hash each key, it rather creates a custom simpler hash function for each key set. It just adds some characters at certain indices until you get different values for each key. Most other methods only create custom hash seeds for each keyset. But gperf can not reliably create stable hash functions and will either time out on certain keysets or will create bad hash tables.

The major variants of phash -gperf include the options -pic and -switch. -pic creates a lookup function optimized for shared libraries, and -switch creates a nested switch statement.

-switch and fast memcmp

gperf --switch=2 is by far worse than my optimized version called phash -switch With gperf you have to manually specify the level of nested switches, with phash -switch the nesting level is created automatically, but the biggest trick is a special memcmp optimization for shorter strings.

The old gperf --switch=n variant:

$ gperf examples/words20 --switch=2

... special hash ...

const char *
in_word_set (str, len)
     register const char *str;
     register unsigned int len;
{
  if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH) {
      register int key = hash (str, len);
      if (key <= MAX_HASH_VALUE && key >= MIN_HASH_VALUE) {
          register const char *resword;
          if (key < 16) {
              switch (key - 1) {
                  case 0:  resword = "A"; goto compare;
                  case 2:  resword = "A's"; goto compare;
                  case 3:  resword = "AM's"; goto compare;
                  case 4:  resword = "ATM's"; goto compare;
                  case 5:  resword = "Aachen"; goto compare;
                  case 8:  resword = "AZ's"; goto compare;
                  case 9:  resword = "AZT's"; goto compare;
                  case 10: resword = "AWOL's"; goto compare;
                  case 13: resword = "AC's"; goto compare;
                  case 14: resword = "ASL's"; goto compare;
                }
          } else {
              switch (key - 16) {
                  case 0: resword = "ACTH's"; goto compare;
                  case 1: resword = "ASCII's"; goto compare;
                  case 2: resword = "AOL"; goto compare;
                  case 3: resword = "AI's"; goto compare;
                  case 4: resword = "AOL's"; goto compare;
                  case 5: resword = "AIDS's"; goto compare;
                  case 8: resword = "AB's"; goto compare;
                  case 9: resword = "ABM's"; goto compare;
                  case 13: resword = "AA's"; goto compare;
                  case 14: resword = "ATP's"; goto compare;
              }
          }
          return 0;
        compare:
          if (*str == *resword && !strcmp (str + 1, resword + 1))
            return resword;
        }
    }
  return 0;
}

vs. the new seperate phash -switch method:

$ phash examples/words20 -switch
$ cat phash.c

#include "phash.h"
#include <string.h>

long phash_lookup(const unsigned char* s, int l) {
    switch (l) {
      case 1: 
        return *(s) == 'A' ? 0 : -1;
      case 3: 
        switch (s[1]) { /* A's, AOL */
          case 39:
            if (*(short*)s == (short)0x2741 /* A's */
        && *(&s[2]) == 's') return 1;
            break;
          case 'O':
            if (*(short*)s == (short)0x4f41 /* AOL */
        && *(&s[2]) == 'L') return 10;
          default:
            return -1;
        }
        return -1;
      case 4: 
        switch (s[1]) { /* AM's, AB's, AZ's, AA's, AI's, AC's */
          case 'A':
            if (*(int*)s == (int)0x73274141 /* AA's */) return 2;
            break;
          case 'B':
            if (*(int*)s == (int)0x73274241 /* AB's */) return 3;
            break;
          case 'C':
            if (*(int*)s == (int)0x73274341 /* AC's */) return 5;
            break;
          case 'I':
            if (*(int*)s == (int)0x73274941 /* AI's */) return 7;
            break;
          case 'M':
            if (*(int*)s == (int)0x73274d41 /* AM's */) return 9;
            break;
          case 'Z':
            if (*(int*)s == (int)0x73275a41 /* AZ's */) return 17;
          default:
            return -1;
        }
        return -1;
      case 5: 
        switch (s[1]) { /* ATP's, AOL's, ASL's, ATM's, ABM's, AZT's */
          case 'B':
            if (*(int*)s == (int)0x274d4241 /* ABM's */
        && *(&s[4]) == 's') return 4;
            break;
          case 'O':
            if (*(int*)s == (int)0x274c4f41 /* AOL's */
        && *(&s[4]) == 's') return 11;
            break;
          case 'S':
            if (*(int*)s == (int)0x274c5341 /* ASL's */
        && *(&s[4]) == 's') return 13;
            break;
          case 'T':
            if (*(int*)s == (int)0x27505441 /* ATP's */
        && *(&s[4]) == 's') return 15;
            if (*(int*)s == (int)0x274d5441 /* ATM's */
        && *(&s[4]) == 's') return 14;
            break;
          case 'Z':
            if (*(int*)s == (int)0x27545a41 /* AZT's */
        && *(&s[4]) == 's') return 18;
          default:
            return -1;
        }
        return -1;
      case 6: 
        switch (s[1]) { /* AIDS's, AWOL's, ACTH's, Aachen */
          case 'C':
            if (*(int*)s == (int)0x48544341 /* ACTH's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 6;
            break;
          case 'I':
            if (*(int*)s == (int)0x53444941 /* AIDS's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 8;
            break;
          case 'W':
            if (*(int*)s == (int)0x4c4f5741 /* AWOL's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 16;
            break;
          case 'a':
            if (*(int*)s == (int)0x68636141 /* Aachen */
        && *(short*)&s[4] == (short)0x6e65 /* en */) return 19;
          default:
            return -1;
        }
        return -1;
      case 7: 
        return *(int*)s == (int)0x49435341 /* ASCII's */
        && *(short*)&s[4] == (short)0x2749 /* I's */
        && *(&s[6]) == 's' ? 12 : -1;
    }
}

So we switch here first on the key length, and then on the best key. The new switch method creates the nesting level automatically for you, all cases with more than 3 comparisons are automatically changed to a separate switch on the best character. And because we know the length an the keys in advance we can optimize memcmp away by doing word-size comparison directly. This is 50% - 2000% faster then doing memcmp on each string. See the implementation of the helper function memcmp_const_str I heard of people doing this on every string shorter than 128 chars. So far I'm using only 36 as cutoff.

This is not a very interesting example, you need much bigger keysizes to see smarter switch nesting in effect. With this method you can beat even the best hash functions for certain key sets.

For example we got with /usr/share/dict/words on debian with size=99171 from 0.307571 seconds down to 0.006666 seconds, while the fastest perfect hash needed 0.007391 seconds. See the benchmarks. In this special case it was 200x faster with the new memcmp optimization method.

Unfortunately we cannot use this new memcmp optimization method for all hash tables. We need to know one string and the length to compare in advance, and the key and length is a parameter of the lookup method. This only works for this static switch table.

Pearson

There exists another special hashing method, called Pearson hashing. This creates a short 256 bytes "pearson" table of randomized shuffled bytes and uses this table to mix with all characters of the string to hash. Such byte-wise operation with a small lookup table are perfect for smaller machines or smaller keysets.

In my benchmarks -pearson wins with ~127 keys, but you need to be a bit lucky to find a good pearson randomization. phash -pearson tries for max 60 seconds to find the best random table.

Small dictionary with 127 keys

Method       *lookup*  generate compile   c size   exesize  options

hanovpp      0.001247  0.001904 0.069912     3454    11575
hanov        0.001231  0.000946 0.070709     3200    11629
urban        0.001298  0.001020 0.072471     3200    11629
pearson      0.001178  0.068758 0.072311    12354    19240
pearsonnp    0.001253  0.018204 0.072419    12386    19259
cmph-bdz_ph  0.001271  0.000412 0.066264      335    10009
cmph-bdz     0.001317  0.000403 0.066471      424    10033
cmph-bmz     0.001277  0.000539 0.066814     2431    10577
cmph-chm     0.001325  0.000646 0.067026     4356    11057
cmph-fch     0.001372  0.000511 0.066843      942    10177
cmph-chd_ph  0.001310  0.000361 0.066292      418    10033
cmph-chd     0.001323  0.000343 0.066428      735    10129
switch       0.001190  0.000173 0.131323    12517    17450
gperf        0.001199  0.002180 0.064883     8987    12574

The generated code looks like this:

$ phash examples/words20 -pearson
$ cat phash.c

#include "phash.h"
#include <string.h>

long phash_lookup(const unsigned char* s) {
    int l = strlen(s);
    long h = 0;
    const char *key = s;
    static unsigned char phash[] = {
        136,176, 67, 19,170, 84, 87,228, 86, 92,100,214,208,201,160,126,
         30,192, 78,189,232,244,142,252,253,147, 34,251,156, 38, 62,141,
         20, 69, 88, 36,241,193, 77,162, 48,247,181,234,194,125, 98,135,
        127, 61,204,180,205,133,132,145, 10,146,229, 18,198, 37, 73,139,
         70,131,249, 51,223,235, 97,202,120,165, 16,  7,186, 89,243, 24,
        216, 43, 45,182,  0, 94,  2,248, 39,188,255,240,219, 96,220,217,
        168,130,  1,190, 85, 75, 44,169, 55,144, 58,224,221,239, 12,159,
        211, 90, 47, 72, 33,  8, 26,231, 56, 57,155, 59,  3,254, 66,185,
         27,  4,  6,102,101,150, 63,153, 52, 81,197,105,161,  9,140, 11,
        207,148,103,199,183, 49,213,143,149,138, 99,117,167, 83,171,175,
        218, 53, 80,113,106,250,124,137,109,104, 22,242,107,134,166, 54,
         14,110,164,236,187,203,237,154,128, 31, 15, 74,179,230,158,177,
        115,210,114,  5,226,196,212, 60,116, 21,157, 32,112,238, 17, 41,
         68,184, 93, 23,245,173,129,172, 25, 79,233,178, 28,121, 40,246,
        119, 50, 13, 65,163,222,206, 95,225,122, 71,111, 46, 35,195, 82,
         42,152, 76, 91,151,118, 29,209,174,227,108, 64,200,215,123,191,

    };
    /* collisions: keys and values */
    static const int   Cv_1[] = {0};
    static const char *Ck_2[] = {"ABM's","AC's", };
    static const int   Cv_2[] = {4,5, };
    static const int   Cv_3[] = {7};
    static const int   Cv_4[] = {3};
    static const int   Cv_5[] = {2};
    static const int   Cv_6[] = {8};
    static const int   Cv_7[] = {1};
    static const int   Cv_8[] = {6};
    static const int   Cv_9[] = {9};
    /* collision keys */
    static const char *Ck[] = { 0, 0, 
        (void*)&Ck_2, 0, 0, 0, 0, 0, 0, 0, };
    /* collision values and direct values */
    static const int *Cv[] = { 0, 
      (void*)&Cv_1, 
      (void*)&Cv_2, 
      (void*)&Cv_3, 
      (void*)&Cv_4, 
      (void*)&Cv_5, 
      (void*)&Cv_6, 
      (void*)&Cv_7, 
      (void*)&Cv_8, 
      (void*)&Cv_9, };
    /* size of collisions */
    static const unsigned char Cs[] = { 0, 1, 2, 1, 1, 1, 1, 1, 1, 1, };
    /* keys */
    static const char* K[] = {
        "A","A's","AA's","AB's","ABM's","AC's","ACTH's","AI's","AIDS's","AM's",
    };
    unsigned char c;
    for (c=*s++; c; c=*s++) {
        h = phash[(h ^ c) % 256];
    }
    h = h % 10;
    if (Cs[h] > 1) {
      const char **ck = (const char **)Ck[h];
      int i = 0;
      for (; i < Cs[h]; i++) {
        if (!memcmp(ck[i], key, l)) return Cv[h][i];
      }
    }
    else if (Cs[h] == 1) {
      h = Cv[h][0];
    }
    if (memcmp(K[h], key, l)) h = -1;
    return h;
}

We see that -pearson did not generate a perfect hash, it rather created a static collision table. But -pearson tries to minimize the collision costs for us by trying random pearson tables for a maximum time of 60 seconds and then stores the best variant. In this example here we only got one colliding bucket with 2 values.

There are 2 variants for -pearson, called -pearsonnp, "np" for non-perfect, and -pearson8 which is the strict perfect hash variant and a 8 bit table. -pearsonnp will use a larger pearson lookup table to create less collisions, and -pearson8 will most likely fail to create a proper hash table. I'm also still experimenting with other pearson variants for bigger tables.

CMPH

The currently supported external variants (which require an external library even at run-time) use the cmph library. They are a bit slower than the other methods, but only a bit. They allow much bigger key sizes, with dictionaries which might even exceed your main memory, and there exists even an variant to use tables which exceed the available main memory.

$ phash examples/words20 -cmph-bdz_ph
$ cat phash.c
#include "phash.h"
#include <string.h>
#include "cmph.h"

long phash_lookup(const unsigned char* s) {
    static unsigned char *packed_mphf = "\006\000\000\000\000\000\000\000\r\000\000\000\t\000\000
    \000\b\177\035\024\033\003\000";
    return cmph_search_packed(packed_mphf, (const char*)s, strlen(s));
}

You need to link this file with -lcmph.

Summary

Creating perfect hashes is easy,

they are grossly underused,

they are fast,

and in certain cases fast data tricks - word-sized cmp instead of memcmp - beat better algorithms.

1 comment

perlcc next steps

By Reini Urban on August 7, 2014 11:46 PM

cPanel uses now the new perl compiler B::C with -O3 and --staticxs with 5.14.4 in production, and I'm outlining the next steps.

Our old compiler (with 5.6.2) needed 2:30 hours for each build, produced ~100 binaries of about 30-50MB size. This sounds a lot but it is not. This is about the size a single perl scripts needs in memory, but a single perl script has to find all the dependent files first, load and parse it, and construct the stashes, the nested hashes of namespaces. All this is not needed at all during run-time, hence perlcc compiled binaries do not need it. The new compiler uses 30 minutes for the same binaries on our build servers, but is not yet using a lot of internal optimizations. It does use a lot of statically initialized strings and data, so the run-time is better than with 5.6, even if 5.6 is still much faster than the 5.14 run-time. But we still don't collect arrays into _independent_comalloc ranges as we did with 5.6, we still don't use link-time nor profile guided optimizations and we still do not optimize single modules seperately.

There are not many shared libraries used in such a compiled binary, just libperl and the used XS shared libraries.

Starting such a single binary is very fast, it's just mapped into memory at once, then the dynamic symbols need to be generated, as B::C cannot yet generate them statically, and perl5 is not really helpful in supporting static symbols, but all other data, strings and ops (perl functions) are generated statically. Then the XS modules are initialized (booted), then some pointers pointing into some shared libraries need to updated, as they pointed to some other pointer at compile-time and then the perl program is started.

perlcc -m

The plan for the next year is to support generating shared modules per perl package. perlcc -m compiles a package to a shared library, buildcc generates the Makefile.depend dependencies for each binary, and then you can use selectively more advanced optimizations per package. I.e. B::CC or rperl optimized compilations, which should run 2-20x faster than the current B::C compiled packages.

B::CC is not stable enough yet, so you can only use it with some modules, and it currently benefits greatly from type information, for which upstream maintainers have no interest. With rperl it is even more strict, as those modules are not even more strict, as in real programming languages compared to our simple dynamic scripting language, it explicitly disallows certain kinds of dynamic magic or array autovivification, which slows down run-time so much as benchmarked in my YAPC::EU 2013 talk.

So the plan is to compile to several optimized libraries seperately and link them together later. For binaries which fork or which need to be updated pretty often, shared libraries are preferred, for other simple binaries static libraries are preferred as they are loaded much faster. Shared libraries can be updated much easier, and when you fork a binary its shared libraries are shared, there's no seperate copy in memory. This is also one of the main reasons to change our current perl5 unicode implementation from simple and slow perl data structures, to compiled shared libraries, which are just mapped in when needed and can be shared.

Data::Compile

But first some other steps are needed. Before we will create compiled libraries per perl packages we want to compile only some read-only datastructures, hashes mainly. The name of this module is Data::Compile and will replace our current cdb databases for localization data. cdb maps readonly strings to strings, but cannot store other values than strings and is implemented as normal hashmap. Our localization data would be much easier to use with stored coderefs, similar to pre-compiled html templates, which which is a mix of strings and coderefs. By using certain parts of B::C it is trivial to store data as shared library, which can be dynaloaded on demand. Loading such a compiled datastructure of perl data is of course much faster then loading a btree or colliding hashmap and accessing it at run-time through some magic tie interface. B::C is very efficient in storing perl data statically, and only some GV symbols need to be initialized at boot-time.

Data::Compile will also replace the overly slow serializers, like Sereal, Cpanel::JSON::XS, Data::MessagePack or Storable, which have to convert binary data from and to perl data. Data::Compile freezes perl data natively and just thaws it at instance. So loading the CPAN Metadata file will go from 16 secs to 0.001ms, similar to CPAN::SQLite and cpanm which does those queries at run-time to some server will not be needed anymore. Updating such a compiled CPAN Metadata file is estimated to need 2 secs, which needs about the same time as updating CPAN with CPAN::SQLite. And CPAN::SQLite still has a locking bug, which prevents multiple cpan sessions, so you are easier of with cpanm now.

In order to optimize loading of static data, readonly hashes, to replace cdb or sqlite databases or serialized caches we need another module first:

Perfect::Hash

Currently there's is only gperf to create perfect hashes in C or C++, and then there is also the lesser known bob jenkins perfect executable to create perfect hashes (with some weird header files) and of course the cmph library to create perfect hashes of bigger dictionaries, google size. No single database or datra-structure can be faster to lookup then perfect hashes for readonly data, the fastest single-host readonly databases are cdb and mysql. And perfect hashes beat them by far. Ever wondered why Google lookups are so fast? Well, they distribute hashes across several servers with so-called consistent hashes, which map strings into buckets of n servers, and when they insert and delete entries the remapping (copying to other buckets) is minimized. But the real trick are minimal perfect hashes, which minimize lookup times and storage sizes far beyond normal or sparse hashes or b-trees.

So I created phash to replace gperf in the long term, by using better algorithms to handle any data (gperf fails to work with anagrams or weird keys), to create optimally fast C libraries or optimally small C libraries for fast hash lookups, and even provide backends to output perfect hashes for several popular programming languages, like perl (XS), java, ruby, php or python. As C library you can only store strings as keys, and integers and strings as values. With those other backends you can store all supported values.

Perfect hashes look differently on small 8-bit machines than on fast x86_64 machines, for static libraries or for shared libraries with -fPIC, for <1000 keys or for >1.000.000 keys, for keys with NUL characters or not, for 7-bit keys only, for unicode keys, for case-insensitive key lookup, and much more. Using a high-level language to analyze the keys to generate a good perfect hash (in this case perl) is much easier than fixing and enhancing gperf.

The other main problem is icu which is a collection of glorified but moderately efficient hashmaps of unicode tables and the even worse perl5 implementation of those unicode tables. Encode does it much better by pre-compiling encoding tables into good shared libraries, but unicode has much more tables than just encodings. parrot currently uses icu with a new gperf generated workaround for missing tables in icu (the not-existing whitespace class and missing namealiases which were broken in icu 5.0), and moarvm came up with it's own implementation of those tables to workaround icu deficiencies.

So far I have re-implemented the best algorithms to create and use perfect hashes in pure perl and optimized it with XS functions and I am able to create good C code. I am still testing optimal hash functions and strategies. I need only ~2 seconds to create a perfect hash for 100.000 entries in pure perl, with C libraries this goes down to 0.001 seconds, scaling linearily. The main problem is solved. Compilation with a C compiler and -O3 of such hashes need another second.

E.g. these hashes can be used to replace the constant lookup in ExtUtils::Constants. So I'll look soon into seperate my current ~10 different implementations of perfect hashes into one simple but good enough pure-perl version, which will generate pure-c code, without the need for any external library, and then seperate the others into several packages with external dependencies, like zlib for a fast hardware-assisted crc32 hash function (1 op per word), or libcmph and the various output formatters.

Then Data::Compiled will use Perfect::Hash::XS to store the read-only hashes. And then source code will change to use those hashes instead, and then perlcc -m will be able to compile and link arbitrary perl packages with various type annotations and various compilers, if B::C -O0, -O3, B::CC -O0, -O2 or even rperl.

The next plan is then to create a type inferencer in C, but I'll wait for this until p5p comes to a syntax decision how to annotate return types of functions. If ever.

2 comments

vec is slow, little-endian and limited

By Reini Urban on August 6, 2014 5:08 PM

I remembered from the perl docs to use vec to compress arrays of integers to bitvectors. In my case with Perfect::Hash::Urban the size advantage is 3320 byte vs 88 bytes for two integer arrays of size 20 with values [0..19], which compress to 8 bit per entry with vec. This is 300x less bytes. The theoretical maximal simple compression would be 13 byte, i.e. 5 bits per entry * 20 = 100 bits = 13 byte. Stored as Perl PV this would require 68 byte PV overhead + 13 byte for the bitvector.

But vec is only allowing sizes which are a power of 2, i.e. (1,2,4,8.16,32 and with 64 bit machines 64). So it seems to use natural words to store the bits, allowing fast access to each entry. With 64 bits it even throws a portable warning. I have no idea what this warning is for nowadays, other than Storable would not be able to thaw it on 32bit machines which do not support 64bit ints. Currently it hurts performance at every single get and set access at run-time for vec with 64 bits entries.

So I implemented my bitvector to store intermediate tables for my perfect hashes in pure-perl and XS with vec, and it was about 2x faster and 300x smaller. That's approximately what I expected.

# convert 2 int arrays @G and @V to one bitvector $G (a string)
my $bits = 1+length(sprintf "%b",$size); # +1 for negative values
for (2,4,8,16,32,($Config{ptrsize}==8?(64):())) {
  next if $bits > $_;
  $bits = $_; last;
}
my $G = "\0" x int($bits * $size / 4); # calloc it
# $G = @G
for my $i (0..$#G) {
  vec($G, $i, $bits) = $G[$i] if $G[$i];
}
# $G+offset = @V
for my $i (0..$#V) { # put V after G
  vec($G, $i+$size, $bits) = $V[$i] if $V[$i];
}
printf("\$G\[$bits]=\"%s\":%d\n", unpack("h*", $G), length($G))
  if $options{-debug};

and read-access is like this:

my $v = vec($G, $index, $bits); # vec version of $v = $G[$index];

But then I tried perfect hashes on bigger dictionaries, like /usr/share/dict/words which need tables of 32 bits per entry, which are needed for 65535 to 4294967295 entries. Note that with that many entries a real bitvector, i.e. via Bit::Vector might be better than the simple approach with vec, which does not support bit sizes between 16 and 32 or even in 32 - 64. But so far I prefer to work without any external dependencies.

Everything works fine in pure-perl, but I wanted faster arithmetic and hash calculations in XS for bigger dictionaries. Like the CPAN Metadata hashes or the various perl5 unicode tables.

But my XS version did not work with bigger dictionaries, it only worked fine with 8 bit tables, which is for a maximum table size of 64.

My XS implementation of vec looked like this:

static inline
IV vec(char *G, IV index, IV bits) {
  if (bits == 8)
    return *(char*)(G + index) & 255;
  else if (bits == 4)
    return *(char*)(G + (index / 2)) & 15;
  else if (bits == 16) {
    short l = *(short*)((short*)G + index); /* __UINT16_MAX__ */
    return (IV)l;
  }
  else if (bits == 32) {
#if INTSIZE == 4
    int l = *(int*)((int*)G + index); /* __UINT32_MAX__ */
#else
    long l = *(long*)((long*)G + index);
#endif
    return (IV)l;
  }
#ifdef HAVE_QUAD
  else if (bits == 64) {
    long l = *(long*)((long*)G + index);
    return (IV)l;
  }
#endif
}

Every access translates to natural words and index offsets, which translates to a single assembler instruction per size. But I got overflows on most indices, so I looked up the perl5 implementation of vec in doop.c and started getting angry.

vec is really little-endian only, no natural endianness, so perl5 does byte fiddling for all the various sizes, but then the bit size limit makes no sense at all anymore, as the naive implementation supporting little-endian only would be a simple loop independent of the bits, instead of all the if branches to check the supported bit sizes. So perl5 vec is not only unnecessarily slow, and only supports little-endian, not native endian-ness, it also adds arbitrary limits on the supported bit sizes for no good technical reason.

So either use Bit::Vector or pack/unpack instead. vec is good for nothing. But I'll probably use just my fast nvec implementation for natural endian-ness and natural word sizes as shown above. This is a drop-in replacement for vec for bits>=4, but works with cooperative access for XS and Pure-Perl, and is fast.