On OP_SIGNATURE

Since it is not possible to write p5p criticism to the mailing list, I'll have to do it in my blog. @p5p: think over your guidelines. I don't believe that stuff like that needs to be blogged.

DaveM now introduced a new OP_SIGNATURE which assigns run-time args according to the compiled signature.

It basically speeds up these compiled checks

    sub f ($a, $b = 0, $c = "foo") {};

=>

    sub f {
        die sprintf("Too many arguments for subroutine at %s line %d.\n", (caller)[1, 2]) unless @_ <= 3;
        die sprintf("Too few arguments for subroutine at %s line %d.\n", (caller)[1, 2]) unless @_ >= 1;
        my $a = $_[0];
        my $b = @_ >= 2 ? $_[1] : 0;
        my $c = @_ >= 3 ? $_[2] : 'foo';
        ();
    }

into one OP, which does the same, similar to the new MULTIDEREF. Moving op chains into C. DaveM is now the goto guy for the big DWIM ops, compressing previous op chains into a single one.

This is far too much for a single op, but previously it was all handled either in ENTERSUB or in user code.

The arity checks should be done in the call op (ENTERSUB), the local assignment should be separate ops, we still have no syntax support for types which could be used previously in user-code (my int $i = $_[0];) and now we even need type hooks. These can also be added after the SIGNATURE op, but make not much sense there, as side effects will appear to early. XS calls do not need the checks as they do by their own, but XS calls are easily detected in ENTERSUB.

The assignment to local lexicals is now buried in this single OP, which makes it's now impossible to change the currently only supported call by value to the faster call by reference. I.e. there's still no support to declare call by ref

sub myinc (\$a) { $a++ }; my $i=0; myinc($i); print $i; # => 1

so you still have to use $_[0] directly, which means @_ still needs to be filled with all args, which makes every signature usage still twice as slow as normal calls without signature declaration. Once for @_ in ENTERSUB and a second time for the named args in SIGNATURE. So this new OP basically just hides this new slowness by design (blame Zefram for this idea) by bypassing the normal ops which assigned the locals.

Any optimizing compiler now needs to replace this new SIGNATURE op and cannot work on the optree. Fine, it cannot be used as is anyway.

Compiled or run-time polymorphism (dispatch on argument types) now needs to replace SIGNATURE and not ENTERSUB. There's not much difference, both are horrible ops to work with. SIGNATURE is probably easier to replace, but replacing ENTERSUB had its advantages by leaving out all the unneeded recursion, @_ and debugger code. So basically you have now to replace both.

Of course there's still no type support, and still no return type declaration syntax, though it seems the post declaration attribute list can now be used, as :const is now supported, just for anonsubs only.

So real subs can soon look like:

sub myinc (int \$a) :int { $a++ }

and you can use the faster i_ops for the result, and since it's a reference, for the lifetime of the caller variable. Just don't expect that from p5p in the next 5 years. Only all the other dynamic languages, python, ruby, php, javascript announce these features officially, and I have to implement it in private.

p5p still has no idea what they are doing, but probably will also announce it as great breakthrough, as they did with the Zefram signatures before. Which is somewhat funny, announcing the worst of all existing signature implementations as positive. People bought that, so it will work now too.

So far the biggest breakthrough lately was besides the new fast METHOD ops (THANKS!), to go for the :const attribute for subs (THANKS!), so the other syntax possibilities => type or perl6 like returns type or is ro are now very unlikely to appear. This was a great decision, even if it was done unconsciously, and I can finally go forward.

A little warning to EUMM and shell-script users

I sometimes need to write shell-script test scripts and not perl, to be able to test perl scripts, without interfering with perl, and also for performance and easier IO reasons.

In order to find out with which perl this distro was built, we need to parse the generated Makefile.

Recent EUMM 7.0x introduced a new feature which broke all my scripts. They started double-quoting PERL and FULLPERL in the generated Makefile. Damage is already done. They only thing you can do is to remove the quote.

PERL=`grep "^PERL =" Makefile|cut -c8-`
PERL=${PERL:-perl}
PERL=`echo $PERL|sed -e's,^",,; s,"$,,'`

They obviously were afraid of spaces in Windows paths. Only cmd.exe accepts "cmd", no other shell. So the obvious fix would be to add double quotes on Win32 only, and only of a space appears on the NAME or the PATH. Same as we have to do with $^X in system calls, where we have to double-quote $^X explicitly in string -context. Like with

$X = $^X =~ / / ? qq("$^X") : $^X; system("$X ...")

Initial feedback to the maintainers was not positive, they don't care. EUMM needs to write Makefiles, nothing else. The second reply was: Just use sh -c $PERL $args. Yeah. Exactly.

So I fear the toolchain also starts rotting now with the newbies taking over. Test::Builder is also in great danger with a newbie maintainer. The initial trials were twice as slow to be able to support streaming. Given that p5p has similar technical problems it doesn't look to good for 5.2x being usable too soon. I'm still forced to use 5.14.4.

Let's just hope CPAN will not get new maintainers.

My fix: https://github.com/rurban/perl-compiler/commit/16379cf29cbffdf8ffce9d0822af0548cfb65051

The sad story of pseudohash criticism

I just had to endure MJD’s horrible pseudohash explanation at the Pittsburgh Workshop. “A new, never-before-seen talk on Perl’s catastrophic experiment with “pseudohashes”, which wasted everyone’s time for nine years between 1998 and 2007”

https://www.youtube.com/watch?v=-HlGQtAuZuY

Watch it, you can fast forward through it. I honestly had higher opinions on Marc-Jason.

So let’s see what’s wrong with the popular and uninformed pseudohash critic:

Their main points are that storing a hash in the array slot 0 for run-time lookup is too complicated, the exists and delete ops need to check for arrays / pseudohashes now also, and all the pseudohash checks slowed down general hash usage by 15%. Which basically levelled the advantage of faster compile-time accelerated array lookup on those pseudo hashes, which was ~15%.

package Critter;
use fields qw(NAME TYPE);

my Critter $h;    # compile-time optimization: href NAME => aref 1
$h->{NAME};     # ==> $h->[1]

but:

$key = "NAME";  # defer to run-time lookup of href in aref 0
$h->{$key};       # ==> $h->[ $h->[0]->{$key} ]

So by allowing the slow run-time access, you need to preserve the hash semantics of the array. Still, the compilers knows about the type of $h, and can still compile it to a href $key aref 0.

Same problem with exists and delete.

exists $h->{NAME} is compile-time constant foldable to YES or NO.

delete $h->{NAME} needs to store a sentinel as with hashes in aref 1. This only slows down aref for pseudohashes, but should not slow down href or aref for arrays.

Of course this was not how it was implemented. In good old perl5 fashion $h was kept as hash, and all the hash ops were extended to check for pseudohashes at run-time. Yes, at run-time, in the ops.

What should have been done instead was to either reject pseudohash optimization when a run-time key was parsed, maybe with a warning under use warnings.

Or if you really don’t want to punish bad behaviour by using computed keys with explicitly requested compile-time keys, compile $h to arrays and not to hashes.

As I said before, perl5 is just badly implemented, but still fixable. use fields could still be a hash hint for the compiler to change it to arrays.

Just don’t expect anything from the current maintainers and the old-timers.

Horrible talk.

New perl5 Porting/bench.pl

Dave Mitchel finally got fed up by the lack of stable perl benchmarks, and the impossibility do catch performance regressions.

This was his announcement, with the sample:

$ Porting/bench.pl -j 8 --raw --fields=Ir,Dr,Dw \
   --tests=expr::assign::scalar_lex \
   perl5210o perl5211o perl5212o perl5213o perl5214o perl5215o perl5216o

....

expr::assign::scalar_lex
lexical $x = 1

   perl5210o perl5211o perl5212o perl5213o perl5214o perl5215o perl5216o
   --------- --------- --------- --------- --------- --------- ---------
Ir     155.0     155.0     155.0     155.0     161.0     161.0     161.0
Dr      54.0      54.0      54.0      54.0      57.0      57.0      57.0
Dw      30.0      30.0      30.0      30.0      31.0      31.0      31.0

and the bisect usage sample:

D=/home/davem/perl5/git/bleed

$D/Porting/bisect.pl              \
 --start=v5.21.3                  \
 --end=v5.21.4                    \
 -Doptimize=-O2                   \
 -j 16                            \
 --target=miniperl                \
 -- perl5201o $D/Porting/bench.pl \
      -j 8                             \
      --benchfile=$D/t/perf/benchmarks \
      --tests=expr::assign::scalar_lex \
      --perlargs='-Ilib'               \
      --bisect='Ir,153,157'            \
      ./miniperl

p5p had universal praise for it, because probably nobody did proper benchmarks before. Well, it's at least better than nothing.

It uses cachegrind, which means it is much slower than linux perf, but works on all platforms. It does not display error rates, it runs the sample only once, so you can only trust it, or do not trust it, e.g. in case of high load. Dave said the results are trustworthy, even with -j8, probably because it runs through cachegrind.

Currently it only measures micro ops, not the big picture, a sample over the most used ops, as I argued before at On simple benchmarks and for the right op distribution at Idea - costmodel for B::Stats.

My parrot-bench uses linux perf on linux instead of just cachegrind. It uses bigger samples, which do work across all versions, it is much faster, using native speed and is more reliable IMHO. I rather see a bad error rate than none, and I rather see the absolute timings down to 6 digits.

For the API you can compare it to this similar existing microbenchmark library by google: https://github.com/google/benchmark

There are now two different benchmark tests in p5p t/: The old t/benchmark/rt26188-speed-up-keys-on-empty-hash.t, and the new t/perf directory with the various micro benchmarks.

So with this bench, you can say this op got faster, but you cannot say that perl got faster.

What can you expect. We even don't have a proper realtime clock in our Time::HiRes module yet. i.e. asm rdtsc on intel or any other realtime clock sources for the various platforms, as in Devel::NYTProf.

Perfect Hashes and faster than memcmp

In my previous post about perlcc next steps I talked shortly about my current project, Perfect::Hash.

# generate c file for readonly lookup
phash keyfile --prefix=phash -nul

# pure-perl usage
use Perfect::Hash;
my @dict = split/\n/,`cat /usr/share/dict/words`;
my $ph = Perfect::Hash->new(\@dict, -minimal);
for (@ARGV) {
  my $v = $ph->perfecthash($_);
  print ($dict[$v] eq $_ ? "$_ at line ".$v+1."\n" : "$_ not found\n");
}

Perfect::Hash->new("keyfile", '-urban', ...)->save_c;
# or just:
phash keyfile --urban
cc -O3 -msse4.2 phash.c ... -lz

phash /usr/share/dict/words --cmph-bdz_ph --nul
cc -O3 phash.c ... -lcmph

phash knows about several fast lookup algorithms for any readonly string-based maps (no integer maps yet), and will select the best method to store it as a lookup method in a C file to compile it with your project. Such as with gperf, only better. Planned are also exporters for other languages, to be able to store values in their native language-specific format. E.g. XS for perl-specific SVs, AVs or HVs.

In order to optimize loading of static data, readonly hashes to replace cdb or sqlite databases or serialized caches we will use phash. gperf was enhanced for libunistring, to lookup unicode data efficiently, compared to icu which uses only straight C++ hashmaps. Searching in constant tables is quite a remarkably common task, so I am constantly suprised why nobody uses optimized data structures and algorithms to use it. I only know of google which also use consistent hashes to spread the load and keyspace across several servers, apple for their speach apps, and libunistring for unicode mappings. cdb e.g. could easily use perfect hashes instead of overly simple DJB hashes. Or mysql which is mostly only usable as glorified readonly concurrent hashtable could be replaced by such tables. The update with regeneration and recompilation for such a table costs less then a typical update into mysql.

Perfect::Hash

Let's create some sample hash tables:

$ phash /usr/share/dict/words -hanovpp -size 20 -false-positives
$ cat phash.c

#include "phash.h"
#include <string.h>

#ifdef _MSC_VER
#define INLINE __inline
#else
#define INLINE inline
#endif

/* FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/ */
static INLINE
unsigned phash_hash(unsigned d, const unsigned char *s, const int l) {
    int i = 0;
    if (!d) d = 0x01000193;
    for (; i < l; i++) {
        d = (d * 0x01000193) ^ *s++;
    }
    return d & 0xffffffff;
}

long phash_lookup(const unsigned char* s) {
    unsigned char h;
    signed char d;
    char v;
    /* hash indices, direct < 0, indirect > 0 */
    static const signed char G[] = {
        -16,  1,  1,-15,  0,  0,  0,-14,  1,-13,-12,  1,  3,-10,  0, -8,
          0, -7,  0, -2,
    };
    /* values */
    static const unsigned char V[] = {
          0,  9, 17,  6, 12, 13, 10, 14, 18,  5,  8, 15,  4,  2, 11, 16,
          1,  7,  3, 19,
    };
    long l = strlen(s);
    h = (unsigned char)(phash_hash(0, s, l) % 20);
    d = G[h];
    v = d < 0
        ? V[(unsigned char)-d-1]
        : d == 0
          ? V[h]
          : V[(unsigned char)(phash_hash(d, s, l) % 20)];
    return v;
}

This creates an interim lookup array G with negative indices for immediate in the values array V, and with positive indices for a second hash-based lookup into the values array V, where the index is the hash seed > 0. So we have typically 1.5 hash function calls per lookup. Generation of such a table is linear in time, intermediate and resulting space is 2n, one array[n] for G and another for the values. But note that this lookup will only work if we know in advance that the key is a member of the perfect hash. That's why we can use the option -false-positive.

In order to use arbitrary keys for the lookup, we omit -false-positives, we need to add a final check if the found key is really contained in the stored perfect hash. So we need 3n space, one more array for the keys also, and one more full memcmp.

We will also add the option -nul to be able to store binary keys, with embedded NUL characters. Note that using \0 in keyfiles is quite tricky.

$ phash /usr/share/dict/words -hanovpp -size 20 -nul
$ tail -n30 phash.c

long phash_lookup(const unsigned char* s, int l) {
    unsigned char h;
    signed char d;
    char v;
    /* hash indices, direct < 0, indirect > 0 */
    static const signed char G[] = {
        -16,  1,  1,-15,  0,  0,  0,-14,  1,-13,-12,  1,  3,-10,  0, -8,
          0, -7,  0, -2,
    };
    /* values */
    static const unsigned char V[] = {
          0,  9, 17,  6, 12, 13, 10, 14, 18,  5,  8, 15,  4,  2, 11, 16,
          1,  7,  3, 19,
    };
    /* keys */
    static const char* K[] = {
        "A","A's","AA's","AB's","ABM's","AC's","ACTH's","AI's","AIDS's","AM's",
        "AOL","AOL's","ASCII's","ASL's","ATM's","ATP's",
        "AWOL's","AZ's","AZT's","Aachen",
    };
    h = (unsigned char)(phash_hash(0, s, l) % 20);
    d = G[h];
    v = d < 0
        ? V[(unsigned char)-d-1]
        : d == 0
          ? V[h]
          : V[(unsigned char)(phash_hash(d, s, l) % 20)];
    if (memcmp(K[v],s,l)) v = -1;
    return v;
}

The differences are the additional storage of the keys and the final if (memcmp(K[v],s,l)) v = -1;

Note that this data structure here is optimized to work with zero-based integer values. For arbitrary values the check would be a bit different and we would need one more array for the values. gperf creates an array of key + value structs and the values can be of any type.

This -hanovpp - pp stands for pure-perl, hanov for Steve Hanov, who wrote a simple python script at http://stevehanov.ca/blog/index.php?id=119 based on the cmph CHD algorithm - method is the default method to create arbitrary sized perfect hashes, and needs no external library. It creates pure and independent, pretty fast C code.

Fast CRC32

The optimized versions -hanov and -urban use the typically faster hardware accelerated iSCSI CRC32 function from zlib, which uses only 1 CPU cycle per word, but requires zlib on the system, and is only fast if zlib is fast. You cannot beat 1 cycle, but some zlib versions only use the slow software fallback version for CRC32, which is most slower than the simple FNV hash function used with -hanovpp.

Note that faster wordsize optimized FNV variants are "presented" at http://www.sanmayce.com/Fastest_Hash/index.html#TRIADvsCRC (Beware of eye-bleeding though)

gperf

gperf might be able to create even faster and better hashes, because it doesn't fully hash each key, it rather creates a custom simpler hash function for each key set. It just adds some characters at certain indices until you get different values for each key. Most other methods only create custom hash seeds for each keyset. But gperf can not reliably create stable hash functions and will either time out on certain keysets or will create bad hash tables.

The major variants of phash -gperf include the options -pic and -switch. -pic creates a lookup function optimized for shared libraries, and -switch creates a nested switch statement.

-switch and fast memcmp

gperf --switch=2 is by far worse than my optimized version called phash -switch With gperf you have to manually specify the level of nested switches, with phash -switch the nesting level is created automatically, but the biggest trick is a special memcmp optimization for shorter strings.

The old gperf --switch=n variant:

$ gperf examples/words20 --switch=2

... special hash ...

const char *
in_word_set (str, len)
     register const char *str;
     register unsigned int len;
{
  if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH) {
      register int key = hash (str, len);
      if (key <= MAX_HASH_VALUE && key >= MIN_HASH_VALUE) {
          register const char *resword;
          if (key < 16) {
              switch (key - 1) {
                  case 0:  resword = "A"; goto compare;
                  case 2:  resword = "A's"; goto compare;
                  case 3:  resword = "AM's"; goto compare;
                  case 4:  resword = "ATM's"; goto compare;
                  case 5:  resword = "Aachen"; goto compare;
                  case 8:  resword = "AZ's"; goto compare;
                  case 9:  resword = "AZT's"; goto compare;
                  case 10: resword = "AWOL's"; goto compare;
                  case 13: resword = "AC's"; goto compare;
                  case 14: resword = "ASL's"; goto compare;
                }
          } else {
              switch (key - 16) {
                  case 0: resword = "ACTH's"; goto compare;
                  case 1: resword = "ASCII's"; goto compare;
                  case 2: resword = "AOL"; goto compare;
                  case 3: resword = "AI's"; goto compare;
                  case 4: resword = "AOL's"; goto compare;
                  case 5: resword = "AIDS's"; goto compare;
                  case 8: resword = "AB's"; goto compare;
                  case 9: resword = "ABM's"; goto compare;
                  case 13: resword = "AA's"; goto compare;
                  case 14: resword = "ATP's"; goto compare;
              }
          }
          return 0;
        compare:
          if (*str == *resword && !strcmp (str + 1, resword + 1))
            return resword;
        }
    }
  return 0;
}

vs. the new seperate phash -switch method:

$ phash examples/words20 -switch
$ cat phash.c

#include "phash.h"
#include <string.h>

long phash_lookup(const unsigned char* s, int l) {
    switch (l) {
      case 1: 
        return *(s) == 'A' ? 0 : -1;
      case 3: 
        switch (s[1]) { /* A's, AOL */
          case 39:
            if (*(short*)s == (short)0x2741 /* A's */
        && *(&s[2]) == 's') return 1;
            break;
          case 'O':
            if (*(short*)s == (short)0x4f41 /* AOL */
        && *(&s[2]) == 'L') return 10;
          default:
            return -1;
        }
        return -1;
      case 4: 
        switch (s[1]) { /* AM's, AB's, AZ's, AA's, AI's, AC's */
          case 'A':
            if (*(int*)s == (int)0x73274141 /* AA's */) return 2;
            break;
          case 'B':
            if (*(int*)s == (int)0x73274241 /* AB's */) return 3;
            break;
          case 'C':
            if (*(int*)s == (int)0x73274341 /* AC's */) return 5;
            break;
          case 'I':
            if (*(int*)s == (int)0x73274941 /* AI's */) return 7;
            break;
          case 'M':
            if (*(int*)s == (int)0x73274d41 /* AM's */) return 9;
            break;
          case 'Z':
            if (*(int*)s == (int)0x73275a41 /* AZ's */) return 17;
          default:
            return -1;
        }
        return -1;
      case 5: 
        switch (s[1]) { /* ATP's, AOL's, ASL's, ATM's, ABM's, AZT's */
          case 'B':
            if (*(int*)s == (int)0x274d4241 /* ABM's */
        && *(&s[4]) == 's') return 4;
            break;
          case 'O':
            if (*(int*)s == (int)0x274c4f41 /* AOL's */
        && *(&s[4]) == 's') return 11;
            break;
          case 'S':
            if (*(int*)s == (int)0x274c5341 /* ASL's */
        && *(&s[4]) == 's') return 13;
            break;
          case 'T':
            if (*(int*)s == (int)0x27505441 /* ATP's */
        && *(&s[4]) == 's') return 15;
            if (*(int*)s == (int)0x274d5441 /* ATM's */
        && *(&s[4]) == 's') return 14;
            break;
          case 'Z':
            if (*(int*)s == (int)0x27545a41 /* AZT's */
        && *(&s[4]) == 's') return 18;
          default:
            return -1;
        }
        return -1;
      case 6: 
        switch (s[1]) { /* AIDS's, AWOL's, ACTH's, Aachen */
          case 'C':
            if (*(int*)s == (int)0x48544341 /* ACTH's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 6;
            break;
          case 'I':
            if (*(int*)s == (int)0x53444941 /* AIDS's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 8;
            break;
          case 'W':
            if (*(int*)s == (int)0x4c4f5741 /* AWOL's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 16;
            break;
          case 'a':
            if (*(int*)s == (int)0x68636141 /* Aachen */
        && *(short*)&s[4] == (short)0x6e65 /* en */) return 19;
          default:
            return -1;
        }
        return -1;
      case 7: 
        return *(int*)s == (int)0x49435341 /* ASCII's */
        && *(short*)&s[4] == (short)0x2749 /* I's */
        && *(&s[6]) == 's' ? 12 : -1;
    }
}

So we switch here first on the key length, and then on the best key. The new switch method creates the nesting level automatically for you, all cases with more than 3 comparisons are automatically changed to a separate switch on the best character. And because we know the length an the keys in advance we can optimize memcmp away by doing word-size comparison directly. This is 50% - 2000% faster then doing memcmp on each string. See the implementation of the helper function memcmp_const_str I heard of people doing this on every string shorter than 128 chars. So far I'm using only 36 as cutoff.

This is not a very interesting example, you need much bigger keysizes to see smarter switch nesting in effect. With this method you can beat even the best hash functions for certain key sets.

For example we got with /usr/share/dict/words on debian with size=99171 from 0.307571 seconds down to 0.006666 seconds, while the fastest perfect hash needed 0.007391 seconds. See the benchmarks. In this special case it was 200x faster with the new memcmp optimization method.

Unfortunately we cannot use this new memcmp optimization method for all hash tables. We need to know one string and the length to compare in advance, and the key and length is a parameter of the lookup method. This only works for this static switch table.

Pearson

There exists another special hashing method, called Pearson hashing. This creates a short 256 bytes "pearson" table of randomized shuffled bytes and uses this table to mix with all characters of the string to hash. Such byte-wise operation with a small lookup table are perfect for smaller machines or smaller keysets.

In my benchmarks -pearson wins with ~127 keys, but you need to be a bit lucky to find a good pearson randomization. phash -pearson tries for max 60 seconds to find the best random table.

Small dictionary with 127 keys

Method       *lookup*  generate compile   c size   exesize  options

hanovpp      0.001247  0.001904 0.069912     3454    11575
hanov        0.001231  0.000946 0.070709     3200    11629
urban        0.001298  0.001020 0.072471     3200    11629
pearson      0.001178  0.068758 0.072311    12354    19240
pearsonnp    0.001253  0.018204 0.072419    12386    19259
cmph-bdz_ph  0.001271  0.000412 0.066264      335    10009
cmph-bdz     0.001317  0.000403 0.066471      424    10033
cmph-bmz     0.001277  0.000539 0.066814     2431    10577
cmph-chm     0.001325  0.000646 0.067026     4356    11057
cmph-fch     0.001372  0.000511 0.066843      942    10177
cmph-chd_ph  0.001310  0.000361 0.066292      418    10033
cmph-chd     0.001323  0.000343 0.066428      735    10129
switch       0.001190  0.000173 0.131323    12517    17450
gperf        0.001199  0.002180 0.064883     8987    12574

The generated code looks like this:

$ phash examples/words20 -pearson
$ cat phash.c

#include "phash.h"
#include <string.h>

long phash_lookup(const unsigned char* s) {
    int l = strlen(s);
    long h = 0;
    const char *key = s;
    static unsigned char phash[] = {
        136,176, 67, 19,170, 84, 87,228, 86, 92,100,214,208,201,160,126,
         30,192, 78,189,232,244,142,252,253,147, 34,251,156, 38, 62,141,
         20, 69, 88, 36,241,193, 77,162, 48,247,181,234,194,125, 98,135,
        127, 61,204,180,205,133,132,145, 10,146,229, 18,198, 37, 73,139,
         70,131,249, 51,223,235, 97,202,120,165, 16,  7,186, 89,243, 24,
        216, 43, 45,182,  0, 94,  2,248, 39,188,255,240,219, 96,220,217,
        168,130,  1,190, 85, 75, 44,169, 55,144, 58,224,221,239, 12,159,
        211, 90, 47, 72, 33,  8, 26,231, 56, 57,155, 59,  3,254, 66,185,
         27,  4,  6,102,101,150, 63,153, 52, 81,197,105,161,  9,140, 11,
        207,148,103,199,183, 49,213,143,149,138, 99,117,167, 83,171,175,
        218, 53, 80,113,106,250,124,137,109,104, 22,242,107,134,166, 54,
         14,110,164,236,187,203,237,154,128, 31, 15, 74,179,230,158,177,
        115,210,114,  5,226,196,212, 60,116, 21,157, 32,112,238, 17, 41,
         68,184, 93, 23,245,173,129,172, 25, 79,233,178, 28,121, 40,246,
        119, 50, 13, 65,163,222,206, 95,225,122, 71,111, 46, 35,195, 82,
         42,152, 76, 91,151,118, 29,209,174,227,108, 64,200,215,123,191,

    };
    /* collisions: keys and values */
    static const int   Cv_1[] = {0};
    static const char *Ck_2[] = {"ABM's","AC's", };
    static const int   Cv_2[] = {4,5, };
    static const int   Cv_3[] = {7};
    static const int   Cv_4[] = {3};
    static const int   Cv_5[] = {2};
    static const int   Cv_6[] = {8};
    static const int   Cv_7[] = {1};
    static const int   Cv_8[] = {6};
    static const int   Cv_9[] = {9};
    /* collision keys */
    static const char *Ck[] = { 0, 0, 
        (void*)&Ck_2, 0, 0, 0, 0, 0, 0, 0, };
    /* collision values and direct values */
    static const int *Cv[] = { 0, 
      (void*)&Cv_1, 
      (void*)&Cv_2, 
      (void*)&Cv_3, 
      (void*)&Cv_4, 
      (void*)&Cv_5, 
      (void*)&Cv_6, 
      (void*)&Cv_7, 
      (void*)&Cv_8, 
      (void*)&Cv_9, };
    /* size of collisions */
    static const unsigned char Cs[] = { 0, 1, 2, 1, 1, 1, 1, 1, 1, 1, };
    /* keys */
    static const char* K[] = {
        "A","A's","AA's","AB's","ABM's","AC's","ACTH's","AI's","AIDS's","AM's",
    };
    unsigned char c;
    for (c=*s++; c; c=*s++) {
        h = phash[(h ^ c) % 256];
    }
    h = h % 10;
    if (Cs[h] > 1) {
      const char **ck = (const char **)Ck[h];
      int i = 0;
      for (; i < Cs[h]; i++) {
        if (!memcmp(ck[i], key, l)) return Cv[h][i];
      }
    }
    else if (Cs[h] == 1) {
      h = Cv[h][0];
    }
    if (memcmp(K[h], key, l)) h = -1;
    return h;
}

We see that -pearson did not generate a perfect hash, it rather created a static collision table. But -pearson tries to minimize the collision costs for us by trying random pearson tables for a maximum time of 60 seconds and then stores the best variant. In this example here we only got one colliding bucket with 2 values.

There are 2 variants for -pearson, called -pearsonnp, "np" for non-perfect, and -pearson8 which is the strict perfect hash variant and a 8 bit table. -pearsonnp will use a larger pearson lookup table to create less collisions, and -pearson8 will most likely fail to create a proper hash table. I'm also still experimenting with other pearson variants for bigger tables.

CMPH

The currently supported external variants (which require an external library even at run-time) use the cmph library. They are a bit slower than the other methods, but only a bit. They allow much bigger key sizes, with dictionaries which might even exceed your main memory, and there exists even an variant to use tables which exceed the available main memory.

$ phash examples/words20 -cmph-bdz_ph
$ cat phash.c
#include "phash.h"
#include <string.h>
#include "cmph.h"

long phash_lookup(const unsigned char* s) {
    static unsigned char *packed_mphf = "\006\000\000\000\000\000\000\000\r\000\000\000\t\000\000
    \000\b\177\035\024\033\003\000";
    return cmph_search_packed(packed_mphf, (const char*)s, strlen(s));
}

You need to link this file with -lcmph.

Summary

Creating perfect hashes is easy,

they are grossly underused,

they are fast,

and in certain cases fast data tricks - word-sized cmp instead of memcmp - beat better algorithms.