A little warning to EUMM and shell-script users

I sometimes need to write shell-script test scripts and not perl, to be able to test perl scripts, without interfering with perl, and also for performance and easier IO reasons.

In order to find out with which perl this distro was built, we need to parse the generated Makefile.

Recent EUMM 7.0x introduced a new feature which broke all my scripts. They started double-quoting PERL and FULLPERL in the generated Makefile. Damage is already done. They only thing you can do is to remove the quote.

PERL=`grep "^PERL =" Makefile|cut -c8-`
PERL=`echo $PERL|sed -e's,^",,; s,"$,,'`

They obviously were afraid of spaces in Windows paths. Only cmd.exe accepts "cmd", no other shell. So the obvious fix would be to add double quotes on Win32 only, and only of a space appears on the NAME or the PATH. Same as we have to do with $^X in system calls, where we have to double-quote $^X explicitly in string -context. Like with

$X = $^X =~ / / ? qq("$^X") : $^X; system("$X ...")

Initial feedback to the maintainers was not positive, they don't care. EUMM needs to write Makefiles, nothing else. The second reply was: Just use sh -c $PERL $args. Yeah. Exactly.

So I fear the toolchain also starts rotting now with the newbies taking over. Test::Builder is also in great danger with a newbie maintainer. The initial trials were twice as slow to be able to support streaming. Given that p5p has similar technical problems it doesn't look to good for 5.2x being usable too soon. I'm still forced to use 5.14.4.

Let's just hope CPAN will not get new maintainers.

My fix: https://github.com/rurban/perl-compiler/commit/16379cf29cbffdf8ffce9d0822af0548cfb65051

The sad story of pseudohash criticism

I just had to endure MJD’s horrible pseudohash explanation at the Pittsburgh Workshop. “A new, never-before-seen talk on Perl’s catastrophic experiment with “pseudohashes”, which wasted everyone’s time for nine years between 1998 and 2007”


Watch it, you can fast forward through it. I honestly had higher opinions on Marc-Jason.

So let’s see what’s wrong with the popular and uninformed pseudohash critic:

Their main points are that storing a hash in the array slot 0 for run-time lookup is too complicated, the exists and delete ops need to check for arrays / pseudohashes now also, and all the pseudohash checks slowed down general hash usage by 15%. Which basically levelled the advantage of faster compile-time accelerated array lookup on those pseudo hashes, which was ~15%.

package Critter;
use fields qw(NAME TYPE);

my Critter $h;    # compile-time optimization: href NAME => aref 1
$h->{NAME};     # ==> $h->[1]


$key = "NAME";  # defer to run-time lookup of href in aref 0
$h->{$key};       # ==> $h->[ $h->[0]->{$key} ]

So by allowing the slow run-time access, you need to preserve the hash semantics of the array. Still, the compilers knows about the type of $h, and can still compile it to a href $key aref 0.

Same problem with exists and delete.

exists $h->{NAME} is compile-time constant foldable to YES or NO.

delete $h->{NAME} needs to store a sentinel as with hashes in aref 1. This only slows down aref for pseudohashes, but should not slow down href or aref for arrays.

Of course this was not how it was implemented. In good old perl5 fashion $h was kept as hash, and all the hash ops were extended to check for pseudohashes at run-time. Yes, at run-time, in the ops.

What should have been done instead was to either reject pseudohash optimization when a run-time key was parsed, maybe with a warning under use warnings.

Or if you really don’t want to punish bad behaviour by using computed keys with explicitly requested compile-time keys, compile $h to arrays and not to hashes.

As I said before, perl5 is just badly implemented, but still fixable. use fields could still be a hash hint for the compiler to change it to arrays.

Just don’t expect anything from the current maintainers and the old-timers.

Horrible talk.

New perl5 Porting/bench.pl

Dave Mitchel finally got fed up by the lack of stable perl benchmarks, and the impossibility do catch performance regressions.

This was his announcement, with the sample:

$ Porting/bench.pl -j 8 --raw --fields=Ir,Dr,Dw \
   --tests=expr::assign::scalar_lex \
   perl5210o perl5211o perl5212o perl5213o perl5214o perl5215o perl5216o


lexical $x = 1

   perl5210o perl5211o perl5212o perl5213o perl5214o perl5215o perl5216o
   --------- --------- --------- --------- --------- --------- ---------
Ir     155.0     155.0     155.0     155.0     161.0     161.0     161.0
Dr      54.0      54.0      54.0      54.0      57.0      57.0      57.0
Dw      30.0      30.0      30.0      30.0      31.0      31.0      31.0

and the bisect usage sample:


$D/Porting/bisect.pl              \
 --start=v5.21.3                  \
 --end=v5.21.4                    \
 -Doptimize=-O2                   \
 -j 16                            \
 --target=miniperl                \
 -- perl5201o $D/Porting/bench.pl \
      -j 8                             \
      --benchfile=$D/t/perf/benchmarks \
      --tests=expr::assign::scalar_lex \
      --perlargs='-Ilib'               \
      --bisect='Ir,153,157'            \

p5p had universal praise for it, because probably nobody did proper benchmarks before. Well, it's at least better than nothing.

It uses cachegrind, which means it is much slower than linux perf, but works on all platforms. It does not display error rates, it runs the sample only once, so you can only trust it, or do not trust it, e.g. in case of high load. Dave said the results are trustworthy, even with -j8, probably because it runs through cachegrind.

Currently it only measures micro ops, not the big picture, a sample over the most used ops, as I argued before at On simple benchmarks and for the right op distribution at Idea - costmodel for B::Stats.

My parrot-bench uses linux perf on linux instead of just cachegrind. It uses bigger samples, which do work across all versions, it is much faster, using native speed and is more reliable IMHO. I rather see a bad error rate than none, and I rather see the absolute timings down to 6 digits.

For the API you can compare it to this similar existing microbenchmark library by google: https://github.com/google/benchmark

There are now two different benchmark tests in p5p t/: The old t/benchmark/rt26188-speed-up-keys-on-empty-hash.t, and the new t/perf directory with the various micro benchmarks.

So with this bench, you can say this op got faster, but you cannot say that perl got faster.

What can you expect. We even don't have a proper realtime clock in our Time::HiRes module yet. i.e. asm rdtsc on intel or any other realtime clock sources for the various platforms, as in Devel::NYTProf.

Perfect Hashes and faster than memcmp

In my previous post about perlcc next steps I talked shortly about my current project, Perfect::Hash.

# generate c file for readonly lookup
phash keyfile --prefix=phash -nul

# pure-perl usage
use Perfect::Hash;
my @dict = split/\n/,`cat /usr/share/dict/words`;
my $ph = Perfect::Hash->new(\@dict, -minimal);
for (@ARGV) {
  my $v = $ph->perfecthash($_);
  print ($dict[$v] eq $_ ? "$_ at line ".$v+1."\n" : "$_ not found\n");

Perfect::Hash->new("keyfile", '-urban', ...)->save_c;
# or just:
phash keyfile --urban
cc -O3 -msse4.2 phash.c ... -lz

phash /usr/share/dict/words --cmph-bdz_ph --nul
cc -O3 phash.c ... -lcmph

phash knows about several fast lookup algorithms for any readonly string-based maps (no integer maps yet), and will select the best method to store it as a lookup method in a C file to compile it with your project. Such as with gperf, only better. Planned are also exporters for other languages, to be able to store values in their native language-specific format. E.g. XS for perl-specific SVs, AVs or HVs.

In order to optimize loading of static data, readonly hashes to replace cdb or sqlite databases or serialized caches we will use phash. gperf was enhanced for libunistring, to lookup unicode data efficiently, compared to icu which uses only straight C++ hashmaps. Searching in constant tables is quite a remarkably common task, so I am constantly suprised why nobody uses optimized data structures and algorithms to use it. I only know of google which also use consistent hashes to spread the load and keyspace across several servers, apple for their speach apps, and libunistring for unicode mappings. cdb e.g. could easily use perfect hashes instead of overly simple DJB hashes. Or mysql which is mostly only usable as glorified readonly concurrent hashtable could be replaced by such tables. The update with regeneration and recompilation for such a table costs less then a typical update into mysql.


Let's create some sample hash tables:

$ phash /usr/share/dict/words -hanovpp -size 20 -false-positives
$ cat phash.c

#include "phash.h"
#include <string.h>

#ifdef _MSC_VER
#define INLINE __inline
#define INLINE inline

/* FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/ */
static INLINE
unsigned phash_hash(unsigned d, const unsigned char *s, const int l) {
    int i = 0;
    if (!d) d = 0x01000193;
    for (; i < l; i++) {
        d = (d * 0x01000193) ^ *s++;
    return d & 0xffffffff;

long phash_lookup(const unsigned char* s) {
    unsigned char h;
    signed char d;
    char v;
    /* hash indices, direct < 0, indirect > 0 */
    static const signed char G[] = {
        -16,  1,  1,-15,  0,  0,  0,-14,  1,-13,-12,  1,  3,-10,  0, -8,
          0, -7,  0, -2,
    /* values */
    static const unsigned char V[] = {
          0,  9, 17,  6, 12, 13, 10, 14, 18,  5,  8, 15,  4,  2, 11, 16,
          1,  7,  3, 19,
    long l = strlen(s);
    h = (unsigned char)(phash_hash(0, s, l) % 20);
    d = G[h];
    v = d < 0
        ? V[(unsigned char)-d-1]
        : d == 0
          ? V[h]
          : V[(unsigned char)(phash_hash(d, s, l) % 20)];
    return v;

This creates an interim lookup array G with negative indices for immediate in the values array V, and with positive indices for a second hash-based lookup into the values array V, where the index is the hash seed > 0. So we have typically 1.5 hash function calls per lookup. Generation of such a table is linear in time, intermediate and resulting space is 2n, one array[n] for G and another for the values. But note that this lookup will only work if we know in advance that the key is a member of the perfect hash. That's why we can use the option -false-positive.

In order to use arbitrary keys for the lookup, we omit -false-positives, we need to add a final check if the found key is really contained in the stored perfect hash. So we need 3n space, one more array for the keys also, and one more full memcmp.

We will also add the option -nul to be able to store binary keys, with embedded NUL characters. Note that using \0 in keyfiles is quite tricky.

$ phash /usr/share/dict/words -hanovpp -size 20 -nul
$ tail -n30 phash.c

long phash_lookup(const unsigned char* s, int l) {
    unsigned char h;
    signed char d;
    char v;
    /* hash indices, direct < 0, indirect > 0 */
    static const signed char G[] = {
        -16,  1,  1,-15,  0,  0,  0,-14,  1,-13,-12,  1,  3,-10,  0, -8,
          0, -7,  0, -2,
    /* values */
    static const unsigned char V[] = {
          0,  9, 17,  6, 12, 13, 10, 14, 18,  5,  8, 15,  4,  2, 11, 16,
          1,  7,  3, 19,
    /* keys */
    static const char* K[] = {
    h = (unsigned char)(phash_hash(0, s, l) % 20);
    d = G[h];
    v = d < 0
        ? V[(unsigned char)-d-1]
        : d == 0
          ? V[h]
          : V[(unsigned char)(phash_hash(d, s, l) % 20)];
    if (memcmp(K[v],s,l)) v = -1;
    return v;

The differences are the additional storage of the keys and the final if (memcmp(K[v],s,l)) v = -1;

Note that this data structure here is optimized to work with zero-based integer values. For arbitrary values the check would be a bit different and we would need one more array for the values. gperf creates an array of key + value structs and the values can be of any type.

This -hanovpp - pp stands for pure-perl, hanov for Steve Hanov, who wrote a simple python script at http://stevehanov.ca/blog/index.php?id=119 based on the cmph CHD algorithm - method is the default method to create arbitrary sized perfect hashes, and needs no external library. It creates pure and independent, pretty fast C code.

Fast CRC32

The optimized versions -hanov and -urban use the typically faster hardware accelerated iSCSI CRC32 function from zlib, which uses only 1 CPU cycle per word, but requires zlib on the system, and is only fast if zlib is fast. You cannot beat 1 cycle, but some zlib versions only use the slow software fallback version for CRC32, which is most slower than the simple FNV hash function used with -hanovpp.

Note that faster wordsize optimized FNV variants are "presented" at http://www.sanmayce.com/Fastest_Hash/index.html#TRIADvsCRC (Beware of eye-bleeding though)


gperf might be able to create even faster and better hashes, because it doesn't fully hash each key, it rather creates a custom simpler hash function for each key set. It just adds some characters at certain indices until you get different values for each key. Most other methods only create custom hash seeds for each keyset. But gperf can not reliably create stable hash functions and will either time out on certain keysets or will create bad hash tables.

The major variants of phash -gperf include the options -pic and -switch. -pic creates a lookup function optimized for shared libraries, and -switch creates a nested switch statement.

-switch and fast memcmp

gperf --switch=2 is by far worse than my optimized version called phash -switch With gperf you have to manually specify the level of nested switches, with phash -switch the nesting level is created automatically, but the biggest trick is a special memcmp optimization for shorter strings.

The old gperf --switch=n variant:

$ gperf examples/words20 --switch=2

... special hash ...

const char *
in_word_set (str, len)
     register const char *str;
     register unsigned int len;
  if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH) {
      register int key = hash (str, len);
      if (key <= MAX_HASH_VALUE && key >= MIN_HASH_VALUE) {
          register const char *resword;
          if (key < 16) {
              switch (key - 1) {
                  case 0:  resword = "A"; goto compare;
                  case 2:  resword = "A's"; goto compare;
                  case 3:  resword = "AM's"; goto compare;
                  case 4:  resword = "ATM's"; goto compare;
                  case 5:  resword = "Aachen"; goto compare;
                  case 8:  resword = "AZ's"; goto compare;
                  case 9:  resword = "AZT's"; goto compare;
                  case 10: resword = "AWOL's"; goto compare;
                  case 13: resword = "AC's"; goto compare;
                  case 14: resword = "ASL's"; goto compare;
          } else {
              switch (key - 16) {
                  case 0: resword = "ACTH's"; goto compare;
                  case 1: resword = "ASCII's"; goto compare;
                  case 2: resword = "AOL"; goto compare;
                  case 3: resword = "AI's"; goto compare;
                  case 4: resword = "AOL's"; goto compare;
                  case 5: resword = "AIDS's"; goto compare;
                  case 8: resword = "AB's"; goto compare;
                  case 9: resword = "ABM's"; goto compare;
                  case 13: resword = "AA's"; goto compare;
                  case 14: resword = "ATP's"; goto compare;
          return 0;
          if (*str == *resword && !strcmp (str + 1, resword + 1))
            return resword;
  return 0;

vs. the new seperate phash -switch method:

$ phash examples/words20 -switch
$ cat phash.c

#include "phash.h"
#include <string.h>

long phash_lookup(const unsigned char* s, int l) {
    switch (l) {
      case 1: 
        return *(s) == 'A' ? 0 : -1;
      case 3: 
        switch (s[1]) { /* A's, AOL */
          case 39:
            if (*(short*)s == (short)0x2741 /* A's */
        && *(&s[2]) == 's') return 1;
          case 'O':
            if (*(short*)s == (short)0x4f41 /* AOL */
        && *(&s[2]) == 'L') return 10;
            return -1;
        return -1;
      case 4: 
        switch (s[1]) { /* AM's, AB's, AZ's, AA's, AI's, AC's */
          case 'A':
            if (*(int*)s == (int)0x73274141 /* AA's */) return 2;
          case 'B':
            if (*(int*)s == (int)0x73274241 /* AB's */) return 3;
          case 'C':
            if (*(int*)s == (int)0x73274341 /* AC's */) return 5;
          case 'I':
            if (*(int*)s == (int)0x73274941 /* AI's */) return 7;
          case 'M':
            if (*(int*)s == (int)0x73274d41 /* AM's */) return 9;
          case 'Z':
            if (*(int*)s == (int)0x73275a41 /* AZ's */) return 17;
            return -1;
        return -1;
      case 5: 
        switch (s[1]) { /* ATP's, AOL's, ASL's, ATM's, ABM's, AZT's */
          case 'B':
            if (*(int*)s == (int)0x274d4241 /* ABM's */
        && *(&s[4]) == 's') return 4;
          case 'O':
            if (*(int*)s == (int)0x274c4f41 /* AOL's */
        && *(&s[4]) == 's') return 11;
          case 'S':
            if (*(int*)s == (int)0x274c5341 /* ASL's */
        && *(&s[4]) == 's') return 13;
          case 'T':
            if (*(int*)s == (int)0x27505441 /* ATP's */
        && *(&s[4]) == 's') return 15;
            if (*(int*)s == (int)0x274d5441 /* ATM's */
        && *(&s[4]) == 's') return 14;
          case 'Z':
            if (*(int*)s == (int)0x27545a41 /* AZT's */
        && *(&s[4]) == 's') return 18;
            return -1;
        return -1;
      case 6: 
        switch (s[1]) { /* AIDS's, AWOL's, ACTH's, Aachen */
          case 'C':
            if (*(int*)s == (int)0x48544341 /* ACTH's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 6;
          case 'I':
            if (*(int*)s == (int)0x53444941 /* AIDS's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 8;
          case 'W':
            if (*(int*)s == (int)0x4c4f5741 /* AWOL's */
        && *(short*)&s[4] == (short)0x7327 /* 's */) return 16;
          case 'a':
            if (*(int*)s == (int)0x68636141 /* Aachen */
        && *(short*)&s[4] == (short)0x6e65 /* en */) return 19;
            return -1;
        return -1;
      case 7: 
        return *(int*)s == (int)0x49435341 /* ASCII's */
        && *(short*)&s[4] == (short)0x2749 /* I's */
        && *(&s[6]) == 's' ? 12 : -1;

So we switch here first on the key length, and then on the best key. The new switch method creates the nesting level automatically for you, all cases with more than 3 comparisons are automatically changed to a separate switch on the best character. And because we know the length an the keys in advance we can optimize memcmp away by doing word-size comparison directly. This is 50% - 2000% faster then doing memcmp on each string. See the implementation of the helper function memcmp_const_str I heard of people doing this on every string shorter than 128 chars. So far I'm using only 36 as cutoff.

This is not a very interesting example, you need much bigger keysizes to see smarter switch nesting in effect. With this method you can beat even the best hash functions for certain key sets.

For example we got with /usr/share/dict/words on debian with size=99171 from 0.307571 seconds down to 0.006666 seconds, while the fastest perfect hash needed 0.007391 seconds. See the benchmarks. In this special case it was 200x faster with the new memcmp optimization method.

Unfortunately we cannot use this new memcmp optimization method for all hash tables. We need to know one string and the length to compare in advance, and the key and length is a parameter of the lookup method. This only works for this static switch table.


There exists another special hashing method, called Pearson hashing. This creates a short 256 bytes "pearson" table of randomized shuffled bytes and uses this table to mix with all characters of the string to hash. Such byte-wise operation with a small lookup table are perfect for smaller machines or smaller keysets.

In my benchmarks -pearson wins with ~127 keys, but you need to be a bit lucky to find a good pearson randomization. phash -pearson tries for max 60 seconds to find the best random table.

Small dictionary with 127 keys

Method       *lookup*  generate compile   c size   exesize  options

hanovpp      0.001247  0.001904 0.069912     3454    11575
hanov        0.001231  0.000946 0.070709     3200    11629
urban        0.001298  0.001020 0.072471     3200    11629
pearson      0.001178  0.068758 0.072311    12354    19240
pearsonnp    0.001253  0.018204 0.072419    12386    19259
cmph-bdz_ph  0.001271  0.000412 0.066264      335    10009
cmph-bdz     0.001317  0.000403 0.066471      424    10033
cmph-bmz     0.001277  0.000539 0.066814     2431    10577
cmph-chm     0.001325  0.000646 0.067026     4356    11057
cmph-fch     0.001372  0.000511 0.066843      942    10177
cmph-chd_ph  0.001310  0.000361 0.066292      418    10033
cmph-chd     0.001323  0.000343 0.066428      735    10129
switch       0.001190  0.000173 0.131323    12517    17450
gperf        0.001199  0.002180 0.064883     8987    12574

The generated code looks like this:

$ phash examples/words20 -pearson
$ cat phash.c

#include "phash.h"
#include <string.h>

long phash_lookup(const unsigned char* s) {
    int l = strlen(s);
    long h = 0;
    const char *key = s;
    static unsigned char phash[] = {
        136,176, 67, 19,170, 84, 87,228, 86, 92,100,214,208,201,160,126,
         30,192, 78,189,232,244,142,252,253,147, 34,251,156, 38, 62,141,
         20, 69, 88, 36,241,193, 77,162, 48,247,181,234,194,125, 98,135,
        127, 61,204,180,205,133,132,145, 10,146,229, 18,198, 37, 73,139,
         70,131,249, 51,223,235, 97,202,120,165, 16,  7,186, 89,243, 24,
        216, 43, 45,182,  0, 94,  2,248, 39,188,255,240,219, 96,220,217,
        168,130,  1,190, 85, 75, 44,169, 55,144, 58,224,221,239, 12,159,
        211, 90, 47, 72, 33,  8, 26,231, 56, 57,155, 59,  3,254, 66,185,
         27,  4,  6,102,101,150, 63,153, 52, 81,197,105,161,  9,140, 11,
        207,148,103,199,183, 49,213,143,149,138, 99,117,167, 83,171,175,
        218, 53, 80,113,106,250,124,137,109,104, 22,242,107,134,166, 54,
         14,110,164,236,187,203,237,154,128, 31, 15, 74,179,230,158,177,
        115,210,114,  5,226,196,212, 60,116, 21,157, 32,112,238, 17, 41,
         68,184, 93, 23,245,173,129,172, 25, 79,233,178, 28,121, 40,246,
        119, 50, 13, 65,163,222,206, 95,225,122, 71,111, 46, 35,195, 82,
         42,152, 76, 91,151,118, 29,209,174,227,108, 64,200,215,123,191,

    /* collisions: keys and values */
    static const int   Cv_1[] = {0};
    static const char *Ck_2[] = {"ABM's","AC's", };
    static const int   Cv_2[] = {4,5, };
    static const int   Cv_3[] = {7};
    static const int   Cv_4[] = {3};
    static const int   Cv_5[] = {2};
    static const int   Cv_6[] = {8};
    static const int   Cv_7[] = {1};
    static const int   Cv_8[] = {6};
    static const int   Cv_9[] = {9};
    /* collision keys */
    static const char *Ck[] = { 0, 0, 
        (void*)&Ck_2, 0, 0, 0, 0, 0, 0, 0, };
    /* collision values and direct values */
    static const int *Cv[] = { 0, 
      (void*)&Cv_9, };
    /* size of collisions */
    static const unsigned char Cs[] = { 0, 1, 2, 1, 1, 1, 1, 1, 1, 1, };
    /* keys */
    static const char* K[] = {
    unsigned char c;
    for (c=*s++; c; c=*s++) {
        h = phash[(h ^ c) % 256];
    h = h % 10;
    if (Cs[h] > 1) {
      const char **ck = (const char **)Ck[h];
      int i = 0;
      for (; i < Cs[h]; i++) {
        if (!memcmp(ck[i], key, l)) return Cv[h][i];
    else if (Cs[h] == 1) {
      h = Cv[h][0];
    if (memcmp(K[h], key, l)) h = -1;
    return h;

We see that -pearson did not generate a perfect hash, it rather created a static collision table. But -pearson tries to minimize the collision costs for us by trying random pearson tables for a maximum time of 60 seconds and then stores the best variant. In this example here we only got one colliding bucket with 2 values.

There are 2 variants for -pearson, called -pearsonnp, "np" for non-perfect, and -pearson8 which is the strict perfect hash variant and a 8 bit table. -pearsonnp will use a larger pearson lookup table to create less collisions, and -pearson8 will most likely fail to create a proper hash table. I'm also still experimenting with other pearson variants for bigger tables.


The currently supported external variants (which require an external library even at run-time) use the cmph library. They are a bit slower than the other methods, but only a bit. They allow much bigger key sizes, with dictionaries which might even exceed your main memory, and there exists even an variant to use tables which exceed the available main memory.

$ phash examples/words20 -cmph-bdz_ph
$ cat phash.c
#include "phash.h"
#include <string.h>
#include "cmph.h"

long phash_lookup(const unsigned char* s) {
    static unsigned char *packed_mphf = "\006\000\000\000\000\000\000\000\r\000\000\000\t\000\000
    return cmph_search_packed(packed_mphf, (const char*)s, strlen(s));

You need to link this file with -lcmph.


Creating perfect hashes is easy,

they are grossly underused,

they are fast,

and in certain cases fast data tricks - word-sized cmp instead of memcmp - beat better algorithms.

perlcc next steps

cPanel uses now the new perl compiler B::C with -O3 and --staticxs with 5.14.4 in production, and I'm outlining the next steps.

Our old compiler (with 5.6.2) needed 2:30 hours for each build, produced ~100 binaries of about 30-50MB size. This sounds a lot but it is not. This is about the size a single perl scripts needs in memory, but a single perl script has to find all the dependent files first, load and parse it, and construct the stashes, the nested hashes of namespaces. All this is not needed at all during run-time, hence perlcc compiled binaries do not need it. The new compiler uses 30 minutes for the same binaries on our build servers, but is not yet using a lot of internal optimizations. It does use a lot of statically initialized strings and data, so the run-time is better than with 5.6, even if 5.6 is still much faster than the 5.14 run-time. But we still don't collect arrays into _independent_comalloc ranges as we did with 5.6, we still don't use link-time nor profile guided optimizations and we still do not optimize single modules seperately.

There are not many shared libraries used in such a compiled binary, just libperl and the used XS shared libraries.

Starting such a single binary is very fast, it's just mapped into memory at once, then the dynamic symbols need to be generated, as B::C cannot yet generate them statically, and perl5 is not really helpful in supporting static symbols, but all other data, strings and ops (perl functions) are generated statically. Then the XS modules are initialized (booted), then some pointers pointing into some shared libraries need to updated, as they pointed to some other pointer at compile-time and then the perl program is started.

perlcc -m

The plan for the next year is to support generating shared modules per perl package. perlcc -m compiles a package to a shared library, buildcc generates the Makefile.depend dependencies for each binary, and then you can use selectively more advanced optimizations per package. I.e. B::CC or rperl optimized compilations, which should run 2-20x faster than the current B::C compiled packages.

B::CC is not stable enough yet, so you can only use it with some modules, and it currently benefits greatly from type information, for which upstream maintainers have no interest. With rperl it is even more strict, as those modules are not even more strict, as in real programming languages compared to our simple dynamic scripting language, it explicitly disallows certain kinds of dynamic magic or array autovivification, which slows down run-time so much as benchmarked in my YAPC::EU 2013 talk.

So the plan is to compile to several optimized libraries seperately and link them together later. For binaries which fork or which need to be updated pretty often, shared libraries are preferred, for other simple binaries static libraries are preferred as they are loaded much faster. Shared libraries can be updated much easier, and when you fork a binary its shared libraries are shared, there's no seperate copy in memory. This is also one of the main reasons to change our current perl5 unicode implementation from simple and slow perl data structures, to compiled shared libraries, which are just mapped in when needed and can be shared.


But first some other steps are needed. Before we will create compiled libraries per perl packages we want to compile only some read-only datastructures, hashes mainly. The name of this module is Data::Compile and will replace our current cdb databases for localization data. cdb maps readonly strings to strings, but cannot store other values than strings and is implemented as normal hashmap. Our localization data would be much easier to use with stored coderefs, similar to pre-compiled html templates, which which is a mix of strings and coderefs. By using certain parts of B::C it is trivial to store data as shared library, which can be dynaloaded on demand. Loading such a compiled datastructure of perl data is of course much faster then loading a btree or colliding hashmap and accessing it at run-time through some magic tie interface. B::C is very efficient in storing perl data statically, and only some GV symbols need to be initialized at boot-time.

Data::Compile will also replace the overly slow serializers, like Sereal, Cpanel::JSON::XS, Data::MessagePack or Storable, which have to convert binary data from and to perl data. Data::Compile freezes perl data natively and just thaws it at instance. So loading the CPAN Metadata file will go from 16 secs to 0.001ms, similar to CPAN::SQLite and cpanm which does those queries at run-time to some server will not be needed anymore. Updating such a compiled CPAN Metadata file is estimated to need 2 secs, which needs about the same time as updating CPAN with CPAN::SQLite. And CPAN::SQLite still has a locking bug, which prevents multiple cpan sessions, so you are easier of with cpanm now.

In order to optimize loading of static data, readonly hashes, to replace cdb or sqlite databases or serialized caches we need another module first:


Currently there's is only gperf to create perfect hashes in C or C++, and then there is also the lesser known bob jenkins perfect executable to create perfect hashes (with some weird header files) and of course the cmph library to create perfect hashes of bigger dictionaries, google size. No single database or datra-structure can be faster to lookup then perfect hashes for readonly data, the fastest single-host readonly databases are cdb and mysql. And perfect hashes beat them by far. Ever wondered why Google lookups are so fast? Well, they distribute hashes across several servers with so-called consistent hashes, which map strings into buckets of n servers, and when they insert and delete entries the remapping (copying to other buckets) is minimized. But the real trick are minimal perfect hashes, which minimize lookup times and storage sizes far beyond normal or sparse hashes or b-trees.

So I created phash to replace gperf in the long term, by using better algorithms to handle any data (gperf fails to work with anagrams or weird keys), to create optimally fast C libraries or optimally small C libraries for fast hash lookups, and even provide backends to output perfect hashes for several popular programming languages, like perl (XS), java, ruby, php or python. As C library you can only store strings as keys, and integers and strings as values. With those other backends you can store all supported values.

Perfect hashes look differently on small 8-bit machines than on fast x86_64 machines, for static libraries or for shared libraries with -fPIC, for <1000 keys or for >1.000.000 keys, for keys with NUL characters or not, for 7-bit keys only, for unicode keys, for case-insensitive key lookup, and much more. Using a high-level language to analyze the keys to generate a good perfect hash (in this case perl) is much easier than fixing and enhancing gperf.

The other main problem is icu which is a collection of glorified but moderately efficient hashmaps of unicode tables and the even worse perl5 implementation of those unicode tables. Encode does it much better by pre-compiling encoding tables into good shared libraries, but unicode has much more tables than just encodings. parrot currently uses icu with a new gperf generated workaround for missing tables in icu (the not-existing whitespace class and missing namealiases which were broken in icu 5.0), and moarvm came up with it's own implementation of those tables to workaround icu deficiencies.

So far I have re-implemented the best algorithms to create and use perfect hashes in pure perl and optimized it with XS functions and I am able to create good C code. I am still testing optimal hash functions and strategies. I need only ~2 seconds to create a perfect hash for 100.000 entries in pure perl, with C libraries this goes down to 0.001 seconds, scaling linearily. The main problem is solved. Compilation with a C compiler and -O3 of such hashes need another second.

E.g. these hashes can be used to replace the constant lookup in ExtUtils::Constants. So I'll look soon into seperate my current ~10 different implementations of perfect hashes into one simple but good enough pure-perl version, which will generate pure-c code, without the need for any external library, and then seperate the others into several packages with external dependencies, like zlib for a fast hardware-assisted crc32 hash function (1 op per word), or libcmph and the various output formatters.

Then Data::Compiled will use Perfect::Hash::XS to store the read-only hashes. And then source code will change to use those hashes instead, and then perlcc -m will be able to compile and link arbitrary perl packages with various type annotations and various compilers, if B::C -O0, -O3, B::CC -O0, -O2 or even rperl.

The next plan is then to create a type inferencer in C, but I'll wait for this until p5p comes to a syntax decision how to annotate return types of functions. If ever.