April 2014 Archives

Statistics for perl hash tables

By Reini Urban on April 9, 2014 11:00 PM

The following data is from a 64bit perl-5.19.10 (i.e. 32 bit jenkins - corrected from 64 bit siphash - which randomly shuffles the 2 first entries of collisions), running the core test suite with -DH.

number of hash keys:

0   3634661x
1   3263596x
2   4361350x
3   2745134x
4   4199196x
5   2921502x
... and so on linearly falling ...
29  1821569x
30  736893x
31  312185x
32  643500x
33  379865x
34  371403x
35  1050967x
36  1016470x
37  1226918x
38  8558286x
39  1671515x
40  4594200x
41  2859521x
42  627789x
43  686036x
...
67  220743x
68  3382026x
69  1225141x
70  147903x
71  95882x
...
307 8330x
308 8962x
309 7902x
310 9041x
...
513 2413x
514 2713x
515 2708x
516 1945x
...
1098    410x
1099    433x
1100    402x
1101    400x
...
1848    69x
2225    14756x
2226    66x
2283    14877x
2284    63x
2309    14818x
2310    60x
2658    25321x
2659    56x
4239    9221x
4240    53x
5654    23526x
5655    46x
6355    1799x
6356    41x
6768    1641x
6769    38x
7476    2001x
7477    35x
7983    37x
7984    76x
7985    27x
7989    25x
7990    23x
8292    54x
8293    19x
9716    18x
10132   19x
10133   18x
13060   1968x
13061   15x
19745   39503x
19746   13x
20893   3795x
20894   9x
22124   3862x
22125   5x
23647   3865x
23648   1x
25137   328x

hash table sizes:

0   1933x
1   294x
3   432x
7   24128585x
15  14099448x
31  15040566x
63  28189821x
127 19795904x
255 10955641x
511 3851922x
1023    2009591x
2047    961707x
4095    180486x
8191    236082x
16383   141396x
32767   136848x

depth of linear search of collisions:

0   19076114x
1   76051832x
2   19870023x
3   4091878x
4   577090x
5   46417x
6   16933x
7   346x
8   23x

It was produced with this patch and a short script to sum up the occurences of the hash keys, sizes and tried collisions. The tests are still work in progress, see some more in this perl-hash-stats repo

The perl5 testsuite has a key size of median = 20, and avg of 133.2. The most commonly used key sizes are 4, 101 and 2.

Average case (perl core testsuite)

| Hash Function     | collisions| cycles/hash |
|:------------------|----------:|------------:|
| CRC32             | 1.078     | 29.78       |
| SUPERFAST         | 1.081     | 34.72       |
| ONE_AT_A_TIME_HARD| 1.092     | 83.75       |
| SIPHASH           | 1.091     | 154.68      |
| ONE_AT_A_TIME     | 1.098     | 43.62       |
| ONE_AT_A_TIME_OLD | 1.100     | 43.62       |
| MURMUR3           | 1.105     | 34.03       |
| DJB2              | 1.131     | 44.73       |
| CITY              |           | 30.13       |
| SDBM              | -         | 30.57       |

Less collisions are better, less cycles/hash is faster. A hash table lookup consists of one constant hash function (depending only on the length of the key) and then resolving 0-x collisions (in our avg case 0-8).

collisions are the number of linked list iterations per hash table usage.
cycles/hash is measured with smhasher for 10 byte keys. (see "Small key speed test")
SDBM and DJBJ did not produce a workable miniperl with -O2. Needed to patch them. Seeing that a HASH=0, effectively creating a long list of linear collisions in HvARRAY[0], does not work in current perl5, makes me feel bad. Note that seed + len is to prevent from the \0 attack.

So it's a pretty good distribution to test better hash tables. Sorry, no beautiful plots yet, later.

The problem I see with the new 5.18 hash tables is manifold:

instead of using a fast and optimized hash function we are using a slow, unoptimized hash function. Instead of using a good and fast function and avoid collision attacks, we try to avoid attacks by using slower hashing. The choosen hash functions are also not the best (faster and less collisions) independently verified for average data.
we hash unnecessary bits. needed are typically the last 5-9 bits (size 32-512) The biggest hash table in the core test suite needs 15 bits!
we cargo-cult salting by adding the random seed to the end of the key. every hash function combines the key with the salt anyway, and we only need the last bits. The last bits should prefer the key, not the seed. This is wrong. We add the len to the seed to prevent from Ruslan's \0 attack.
the problem is only to avoid linear collision exhaustion, O(n/2) search time, but we didn't change the collision strategy, e.g. from linear list to binary sorted array, i.e. O(log n). A sorted array of collision entries is much faster to search and much more cache friendly. That's why open addressing is so popular. Sorting up to 6 collisions is thanksfully trivial (hard-coded insertion-sort), and we only need to fallback to complicated called sort functions on real attacks.
we randomize seeds only when rehashing, when the attack can already occur. We did it better with 5.8.1, but this was "optimized" with 5.8.2, which lead the problem with 5.18. old info: cPanel perl randomizes the seed for all hashes, so is even immune to the attack Ruslan is talking about in the public blog post at booking.com, but the full motivation for the current 5.18 design is not publicly accessible. It was accessible, I just did not find it.

I'm working on fixing the 5.18 hash table problem and also evaluating robin-hood (fastest open addressing hash table) and cmph BMZ (fastest perfect hash) and using HW optimized hash functions (crc32 intrinsincs and SSE4.2 or armv8).

My analysis of the ususal hash functions is here: github.com/rurban/smhasher, where you can see how ridicuously slow siphash is, how slow and bad jenkins is, compared to fast hash and more secure functions. There are also the results of the full testsuite for each hash function and the easier to follow stats) evaluating quality vs performance.

Good open addressing hash tables are evaluated at github.com/goossaert/hashmap.

Perfect hash tables are explained at cmph.sourceforge.net/.

The public motivation to slow down the perl hash function is here. The discussion within the security team is not available, but as I was told, only the actual attack is not published, which is good. What is not good is that the developers have no idea how to evaluate hash tables and there is no sane discussion possible in the mailinglist, hence this blog and the various repos. In the end I will submit the best solution as patch.

I am also working a symbolic proof besides the statistical one with the stp SAT solver, to see the "quality" of a hash function over the possible seeds, counting the number of collisions over the typical table sizes. With such a solver you can automatically create attack vectors against most web-sites with bad hash collision strategies, so I'm a bit reluctant to publish that.

Note that "hardening" a hash function does not help in hardening a hash table. It only will get slower, but only marginally better.

Notes: * Edited SIPHASH out. We are using jenkins_hard in blead now. This is 2x faster than siphash, and only 6x slower than unoptimized City, but 4-20x slower than SSE4 optimized crc32 intrinsics, with not that high additional collision costs. * Edited seed + len confusion. This is a good thing to do. In fact CRC32 is currently the leader in least collisions and performance.

16 comments

Do not use each

By Reini Urban on April 2, 2014 5:39 PM

The each hash iterator has two problems which almost nobody is aware of, and which might lead to unexpected and hard to detect problems. We were using the perl-compiler since 12 years in production and just recently found out that the B function walksymtable will miss random symbols from time to time. It entirely depends on hash randomization. I cannot predict what each did to your code.

First problem: Do not change the data

Adding or deleting a key might lead to rehashing, which will re-order the keys. The iterator will just continue and then either miss keys or get keys again a second time.

perldoc -f each:

If you add or delete a hash's elements while iterating over it, entries may be skipped or duplicated--so don't do that. Exception: It is always safe to delete the item most recently returned by "each()", so the following code works properly:

while (($key, $value) = each %hash) {
    print $key, "\n";
    delete $hash{$key};   # This is safe
}

Below is the bug fixed with http://perl5.git.perl.org/perl.git/commit/5cc8528c900964306cba9b53c6eaa27af540eaea?f=ext/B/B.pm but even the p5p committer was not able to understand the underlying problem, why it failed. This bug was in core perl since 5.005 (1998) until 5.18.

sub walksymtable {
    my ($symref, $method, $recurse, $prefix) = @_;
    my $sym;
    my $ref;
    my $fullname;
    no strict 'refs';
    $prefix = '' unless defined $prefix;
    while (($sym, $ref) = each %$symref) {
        $fullname = "*main::".$prefix.$sym;
        if ($sym =~ /::$/) {
            $sym = $prefix . $sym;
            if (svref_2object(\*$sym)->NAME ne "main::"
                && $sym ne "<none>::"
                && &$recurse($sym))
            {
                walksymtable(\%$fullname, $method, $recurse, $sym);
            }
        } else {
            svref_2object(\*$fullname)->$method();
        }
    }
}

The fix is to replace

while (($sym, $ref) = each %$symref)

with

foreach my $sym (keys %$symref) {
    my $ref = $symref->{$sym};

walksymtable walks all entries in stashes (symbol table hashes) and all its children, and calls a method on each symbol. There is no apparent destructive code in walksymtable which could change the stash, but calling methods might lead to changes. Here it's &$recurse and ->$method(). It could be totally unexpected, e.g. loading a module which changes the symboltable or @ISA hierarchy, which might lead to skipped symbols. Or just parsing normal perl code, such as attributes or named regex references will magically load code and change the symbol table.

each is a bit more performant then foreach, because foreach stores the list to iterate over and just advances the index into it. But you are still able to change entries in the hash with for/foreach, you are safe to use recursion and you are safe from unexpected side-effects as in the walksymtable function.

Second problem: Not re-entrant

while (($key, $value) = each %hash) {
    use Data::Dumper;
    print $key, "\n";
    delete $hash{$key};
    print Dumper(\%hash);  # This is a bug
}

The problems is that Data::Dumper internally uses each to iterate over the hash, which resets the iterator of the hash to the end. The first while loop will not be able continue, the iterator cannot even be stored and reset manually. This is a Data::Dumper bug not detected and fixed in core perl for 16 years, since perl 5.005 (1998).

each not being re-entrant is not even documented.

The hash iterators should have really been stored in the op, not the hash. This is one of the oldest perl5 bugs which will never be fixed.

So each should be treated as in php: Avoid it like a plague. Only use it in optimized cases where you know what you are doing, but not in libraries as you saw with Data::Dumper.