## Statistics for perl hash tables

The following data is from a 64bit perl-5.19.10 (i.e. 32 bit jenkins - *corrected from 64 bit siphash* - which randomly shuffles the 2 first entries of collisions), running the core test suite with -DH.

## number of hash keys:

```
0 3634661x
1 3263596x
2 4361350x
3 2745134x
4 4199196x
5 2921502x
... and so on linearly falling ...
29 1821569x
30 736893x
31 312185x
32 643500x
33 379865x
34 371403x
35 1050967x
36 1016470x
37 1226918x
38 8558286x
39 1671515x
40 4594200x
41 2859521x
42 627789x
43 686036x
...
67 220743x
68 3382026x
69 1225141x
70 147903x
71 95882x
...
307 8330x
308 8962x
309 7902x
310 9041x
...
513 2413x
514 2713x
515 2708x
516 1945x
...
1098 410x
1099 433x
1100 402x
1101 400x
...
1848 69x
2225 14756x
2226 66x
2283 14877x
2284 63x
2309 14818x
2310 60x
2658 25321x
2659 56x
4239 9221x
4240 53x
5654 23526x
5655 46x
6355 1799x
6356 41x
6768 1641x
6769 38x
7476 2001x
7477 35x
7983 37x
7984 76x
7985 27x
7989 25x
7990 23x
8292 54x
8293 19x
9716 18x
10132 19x
10133 18x
13060 1968x
13061 15x
19745 39503x
19746 13x
20893 3795x
20894 9x
22124 3862x
22125 5x
23647 3865x
23648 1x
25137 328x
```

## hash table sizes:

```
0 1933x
1 294x
3 432x
7 24128585x
15 14099448x
31 15040566x
63 28189821x
127 19795904x
255 10955641x
511 3851922x
1023 2009591x
2047 961707x
4095 180486x
8191 236082x
16383 141396x
32767 136848x
```

## depth of linear search of collisions:

```
0 19076114x
1 76051832x
2 19870023x
3 4091878x
4 577090x
5 46417x
6 16933x
7 346x
8 23x
```

It was produced with this patch and a short script to sum up the occurences of the hash keys, sizes and tried collisions. The tests are still work in progress, see some more in this perl-hash-stats repo

The perl5 testsuite has a key size of median = 20, and avg of 133.2. The most commonly used key sizes are 4, 101 and 2.

## Average case (perl core testsuite)

```
| Hash Function | collisions| cycles/hash |
|:------------------|----------:|------------:|
| CRC32 | 1.078 | 29.78 |
| SUPERFAST | 1.081 | 34.72 |
| ONE_AT_A_TIME_HARD| 1.092 | 83.75 |
| SIPHASH | 1.091 | 154.68 |
| ONE_AT_A_TIME | 1.098 | 43.62 |
| ONE_AT_A_TIME_OLD | 1.100 | 43.62 |
| MURMUR3 | 1.105 | 34.03 |
| DJB2 | 1.131 | 44.73 |
| CITY | | 30.13 |
| SDBM | - | 30.57 |
```

Less collisions are better, less cycles/hash is faster. A hash table lookup consists of one constant hash function (depending only on the length of the key) and then resolving 0-x collisions (in our avg case 0-8).

- collisions are the number of linked list iterations per hash table usage.
- cycles/hash is measured with smhasher for 10 byte keys. (see "Small key speed test")
- SDBM and DJBJ did not produce a workable miniperl with -O2. Needed to patch them. Seeing that a HASH=0, effectively creating a long list of linear collisions in HvARRAY[0], does not work in current perl5, makes me feel bad. Note that seed + len is to prevent from the \0 attack.

So it's a pretty good distribution to test better hash tables. Sorry, no beautiful plots yet, later.

The problem I see with the new 5.18 hash tables is manifold:

- instead of using a fast and optimized hash function we are using a slow, unoptimized hash function. Instead of using a good and fast function and avoid collision attacks, we try to avoid attacks by using slower hashing. The choosen hash functions are also not the best (faster and less collisions) independently verified for average data.
- we hash unnecessary bits. needed are typically the last 5-9 bits (size 32-512) The biggest hash table in the core test suite needs 15 bits!
~~we cargo-cult salting by adding the random seed to the end of the key. every hash function combines the key with the salt anyway, and we only need the last bits. The last bits should prefer the key, not the seed.~~This is wrong. We add the len to the seed to prevent from Ruslan's \0 attack.- the problem is only to avoid linear collision exhaustion, O(n/2) search time, but we didn't change the collision strategy, e.g. from linear list to binary sorted array, i.e. O(log n). A sorted array of collision entries is much faster to search and much more cache friendly. That's why open addressing is so popular. Sorting up to 6 collisions is thanksfully trivial (hard-coded insertion-sort), and we only need to fallback to complicated called sort functions on real attacks.
- we randomize seeds only when rehashing, when the attack can already occur. We did it better with 5.8.1, but this was "optimized" with 5.8.2, which lead the problem with 5.18.
old info:
~~cPanel perl randomizes the seed for all hashes, so is even immune to the attack Ruslan is talking about in the public blog post at booking.com, but the full motivation for the current 5.18 design is not publicly accessible.~~It was accessible, I just did not find it.

I'm working on fixing the 5.18 hash table problem and also evaluating robin-hood (fastest open addressing hash table) and cmph BMZ (fastest perfect hash) and using HW optimized hash functions (crc32 intrinsincs and SSE4.2 or armv8).

My analysis of the ususal hash functions is here: github.com/rurban/smhasher, where you can see how ridicuously slow siphash is, how slow and bad jenkins is, compared to fast hash and more secure functions. There are also the results of the full testsuite for each hash function and the easier to follow stats) evaluating quality vs performance.

Good open addressing hash tables are evaluated at github.com/goossaert/hashmap.

Perfect hash tables are explained at cmph.sourceforge.net/.

The public motivation to slow down the perl hash function is here. The discussion within the security team is not available, but as I was told, only the actual attack is not published, which is good. What is not good is that the developers have no idea how to evaluate hash tables and there is no sane discussion possible in the mailinglist, hence this blog and the various repos. In the end I will submit the best solution as patch.

I am also working a symbolic proof besides the statistical one with the stp SAT solver, to see the "quality" of a hash function over the possible seeds, counting the number of collisions over the typical table sizes. With such a solver you can automatically create attack vectors against most web-sites with bad hash collision strategies, so I'm a bit reluctant to publish that.

Note that "hardening" a hash function does not help in hardening a hash table. It only will get slower, but only marginally better.

Notes: * Edited SIPHASH out. We are using jenkins_hard in blead now. This is 2x faster than siphash, and only 6x slower than unoptimized City, but 4-20x slower than SSE4 optimized crc32 intrinsics, with not that high additional collision costs. * Edited seed + len confusion. This is a good thing to do. In fact CRC32 is currently the leader in least collisions and performance.