Statistics for perl hash tables
The following data is from a 64bit perl-5.19.10 (i.e. 32 bit jenkins - corrected from 64 bit siphash - which randomly shuffles the 2 first entries of collisions), running the core test suite with -DH.
number of hash keys:
0 3634661x 1 3263596x 2 4361350x 3 2745134x 4 4199196x 5 2921502x ... and so on linearly falling ... 29 1821569x 30 736893x 31 312185x 32 643500x 33 379865x 34 371403x 35 1050967x 36 1016470x 37 1226918x 38 8558286x 39 1671515x 40 4594200x 41 2859521x 42 627789x 43 686036x ... 67 220743x 68 3382026x 69 1225141x 70 147903x 71 95882x ... 307 8330x 308 8962x 309 7902x 310 9041x ... 513 2413x 514 2713x 515 2708x 516 1945x ... 1098 410x 1099 433x 1100 402x 1101 400x ... 1848 69x 2225 14756x 2226 66x 2283 14877x 2284 63x 2309 14818x 2310 60x 2658 25321x 2659 56x 4239 9221x 4240 53x 5654 23526x 5655 46x 6355 1799x 6356 41x 6768 1641x 6769 38x 7476 2001x 7477 35x 7983 37x 7984 76x 7985 27x 7989 25x 7990 23x 8292 54x 8293 19x 9716 18x 10132 19x 10133 18x 13060 1968x 13061 15x 19745 39503x 19746 13x 20893 3795x 20894 9x 22124 3862x 22125 5x 23647 3865x 23648 1x 25137 328x
hash table sizes:
0 1933x 1 294x 3 432x 7 24128585x 15 14099448x 31 15040566x 63 28189821x 127 19795904x 255 10955641x 511 3851922x 1023 2009591x 2047 961707x 4095 180486x 8191 236082x 16383 141396x 32767 136848x
depth of linear search of collisions:
0 19076114x 1 76051832x 2 19870023x 3 4091878x 4 577090x 5 46417x 6 16933x 7 346x 8 23x
It was produced with this patch and a short script to sum up the occurences of the hash keys, sizes and tried collisions. The tests are still work in progress, see some more in this perl-hash-stats repo
The perl5 testsuite has a key size of median = 20, and avg of 133.2. The most commonly used key sizes are 4, 101 and 2.
Average case (perl core testsuite)
| Hash Function | collisions| cycles/hash | |:------------------|----------:|------------:| | CRC32 | 1.078 | 29.78 | | SUPERFAST | 1.081 | 34.72 | | ONE_AT_A_TIME_HARD| 1.092 | 83.75 | | SIPHASH | 1.091 | 154.68 | | ONE_AT_A_TIME | 1.098 | 43.62 | | ONE_AT_A_TIME_OLD | 1.100 | 43.62 | | MURMUR3 | 1.105 | 34.03 | | DJB2 | 1.131 | 44.73 | | CITY | | 30.13 | | SDBM | - | 30.57 |
Less collisions are better, less cycles/hash is faster. A hash table lookup consists of one constant hash function (depending only on the length of the key) and then resolving 0-x collisions (in our avg case 0-8).
- collisions are the number of linked list iterations per hash table usage.
- cycles/hash is measured with smhasher for 10 byte keys. (see "Small key speed test")
- SDBM and DJBJ did not produce a workable miniperl with -O2. Needed to patch them. Seeing that a HASH=0, effectively creating a long list of linear collisions in HvARRAY, does not work in current perl5, makes me feel bad. Note that seed + len is to prevent from the \0 attack.
So it's a pretty good distribution to test better hash tables. Sorry, no beautiful plots yet, later.
The problem I see with the new 5.18 hash tables is manifold:
- instead of using a fast and optimized hash function we are using a slow, unoptimized hash function. Instead of using a good and fast function and avoid collision attacks, we try to avoid attacks by using slower hashing. The choosen hash functions are also not the best (faster and less collisions) independently verified for average data.
- we hash unnecessary bits. needed are typically the last 5-9 bits (size 32-512) The biggest hash table in the core test suite needs 15 bits!
we cargo-cult salting by adding the random seed to the end of the key. every hash function combines the key with the salt anyway, and we only need the last bits. The last bits should prefer the key, not the seed.This is wrong. We add the len to the seed to prevent from Ruslan's \0 attack.
- the problem is only to avoid linear collision exhaustion, O(n/2) search time, but we didn't change the collision strategy, e.g. from linear list to binary sorted array, i.e. O(log n). A sorted array of collision entries is much faster to search and much more cache friendly. That's why open addressing is so popular. Sorting up to 6 collisions is thanksfully trivial (hard-coded insertion-sort), and we only need to fallback to complicated called sort functions on real attacks.
- we randomize seeds only when rehashing, when the attack can already occur. We did it better with 5.8.1, but this was "optimized" with 5.8.2, which lead the problem with 5.18.
cPanel perl randomizes the seed for all hashes, so is even immune to the attack Ruslan is talking about in the public blog post at booking.com, but the full motivation for the current 5.18 design is not publicly accessible.It was accessible, I just did not find it.
I'm working on fixing the 5.18 hash table problem and also evaluating robin-hood (fastest open addressing hash table) and cmph BMZ (fastest perfect hash) and using HW optimized hash functions (crc32 intrinsincs and SSE4.2 or armv8).
My analysis of the ususal hash functions is here: github.com/rurban/smhasher, where you can see how ridicuously slow siphash is, how slow and bad jenkins is, compared to fast hash and more secure functions. There are also the results of the full testsuite for each hash function and the easier to follow stats) evaluating quality vs performance.
Good open addressing hash tables are evaluated at github.com/goossaert/hashmap.
Perfect hash tables are explained at cmph.sourceforge.net/.
The public motivation to slow down the perl hash function is here. The discussion within the security team is not available, but as I was told, only the actual attack is not published, which is good. What is not good is that the developers have no idea how to evaluate hash tables and there is no sane discussion possible in the mailinglist, hence this blog and the various repos. In the end I will submit the best solution as patch.
I am also working a symbolic proof besides the statistical one with the stp SAT solver, to see the "quality" of a hash function over the possible seeds, counting the number of collisions over the typical table sizes. With such a solver you can automatically create attack vectors against most web-sites with bad hash collision strategies, so I'm a bit reluctant to publish that.
Note that "hardening" a hash function does not help in hardening a hash table. It only will get slower, but only marginally better.
Notes: * Edited SIPHASH out. We are using jenkins_hard in blead now. This is 2x faster than siphash, and only 6x slower than unoptimized City, but 4-20x slower than SSE4 optimized crc32 intrinsics, with not that high additional collision costs. * Edited seed + len confusion. This is a good thing to do. In fact CRC32 is currently the leader in least collisions and performance.