Statistics for perl hash tables
The following data is from a 64bit perl-5.19.10 (i.e. 32 bit jenkins - corrected from 64 bit siphash - which randomly shuffles the 2 first entries of collisions), running the core test suite with -DH.
number of hash keys:
0 3634661x
1 3263596x
2 4361350x
3 2745134x
4 4199196x
5 2921502x
... and so on linearly falling ...
29 1821569x
30 736893x
31 312185x
32 643500x
33 379865x
34 371403x
35 1050967x
36 1016470x
37 1226918x
38 8558286x
39 1671515x
40 4594200x
41 2859521x
42 627789x
43 686036x
...
67 220743x
68 3382026x
69 1225141x
70 147903x
71 95882x
...
307 8330x
308 8962x
309 7902x
310 9041x
...
513 2413x
514 2713x
515 2708x
516 1945x
...
1098 410x
1099 433x
1100 402x
1101 400x
...
1848 69x
2225 14756x
2226 66x
2283 14877x
2284 63x
2309 14818x
2310 60x
2658 25321x
2659 56x
4239 9221x
4240 53x
5654 23526x
5655 46x
6355 1799x
6356 41x
6768 1641x
6769 38x
7476 2001x
7477 35x
7983 37x
7984 76x
7985 27x
7989 25x
7990 23x
8292 54x
8293 19x
9716 18x
10132 19x
10133 18x
13060 1968x
13061 15x
19745 39503x
19746 13x
20893 3795x
20894 9x
22124 3862x
22125 5x
23647 3865x
23648 1x
25137 328x
hash table sizes:
0 1933x
1 294x
3 432x
7 24128585x
15 14099448x
31 15040566x
63 28189821x
127 19795904x
255 10955641x
511 3851922x
1023 2009591x
2047 961707x
4095 180486x
8191 236082x
16383 141396x
32767 136848x
depth of linear search of collisions:
0 19076114x
1 76051832x
2 19870023x
3 4091878x
4 577090x
5 46417x
6 16933x
7 346x
8 23x
It was produced with this patch and a short script to sum up the occurences of the hash keys, sizes and tried collisions. The tests are still work in progress, see some more in this perl-hash-stats repo
The perl5 testsuite has a key size of median = 20, and avg of 133.2. The most commonly used key sizes are 4, 101 and 2.
Average case (perl core testsuite)
| Hash Function | collisions| cycles/hash |
|:------------------|----------:|------------:|
| CRC32 | 1.078 | 29.78 |
| SUPERFAST | 1.081 | 34.72 |
| ONE_AT_A_TIME_HARD| 1.092 | 83.75 |
| SIPHASH | 1.091 | 154.68 |
| ONE_AT_A_TIME | 1.098 | 43.62 |
| ONE_AT_A_TIME_OLD | 1.100 | 43.62 |
| MURMUR3 | 1.105 | 34.03 |
| DJB2 | 1.131 | 44.73 |
| CITY | | 30.13 |
| SDBM | - | 30.57 |
Less collisions are better, less cycles/hash is faster. A hash table lookup consists of one constant hash function (depending only on the length of the key) and then resolving 0-x collisions (in our avg case 0-8).
- collisions are the number of linked list iterations per hash table usage.
- cycles/hash is measured with smhasher for 10 byte keys. (see "Small key speed test")
- SDBM and DJBJ did not produce a workable miniperl with -O2. Needed to patch them. Seeing that a HASH=0, effectively creating a long list of linear collisions in HvARRAY[0], does not work in current perl5, makes me feel bad. Note that seed + len is to prevent from the \0 attack.
So it's a pretty good distribution to test better hash tables. Sorry, no beautiful plots yet, later.
The problem I see with the new 5.18 hash tables is manifold:
- instead of using a fast and optimized hash function we are using a slow, unoptimized hash function. Instead of using a good and fast function and avoid collision attacks, we try to avoid attacks by using slower hashing. The choosen hash functions are also not the best (faster and less collisions) independently verified for average data.
- we hash unnecessary bits. needed are typically the last 5-9 bits (size 32-512) The biggest hash table in the core test suite needs 15 bits!
we cargo-cult salting by adding the random seed to the end of the key. every hash function combines the key with the salt anyway, and we only need the last bits. The last bits should prefer the key, not the seed.This is wrong. We add the len to the seed to prevent from Ruslan's \0 attack.- the problem is only to avoid linear collision exhaustion, O(n/2) search time, but we didn't change the collision strategy, e.g. from linear list to binary sorted array, i.e. O(log n). A sorted array of collision entries is much faster to search and much more cache friendly. That's why open addressing is so popular. Sorting up to 6 collisions is thanksfully trivial (hard-coded insertion-sort), and we only need to fallback to complicated called sort functions on real attacks.
- we randomize seeds only when rehashing, when the attack can already occur. We did it better with 5.8.1, but this was "optimized" with 5.8.2, which lead the problem with 5.18.
old info:
cPanel perl randomizes the seed for all hashes, so is even immune to the attack Ruslan is talking about in the public blog post at booking.com, but the full motivation for the current 5.18 design is not publicly accessible.It was accessible, I just did not find it.
I'm working on fixing the 5.18 hash table problem and also evaluating robin-hood (fastest open addressing hash table) and cmph BMZ (fastest perfect hash) and using HW optimized hash functions (crc32 intrinsincs and SSE4.2 or armv8).
My analysis of the ususal hash functions is here: github.com/rurban/smhasher, where you can see how ridicuously slow siphash is, how slow and bad jenkins is, compared to fast hash and more secure functions. There are also the results of the full testsuite for each hash function and the easier to follow stats) evaluating quality vs performance.
Good open addressing hash tables are evaluated at github.com/goossaert/hashmap.
Perfect hash tables are explained at cmph.sourceforge.net/.
The public motivation to slow down the perl hash function is here. The discussion within the security team is not available, but as I was told, only the actual attack is not published, which is good. What is not good is that the developers have no idea how to evaluate hash tables and there is no sane discussion possible in the mailinglist, hence this blog and the various repos. In the end I will submit the best solution as patch.
I am also working a symbolic proof besides the statistical one with the stp SAT solver, to see the "quality" of a hash function over the possible seeds, counting the number of collisions over the typical table sizes. With such a solver you can automatically create attack vectors against most web-sites with bad hash collision strategies, so I'm a bit reluctant to publish that.
Note that "hardening" a hash function does not help in hardening a hash table. It only will get slower, but only marginally better.
Notes: * Edited SIPHASH out. We are using jenkins_hard in blead now. This is 2x faster than siphash, and only 6x slower than unoptimized City, but 4-20x slower than SSE4 optimized crc32 intrinsics, with not that high additional collision costs. * Edited seed + len confusion. This is a good thing to do. In fact CRC32 is currently the leader in least collisions and performance.
What is/what unit is column 2 in "number of hash keys" and next 2 tables?
instead of using a fast and optimized hash function we are using a slow, unoptimized almost-crypto hash function with 64 bits
That ultra slow siphash was removed in 5.17.10 http://perl5.git.perl.org/perl.git/commit/8dafa723e1a85f9bfb12b62414c660009b085761 , its slow but slightly faster predecessor murmurhash was removed in 5.17.8 http://perl5.git.perl.org/perl.git/commit/3db6cbfca39da94d152d3e860e2aa79b9c6bb161 RTM 5.16 uses one at a time which is the ancient traditional algo.
we hash unnecessary bits. needed are typically the last 5-9 bits (size 32-512) The biggest hash table in the core test suite needs 15 bits!
Maybe I dont get something here, but CPUs can only do natively 32/8 bit operations, or 64/32/8 bit operations (16 bits operand override are synthesized in CPU firmware IME). No performance is saved by doing & masks to 0-out the high bits during hashing or in other words synthesizing 2Cs 11 bit integers.
the problem is only to avoid linear collision exhaustion, O(n/2) search time, but we didn't change the collision strategy, e.g. from linear list to binary sorted array, i.e. O(log n). A sorted array of collision entries is much faster to search and much more cache friendly. That's why open addressing is so popular. Sorting up to 6 collisions is thanksfully trivial (hard-coded insertion-sort), and we only need to fallback to complicated called sort functions on real attacks.
So you are proposing that perl switch from a linked list of collisions to a gapless array, with the entries of the array sorted by hash number? Wouldn't that O(n) increase insert/removal times as the # of collieded entries grows? including memcpying the collision array to collapse or create the gap on removal/insert? You are supposing that hash lookup is done much more frequently than hash modification, is that idea something you believe for all Perl workloads or your workloads?
My analysis of the ususal hash functions is here: github.com/rurban/smhasher, where you can see how ridicuously slow siphash really is, compared to fast hash and still secure functions. There are also the results of the full testsuite for each hash function
FWIW, http://www.cpantesters.org/distro/B/Benchmark-Perl-CoreHashes.html?oncpan=1&distmat=1&version=0.02 . Some of the alternative P5 algos cause massive test failures https://rt.perl.org/Ticket/Display.html?id=120208
I'm working on fixing the 5.18 hash table problem and also evaluating robin-hood (fastest open addressing hash table) and cmph BMZ (fastest perfect hash)
In reading about those topics, there is too much collegiate math, not enough turing machine explanation, so I cant comment.
~ bulk88
Reini, Perl 5.18/19 is NOT using siphash. Python uses siphash nowadays. Perl uses a hardened version of one-at-a-time unless you explicitly edit header files to choose a different hash function. The hardened one-at-a-time is a lot faster than siphash.
So to be honest, you usual railing against the oh-so-secretive non-cabal is turned upside down by not having bothered to properly read what Perl even does. Search for PERL_HASH_FUNC_ONE_AT_A_TIME_HARD in the sources.
This all being said, if you have actual improvements to the hash (data structure) implementation in the perl core, and you manage to keep a civil tone and paranoia at bay, I am actually pretty confident your contributions would be welcomed. During the 5.18 hash revamp, Yves already had a number of additional ideas that he didn't pursue to the limit for lack of time.
Thanks Reini for being our low level optimization and security guy! Don't let anyone get you off track!
Steffen, you are right. I see we are using ONE_AT_A_TIME_HARD in blead now. That's what the statistics is against. Obviously I was only reading Yves blog post and didn't follow the many changes of the now default function. JenkinsOOAT => Murmur3 => SIP => Jenkins_HARD
I'm running now the stats against the other hash functions to measure the effect on the collisions.
BTW: I see no single conspiracy tone in my post. It's just sad that I cannot read about the rationale and possible attacks which lead to some design decisions which really look crazy, not to say stronger words, from the outside.
If don't publish that, you'll need to live with my allegations. And it just makes it much harder to come to any conclusions for anybody not in your secret security team.
I only I think I understand what Ruslan meant.
To my knowledge, there was only one thing that was available in the private thread that shouldn't be exposed still. That was a complete exploit of the original weakness that Yves fixed. Now, what I mean by that was "a complete set of data to use to make many web servers running older Perls use all available memory". Furthermore, to my knowledge, the actual weakness that was fixed by Yves is publicly known and described and is actually part of the blog post. Or to put it differently, I think there's a lot less secrecy that you might realize!
This is what I am referring to: You're assuming there's some choice to keep lots of relevant things secret. There isn't. Also: "The public motivation to slow down the perl hash function is here. The discussion within the security team is not available." The Perl hash function wasn't actually slowed down in a significant way at all (cf. it not being siphash).
Regarding use of CityHash: I'm not on firm ground on this, but I believe that Yves rejected it because its properties as a hash function weren't good enough (in that its output distribution isn't sufficiently flat). {citation required}
Steffen: Ok, sounds fair not to publish the attack itself.
I'm working on testing my ideas currently.
The typical case should be fast and the abnormal attack costs should not be linear.
So far with our linear collision scheme I got:
_mm_crc32_u64: 1.078 collision cost / hash, but ~20x faster with SSE4.
jenkins_hard: 0.456 collision cost / hash
The cost of resolving 2-8 collisions is not that high.
I also need to add optimizations for shorter keys.
I'll add more pages when the new data will be ready.
I'd like to clear a couple of things up.
First, regarding the hash security changes, we have not published the key discovery attack on the ONE_AT_A_TIME_OLD function, and we have not disclosed the full attack key set I calculated to prove you could use all the memory on a box with a fairly small set of keys.
The use of OAAT_HARD was chosen because of unpublished analysis that it was harder to attack than a number of the other options while still performing relatively well.
Anyway, I put a lot of effort to make it possible to switch hash function, and to use Perl as a test bed for finding a better hash function. I positively welcome you finding one.
Regarding your patch to monitor this stuff. It is quite interesting, I have done similar stuff myself. May I point out however that there are a variety of statistical tools for probing the collisions in our hash functions in Hash::Util. If you simply construct a hash and then use them you can find out all sorts of interesting information about the distribution of the hash functions you choose to investigate.
Please be aware tho, that Perl as a public release will have to put security before performance, and will have to ship with an algorithm that is at least "not known to be broken" by our security team. The hash function used needs to be robust to key discovery attacks and key extension attacks.
Anyway, adding a different hash function is a matter of tinkering with hv_func.h and setting up the appropriate defines so we create a key schedule/seed of the appropriate size, possibly other initialization, and then implement a function that does the hashing with a particular call signature. Once you do that you can hook in whatever hash function you like and try it out. If you find a good one that is faster and more secure I will be very happy.
Also I should mention you compared against CityHash, which as far as I know is broken in that you can construct a multi-collision attack (what I call a key extension attack above).
https://131002.net/siphash/#at
https://131002.net/siphash/citycollisions-20120730.tar.gz
For the record: I'm with Yves on this. If you do find a hash function (or a change to how the data structure is implemented) that speeds things up while retaining security, I'll be very happy and grateful!
I think a lot of what looks like secrecy and cabal-ism is merely the result of a very low bus number for all of the perl core. Not only are there only two dozen people at most with a good understanding of the core, but worse, for any one particular area of the code, it is likely there is only one person who understands its design – sometimes two, and for quite a few areas, zero. The result of that is that a lot of decisions are made unilaterally by whoever is working in some area, simply because there is no one else with any real input aside from technicalities like style and consistency with other parts of the code. And much of the time, that person doesn’t write out their thinking process in much detail, or at all – and I don’t know if it would help if they tried, because it’s hard to express something effectively if you have no sense of who you are trying to communicate it to.
As a situation this is awful, it’s just not really driven by anyone’s purposeful intent.
As for seeming crazy, a lot of crazy decisions seem (and often (if not always) are) rational when they’re made conservatively on top of existing decisions. You always come up with better decisions if you know where you are going and you start from scratch – but the nature of maintaining software over the long term is that you don’t know where you will be going years down the line, and you always have to work on top of your old decisions. You end up with lots of decisions that are crazy if you take them by themselves. (This is one of the things that suck big time about software development. Unfortunately it seems inherent.)
bulk88:
Yes, but you can optimize on certain keys better. On key length <= 8 we can use one single _mm_crc32_u64 instruction on 64bit. On 32bit we can use _mm_crc32_u32 for key length <= 4.
The quality of the crc32 function is enough for this case when the collision strategy is good enough.
When SSE4 is not available, then we could test shorter and faster hash functions which hash only the lower bits good enough for smaller tables.
I.e. the API for PERL_HASH needs to know the hash table size and rehashing needs to recalc the hashes on certain jumps from fast to slower hash func.
The sorted array idea is used with some hash tables which care about collision attacks. Since the hash is equal on collisions you couldn't sort by hash, you need to sort by strings. The rest is true. The 4th column should be the operation used, but I haven't marked it good enough yet.
insert/delete on collisions needs to memcpy/realloc this array, yes. It's expensive, but collisions are pretty rare. search is favored over changes. arrays are much more cache friendly than linked lists.
But I hope the other hash tables will work out better. Everybody is totally crazy about robin hood now, even if the implementation is not trivial at all (and many got it wrong). See the hashmap link. (named backshift).
I'm testing the workload with a better/user-specific -DH patch (print to log.hash fh) for our realistic workload also.
But I guess the make test numbers are pretty good. If not, I need to add some tests to skew it more to reality. Esp. for benchmarks. I had to tune the parrot benchmarks similarily recently.
Dont worry about Reini's crazier ideas like sorting things by their hash number. That will never ever ever happen.
Its a patently ridiculous idea, completely insecure, and nonsensical.
yves: This was bulk's idea not mine. If the hash numbers are the same in an array of collisions, sorting by it would be wrong.
But this quickly got below my level of interaction, sorry.
Aristotle: That is my theory also. The implementation is just bad (that's what I see) and the decisions were not rational.
The contradicting theory (in p5p's favor) would have been that they found something to keep the current collision scheme (the problem) within their non-public security discussion, and choose not to disclose it.
My best guess is, that nobody but the implementors have a basic idea about hash tables. That's what I found out when trying to explain it on IRC. Wikipedia is really good in explaining it. Perl is not the only dynamic language with these problems.
In fact the others are even worse, they copycat the worst decisions of perl hashes.
"Oh, we need security, we need improve our hash func and add a random seed". (The 2nd being actually a good decision)
python taking siphash, instead of better collision resolution, hmm. They either trust perl more than their own judgement or have no clue neither.
ruby still uses the stone-old (1980 Peter Moore) hash table (linear linked list) as we do, but their code is still readable, they actually use primes for the table size to limit collisions, and they actually count their collisions. But still, same problem. https://github.com/ruby/ruby/blob/trunk/st.c
And the best is the wikipedia entry about python and perl hash tables, being "highly optimized". Read my comment in the Talk page to this lie/uninformedness.
https://en.wikipedia.org/wiki/Hash_Table#In_programming_languages and https://en.wikipedia.org/wiki/Talk:Hash_table#Wrong_.22highly_optimized.22_Perl_claim
"What is not good is that the developers have no idea how to evaluate hash tables and there is no sane discussion possible in the mailinglist"
Asserting that nobody except you is competent to have an opinion does, indeed, largely ensure that no sane discussion is possible. Occam's Razor suggests that the problem does not, however, lie with the mailing list.
About fifteen years ago, I re-engineered significant chunks of my visible personality and interaction patterns with other people in order to put me in situations where I had to improve my social and communication skills.
It's been an ongoing and uphill effort since, and I'm still far from perfect - but it's been very noticeable to me that the investment has paid off many times over, simply because it doesn't actually matter if I know better than everybody else if I can't successfully explain why I'm right - and if people find me too unpleasant to interact with to listen to in the first place, I've effectively failed in advance.
Currently you're faceplanting at the first hurdle before even getting a chance to try and vault the second, and I would suggest you give serious consideration to the idea that either making such an investment yourself, or finding somebody who has already done so to partner with in bringing your eventual findings to p5p, would substantially increase the odds of getting a productive end result.
It'd seem a shame to have all this effort wasted because you're so busy telling people they're wrong that you never get round to successfully communicating why you're right.
-- mst, out
mst: It is a sad thing to tell that, and it would have been better to avoid these allegations, but other people need to work with perl5 also, so they need to be told how the state of the onion is and what to expect. Most people left p5p already. I haven't tried communicating with p5p recently, but my impressions only got worse when I tried it to explain it on the #p5p channel.
The best thing I can do is to write blog posts, explaining the problems, and create reproducable code and testcases, in the end patches, and maybe the relevant devs will look up unknown concepts in wikipedia and understand a bit more. There's also a lot of side-chatter from the junior side going on, calling names and creating an unpleasant athmosphere all over. Aside from the technical nonsense they are repeating over and over.
I thought it's your job to prevent the folks from calling names. You failed to do so this time.