threads::lite and the chameneoses
The Challenge
Last week on stackoverflow I came across an interesting challenge. Since my new module threads::lite seems to have stabilized enough for such a task I decided to to port the erlang submission to it (while using some helper routines from the perl submission). The porting was a fairly straightforward process that resulted in a pleasantly readable program (specially when compared to the other entry).
Then I ran it. It ran almost 6 times as slow as the other perl program!
Reason enough to profile it and see what was going on. Renodino's comment proved to be quite accurate: Storable and locking seemed to be the main culprits.
To tackle the first issue I added a simple but effective feature: if a message contains only simple elements (no references or undefined values), it is pack
ed instead of frozen.
So I changed my script to use only simple messages. The result: it was suddenly 3 times as fast (edit: compared to my first version). The lesson was simple: avoid expensive serialization steps when you don't need them. Storable is fast, but not fast enough for this kind of task.
But there is more to win. The message passing routines consume lots of CPU. Run-times seem to fluctuate quite a bit and all profilers seem to report unreliable results (except NYTProf, which simply panicked), making profiling rather hard. It seems so far the costly part is blocking for input when receiving on an empty queue. Unfortunately that scenario is quite common in this benchmark. I don't really know how to solve that though, so far my attempts at finding something faster have been less than successful. This proves to be an interesting challenge.
What does this mean?
Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.
Not that much, actually. The results from this benchmark aren't all that relevant for most real applications. It's a benchmark for synchronization primitives in a very tight loop. It proves how fast you can communicate a single byte of data between threads, but that's not very representative for real programs. Message passing is not a synchronization primitive, but a high level abstraction build on top of them.
It is telling that the Erlang entry is actually two orders of magnitude slower than the fastest C entry, despite having a reputation as a highly capable for multi-threading. In my opinion, the Erlang score is the one to benchmark against, not the C one. It's way more representative for real programs than the latter. That still means I'm not doing that well (Erlang is an order of magnitude faster), but it does put things in perspective.
As for the speed, sending a message appears to costs 7-10 µs on my laptop. I expect receiving a message costs the same when it doesn't have to block and it seems to cost about 30-60 µs when it has to block. It's not that slow, but if you want to call it millions of times you will notice the impact.
But there's something more important here. The erlang entry isn't only the shortest, but also the most readable solution of the list. I'm an experienced Perl programmer, but I find the Perl entry hard to comprehend, I can't really tell if it's doing the locking correctly by reading the code. The C entry is even worse (though that's mostly because it's bizarrely optimized). I'm not nearly as fluent in Erlang (in fact I've only worked with it on one project), but I could immediately understand what it was doing and why that would work.
That's the sort of thing I aim to recreate.
Maybe you should try submitting your entry if it's 3 times faster than the original one? Or did you mean it's 3 times faster than the 6 times slower version?
As you've said, it's not like these benchmarks show real-world performance, but it still makes Perl look bad in one way or another.
I worked on this a while back, but never submitted the entry, don't know why really. It's based on this version.
Take a look at mine and try it out. I think it's quite fast, maybe 7x faster, and uses standard modules all around.