C::Blocks Advent Day 11
This is the C::Blocks Advent Calendar, in which I release a new treat each day about the C::Blocks library. Yesterday I explained how to use C::Blocks in multithreaded Perl code. Today I will compare C::Blocks to other TinyCC-based Perl libraries.
C::Blocks is built on the Tiny C Compiler, but it's not the only way to include C in your scripts. First of all, tcc itself has a -run
option, so saying something like tcc -run my-code.c
will immediately compile and run your code. This means tcc sorta turns C into a scripting language all by itself. But if we're talking about Perl, the first distribution to work with tcc was C::TCC. This was little more than a glorified wrapper around tcc's own run function, except that you could assemble your C code into a string rather than storing it in a file. The first wrappers to focus on serious interactions between Perl and C were my C::TinyCompiler and Steffen Mueller's XS::TCC. Steffen's module focused on making it easy to write C code that could be called from Perl; my distribution sorta attempted that, but mostly tried to provide a framework for writing C libraries that could be used from Perl. Graham Ollis' FFI::TinyCC is a more recent edition and leverages the amazing capabilities of FFI::Platypus. His library is similar in spirit to Steffen's, but creates its xsubs using a foreign-function interface instead of the usual XS mechanism. All of these present tradeoffs of different sorts. If you find yourself needing some of these capabilities, which library should you use?
Caveats
Just like with my last post on benchmarks, this post has been delayed even more than usual. The reason for the delay is a growing body of data on spectacular slow-downs for incremental changes to functions created using C::Blocks. Much of my spare time over the last few days has focused on chasing down the root cause of these slow-downs and discussing them on the tcc mailing list. As far as I can tell, the slowdowns that I can produce arise not from tcc itself, but from unfortunate processor cache evictions that arise from interactions between tcc-compiled code and the Perl interpreter. As far as I can tell, XS modules do not suffer from these sorts of problems, but more testing is needed to be sure. Regardless, now is a great time to mention that C::Blocks is still in PRE-beta. I'll discuss what that means in a future treat, but for now know that performance optimizations are not a key goal yet. I want these issues sorted out before v1.0. Now, on to the comparisons and discussion.
Base Recommendations
I do not recommend C::TCC or C::TinyCompiler for anything. The first of these does not provide any facilities for passing Perl data to C (or back), and the latter of these (mine) is not well designed for its purpose. Anything you can do with C::TinyCompiler can be done better with one of the others.
If you simply need to call C functions from Perl from a pre-compiled C library, and if you can properly form your arguments, you should just use FFI::Platypus. It is the most mature option for doing this sort of thing. The only reason for using C::Blocks in this situation is philosophical, i.e. if you'd rather think of you C code in terms of blocks rather than function calls.
If you want to share C constructs across multiple compiler contexts and even multiple modules, then C::Blocks is really the way to go. I know I'm a bit biased here, but C::Blocks' support for sharing struct, function, enum, typedef, and preprocessor declarations is unmatched. This sort of sharing was one of the chief design goals of the distribution, and was based on things I learned from doing C::TinyCompiler poorly. FFI::TinyCC and XS::TCCdo even less than C::TinyCompiler in this realm (though they do what they do much better than C::TinyCompiler). If you're building a large codebase in which you want to freely mix C and Perl, you should use C::Blocks.
If you just want to include in your Perl script C code that exchanges data with Perl, then read on. Inline::C is an option, but if you want to do this with jit then you should consider XS::TCC, FFI::TinyCC, and C::Blocks. Let's drill down a bit.
Comparing Implementations of a Simple Case
The simplest meaningful example I could come up with was one we've already seen: the random number generator. We saw this on Day 4 as an example of how to declare functions using clex
. On Day 5 I compared the performance of the C::Blocks and Inline::C implementations. On Day 8 I demonstrated how easy it is to share code across modules using cshare
. We'll use it this time to compare how it is implemented using these various approaches.
I've written a number of implementations. The complete collection can be found as a gist on github. They are all fairly similar in layout so I will only explain the key aspects. All of them except for rand-clex.pl
create a Perl function KISS_rand
that have a C-implemented random number generator. Of course, rand-perl.pl
implements its random number generator with pure Perl.
I provide three implementations using C::Blocks. The first example uses csub
, differing from the others because it has to explicitly push the return value onto the stack. The second example embeds a cblock
in a regular Perl sub
. It uses an idiom in which a variable is declared, then modified by the cblock
, and finally returned at the end. The third implementation uses clex
to declare the function and cblock
s to execute it. This is awkward because a cblock
is not considered an expression, and so cannot be combined with a postfix foreach loop. None of these implementations are very fluid, evidence that this specific problem was not a major design target.
The implementation for rand-ffi.pl
and rand-xs-tcc.pl
are nearly identical, differing only in the library that provides tcc_inline
. The Inline::C
implementation is mostly the same. All three of these extract the function signature and wrap the function call with essentially no work needed by the programmer, making this particular task much easier than C::Blocks.
The only remaining implementation is rand-perl.pl
, which implements the generator in pure Perl. This one is almost the same as the others except that all integer operations must be truncated at 32 bits by &
ing the results with 0xffffffff
. If I don't do this, the random number generator produces numbers that quickly blow up. Apparently my perl interpreter was built with 64-bit integers, and Perl quickly switches to a floating-point representation of a number when it encounters integer overflow.
An important quality metric is ease of writing. The pure-Perl and csub
implementations were the hardest to write; the former because it needed debugging to truncate at 32-bits, and the latter because I had to look up how to return values on the Perl stack. The others are all pretty easy to write, with the cblock
example losing an edge due to the boilerplate inherent in the declare-modify-return idiom. Ultimately, if I were given this algorithm from a C programmer and asked to implement a Perl-visible one, the only implementation that was tricky was the pure-Perl one. All the others were basically of equal difficulty, I would say.
Benchmarks
Another important quality metric is speed of execution. To check the speed, I timed these on my computer using the time
bash built-in, like so:
$ time perl rand-inline.pl 10000000
Random number #10000000 is 292847080
real 0m0.751s
user 0m0.744s
sys 0m0.004s
The times below are rough averages of the "real" times as reported above, but here listed in milliseconds, and are sorted according to execution time for ten million iterations, the longest test.
N = 1,000 10,000 100,000 1,000,000 10,000,000 LOC
clex 60 60 65 95 400 24
inline 55 60 65 125 730 23
xs-tcc 100 100 100 200 1000 23
csub 60 60 70 250 1900 24
perl 10 30 80 490 4500 18
cblock 60 65 120 550 5500 26
clex** 60 60 120 650 5650 24
ffi 50 60 130 800 8000 23
Before doing anything else, I want to discuss the two entries for clex
. If you look at the gist, you'll see that rand-clex.pl
has a line with a trivial operation commented out. The first set of results in the above table are for that situation. Uncomment that line and on my machine you get the results shown for clex**
. In light of these huge swings, it's hard to draw deep conclusions from these timings. All this having been said, here is what I conclude from these:
The first take-away is that Perl really is fast! Since the pure-Perl implementation did not need to load any modules, it easily beat out all the other options for as many as 10,000 iterations. The speed improvements for compiled code only become evident after enough iterations have caught up with the cost of loading the module or compiling the code in the first place. In other words, don't over-estimate the speed gains of a C implementation if you're only doing a few lines of math.
The next take-away is that if you don't plan on calling this kind of function more than 100,000 times, the implementation does not matter. They all take roughly the same amount of time, so if you are already using one of these tools and you find yourself in this situation, you should keep using your tool.
The third take-away is that XS::TCC is hindered a bit by its approach. XS::TCC is the only distribution that instructs tcc
to #include
the Perl header files like perl.h
. C::Blocks uses a cached symbol table produced from perl.h
whereas FFI::TinyCC packs its arguments in a way that does not rely on perl.h
at all. Other tests not shown here indicate that XS::TCC suffers dramatically on repeated string compilation, which can only be explained by the direct loading and parsing of perl.h
and friends.
The fourth caveat-laden take-away is that a cblock
calling a function defined in a clex
can be faster than a gcc-compiled xsub! "Wait," you say, "what about what you said on Day 5, about gcc-compiled xsubs being faster than C::Blocks???" It's important to emphasize that the for-loop for all of today's examples happen in Perl code, not C code. This means that each and every call to KISS_rand
in the Inline::C code has to be invoked via the xsub calling mechanism; the C::Blocks clex
code, on the other hand, calls it using just one or two ops in the op-tree. The cost of the argument stack setup and teardown for an xsub ends up dominating the computation cost, which is no surprise in light of the brevity of the function being called.
Just about everything else I might want to conclude would seem unjustified in light of the effects of cache conflicts. I suspect that results could vary significantly if these benchmarks were run on other machines. It appears that FFI::TinyCC suffers from the same sort of cache problems as clex**
, though it fares worse, for some reason. As such, it seems that the best-case performance for FFI::TinyCC is probably comparable to, or maybe just a little bit slower than, C::Blocks.
Conclusions from the Simple Case
So, what are we to make of all of this? Here are my take-aways:
- If you're not already using one of these libraries, and if you have a short numeric calculation that does not utilize looping, and if you only call the function a few hundred or thousand times, then just implement the calculation in Perl.
- If you're already using one of these libraries for other one-off functions and you need to call your function only 100,000 times or less, just use whatever library you're currently using.
- If you want predictable scaling and good performance, even in the face of minor changes to your function, Inline::C is your best bet.
- If you call a small function millions or tens of millions of times, and need absolute speed (but aren't planning to write your code in C in the first place), you might be able to get it using
clex
andcblock
. But, be sure to benchmark the heck out of it to be sure. Even better: implement your code using XS; add a keyword hook that injects an op that calls your function when it sees your keyword.
All of this having been said, the case under consideration is a little constrained. I'm assuming you never want to share the C code implementing your function with other modules. Most Perl programmers have learned to live with this as a reality of using XS, but I would like to challenge that. C::Blocks makes it really easy to share your C code across modules.
Also, we haven't considered the case of multiple return values, which necessitates manipulating Perl's stack. As far as I can tell, that case knocks out FFI::TinyCC since it doesn't have any stack manipulation facilities. XS::TCC and Inline::C have those capabilities, but I'm not sure that they can efficiently handle the Perl stack. They do not pass the Perl interpreter into the function where you do your work, so they would need to retrieve it from thread-local storage. I'm told that's a costly operation. At any rate, the additional code for this sort of thing levels the aesthetic playing field, so to speak.
Wrapping it up
Today I compared implementations of the random number generator using a number of different tcc-based libraries. If you find yourself needing to implement a single function to run quickly, there is no obvious choice: you'll have to resort to benchmarks of the various libraries to find out what works best. Writing these benchmarks have been illuminating for me; I hope that today's treat has been illuminating for you as well. I also hope that you consider opening up your C code and putting it on par with your Perl code as something worth sharing.
Leave a comment