perlcc next steps

By Reini Urban on August 7, 2014 11:46 PM

cPanel uses now the new perl compiler B::C with -O3 and --staticxs with 5.14.4 in production, and I'm outlining the next steps.

Our old compiler (with 5.6.2) needed 2:30 hours for each build, produced ~100 binaries of about 30-50MB size. This sounds a lot but it is not. This is about the size a single perl scripts needs in memory, but a single perl script has to find all the dependent files first, load and parse it, and construct the stashes, the nested hashes of namespaces. All this is not needed at all during run-time, hence perlcc compiled binaries do not need it. The new compiler uses 30 minutes for the same binaries on our build servers, but is not yet using a lot of internal optimizations. It does use a lot of statically initialized strings and data, so the run-time is better than with 5.6, even if 5.6 is still much faster than the 5.14 run-time. But we still don't collect arrays into _independent_comalloc ranges as we did with 5.6, we still don't use link-time nor profile guided optimizations and we still do not optimize single modules seperately.

There are not many shared libraries used in such a compiled binary, just libperl and the used XS shared libraries.

Starting such a single binary is very fast, it's just mapped into memory at once, then the dynamic symbols need to be generated, as B::C cannot yet generate them statically, and perl5 is not really helpful in supporting static symbols, but all other data, strings and ops (perl functions) are generated statically. Then the XS modules are initialized (booted), then some pointers pointing into some shared libraries need to updated, as they pointed to some other pointer at compile-time and then the perl program is started.

perlcc -m

The plan for the next year is to support generating shared modules per perl package. perlcc -m compiles a package to a shared library, buildcc generates the Makefile.depend dependencies for each binary, and then you can use selectively more advanced optimizations per package. I.e. B::CC or rperl optimized compilations, which should run 2-20x faster than the current B::C compiled packages.

B::CC is not stable enough yet, so you can only use it with some modules, and it currently benefits greatly from type information, for which upstream maintainers have no interest. With rperl it is even more strict, as those modules are not even more strict, as in real programming languages compared to our simple dynamic scripting language, it explicitly disallows certain kinds of dynamic magic or array autovivification, which slows down run-time so much as benchmarked in my YAPC::EU 2013 talk.

So the plan is to compile to several optimized libraries seperately and link them together later. For binaries which fork or which need to be updated pretty often, shared libraries are preferred, for other simple binaries static libraries are preferred as they are loaded much faster. Shared libraries can be updated much easier, and when you fork a binary its shared libraries are shared, there's no seperate copy in memory. This is also one of the main reasons to change our current perl5 unicode implementation from simple and slow perl data structures, to compiled shared libraries, which are just mapped in when needed and can be shared.

Data::Compile

But first some other steps are needed. Before we will create compiled libraries per perl packages we want to compile only some read-only datastructures, hashes mainly. The name of this module is Data::Compile and will replace our current cdb databases for localization data. cdb maps readonly strings to strings, but cannot store other values than strings and is implemented as normal hashmap. Our localization data would be much easier to use with stored coderefs, similar to pre-compiled html templates, which which is a mix of strings and coderefs. By using certain parts of B::C it is trivial to store data as shared library, which can be dynaloaded on demand. Loading such a compiled datastructure of perl data is of course much faster then loading a btree or colliding hashmap and accessing it at run-time through some magic tie interface. B::C is very efficient in storing perl data statically, and only some GV symbols need to be initialized at boot-time.

Data::Compile will also replace the overly slow serializers, like Sereal, Cpanel::JSON::XS, Data::MessagePack or Storable, which have to convert binary data from and to perl data. Data::Compile freezes perl data natively and just thaws it at instance. So loading the CPAN Metadata file will go from 16 secs to 0.001ms, similar to CPAN::SQLite and cpanm which does those queries at run-time to some server will not be needed anymore. Updating such a compiled CPAN Metadata file is estimated to need 2 secs, which needs about the same time as updating CPAN with CPAN::SQLite. And CPAN::SQLite still has a locking bug, which prevents multiple cpan sessions, so you are easier of with cpanm now.

In order to optimize loading of static data, readonly hashes, to replace cdb or sqlite databases or serialized caches we need another module first:

Perfect::Hash

Currently there's is only gperf to create perfect hashes in C or C++, and then there is also the lesser known bob jenkins perfect executable to create perfect hashes (with some weird header files) and of course the cmph library to create perfect hashes of bigger dictionaries, google size. No single database or datra-structure can be faster to lookup then perfect hashes for readonly data, the fastest single-host readonly databases are cdb and mysql. And perfect hashes beat them by far. Ever wondered why Google lookups are so fast? Well, they distribute hashes across several servers with so-called consistent hashes, which map strings into buckets of n servers, and when they insert and delete entries the remapping (copying to other buckets) is minimized. But the real trick are minimal perfect hashes, which minimize lookup times and storage sizes far beyond normal or sparse hashes or b-trees.

So I created phash to replace gperf in the long term, by using better algorithms to handle any data (gperf fails to work with anagrams or weird keys), to create optimally fast C libraries or optimally small C libraries for fast hash lookups, and even provide backends to output perfect hashes for several popular programming languages, like perl (XS), java, ruby, php or python. As C library you can only store strings as keys, and integers and strings as values. With those other backends you can store all supported values.

Perfect hashes look differently on small 8-bit machines than on fast x86_64 machines, for static libraries or for shared libraries with -fPIC, for <1000 keys or for >1.000.000 keys, for keys with NUL characters or not, for 7-bit keys only, for unicode keys, for case-insensitive key lookup, and much more. Using a high-level language to analyze the keys to generate a good perfect hash (in this case perl) is much easier than fixing and enhancing gperf.

The other main problem is icu which is a collection of glorified but moderately efficient hashmaps of unicode tables and the even worse perl5 implementation of those unicode tables. Encode does it much better by pre-compiling encoding tables into good shared libraries, but unicode has much more tables than just encodings. parrot currently uses icu with a new gperf generated workaround for missing tables in icu (the not-existing whitespace class and missing namealiases which were broken in icu 5.0), and moarvm came up with it's own implementation of those tables to workaround icu deficiencies.

So far I have re-implemented the best algorithms to create and use perfect hashes in pure perl and optimized it with XS functions and I am able to create good C code. I am still testing optimal hash functions and strategies. I need only ~2 seconds to create a perfect hash for 100.000 entries in pure perl, with C libraries this goes down to 0.001 seconds, scaling linearily. The main problem is solved. Compilation with a C compiler and -O3 of such hashes need another second.

E.g. these hashes can be used to replace the constant lookup in ExtUtils::Constants. So I'll look soon into seperate my current ~10 different implementations of perfect hashes into one simple but good enough pure-perl version, which will generate pure-c code, without the need for any external library, and then seperate the others into several packages with external dependencies, like zlib for a fast hardware-assisted crc32 hash function (1 op per word), or libcmph and the various output formatters.

Then Data::Compiled will use Perfect::Hash::XS to store the read-only hashes. And then source code will change to use those hashes instead, and then perlcc -m will be able to compile and link arbitrary perl packages with various type annotations and various compilers, if B::C -O0, -O3, B::CC -O0, -O2 or even rperl.

The next plan is then to create a type inferencer in C, but I'll wait for this until p5p comes to a syntax decision how to annotate return types of functions. If ever.