native_pbc in parrot revived (numbers part1)

The design for parrot, the vm (virtual machine) under rakudo (perl6), envisioned a platform and version compatible, fast, binary format for scripts and modules. Something perl5 was missing. Well, .pbc and .pmc from ByteLoader serves this purpose, but since it uses source filters it is not that fast.

Having a binary and platform independent compiled format can skip the parsing compiling and optimizing steps each time a script or module is loaded.

Version compatiblity was broken with the 1.0 parrot release, that's why I left the project in protest a few years ago. Platform compatibility is still a goal but seriously broken, because the tests were disabled, and nobody cared.

Since I have to wait in perl5 land until p5p can decide and discuss on a syntax for the upcoming improvements which I can then implement in the type optimizers and the static B::CC compiler, I went back to parrot. p5p needs a few years to understand the most basic performance issues first. The basic obstacles in parrot were gone, parrot is almost bug free and has most features rakudo needs, but is lacking performance.

Platform compatibility

So I tried to enable platform compatibility again. I wrote and fixed most of the native_pbc code several years ago until 1.0, and only a little bit of bitrot crept in. Platform-compatible means, any platform can write a pbc and any other platform should be able to read this format. Normally such a format would require a fixed format, not so .pbc. The pbc format is optimized for native reads and writes, so all integers, pointers and numbers are stored in native format, and when you try to open such a file on a different platform converters will try to read those 3 types. integers and pointers can be 4 or 8 byte, little or big endian. This is pretty easy to support.

The problem comes with numbers. Supported until now was double and the intel specific long double format. The main problem is that the intel long double format is a tricky and pretty non-standard format. It has 80 bits, which is 10 bytes, but the numbers are stored with padding bytes, 12 byte on 32-bit and 16 byte on 64-bit. 2 bytes or 6 bytes padding. Here Intel got padding right but in the normal compiler ABI Intel is the only processor which does not care about alignment. Which leads to slow code, and countless alignment problems with fast SSE integer code. Most other processors require stricter alignment to be able to access ints and numbers faster. Intel code is also not easy to compile on better processors, because they fail on unaligned access. You cannot just access every single byte in a stream at will. At least you should not.

As it turns out sparc64 and s390 (AIX) uses for long double the IEEE-754 standard quad double 16-byte binary format, which is the best so far we can get, GCC since 4.6 supports the same format as __float128 (via the quadmath library), and finally powerpc has its own third long double format with -mlong-double-128, which is two normal 8-byte double one after another, and the result is the sum of the two, "head" and "tail". It's commonly called ppc "double-double". For smaller devices the typical format is single float, 4 bytes. Thanksfully in IEEE-754 standard format. All compilers can at least read and write it. But when it comes to va_arg() accessing ... arguments from functions, gcc fails to accept float.

So after rewriting the test library I still found some bugs in the code.

So I fixed a lot of those old bugs, esp. various intel long double confusions: with the padding bytes, 12 or 16 bytes, and a special normalize bit at 63, which is always 1 when a valid number was written to disc. So when reading such a number this bit is not part of the mantissa. Documentation for these formats was also wrong. And I added support for all missing major number formats to parrot, float, double, long double in various variants: FLOATTYPE_10 for intel, FLOATTYPE_16PPC for the powerpc double-double format, and finally FLOATTYPE_16 for IEEE-754 quadmath, i.e. __float128 or sparc64/s390 long double.

sparc64

The biggest obstacle for progress was always the lack of a UltraSparc to test the last number format. As it turns out a simple darwin/ppc Powerbook G4 was enough to generate all needed formats, together with a normal Intel multilib linux. My colleague Erin Schoenhals gave me her old powerbook for $100. The Powerbook could generate float, double, long double which is really a 16ppc double-double and gcc 4.6 could generate __float128, which is the same format as a 64bit sparc long double.

Good enough tests

One important goal was a stable test suite, that means real errors should be found, invalid old .pbc files should be skipped (remember, pbc is not version compatible anymore) and numbers only differing in natural precison loss while converting a number should be compared intelligently. Interestingly there does not even exist a good perl5 Test::More or Test::Builder numcmp method to compare floating point numbers in the needed precision. There is a Test::Number::Delta on CPAN, but this was not good enough. It only uses some epsilon, not the number of valid precision digits, and the test is also numerically not stable enough. And really, number comparisons should be in the standard. I added a Test::Builder::numcmp method locally. It works on lines of strings, but could be easily changed to take an arrayref and single number also.

Expected precision loss

So what is the expected precision loss when reading e.g. a float with intel long double? A float claims to hold 7 digits without loss, FLT_DIG, so such a conversion should keep 7 digits precision, and the test only needs to test the 7 first digits. The precision holds 24 bit, log10(2**24) ≈ 7.225 decimal digits. So 123456789.0 stored as float, converted to long double needs to be compared with something like /^1234567\d\*/ if done naively. It can be 123456755.0 or any other number between 123456700.0 and 123456799.4. Better round the last significant digit.

But first at all, what is the right precision to survive a number -> string -> number round trip? Numbers need to be sprintf-printed precise enough and need to be restored from strings precise enough. Printing more digits than supported will lead to unprecise numbers when being read back, and the same when printing not enough digits. The C library defines various defines for this number: FLT_DIG=7, DBL_DIG=16, LDBL_DIG=18, FLT128_DIG=34. But better than trusting your C library vendor is a configure probe, now in auto::format. So parrot outsmarts perl5 now by testing for the best and most precise sprintf format to print numbers. As experimentally found out, this number is usually one less than the advertised *_DIG definition. double uses %.15g, not %.16g, float uses %.6g, and so on. But this might vary on the used CPU and C library. Before parrot used hardcoded magic numbers. And wrongly.

One might say, why bother? Simply stringify it when exporting it. Everything is already supported in your c library. Two counter arguments:

  1. Fixed-width size. Native floats are easily stored in fixed-width records, strings not. Accessing the x-th float on disc is significantly faster with fixed size, and native floats are also significantly smaller than strings.

  2. Precision loss: With stringification you'll loose precision. In my configure probe I verified that we always loose the last digit. The previous code in imcc had this loss hardcoded, 15 instead of 16.

Storage

parrot's Configure also checks now for the native floattype and its size. Before a pbc header only checked the size of a number, now the type is different from the size. The size of long double can be 10, 12, or 16 and can mean completely different binary representations.

As next improvement, parrot used to store the parrot version triple in the ops library header inside the pbc format. But whenever a ops library changed, the other version number needs to be changed, the PBC_COMPAT version number, or simply the bytecode version. This needs to be done for format changes and a change of native ops. Because parrot stores and accesses ops only by index, not by name, and sorts its ops on every change. This was my main critic when I left parrot with 1.0. Because it was never thought this way. Old ops should be readable by newer parrots, just newer ops cannot not be understood. So new ops need to be added to the end.

So now the bytecode version is stored in the ops library header, and newer parrot versions with the same bytecode version can still read old pbc files. Older bytecode versions not yet, as it needs to revert the policy change from v1.0, back to pre-v1.0.

mk_native_pbc

The script to generate the native pbc on every PBC_COMPAT change was pretty immature. I wrote it several years ago. I rewrote it, still as shell script, but removed all bashisms, and enabled generating and testing all supported floting point formats in one go with custom perl Configure options tools/dev/mk_native_pbc [--my-config-options...], or when called with tools/dev/mk_native_pbc --noconf just generate and test the current configuration.

Tests again

As it turns out the tested numbers were also horrible. Someone went the easy way and tested only some exponents in the numbers, but the mantissas were always blank zeros. Numbers can be signed (there's one to two sign bits in the format), there can be -0.0, -Inf, Inf, NaN, and the mantissa is sometimes tricky to convert between various formats. The new number test has a now some such uncommon numbers to actually test the converters and expected precision loss.

Too much?

With the 5 types - 4 (float), 8 (double), 10 (intel long double), 16ppc, and 16 (float128) - and little<->big endian, there is combinatorial explosion in the number of converters. So I removed 50% of them by converting endian-ness beforehand, some of the easy conversion are best done by compiler casts whenever the compiler supports both formats, 16ppc conversions are pretty trivial to do, so there are only a few tricky conversions left. Mainly with the intel long double format. The 5*4 converters are still linked function pointers, assigned at startup-time. So it's maintainable and fast.

Optimizations

More optimizations were done by using more than single byte operations, such as builtin native bswap operations (also a new probe), and int16_t, int32_t and int64_t copy and compare ops. perl5 is btw. also pretty unoptimized in this regard. Lots of unaligned single-byte accesses. The worst of all scripting languages as measured by AddressSanitizer. A typical register is 32bit or 64 bit wide, the whole width should be used whenever possible. For the beginning the perl5 hash function is only fast on 32bit cpus. Fast checks could trade speed for size, not to bitmask every single bit. Maybe combine the most needed bits into an aligned short. But as long as there are unhandled really big optimization goals (functions, method calls, types, const) these micro optimizations just stay in my head.

Code on https://github.com/parrot/parrot/commits/native_pbc2

In a followup post I'll explain for the general community reading binary representations of numbers. Reading foreign floats would even deserve a new C library.

Leave a comment

About Reini Urban

user-pic Working at cPanel on B::C (the perl-compiler), p2, types, parrot, B::Generate, cygwin perl and more guts (LLVM, jit, optimizations).