native_pbc in parrot revived (numbers part1)
The design for parrot, the vm (virtual machine) under rakudo (perl6), envisioned a platform and version compatible, fast, binary format for scripts and modules. Something perl5 was missing. Well, .pbc and .pmc from ByteLoader serves this purpose, but since it uses source filters it is not that fast.
Having a binary and platform independent compiled format can skip the parsing compiling and optimizing steps each time a script or module is loaded.
Version compatiblity was broken with the 1.0 parrot release, that's why I left the project in protest a few years ago. Platform compatibility is still a goal but seriously broken, because the tests were disabled, and nobody cared.
Since I have to wait in perl5 land until p5p can decide and discuss on a syntax for the upcoming improvements which I can then implement in the type optimizers and the static B::CC compiler, I went back to parrot. p5p needs a few years to understand the most basic performance issues first. The basic obstacles in parrot were gone, parrot is almost bug free and has most features rakudo needs, but is lacking performance.
So I tried to enable platform compatibility again. I wrote and fixed most of the native_pbc code several years ago until 1.0, and only a little bit of bitrot crept in. Platform-compatible means, any platform can write a pbc and any other platform should be able to read this format. Normally such a format would require a fixed format, not so .pbc. The pbc format is optimized for native reads and writes, so all integers, pointers and numbers are stored in native format, and when you try to open such a file on a different platform converters will try to read those 3 types. integers and pointers can be 4 or 8 byte, little or big endian. This is pretty easy to support.
The problem comes with numbers. Supported until now was double and the intel specific long double format. The main problem is that the intel long double format is a tricky and pretty non-standard format. It has 80 bits, which is 10 bytes, but the numbers are stored with padding bytes, 12 byte on 32-bit and 16 byte on 64-bit. 2 bytes or 6 bytes padding. Here Intel got padding right but in the normal compiler ABI Intel is the only processor which does not care about alignment. Which leads to slow code, and countless alignment problems with fast SSE integer code. Most other processors require stricter alignment to be able to access ints and numbers faster. Intel code is also not easy to compile on better processors, because they fail on unaligned access. You cannot just access every single byte in a stream at will. At least you should not.
As it turns out sparc64 and s390 (AIX) uses for long double the
IEEE-754 standard quad double 16-byte binary format, which is the best
so far we can get, GCC since 4.6 supports the same format as
__float128 (via the
quadmath library), and
finally powerpc has its own third long double
-mlong-double-128, which is two normal 8-byte double one after
another, and the result is the sum of the two, "head" and "tail". It's
commonly called ppc "double-double". For smaller devices the typical
format is single float, 4 bytes. Thanksfully in IEEE-754 standard
format. All compilers can at least read and write it. But when it
comes to va_arg() accessing ... arguments from functions, gcc fails
to accept float.
So after rewriting the test library I still found some bugs in the code.
So I fixed a lot of those old bugs, esp. various intel long double
confusions: with the padding bytes, 12 or 16 bytes, and a special
normalize bit at 63, which is always 1 when a valid number was written
to disc. So when reading such a number this bit is not part of the
mantissa. Documentation for these formats was also wrong. And I added
support for all missing major number formats to parrot,
long double in various variants: FLOATTYPE_10 for intel,
FLOATTYPE_16PPC for the powerpc double-double format, and finally
FLOATTYPE_16 for IEEE-754 quadmath, i.e.
__float128 or sparc64/s390 long
The biggest obstacle for progress was always the lack of a UltraSparc to test the last number format. As it turns out a simple darwin/ppc Powerbook G4 was enough to generate all needed formats, together with a normal Intel multilib linux. My colleague Erin Schoenhals gave me her old powerbook for $100. The Powerbook could generate float, double, long double which is really a 16ppc double-double and gcc 4.6 could generate __float128, which is the same format as a 64bit sparc long double.
Good enough tests
One important goal was a stable test suite, that means real errors should be found, invalid old .pbc files should be skipped (remember, pbc is not version compatible anymore) and numbers only differing in natural precison loss while converting a number should be compared intelligently. Interestingly there does not even exist a good perl5 Test::More or Test::Builder numcmp method to compare floating point numbers in the needed precision. There is a Test::Number::Delta on CPAN, but this was not good enough. It only uses some epsilon, not the number of valid precision digits, and the test is also numerically not stable enough. And really, number comparisons should be in the standard. I added a Test::Builder::numcmp method locally. It works on lines of strings, but could be easily changed to take an arrayref and single number also.
Expected precision loss
So what is the expected precision loss when reading e.g. a float with
intel long double? A float claims to hold 7 digits without loss,
FLT_DIG, so such a conversion should keep 7 digits precision, and the
test only needs to test the 7 first digits. The precision holds 24
log10(2**24) ≈ 7.225 decimal digits. So
123456789.0 stored as
float, converted to long double needs to be compared with something
/^1234567\d\*/ if done naively. It can be
123456755.0 or any other
123456799.4. Better round the last significant
But first at all, what is the right precision to survive a number ->
string -> number round trip? Numbers need to be sprintf-printed
precise enough and need to be restored from strings precise
enough. Printing more digits than supported will lead to unprecise
numbers when being read back, and the same when printing not enough
digits. The C library defines various defines for this number:
better than trusting your C library vendor is a configure probe, now
So parrot outsmarts perl5 now by testing for the best and most precise
sprintf format to print numbers. As experimentally found out, this
number is usually one less than the advertised *_DIG
definition. double uses
%.16g, float uses
so on. But this might vary on the used CPU and C library. Before
parrot used hardcoded magic numbers. And wrongly.
One might say, why bother? Simply stringify it when exporting it. Everything is already supported in your c library. Two counter arguments:
Fixed-width size. Native floats are easily stored in fixed-width records, strings not. Accessing the x-th float on disc is significantly faster with fixed size, and native floats are also significantly smaller than strings.
Precision loss: With stringification you'll loose precision. In my configure probe I verified that we always loose the last digit. The previous code in imcc had this loss hardcoded, 15 instead of 16.
parrot's Configure also checks now for the native floattype and its size. Before a pbc header only checked the size of a number, now the type is different from the size. The size of long double can be 10, 12, or 16 and can mean completely different binary representations.
As next improvement, parrot used to store the parrot version triple in
the ops library header inside the pbc format. But whenever a ops
library changed, the other version number needs to be changed, the
PBC_COMPAT version number, or simply the bytecode version. This
needs to be done for format changes and a change of native
ops. Because parrot stores and accesses ops only by index, not by
name, and sorts its ops on every change. This was my main critic when
I left parrot with 1.0. Because it was never thought this way. Old ops
should be readable by newer parrots, just newer ops cannot not be
understood. So new ops need to be added to the end.
So now the bytecode version is stored in the ops library header, and newer parrot versions with the same bytecode version can still read old pbc files. Older bytecode versions not yet, as it needs to revert the policy change from v1.0, back to pre-v1.0.
The script to generate the native pbc on every
was pretty immature. I wrote it several years ago. I rewrote it, still
as shell script, but removed all bashisms, and enabled generating and
testing all supported floting point formats in one go with custom perl
or when called with
just generate and test the current configuration.
As it turns out the tested numbers were also horrible. Someone went
the easy way and tested only some exponents in the numbers, but the
mantissas were always blank zeros. Numbers can be signed (there's one
to two sign bits in the format), there can be
and the mantissa is sometimes tricky to convert between various
formats. The new number test has a now some such uncommon numbers to
actually test the converters and expected precision loss.
With the 5 types - 4 (float), 8 (double), 10 (intel long double), 16ppc, and 16 (float128) - and little<->big endian, there is combinatorial explosion in the number of converters. So I removed 50% of them by converting endian-ness beforehand, some of the easy conversion are best done by compiler casts whenever the compiler supports both formats, 16ppc conversions are pretty trivial to do, so there are only a few tricky conversions left. Mainly with the intel long double format. The 5*4 converters are still linked function pointers, assigned at startup-time. So it's maintainable and fast.
More optimizations were done by using more than single byte
operations, such as builtin native
bswap operations (also a new
probe), and int16_t, int32_t and int64_t copy and compare ops. perl5
is btw. also pretty unoptimized in this regard. Lots of unaligned
single-byte accesses. The worst of all scripting languages as measured
typical register is 32bit or 64 bit wide, the whole width should be
used whenever possible. For the beginning the perl5 hash function is
only fast on 32bit cpus. Fast checks could trade speed for size, not
to bitmask every single bit. Maybe combine the most needed bits into
an aligned short. But as long as there are unhandled really big
optimization goals (functions, method calls, types, const) these micro
optimizations just stay in my head.
In a followup post I'll explain for the general community reading binary representations of numbers. Reading foreign floats would even deserve a new C library.