Memory savings with -fcow

B::C has now better support for copy-on-write (COW) strings with about 6% memory savings for 5.20 and 5.22.

The perl5.18 implementation for COW strings is totally broken as it uses the COW REFCNT field within the string. You cannot ever come to a true successful copy-on-write COW scheme. You cannot put the string into the .rodata segment as with static const char* pv = "foo"; it needs to be outlined as static char* pv = "foo\000\001";. The byte behind the NUL delimiter is used as REFCNT byte, which prohibits its use in multi-threading or embedded scenarios. In cperl I'm was working on moving this counter to an extra field, but the 2 authors made it impossible to write it in a maintainable way. I could easily seperate the refcnt flag but I couldn't make it COW yet.

But even if the COW implementation in the libperl run-time is broken by design it still can be put into good use to store more strings statically than expected. The problem was that since 5.18 and with this COW feature binaries needed 20% more memory, as I couldn't save the strings statically anymore and had to allocate them dynamically.

In the first attempt I save some kilobytes memory by removing the IsCOW flag and store more strings statically.

But now I do the opposite. I set the IsCOW flags on much more strings since 5.20 and -O2, store it not as const char* to be able up update the cow refcnt, and rely in the automatic cow and uncow functions in the runtime to move this static buffer to the heap when being written to, and don't need to rely on LEN=0 anymore, which indicates a normal static string.

With a typical example of a medium sized module, Net::DNS::Resolver, 64bit not threaded, the memory usage is now as follows:

5.22:

pcc -O0 -S -e'use Net::DNS::Resolver; my $res = Net::DNS::Resolver->new;
  $res->send("www.google.com"); print `ps -p $$ -O rss,vsz`'
pcc -O3 -S -e'use Net::DNS::Resolver; my $res = Net::DNS::Resolver->new;
  $res->send("www.google.com"); print `ps -p $$ -O rss,vsz`'

               rss
without -fcow: 12832
with -fcow   : 12112
cperl        : 12532

6% percent memory win for 5.22. Even better than with cperl.

The current distribution of .rodata, .data and dynamic heap strings with this example is as follows:

                 .rodata  .data  heap
-fno-cow (-O0):  305      1945   1435
-fcow (-O3):     110      2225   1024
cperl -O3:       107      2112   1001

Thus with -O3 we traded 40% less dynamic strings for 3x less .ro strings, but 14% more static strings. With cperl the improvements are no so dramatic, as cperl already has much more static optimizations already.

Memory savings with cperl and AvSTATIC

B::C and cperl has now proper support for copy-on-grow (COG) and copy-on-write (COW) arrays.

COG means that the array of SV* pointers is allocated by the compiler statically, not dynamically, and that the cperl runtime creates a new array whenever the array is extended (copy-on-grow).

COW means that the array of SV* pointers is allocated by the compiler constant and static in the .rodata segment, and that the cperl runtime creates a new array whenever an element of the arrays is changed (copy-on-write).

With a typical example of a medium sized module, Net::DNS::Resolver, the memory usage is as follows:

pcc -O0 -S -e'use Net::DNS::Resolver; my $res = Net::DNS::Resolver->new;
  $res->send("www.google.com"); print `ps -p $$ -O rss,vsz`'

            rss
with avcow: 12720
without   : 13456

5.8% percent win.

The numbers with a small example are as follows:

                        rss  vsz
cperl5.22.2-nt-avcow    2536 2438744
                   -O3  2532 2438740
cperl5.22.2d-nt-avcog   3516 2451728
perl5.22.1-nt           3316 2438912
perl5.20.3-nt           3264 2438696
perl5.18.2-nt           3036 2438468
perl5.18.4d             4276 2450540
perl5.18.4d-nt          4120 2451332
perl5.16.3              4072 2458904
perl5.16.3-nt           3008 2438420
perl5.14.4              3168 2447764
perl5.14.4-nt           2944 2447540
perl5.14.4-nt -O3       2852 2447472
perl5.12.5              3440 2449964
perl5.12.5-nt           3244 2447716
perl5.10.1-nt           3172 2456836
perl5.8.9               3176 2465976
perl5.8.9d-nt           3096 2438400
perl5.8.5d-nt           3228 2456836
perl5.8.4d-nt           3176 2457792

Here you see that the previously useful perl version perl5.14.4-nt with 2852 kB is now finally made obsolete by cperl with an RSS of 2532 kB.

5.16 introduced binary symbols, and 5.18 added a completely broken implementation of COW strings, which forced all previously statically allocated strings to be allocated dynamically. This caused a 20% memory increase in 5.22, which we could only overcome with cperl, and some tricks in the compiler to disable COW strings at all.

Theoretically I can set all arrays as COW to get the biggest memory win, but at run-time all writes need to copy those arrays to the heap, which is a performance and memory loss. So I cow only the arrays which are very likely to be not changed at all. I.e. all @ISA arrays, the @INC and all READONLY arrays.

The current distribution with this example is as follows:

24 COW arrays of size 1, 2x 2, 1x 3, 1x 9. 28 COW arrays at all.

11 COG arrays of size 1, overall 90 COG array sizes with max 169 elements, 89 COG arrays at all.

1338 arrays and 16562 SVs at all.

I haven't measured the hit and miss rate yet, and I haven't fixed COW or COG for other data types, such as strings or hashes. A big improvement would be proper COW or COG for strings of course, with an expected memory win of 10-20%.

Fixed 5.22 problems during my compiler port

I uncovered and fixed many 5.22 problems with cperl already, but in the last months I was busy to port the 3 compilers B::C, B::CC and B::Bytecode to 5.22.

As I said in my interview it's my belief that if all current p5p core committers would stop committing their bad code it would be actually be the best for the perl5 project. They weren't able to implemented any of the already properly designed features from perl6 in the last 12 years, and every feature they did implement is just so horrifibly bad, making our already bad code base, which led to reimplementation efforts of perl6/parrot with a better core, even worse. With cperl I can only undo a little, but when they start breaking the API and planned features in an incompatible way they should just stop.

Nevertheless, 5.22 added a significant improvement from outside, syber's monomorphic inline caching for method calls besides the internal improvement of multideref by Dave Mitchell.

Now to the problems I had to fix in the last months with that 5.22.0 release:

1. Father broke ByteLoader

cperl #75 perl-compiler

This is something I cannot fix in the compiler. I updated my perl patcher App::perlall with new --patches=Compiler patches to fix this, and cperl of course also has this fix.

I had to write a complicated probe mechanism for ByteLoader to check if the used perl5.22 version is already patched or not. Probing a to-be-built XS submodule is not that easy. A typical chicken and egg problem. I could use my already existing B::C::Flags helper config, which allows custom compiler settings. There I initialize the variable $B::C::Flags::have_byteloader with undef, and when the XS modules are all built I call a helper script to probe for a working ByteLoader, and patch $B::C::Flags::have_byteloader to 0 or 1. I can use this then in the tests to skip or run the bytecode tests. And I had to put this helper script into the hints directory to skip it from being installed. Messing with EUMM libscan() was too dirty for me.

The internal compiler op.c creates a new main or eval environment with newPROG(), setting the entry points PL_main_start and PL_main_root from the intermediate parsed PL_compcv. In the case of en empty source the parser always adds a final ; semicolon, which leads to an empty optree starting with OP_STUB.

But with commit 34b5495 for [perl #77452] the compiler now always adds a LINESEQ in front of the STUB, but the logic in newPROG for source filters which already setup PL_main_start and PL_main_root wasnot changed, which led to a broken ByteLoader.

This is an interesting commit as it added a lot of wrong comments about the inner working of this, but didn't update the logic.

The fix in cperl is here and for perlall here, and my perlbug report did not get through.

I can only guess that p5p blocked me again, because they didn't like me to call them incompetent. Blocking bug reports and fixes is worse than just incompetence, but I got used to that recently. They blocked my simple fix for the horrific double-readonly system, and they proudly announced last week some new optimization regarding faster arithmetic, but didn't have a look into my fast arithmetic optimizations which I wrote half a year ago, and which makes them look very bad in the end. Everybody applauded poor Dave for this "fantastic breakthrough". The guys are really that simple. Looking through my improvements would have wasted less time and would have improved it upstream by 30% not just 10%.

2. Dave couldn't implemented multideref access for the compiler

cperl #76 perl-compiler #341

Multideref merges sequential hash or array access into one compressed op. This is a pretty good compiler optimization, if the B design would not be so bad.

The upstream design of the new 5.22 B::UNOP_AUX::aux_list method deviates significantly from proper B design. aux_list requires the curcv to be provided, which is not trivial to do for a B module, and it needs this to resolve shared SVs beforehand. Requiring the curcv to resolve the padoffset is unneeded and does not help B and any of its clients. Clients need the padoffset and resolving it e.g. in B::Deparse is to be done in B as with all other threaded and shared SV accessing methods.

Thanksfully I can patch most of B bugs by myself, and don't have to fork it publicly into a worse name. B is already a good enough name, and I don't want to deviate from that, even if p5p consistently refused to maintain B properly in the last years. There was some short time a few years ago where I could work without a patched B, but this period only lasted very shortly, and none of my fixes were applied, while other new horrific mistakes made it in.

3. Missing HV::ENAMES api

perl-compiler commit ++

Stashes can be aliased to seperate namespaces, and the ENAMES API to access this names never made it into B, and thus never into a compiler. Namespaces aliases are rather seldom, so it caused not too much trouble, but now I added ENAMES and could hereby fix most of the remaining compiler limitations, even for 5.14.

4. Missing PADNAME B api

I explained that technically in my interview. Currently we limit the max name length of lexical variables to 60, because we statically allocate the buffers for them. It is not a practical problem, and I'll optimize that sooner or later to smaller static structs.

5. Fixed HEK assertions

HEK's (shared hash keys) are still dynamic, not static, but I could fix the remaining refcount issues at least.

The cperl code to support static HEKs is already there, but I still need to add compiler code and probes to support that.

6. Broken B::RV->FLAGS for GVOP_gv -> CVREF

5.22 has a wrong RV->FLAGS for a GVOP_gv pointing to a CVREF gv(cv ref:). It returns the flags for a GV (0x808009) where it should be just 0x801, a ROK RV. This is suddenly broken in 5.22 because it's a new optimization they did, and of course wrong.

cperl #63

I haven't fixed that yet in cperl, I just a workaround in the compiler with this patch


Overall we are very happy with the new 5.22 compiler, though we are not yet using the much more advanced cperl optimizations. The B::C optimizations alone lead to ~20% less memory, with cperl and its compiled readonly hashes for Config and warnings and its upcoming support for static GV/AV/CV/PAD/HEK layout it's much more dramatic. This will be a real COW (copy-on-write) mechanism then, being able to statically allocate readonly buffers, and copy it to the heap, when it's being changed. For the compiler we only need to ensure that static buffers are not freed, which is trivial with the added flag.

-m support for perlcc, compiling to modules, not single binaries is also improving. This can split various optimizations per module/.pm file, so we can use B::CC compiled modules or even rperl compiled modules, compile-times should go down from 20min to ~5min, with much faster smoker feedbacks, and pushing updates live is much faster, because they will be much smaller. The old compile times were 2 hours.

But since fixing B::C for 5.22 needed so much more time than expected I couldn't add most of the planned cperl optimizations for the upcoming cperl-5.22.2 release and B-C-1.53 release.

cperl-5.22.1 released

https://github.com/perl11/cperl/releases/tag/cperl-5.22.1

The name cperl stands for a perl with classes, types, compiler support, or just a company-friendly perl, but currently it's only a better 5.22 based variant without classes.

Currently it is about 1.5x faster than perl5.22 overall, >2x faster then 5.14 and uses the least amount of memory measured since 5.6, i.e. less than 5.10 and 5.6.2, which were the previous leaders. While perl5.22 uses the most memory yet measure…

gdb-dashboard

https://github.com/cyrus-and/gdb-dashboard

wget https://raw.githubusercontent.com/cyrus-and/gdb-dashboard/master/.gdbinit -O .gdb-dashboard

sed -i 's,python Dashboard.start(),#python Dashboard.start(),' .gdb-dashboard

joe .gdbinit
source .gdb-dashboard
python Dashboard.start()

="window.open('http://blogs.perl.org/users/rurban/assets_c/2015/09/Screen Shot 2015-09-12 at 11.30.44 AM-2215.html','popup','width=720,height=816,scrollbars…