You think you're an X, but you're only a Y

The other day I was converting the output of a Git::Raw::Commit into JSON using my module JSON::Create, when I noticed an oddity:

{
"commits":[
    {
        "body":null,
        "id":"27ed4669e32ce2d14831c719dfd5b341a659788e",
        "message":"Remove a stray html ending tag\n",
        "time":"1609997818"
    },

The "time" part always comes out as a string, even though it's clearly a number. Was this a bug in my module, some kind of dual-string-and-number wannabee variable which JSON::Create falsely turned into a string?

As it happens, no. Git::Raw actually puts the number into a string. (The newSVpv there makes a new Perl string, and the sprintf above that does exactly the same job as Perl's sprintf.)

So Git::Raw turns the original C variable of the form git_time_t, a 64-bit integer type representing the number of seconds since the "epoch" (1970), into a string, perhaps to avoid the "year two million" problem or whatever, because Perl can hold up to 52 or 53 bit integers.

Anyway Perl's monkey business with numbers and strings, and the lack of booleans, makes creating JSON quite complicated, although not as complicated as identifying cats in photographs and youtube videos.

gary-7-cat-50pc.png

gary-7-cat-cat-50pc.png

ABC Mart

800px-ABCMART_Minamisomaharamachi_Shop.jpg

One thing which I like about Mojolicious is that they put all the functions and methods in alphabetical order. When you get used to that then go back to a module like Git::Raw::Repository with a large number of functions in apparently random order, it does seem like quite a smart move for the reader to use the alphabetical ordering.

Anyway I thought so, so I've been putting all the functions, methods, and other things in my modules into alphabetical order. I even started to write tests that they are all alphabetical, since I usually manage to slip up on these things.

I'm Making Headway Now

Last January there was a post on reddit which claimed that my module JSON::Parse was not only failing some of the JSON Test Suite tests, but also crashing on one of them. Anyway I should have got around to doing something about it sooner, but here are my conclusions.

First of all there was a crash on one of the files, which went something like this: [{"":[{"":[{"", repeated about 100,000 times. (The actual file is here if you really want to see it.) Investigating it using a LInode, I found that after 80,000 open brackets the stack was overflowing, causing the crash to occur. If I added a printf in the midst of my code the printf would cause the stack overflow, so it wasn't actually due to my code but just because the stack size seems to be quite small on Linux.

There are various things one could do to tackle this, but it does seem a bit unlikely that anyone would want to have that many open brackets, so what I did as a strategy was to add a "max_depth" of parsing after which it would stop. I thought 10,000 open { and [ would be enough for anyone, and it would satisfy the people who want to run the JSON Test Suite tests, but I also added an option for the user to alter the max depth and get the max depth as well.

After finishing that I started to wonder what other modules were doing that they had passed this test. A quick glance at the Cpanel::JSON::XS documentation and it seems that those people had the same idea, but their maximum depth is a measly 512 as opposed to my whopping 10,000. I'm not sure which is better or worse. ๐Ÿ˜•

There were a couple of other tests in the JSON Test Suite which I failed, which were tests about Unicode; the JSON Test Suite thinks that it is compulsory for JSON parsers to pass through Unicode non-characters, UTF-8 representing surrogate pairs, and so on. I'm not really sure why they think so, there is a bit of a discussion on the github but that doesn't strike me as being very convincing. Frankly I think JSON::Parse is much more useful tool for users if it rejects the non-characters and other crap, ๐Ÿ’ฉ so I would need to see much stronger justification than I have done to alter it in the way that JSON Test Suite suggests. emoticon.png

In other news, I've also made it possible for the module to be installed from github, so it should now be possible to start using continuous integration. I've already got Travis working, and I hope to add some more useful things soon.

Misusing newSVpv

I managed to cause another set of obscure bugs by misusing newSVpv. The goal of this code is to split the RGB and the alpha (transparent) part of a PNG image, for the use of the PDF::Builder module on CPAN.

Here the SV in newSVpv is "scalar value", and the "pv" means "string". My code says this:

sv = newSVpv ("", len);

and it causes the crash on Solaris and other operating systems because the above statement is a bad idea. What happens is that Perl copies "len" bytes from my string, where "len" might be a very large number, for a large PNG image, but I've given it the string "" to copy from. So Perl tries to copy from uninitialised, or possibly even inaccessible, parts of the computer's memory. I found out my error only after almost giving up, by using valgrind to look for memory errors.

The correct version of this is

 sv = newSV (sv_len);
 SvPOK_on (sv);
 SvCUR_set (sv, sv_len);

We make a new SV, then we tell Perl that our new SV is meant to be a string using SvPOK_on, then we tell Perl the length of our string is sv_len with SvCUR_set.

Here is the manual page from Perl 5.32 (perldoc perlapi) for newSVpv:

**newSVpv** Creates a new SV and copies a string (which may contain "NUL" ("\0") characters) into it. The reference count for the SV is set to 1. If "len" is zero, Perl will compute the length using "strlen()", (which means if you use this option, that "s" can't have embedded "NUL" characters and has to have a terminating "NUL" byte).

It doesn't really say that it is going to copy len bytes of the string, but I suppose that is obvious in retrospect since it does say that the string may contain NUL characters, so the only test it could be making for the end of the string would be the number of bytes.

This function can cause reliability issues if you are likely to pass in empty strings that are not null terminated, because it will run strlen on the string and potentially run past valid memory.

This seems to be warning about what happened to me in Image::PNG::Libpng version 0.55, but what is an "empty string" that is "not null terminated"? An empty string to me is "", which is automatically NUL terminated. `

Using "newSVpvn" is a safer alternative for non "NUL" terminated strings. For string literals use "newSVpvs" instead. This function will work fine for "NUL" terminated strings, but if you want to avoid the if statement on whether to call "strlen" use "newSVpvn" instead (calling "strlen" yourself). SV* newSVpv(const char *const s, const STRLEN len)

I don't really see why that would be safer. Here is the documentation for newSVpvn:

newSVpvn Creates a new SV and copies a string into it, which may contain "NUL" characters ("\0") and other binary data. The reference count for the SV is set to 1. Note that if "len" is zero, Perl will create a zero length (Perl) string. You are responsible for ensuring that the source buffer is at least "len" bytes long. If the "buffer" argument is NULL the new SV will be undefined. SV* newSVpvn(const char *const buffer, const STRLEN len)

It seems to be the same thing except that it doesn't look for the NUL (the zero byte) but insists on having the string length, so the only time it would be "safer" would be if len was accidentally set to zero. So what it meant by "an empty string which is not null (NUL) terminated" was the case where the user put the length as zero, but also passed a pointer which is not NULL, but points to some random place in memory where there is no NUL (zero, '\0') byte to be found, and then newSVpv went on a wild goose chase over random bytes of memory looking for the end of the string. I wonder if that has ever actually happened to anyone, or was it just a possibility that the authors were worried about?

Anyway that doesn't cover dummies like me who send in NUL-terminated strings, but lie about their length to try to get extra memory allocated, but I may have been the first person ever to abuse newSVpv in this way.

Also, wouldn't it be clearer to say that it creates a new SV of length len+1 and copies len bytes of a string into it, if the string (buffer) is not NULL, or no bytes if it is NULL.

It would also be nice if it could have been explained a bit better what to do with the return value of newSV. My initial problem, the reason I ended up abusing newSVpv, was because I couldn't work out how to tell Perl that the thing I'd created with newSV was a string. The answer to this turned out to be SvPOK_on, which is documented, but I found in fact by rooting through the Perl source code trying to work out what SvPOK was checking for. The documentation of newSV in perlguts says this about it:

In the unlikely case of a SV requiring more complex initialization, you can create an empty SV with newSV(len). If len is 0 an empty SV of type NULL is returned, else an SV of type PV is returned with len + 1 (for the NUL) bytes of storage allocated, accessible via SvPVX. In both cases the SV has the undef value.

I don't think my case is particularly unlikely, it seems fairly normal to want to get some memory to write into before actually retrieving the "string" to write, for example in my case I am calling a libpng routine to get the "string" to write into the buffer, so I want to allocate the buffer before retrieving. If I allocate the memory myself then use newSVpv on my allocated memory, there is an unnecessary copy of data, so it seems sensible to just write into the buffer of the SV.

It wasn't very clear to me what an SV of type PV with the undef value was, or how I could turn my SV with the undef value into something which Perl recognised as a string. My own preference would be for the documentation to go into more practical details of how to accomplish tasks.

Alles in Ordnung

Perl returns its hash values in a random order. Since 5.14 or so, the random order changes every time. So if you loop over your hash values, you get a different ordering each time.

for my $k (keys %hash) { }

No problem you say, I'll use sort to order my keys.

for my $k (sort keys %hash) { }

But what if you want to use a non-default order, like case-insensitive? Easy-peasy you claim.

for my $k (sort {uc ($a) cmp uc ($b)} keys %hash) { }

Now here's my problem. I'm using XS to loop through the hash, and I want to sort the keys in the hash according to the user's preference.

Well, there must be a way to sort things with the Perl API mustn't there? Well, there is something called sortsv_flags which wants a user-defined sorting routine of the type SVCOMPARE_t. So if I stuff the hash keys into an array of pointers, then I can sort them using Perl's default sorting if I use Per's default sorting routine, which is called Perl_sv_cmp. So that is that "sorted".

However, what if I want to allow a user-defined ordering in my XS program? I had a look with Google and with grep cpan but I couldn't find a lot of information, or an example of an XS module which does this. Writing a callback to call the Perl function from C is documented in perlcall, but it's not at all clear to me how I can set the user-defined function within my SVCOMPARE_t function, since there seems to be no way to pass my object into that.

The way I would do this in C is by using qsort_r, which allows me to pass a pointer to a void * which I can fill with anything I want to. The easiest way to go seems to be to use that, but qsort_r is another can of worms, since it is not standardised.

qsort_r started out on BSD, then Gnu implemented it, but Gnu had a bright idea of putting the arguments in the opposite order to BSD. Then Bill Gates came along and thought he would implement it too for his "Windows" operating system, but he decided to call it qsort_s, but with some of the arguments in the same order as Gnu, and some of the arguments in the same order as BSD. ๐Ÿ‘ทโ€

Some bright spark figured this all out though and made some macros, which you can find on github, which do it all for you and give you a universal qsort_r-like function. That would be great, except the macros don't actually work on ๐Ÿ“ Strawberry Perl, since that looks too much like a Gnu environment, and changing the macros so that they give the correct definition of qsort_s, I then found out that some versions of Windows don't even seem to have the qsort_s function.

Thankfully the ๐Ÿ‘‘ Regents of the University of California ๐Ÿ‘‘ have made their operating system open source, so what I did in the end is just to copy the BSD qsort_r source code into my module, give it a different name, and voila, I now have reliable user-defined sorting in the module, and it seems to work OK so far.

But is this really the best possible solution? Is there not some kind of clever trick that one can use to access the Perl sorting with a user-defined function from XS, and get all the $a and $b stuff too?