The mysterious case of the SVt_PVIV

The other day I wanted to send my friend some silly emojis on LINE and so I updated my flaky old Unicode browser to the new-fangled Unicode with values above 0x10000, so that I could fetch the Emojis, which start around here. The thing also features a perl script which fetches values from Unicode::UCD using the charinfo function. I also updated to Perl 5.32 around the same time. Now the funny thing was that I started getting all kinds of errors about invalid JSON in the browser console. My Perl script was sending something of the form {... "script":Common ...} from my module JSON::Create, which is not valid JSON due to not having quotes around Common, and obviously my module was faulty.

Investigating the fault led me into the XS (C) code of my module where the value part of the JSON thought that the value associated with the script key in the hash reference returned by charinfo was of the form SVt_PVIV. PV means "pointer value" which is basically a string, and IV means "integer value", you can probably guess what that is supposed to contain.

My stupid module assumed that the string in an SVt_PVIV was just a representation of the IV part, so it just printed the PV as a string without quotes, leading to the above Common appearing. But it doesn't seem to be so. Is it some kind of "dual variable"? It turned out that the IV part wasn't even valid, so forcing it to treat the SVt_PVIV as an IV didn't work. The solution at the moment is to test with something called SvIOK whether the IV part is OK then treat it as a string if not.

The mysterious part for me is why is the script value an SVt_PVIV in the first place? Answers on a postcard, or comment below if you prefer.

I tried to replicate this bug for testing purposes using Scalar::Util's dualvar, but that creates an SVt_PVNV (floating point/string combo), which my daft module treated differently again.

JSON::Create now features indentation

In version 0.27 of JSON::Create I added a new indentation feature. This was added basically out of necessity. Originally the purpose of the module was sending short bits of JSON over the internet, but I've been using JSON more and more for processing data too. I've spent quite a long time working on a web site for recognition of Chinese, and I've been using JSON more and more extensively. The basic data file for the web site is a 168 megabyte JSON file. Not indenting this kind of file makes for "interesting" problems if one accidentally opens it in an editor or on a terminal screen, a million characters all on one line tends to confuse the best-written text reading utilities. So after years of suffering the relief is tremendous, and now I have tab-based indentation in JSON::Create.

Originally I thought that I should make all kinds of customisable indentation possible, but then it occurred to me that basically any fool armed with a regular expression could easily alter the indentation however they want to. I put a simple example in the documentation.

av_fetch can return NULL

If you create an array by inserting values, in the following way,

$thing{key}[10] = 1;

and then don't populate the rest of the array, a call to av_fetch in the array to retrieve values lower than the tenth one may return a NULL value.

I found this out the hard way, by segmentation faults returned from the following dereference:

https://metacpan.org/source/BKB/JSON-Create-0.24/json-create-perl.c#L1021

The important thing to do here is not to rely on av_fetch to not return nulls. I fixed the problem in JSON::Create version 0.25:

https://metacpan.org/source/BKB/JSON-Create-0.25/json-create-perl.c#L1036

I chose to populate the output JSON with the JSON value "null" if there a NULL value is returned by av_fetch:

https://metacpan.org/source/BKB/JSON-Create-0.25/json-create-perl.c#L1044

What to do with doubly-broken UTF-8?

I recently got a few test reports like this:

www.cpantesters.org/cpan/report/49de90f8-4ec9-11e9-98fa-fc611f24ea8f

Although I've put all kinds of stuff in my test file:

https://metacpan.org/source/BKB/Lingua-JA-Moji-0.56/t/katakana2syllable.t#L9-13

the cpan testers doesn't like that. How to deal with this garbage characters?

The solution is this:

#!/home/ben/software/install/bin/perl
use warnings;
use strict;
no utf8;
use FindBin '$Bin';
my $got = 'ック';
my $expected = 'ソー';

dec ($got);
dec ($expected);

exit;

sub dec
{
my ($in) = @_;
utf8::decode ($in);
utf8::decode ($in);
print "$in\n";
}

This turns the doubly-decoded garbage back into readable characters:

[ben@mikan] {14:28 25} moji 513 $ perl ~/oneoff/superdecode.pl 
ック
ソー

Be generous, but not very

This is another blog post based on my experiences of validating user inputs. Here is the previous one, where I argue against the use of "eval".

"Be generous in what you accept" says the adage. I want to argue that generosity works sometimes, but it's all too easy to be generous to a fault, and cheerfully responding to nonsense input is a losing battle. The problem with being too generous is that you end up correcting or even rejecting the slightly poor inputs in a flawed quest to make sense of complete drivel.

Let's take the case of processing English numbers in words as an example, because I have a web site which does this. It's quite easy to make a number parser which parses well-formed English words like "twenty-one" into numerals, "21", and a lot of people started out in programming doing simple exercises like that, in BASIC or something. The next step is converting somewhat broken but basically understandable inputs like "twentyone" or "fourty two" or "a milion" into numerals. These are clearly numbers alright, so it's just a case of someone whose spelling or punctuation is a bit shaky. So because we're using Perl, we shake a few regular expressions at the problem, like matching fou?rty or \b(b|m|tr)ill?ions?\b or something.

So far so good, but what happens when we descend into the maelstrom of trying to make sense out of absolutely any kind of input? Here are some genuine examples of rejected inputs to the above-mentioned number converter, randomly chosen:

what
thank you very much master
3'000,000
ichiro yamada
Kaku gidai ni mōke rareta mojisū no rūru ni shitagatte bunshō sakusei o okonatte kudasai
8 mili
1000 cherry trees
down
201; 202; 203; 204
diamond
thank
X767
one times ten to the fourty-eighth
ga be fuo n su
September 28, 2008
create account
3-Nen-me
forth
deb
6\30\2007

The problem you get when you try to be "generous" to these kinds of inputs is that you end up wrecking the mechanisms you made to fix up the moderately bad stuff. Here's an example: because it's a Western to Japanese converter, one common type of thing I get is people adding "en" or "yen" to the end of numbers. What happens when you try to allow for this? The problem is that "en" collides with "ten", so if I allow "a million en", then the user who types "twentyten" gets that misinterpreted as "twentyt en", and then gets muffed with an error message. This is the big problem with Lingua::EN::Numericalize, where it turns "america" into "americ1" in its desperate quest to convert everything into numbers, and it's why I had to stop using that.

Here are some 100% genuine inputs which I do accept, again randomly chosen:

thousend
zero , one
thirty-one
One thousand two hundred thirty four
573.4 million
one thousand amd sixty eight
FORTY SEVEN
1 billion, 51 million 855 thousand
Ninety-nine millions, nine hundred and ninety-nine thousands, nine hundred and ninety-nine
two thousand and eighteen
One thousand one hundred eleven
Eighty five thousand four hundred fifty five point eight
1.89 million
397.2 billion
12.5 million
50million300thiusand
seven bilion seven hundred thirteen million two thousand eightynine
one billion nine millions and thirty three
One hundred and twenty-three
5.6 billion

I also encountered this problem with the CPAN module JSON::Repair. I made the module initially with the hope that it would be able to process all kinds of things, like HJSON, into pure JSON, but I found that if I set it up with too extreme attempts to repair one kind of thing, that would end up breaking another part of the repairs. The only solution was to set it up to only do a modest number of repairs, to fairly unambiguous stuff. There is a little bit of documentation about that here.

There is probably some kind of information-theoretic statement of this conclusion in terms of error-correcting codes and redundancy and measures of spaces over spaces and whatnot, but I'm not completely sure what it is. But it's interesting that over-generous attempts to accept user inputs end up throwing the baby out with the bathwater, because it isn't something you find out until you have a big collection of user inputs, both valid and invalid, to test your program with.