How should a JSON parser handle character strings?

How should a JSON parser handle character strings with non-ASCII characters? My humble opinion: fatal error. Here's what the JSON parsers on my machine did:

http://pastebin.com/mwui3iDy

19 Comments

How should a JSON parser handle character strings with non-ASCII characters? My humble opinion: fatal error.
The JSON specification states:
3. Encoding

JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.

Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.

00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8

My JSON parser gives:


JSON::Parse: Eat More Turkey \x{2605}

You might be interested that \u2605 is the JSON format for Unicode characters.

You're confusing character strings with byte strings. This is thoroughly documented in the JSON module. When dealing with character strings, use JSON::from_json or JSON->new->decode. Works fine.

If you add these lines to your loop tests, you should have the same results as you currently do with JSON::DWIW.

['JSON::PP', 'JSON::PP::from_json']
['JSON::XS', 'JSON::XS::from_json']

Or if you remove use utf8, JSON::XS::decode_json and JSON::PP::decode_json should start working (and the examples above will not).

To sum it up, pass UTF-8 encoded strings to decode_json or Perl internal strings to from_json. Literal Unicode strings in your Perl code will be in Perl internal format if you use utf8 and UTF-8 encoded otherwise.

Peter Sergeant responds:


Ben: if you don't throw a fatal error with a character string with high codepoints, you're doing it wrong.

I'm not sure what practical problem it is that you're addressing. What you seem to be saying is that if I make a Perl parser to parse Perl variables from one form into another (from a JSON string to a Perl hash or list), it should create a fatal error if it encounters a well-defined Perl construction, rather than simply convert that well-defined Perl construction into another well-defined construction in the obvious way.

As you quote from the spec, JSON is encoded as UTF-whatever: this means there is no possible 'right' behaviour when encountering a character that can't be represented in bytes without further information.

In a general context that might be true, but in the context of a Perl program, we do indeed know what to do with such a character, since the inputs and outputs are all Perl variables. As I stated above, I am not sure what actual problem you are trying to solve. Can you give a real-world example of where this kind of fatal error would be desirable?

Just because it's a well-formed Perl variable doesn't make it valid JSON. Because JSON has a defined (and required) encoding, where Perl character strings don't.
If the input is a Perl character string, and the output is a Perl character string, could you please give an example of how this causes problems.
To be in a position where you are parsing a "JSON string" that is in fact a Perl character string with exotic characters, someone somewhere has messed up, because someone has prematurely decoded it.
Could you show an example of a practical situation where that became a problem.
Silently accepting that with an undefined outcome is how applications become encoding nightmares.
Then I'm sure that you will have lots of actual examples of programs which have become encoding nightmares because they turned Perl character strings into other Perl character strings. Could you tell me about one of them.
Throwing a fatal error - unless the user has explicitly asked you not to - is unambiguously the correct behaviour.
I wonder if I could ask you for a simple practical example of what problems happen if this fatal error does not occur.


First of all thanks for responding politely, and apologies for the nagging tone of my previous post.


Let's say you have a junior developer working on a web development project using Catalyst. Another developer has installed a plugin which automatically encodes and decodes UTF-8 on the incoming and outgoing, because it seemed like a good idea.

So the plugin automatically encodes UTF-8 bytes into Perl's character strings, right?

The junior developer writes a handler which accepts, acts on, and the replies with a JSON file. Sadly their JSON handler is a little naughty, and silently, incorrectly handles the incoming data.

That is not an example of a correctly-functioning JSON parser which turns character strings into character strings causing a problem. That is an example of a bug in a JSON parser causing a problem.

No problem is apparent to the developer at this point. They spit out JSON files to a different part of the system, again via the handler, and now a material problem /is/ created, because the UTF-8 magic double-encodes the outgoing JSON.

Again, that is not an example of a functioning JSON parser turning character strings into character strings causing a problem. That is an example of a malfunctioning JSON parser which takes a character string as input, doesn't test it, and assumes that it is a byte string. In other words that is just an example of a bug.

Corrupt data gets put in to the receiver, and the person who has to clear up the mess dies of stress. Had we thrown an error at the time, it would have triggered the developer to try and work out what was going on as opposed to what should be going on.

So the JSON parser which is buggy should have thrown an error because of its bug? Better to fix the bug, no?

Let me be more specific: I would like you to give me an example of where a JSON parser which correctly functions in all other respects, but given a character string as input, outputs character strings, and given a byte string as input, outputs a byte string, causes a problem. I want to see an example, not of bugs in JSON parsers, but of what problems can be caused by a non-buggy JSON parser which does the obvious operation of turning something like

my $json = '{"ß":"ss"}';

into a Perl structure something like

my $perl = {ß => 'ss'};

treating $json as a character string.


So let me turn this back to you and ask: in what situation will you legitimately receive a JSON string that is a character string? It only ever happens as a result of an error, and so it should be treated as such.

Well, the JSON could come from a file opened with something like

open my $file, "<:encoding(utf8)", "json-file";

And it's perfectly straightforward to check whether the input data is a character string or not. In the case of the above module I mentioned, it tests SvUTF8 on the input string and remembers it:

if (SvUTF8 (json_sv)) {
json_argo_t * json_argo_data = jpo->ud;
json_argo_data->utf8 = 1;

}

then it switches on UTF-8 for the output strings:

if (data->utf8) {
SvUTF8_on (string_sv);
}

I'd really appreciate seeing an example, preferably in code, of how this could possibly cause a problem somehow.


If your module can see FULL WELL that there is something wrong, why would you ever silently handle that by default? In what circumstance would that ever be the right thing to do, rather than making the user specify that they know something is wrong.

Because I don't consider character string inputs to be "something wrong". A routine which converts JSON into a Perl hash shouldn't be any different from something like

s/ß/ss/;

which works differently if there is

use utf8;

at the top of the code to if there is not, and also works differently on byte strings and character strings.

I note you even rally against this in the documentation of your own module. You say:
Unfortunately that documentation is out of date, I should have deleted that paragraph. That was for an older version with a lexer based on "flex". The new lexer actually doesn't do any checking at all.


OK. It seems that the every other experienced Perl developer in the world is happy with the way the JSON module is prepared to deserialise from, and serialise to, character strings as well as byte strings. Only Pete Sergeant has a problem. Now, either Pete is a tin-foil-hat genius, and all the other experienced Perl developers are a bit thick, or Pete's missed something. Since Perl folk are a kindly lot, if Pete has a misunderstanding, we will try to help him.

Let's go back through one of Pete's walkthroughs and see if we can de-bug the understanding:

Let's say you have a junior developer working on a web development project using Catalyst. Another developer has installed a plugin which automatically encodes and decodes UTF-8 on the incoming and outgoing, because it seemed like a good idea.

(I assume you mean "decodes and encodes UTF-8 on the incoming and outgoing" — if it were the way round you suggest we would not be getting anywhere.)

It seemed like a good idea because it is a good idea.

The junior developer writes a handler which accepts, acts on, and the replies with a JSON file. Sadly their JSON handler is a little naughty, and silently, incorrectly handles the incoming data.

Now, you've said here that the "JSON handler is a little naughty". According to what you've written elsewhere, this means that the JSON deserialiser interface used is the one which takes a character string as input, so JSON::from_json or JSON->new->decode (and so does not include a the byte-to-character decoding step).

That being the case it will correctly deserialise the incoming data, recovering the original structured data.

So, we've already identified a difference between the Pete world-view, and the view taken by the remainder of the experienced Perl developer community. Perhaps the problem with Pete's understanding lies here?

No problem is apparent to the developer at this point.

That's because in fact there is no problem!

They spit out JSON files to a different part of the system, again via the handler, and now a material problem /is/ created, because the UTF-8 magic double-encodes the outgoing JSON.

OK, well, no.

The developer, having previously used JSON::from_json or JSON->new->decode to deserialise the incoming data, will naturally use JSON::to_json or JSON->new_encode to serialise the outgoing data.

These functions will serialise the structured data to a Perl character string.

This Perl character string will circulate happily in the Perl environment until it reaches the border to the non-Perl world, at which point the border patrol will encode it using UTF-8.

There's no magic; rather it's good practice. Within the Perl enviroment, character data circulates as character strings. When that data gets to the border, it gets encoded using UTF-8.

Corrupt data gets put in to the receiver, and the person who has to clear up the mess dies of stress.

Again, no. The resultant byte stream will have a single layer of UTF-8 encoding, which is what we want.

It seems to me that the only person dying of stress is Pete, who seems to be stressing over a problem which does not exist.

So I am really struggling to see what the problem is.

Point of terminology: When I say "serialisation", I mean that a data structure is converted into a flat one-dimentional array of somethings, in this case (Unicode) characters. "deserialisation" is the converse.

I think we are at an impasse here.

I think it is Pete who is at an impasse; the rest of the experienced Perl developer world is happy. The rest of the experienced Perl developer world could ignore Pete and happily go on about its business. The only problem with the status quo is where Pete is put in charge of a project and his failure to grok the situation means he insists on implementing roundabout solutions to problems which do not exist, and in the process inconveniences other people.

Your position is that as you can handle character strings in a way that doesn't cause data corruption, then you should - that as the parse operation is going to perform the conversion anyway, that certain characters have already been decoded is ok. You reinforce this position by asking me for examples where premature conversion causes data corruption - it does not, at the parse step.

My position is that a parser should enforce valid input, and that the only valid input for JSON is UTF-8(16/32/bla) bytes, as per the RFC you quoted.

And stop right there. That is the point of your misunderstanding. The only valid encodings for byte-encoded JSON are UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE. However JSON serialised data is fundamentally a stream of Unicode characters. I refer you to the opening words of section 3 of RFC4627:

JSON text SHALL be encoded in Unicode.

Read that over: JSON text SHALL be encoded in Unicode. OK, ready? The next bit says:

The default encoding is UTF-8

The default encoding is UTF-8. In an byte stream environment, the Unicode character stream needs to be encoded to a byte stream, and RFC4627 (handily) lays down the law on exactly which byte encodings are valid.

But Perl is not a byte stream enviroment; it is a character stream environment, so no explicit encoding is required. Yes, under the bonnet Perl encodes Unicode strings using UTF-8, but as a Perl programmer (as opposed to an XS programmer), that's not supposed to be relevant to me. Let's go on to look at section 5:

A JSON generator produces JSON text.

A JSON generate produces JSON text. And section 3 says "JSON text SHALL be encoded in Unicode". Yeah? It's talking about Unicode text. It does lay down the rules for how that text must be encoded if it is to be passed into a byte space, but nowhere does it say that the JSON serialisation and the byte encoding must be combined into a single inseparable code unit.

Provided that the byte encoding is ultimately done in a way which complies with the rules laid down in RFC4627, then the end result is perfectly compliant.

In Perl, since the existing "border patrol" layers generally operate between UTF-8 and characters, it's perfectly reasonable logical to leverage those existing layers for processing JSON data. So rather than inseperably welding a set of UTF-8 encoding and decoding layers to the JSON serialisation and deserialisation code, we can do the JSON serialisation and deserialisation in one place, and do the byte encoding and decoding in the border patrol.

This is exactly the model that the JSON libraries support.

Just because you can safely handle incorrect input

Let me stop you again. A JSON deserialiser takes a stream of (Unicode) characters. If you pass it a stream of characters, then that's correct input. So I'll skip until I get to the next bit which isn't predicated on that.

doesn't mean you should: take an HTML parser that doesn't complain about a missing '' tag, for example, or a YAML parser which silently upgrades literal tabs to be 4 spaces.

I believe this is important from years of dealing with encoding issues. It is vitally important - in my opinion - that when dealing with non-ASCII characters, NOTHING does ANYTHING magic. Ever.

I agree. "Magic" behaviour is a source of problems. But there is no magic going on here. The JSON libraries provide a JSON serialiser and deserialiser and as a piece of API sugar also provide methods which will do the UTF-8 byte encoding or decoding step for you. It is up to the programmer of the calling code to choose which interface to use. If the input JSON data, or the desired JSON output data, is a character stream, then the programmer chooses the interface which just does the deserialisation or serialisation. If the input JSON data, or the desired JSON output data, is a byte stream, then the programmer chooses the interface which rolls in the byte decoding or encoding sugar. All perfectly-well behaved and documented.

And silently accepting incorrect input - even if you can handle it perfectly every time - is an example of that.

Indeed it would be. If the byte-oriented interface silently accepts character data, then that is a bug. However the character-oriented interfaces rightly work with JSON character streams.

I've been bitten in production a few times by people switching between JSON parsers (different versions, JSON::Any finding different modules, people deciding to change the default JSON parser throughout the app) which handle character/byte parsing differently. This is a symptom of developers being allowed to get away with not handling their inputs properly by 'helpful' modules.

No. This is a symptom of not having proper software engineering procedure. There should be (regression) tests which cover this, so that if a programmer does any of the things you mentioned, then the tests break, and the new version can't be released until it's fixed. You of all people should know this — you have a blog at writemoretests.com.

The example you give of opening the file containing JSON with an encoding set - again, it doesn't cause an issue, but that doesn't mean it isn't wrong - JSON is bytes, not characters, by design.

This is your misunderstanding popping up again. JSON is Unicode by design. RFC4627 handily mandates UTF encoding for when the Unicode character stream needs to be in a byte stream, but in a system which handles (Unicode) character streams such as Perl, that encoding is irrelevant.

The JSON libraries weren't written by imbeciles. They're sufficiently high profile that if they were as wrong as you imply, someone would have done something about it by now. If you don't agree with them, by all means submit a patch. But make sure you have a thick skin because the maintainers will surely give you a funny look and tell you to go and read the docs.

I think that is about as helpful as I can be. I have now spent several hours of my life trying to help someone I barely know.

In answer to your original question:

How should a JSON parser handle character strings with non-ASCII characters?

My answer would be:

If you pass the character string to an interface which provides byte-decoding sugar on top of the deserialisation, then it should, as you say, result in an exception.

If you pass the character string to the interface which provides just the deserialisation, then it should deserialise the character string into the relevant Perl data structure.

Taking your example string

["Eat More Turkey ★"]

The interfaces which provides byte-decoding sugar on top of the deserialisation are JSON::decode_json or JSON->new->utf8->decode. If you pass that string to either of those interfaces, then an exception should be thrown.

The interfaces which provide just the deserialisation are JSON::from_json or JSON->new->decode. If you pass the sstring to either of those two interfaces, it should correctly decode that JSON string into the relevant Perl data structure:

[ "Eat More Turkey \x{2605}" ]

I may be a bit late, but I am mostly with Pete here...

JSON parsers often accept JSON texts in various encodings - utf-8 is obviously very common. They sometimes also accept unencoded JSON, which, IMHO, isn't valid JSON text (a perl string isn't "encoded in unicode", it *is* unicode, which isn't encoded in any way).

So apart from changing "ASCII" to "characters with codes >255", Pete is right: if a JSON parser is asked to decode UTF-8 encoded JSON text, then it should (at least optionally) signal an error on invalid input. \x{2222} is invalid in utf-8 encoded json text (as well as in any other standard unicode encoding).

I have no issue if the parser also accepts unencoded unicode strings in some mode (or has some autodetect mode).

If a parser forces some autodetection, that would be valid, but bad taste :) If a parser accepts invalid input, it's plain broken unless it's a well-documented extension.

As for JSON::XS, the reason you get the error is because perl warns about this case, and you probably convert that into an error. Perl warns when a module properly asks for utf-8 and gets unicode. So maybe not even my module does what you want, but at least it can be configured to do so.

Leave a comment

About Peter Sergeant

user-pic I blog about Perl.