How should a JSON parser handle character strings with non-ASCII characters?
My answer would be:
If you pass the character string to an interface which provides byte-decoding sugar on top of the deserialisation, then it should, as you say, result in an exception.
If you pass the character string to the interface which provides just the deserialisation, then it should deserialise the character string into the relevant Perl data structure.
Taking your example string
["Eat More Turkey ★"]
The interfaces which provides byte-decoding sugar on top of the deserialisation are JSON::decode_json or JSON->new->utf8->decode. If you pass that string to either of those interfaces, then an exception should be thrown.
The interfaces which provide just the deserialisation are JSON::from_json or JSON->new->decode. If you pass the sstring to either of those two interfaces, it should correctly decode that JSON string into the relevant Perl data structure:
[ "Eat More Turkey \x{2605}" ]
]]>I think we are at an impasse here.
I think it is Pete who is at an impasse; the rest of the experienced Perl developer world is happy. The rest of the experienced Perl developer world could ignore Pete and happily go on about its business. The only problem with the status quo is where Pete is put in charge of a project and his failure to grok the situation means he insists on implementing roundabout solutions to problems which do not exist, and in the process inconveniences other people.
Your position is that as you can handle character strings in a way that doesn't cause data corruption, then you should - that as the parse operation is going to perform the conversion anyway, that certain characters have already been decoded is ok. You reinforce this position by asking me for examples where premature conversion causes data corruption - it does not, at the parse step.
My position is that a parser should enforce valid input, and that the only valid input for JSON is UTF-8(16/32/bla) bytes, as per the RFC you quoted.
And stop right there. That is the point of your misunderstanding. The only valid encodings for byte-encoded JSON are UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE. However JSON serialised data is fundamentally a stream of Unicode characters. I refer you to the opening words of section 3 of RFC4627:
JSON text SHALL be encoded in Unicode.
Read that over: JSON text SHALL be encoded in Unicode. OK, ready? The next bit says:
The default encoding is UTF-8
The default encoding is UTF-8. In an byte stream environment, the Unicode character stream needs to be encoded to a byte stream, and RFC4627 (handily) lays down the law on exactly which byte encodings are valid.
But Perl is not a byte stream enviroment; it is a character stream environment, so no explicit encoding is required. Yes, under the bonnet Perl encodes Unicode strings using UTF-8, but as a Perl programmer (as opposed to an XS programmer), that's not supposed to be relevant to me. Let's go on to look at section 5:
A JSON generator produces JSON text.
A JSON generate produces JSON text. And section 3 says "JSON text SHALL be encoded in Unicode". Yeah? It's talking about Unicode text. It does lay down the rules for how that text must be encoded if it is to be passed into a byte space, but nowhere does it say that the JSON serialisation and the byte encoding must be combined into a single inseparable code unit.
Provided that the byte encoding is ultimately done in a way which complies with the rules laid down in RFC4627, then the end result is perfectly compliant.
In Perl, since the existing "border patrol" layers generally operate between UTF-8 and characters, it's perfectly reasonable logical to leverage those existing layers for processing JSON data. So rather than inseperably welding a set of UTF-8 encoding and decoding layers to the JSON serialisation and deserialisation code, we can do the JSON serialisation and deserialisation in one place, and do the byte encoding and decoding in the border patrol.
This is exactly the model that the JSON libraries support.
Just because you can safely handle incorrect input
Let me stop you again. A JSON deserialiser takes a stream of (Unicode) characters. If you pass it a stream of characters, then that's correct input. So I'll skip until I get to the next bit which isn't predicated on that.
doesn't mean you should: take an HTML parser that doesn't complain about a missing '' tag, for example, or a YAML parser which silently upgrades literal tabs to be 4 spaces.
I believe this is important from years of dealing with encoding issues. It is vitally important - in my opinion - that when dealing with non-ASCII characters, NOTHING does ANYTHING magic. Ever.
I agree. "Magic" behaviour is a source of problems. But there is no magic going on here. The JSON libraries provide a JSON serialiser and deserialiser and as a piece of API sugar also provide methods which will do the UTF-8 byte encoding or decoding step for you. It is up to the programmer of the calling code to choose which interface to use. If the input JSON data, or the desired JSON output data, is a character stream, then the programmer chooses the interface which just does the deserialisation or serialisation. If the input JSON data, or the desired JSON output data, is a byte stream, then the programmer chooses the interface which rolls in the byte decoding or encoding sugar. All perfectly-well behaved and documented.
And silently accepting incorrect input - even if you can handle it perfectly every time - is an example of that.
Indeed it would be. If the byte-oriented interface silently accepts character data, then that is a bug. However the character-oriented interfaces rightly work with JSON character streams.
I've been bitten in production a few times by people switching between JSON parsers (different versions, JSON::Any finding different modules, people deciding to change the default JSON parser throughout the app) which handle character/byte parsing differently. This is a symptom of developers being allowed to get away with not handling their inputs properly by 'helpful' modules.
No. This is a symptom of not having proper software engineering procedure. There should be (regression) tests which cover this, so that if a programmer does any of the things you mentioned, then the tests break, and the new version can't be released until it's fixed. You of all people should know this — you have a blog at writemoretests.com.
The example you give of opening the file containing JSON with an encoding set - again, it doesn't cause an issue, but that doesn't mean it isn't wrong - JSON is bytes, not characters, by design.
This is your misunderstanding popping up again. JSON is Unicode by design. RFC4627 handily mandates UTF encoding for when the Unicode character stream needs to be in a byte stream, but in a system which handles (Unicode) character streams such as Perl, that encoding is irrelevant.
The JSON libraries weren't written by imbeciles. They're sufficiently high profile that if they were as wrong as you imply, someone would have done something about it by now. If you don't agree with them, by all means submit a patch. But make sure you have a thick skin because the maintainers will surely give you a funny look and tell you to go and read the docs.
I think that is about as helpful as I can be. I have now spent several hours of my life trying to help someone I barely know.
]]>Let's go back through one of Pete's walkthroughs and see if we can de-bug the understanding:
Let's say you have a junior developer working on a web development project using Catalyst. Another developer has installed a plugin which automatically encodes and decodes UTF-8 on the incoming and outgoing, because it seemed like a good idea.
(I assume you mean "decodes and encodes UTF-8 on the incoming and outgoing" — if it were the way round you suggest we would not be getting anywhere.)
It seemed like a good idea because it is a good idea.
The junior developer writes a handler which accepts, acts on, and the replies with a JSON file. Sadly their JSON handler is a little naughty, and silently, incorrectly handles the incoming data.
Now, you've said here that the "JSON handler is a little naughty". According to what you've written elsewhere, this means that the JSON deserialiser interface used is the one which takes a character string as input, so JSON::from_json or JSON->new->decode (and so does not include a the byte-to-character decoding step).
That being the case it will correctly deserialise the incoming data, recovering the original structured data.
So, we've already identified a difference between the Pete world-view, and the view taken by the remainder of the experienced Perl developer community. Perhaps the problem with Pete's understanding lies here?
No problem is apparent to the developer at this point.
That's because in fact there is no problem!
They spit out JSON files to a different part of the system, again via the handler, and now a material problem /is/ created, because the UTF-8 magic double-encodes the outgoing JSON.
OK, well, no.
The developer, having previously used JSON::from_json or JSON->new->decode to deserialise the incoming data, will naturally use JSON::to_json or JSON->new_encode to serialise the outgoing data.
These functions will serialise the structured data to a Perl character string.
This Perl character string will circulate happily in the Perl environment until it reaches the border to the non-Perl world, at which point the border patrol will encode it using UTF-8.
There's no magic; rather it's good practice. Within the Perl enviroment, character data circulates as character strings. When that data gets to the border, it gets encoded using UTF-8.
Corrupt data gets put in to the receiver, and the person who has to clear up the mess dies of stress.
Again, no. The resultant byte stream will have a single layer of UTF-8 encoding, which is what we want.
It seems to me that the only person dying of stress is Pete, who seems to be stressing over a problem which does not exist.
So I am really struggling to see what the problem is.
]]>