Quick note on using module JSON

This Unicode stuff tried to drive me crazy, hopefully I'll record something useful here because the docs are a bit too intricated to understand.

The underlying assumption is that data is "moved" using utf8 encoding, i.e. files and/or stuff transmitted on the network contain data that is utf8 encoded. This boils down to the fact that some characters will be represented by two or more bytes if you look at the raw byte stream.

There are two ways you can obtain such a stream in Perl, depending on what you want to play with at your interface: string of bytes or string of characters. It's easy to see whether a string is the first or the second, because it suffices to call utf8::is_utf8:

$is_string_of_characters = utf8::is_utf8($string);

The name is not very happy, in my opinion, because it tells you whether Perl has activated its internal utf8 representation. I would have preferred to have something called is_characters or whatever.

There's a small corner case in which the test above is false but you are dealing with a string of characters: it's the plain ASCII case, in which the string of bytes and the string of characters are exactly the same. But you understand that this is not a problem.

Independently of how you want to deal at the interface, anyway, we will always assume that you will be using strings of characters in your program, i.e. if your string contains accented characters (for example) the test above will be true.

String of Bytes

If you have a string of bytes that represents valid utf8, your data is already in the right shape to be transmitted and/or saved without doing further transformations. In this case, to save the file you have to set it in raw mode, so that you eliminate the possibility of doing additional transformations on it:

binmode $outfh, ':raw';
print {$outfh} $string_of_bytes;

The same applies for stuff that you want to read, of course:

binmode $infh, ':raw';
$string_of_bytes = do { local $/; <$infh> }; # poor man's slurp

String of Characters

If you have a string of characters, either you pass to the string of bytes representation using Encode::encode and revert to to the previous case:

my $string_of_bytes = encode('utf8', $string_of_characters);

or you tell your interface to do this transformation for you by setting up the proper encoding:

binmode $outfh, ':encoding(utf8)';
print {$outfh} $string_of_characters;

The same goes for your inputs:

binmode $infh, ':encoding(utf8)';
$string_of_characters = do { local $/; <$infh> };

and, of course, if you want your string of characters from a raw bytes representation you can use Encode::decode:

my $string_of_characters = decode('utf8', $string_of_bytes);

The funny thing is that this Encode::decode lets you decide from which encoding you start, not only utf8, but always starting from the assumption that you start from the raw representation. The same applies to Encode::encode, of course; anyway, in my opinion new stuff should stick to utf8 so I'll not dig this further.

What About JSON?

Now that we have set the baseline:

  • all internal stuff will be using Perl's internal Unicode support, which means strings of characters, which means that scalars containing stuff outside ASCII will have their flag set;
  • communications with the external world will be done using the utf8 encoding;

we can finally move on to using the JSON module properly. We have to cope with four different case, depending on the following factors:

  • format: the JSON representation is a string of bytes or a string of characters?
  • direction: are we converting from JSON to Perl or vice-versa?

We'll never say this too many times: in all cases, all the string scalars in the Perl data structure will always be strings of characters (which might appear from utf8::is_utf8 or not depending on whether they contain data outside ASCII or not as we already noted).

String of Bytes

If you're playing with raw bytes on the JSON side, decode_json and encode_json are the functions for you:

$data_structure = decode_json($json_as_string_of_bytes);
$json_as_string_of_bytes = encode_json($data_structure);

String of Characters

On the other hand, if your JSON scalar is (or has to be) a string of characters, you have to use from_json and to_json:

$data_structure = from_json($json_as_string_of_characters);
$json_as_string_of_characters = to_json($data_structure);

Summary

A code fragment is worth a billion words. For stuff that you read:

$json = read_in_some_consistent_way();
$data_structure = utf8::is_utf8($json)
   ? from_json($json)
   : decode_json($json);

For stuff that you have to write:

$json = want_characters()
   ? to_json($data_structure)
   : encode_json($data_structure);

As a final note, if you want to be precise in your new projects you should always stick to using the utf8 Perl IO layer, in order to properly enforce checks on your inputs and forget about encoding issues in your outputs. This means of course that you end up using from_json/to_json.

8 Comments

There is one more special case when utf8::is_utf8 says no, but you are dealing with characters not bytes - this is the dangerous case - and it is when you have characters that are internally encoded as Latin1. Because of this utf8::is_utf8 is rather useless. I think your is_character idea is something that should definitively get into the core - the lack of introspection here is painful.

Are you sure you're attacking this problem the right way? The moment you start worrying about how perl stores the strings internally you're going off the deep end :)

From: http://perldoc.perl.org/perlunifaq.html
"Please, unless you're hacking the internals, or debugging weirdness, don't think about the UTF8 flag at all. That means that you very probably shouldn't use is_utf8 , _utf8_on or _utf8_off at all."

I still maintain that

$data_structure = utf8::is_utf8($json)
? from_json($json)
: decode_json($json);

can easily lead to crashes, even if $json is already decoded character data see my blog post: http://perlalchemy.blogspot.com/2011/08/isutf8-is-useless-can-we-have.html

Sorry for being rather persistent - but this is a common misconception - so I think it is important to clear it at out at the internets.


forgetting to use utf8::is_utf8(), because you know that it will tell you "yes" unless when the string is pure ASCII, in which case the "no" is equivalent to a "yes" (for the purpose of using it as some sort of "is_character", of course)

This is incorrect - there is the third case of character data, one that I wrote about in my first comment, where is_utf8 says "no", but it is not pure ASCII - this is when the internal representation is Latin1. Bear in mind that I am not talking about byte strings containing Latin1 encoded text - this is about character data internally represented as Latin1.

If you are looking at the UTF8 flag in application code then your code is broken. Period.

The flag does NOT mean that the string contains character data. It ONLY means that the string contains a character > 255 either now or could have, at some point in the past.

For strings that contain characters > 127 < 256, you have no idea whatsoever about whether the string contains characters or bytes regardless of what the UTF-8 flag says. (For strings with only characters < 128 there is no difference whether they are ASCII or binary, and strings with any characters > 255, must be either all characters – or corrupt (character and binary data mixed together).)

The following will produce a perfect copy of lolcat.jpg in lolol.jpg:

my $funneh = do { local ( @ARGV, $/ ) = 'lolcat.jpg'; <> };
utf8::upgrade( $funneh );
# now the utf8 flag on $funneh is set!
open my $copy, '>', 'lolol.jpg' or die $!;
print $copy $funneh;

It makes no difference at all that the data was internally UTF-8-encoded at one point! Because those UTF-8-encoded string elements still represented bytes. And the UTF8 flag would not – because it could not, because that’s not what it means – (not) tell you that this is binary data.

So don’t ask it that.

The flag tells perl how the data is stored, it does not tell Perl code what it means. Strings in Perl are, essentially, simply sequences of arbitrarily large integers that know nothing of characters or bytes; and perl has two different internal formats for storing such integer sequences – a compact random-access one that cannot store integers > 255, and a variable-width one that can store anything and just so happens to use the same encoding as UTF-8 because that is convenient. The UTF8 flag just says which of these two formats a string uses. (It has been pointed out many times over the years that the flag should have been called UOK instead, in keeping with the other internal flags.)

If you check the perldelta you will in fact find that the major innovation that started in Perl 5.12 and largely completed in 5.14 is to (finally) stop the regex engine from making this exact mistake, namely, it no longer derives semantics about a string from its internal representation.

Whether any particular packed integer sequence represents bytes or characters, in Perl, is for the programmer to know. The strings themselves carry no knowledge of what they are. If you are designing an API in which bytes vs characters is a concern, then in each case it accepts a string you must decide whether you prefer bytes or characters there, then document that this is what you expect, and then write the code so it always treats that string the same way – either always as characters or always as bytes. No trying to magically do the right thing: you can’t.

Leave a comment

About Flavio Poletti

user-pic