August 2011 Archives

Quick note on using module JSON

By Flavio Poletti on August 28, 2011 5:36 PM under Perl

This Unicode stuff tried to drive me crazy, hopefully I'll record something useful here because the docs are a bit too intricated to understand.

The underlying assumption is that data is "moved" using utf8 encoding, i.e. files and/or stuff transmitted on the network contain data that is utf8 encoded. This boils down to the fact that some characters will be represented by two or more bytes if you look at the raw byte stream.

There are two ways you can obtain such a stream in Perl, depending on what you want to play with at your interface: string of bytes or string of characters. It's easy to see whether a string is the first or the second, because it suffices to call utf8::is_utf8:

$is_string_of_characters = utf8::is_utf8($string);

The name is not very happy, in my opinion, because it tells you whether Perl has activated its internal utf8 representation. I would have preferred to have something called is_characters or whatever.

There's a small corner case in which the test above is false but you are dealing with a string of characters: it's the plain ASCII case, in which the string of bytes and the string of characters are exactly the same. But you understand that this is not a problem.

Independently of how you want to deal at the interface, anyway, we will always assume that you will be using strings of characters in your program, i.e. if your string contains accented characters (for example) the test above will be true.

String of Bytes

If you have a string of bytes that represents valid utf8, your data is already in the right shape to be transmitted and/or saved without doing further transformations. In this case, to save the file you have to set it in raw mode, so that you eliminate the possibility of doing additional transformations on it:

binmode $outfh, ':raw';
print {$outfh} $string_of_bytes;

The same applies for stuff that you want to read, of course:

binmode $infh, ':raw';
$string_of_bytes = do { local $/; <$infh> }; # poor man's slurp

String of Characters

If you have a string of characters, either you pass to the string of bytes representation using Encode::encode and revert to to the previous case:

my $string_of_bytes = encode('utf8', $string_of_characters);

or you tell your interface to do this transformation for you by setting up the proper encoding:

binmode $outfh, ':encoding(utf8)';
print {$outfh} $string_of_characters;

The same goes for your inputs:

binmode $infh, ':encoding(utf8)';
$string_of_characters = do { local $/; <$infh> };

and, of course, if you want your string of characters from a raw bytes representation you can use Encode::decode:

my $string_of_characters = decode('utf8', $string_of_bytes);

The funny thing is that this Encode::decode lets you decide from which encoding you start, not only utf8, but always starting from the assumption that you start from the raw representation. The same applies to Encode::encode, of course; anyway, in my opinion new stuff should stick to utf8 so I'll not dig this further.

What About JSON?

Now that we have set the baseline:

all internal stuff will be using Perl's internal Unicode support, which means strings of characters, which means that scalars containing stuff outside ASCII will have their flag set;
communications with the external world will be done using the utf8 encoding;

we can finally move on to using the JSON module properly. We have to cope with four different case, depending on the following factors:

format: the JSON representation is a string of bytes or a string of characters?
direction: are we converting from JSON to Perl or vice-versa?

We'll never say this too many times: in all cases, all the string scalars in the Perl data structure will always be strings of characters (which might appear from utf8::is_utf8 or not depending on whether they contain data outside ASCII or not as we already noted).

String of Bytes

If you're playing with raw bytes on the JSON side, decode_json and encode_json are the functions for you:

$data_structure = decode_json($json_as_string_of_bytes);
$json_as_string_of_bytes = encode_json($data_structure);

String of Characters

On the other hand, if your JSON scalar is (or has to be) a string of characters, you have to use from_json and to_json:

$data_structure = from_json($json_as_string_of_characters);
$json_as_string_of_characters = to_json($data_structure);

Summary

A code fragment is worth a billion words. For stuff that you read:

$json = read_in_some_consistent_way();
$data_structure = utf8::is_utf8($json)
   ? from_json($json)
   : decode_json($json);

For stuff that you have to write:

$json = want_characters()
   ? to_json($data_structure)
   : encode_json($data_structure);

As a final note, if you want to be precise in your new projects you should always stick to using the utf8 Perl IO layer, in order to properly enforce checks on your inputs and forget about encoding issues in your outputs. This means of course that you end up using from_json/to_json.