Quick note on using module JSON
This Unicode stuff tried to drive me crazy, hopefully I'll record something useful here because the docs are a bit too intricated to understand.
The underlying assumption is that data is "moved" using utf8 encoding, i.e. files and/or stuff transmitted on the network contain data that is utf8 encoded. This boils down to the fact that some characters will be represented by two or more bytes if you look at the raw byte stream.
There are two ways you can obtain such a stream in Perl, depending on what you want to play with at your interface: string of bytes or string of characters. It's easy to see whether a string is the first or the second, because it suffices to call utf8::is_utf8
:
$is_string_of_characters = utf8::is_utf8($string);
The name is not very happy, in my opinion, because it tells you whether Perl has activated its internal utf8 representation. I would have preferred to have something called is_characters
or whatever.
There's a small corner case in which the test above is false but you are dealing with a string of characters: it's the plain ASCII case, in which the string of bytes and the string of characters are exactly the same. But you understand that this is not a problem.
Independently of how you want to deal at the interface, anyway, we will always assume that you will be using strings of characters in your program, i.e. if your string contains accented characters (for example) the test above will be true.
String of Bytes
If you have a string of bytes that represents valid utf8, your data is already in the right shape to be transmitted and/or saved without doing further transformations. In this case, to save the file you have to set it in raw mode, so that you eliminate the possibility of doing additional transformations on it:
binmode $outfh, ':raw';
print {$outfh} $string_of_bytes;
The same applies for stuff that you want to read, of course:
binmode $infh, ':raw';
$string_of_bytes = do { local $/; <$infh> }; # poor man's slurp
String of Characters
If you have a string of characters, either you pass to the string of bytes representation using Encode::encode
and revert to to the previous case:
my $string_of_bytes = encode('utf8', $string_of_characters);
or you tell your interface to do this transformation for you by setting up the proper encoding:
binmode $outfh, ':encoding(utf8)';
print {$outfh} $string_of_characters;
The same goes for your inputs:
binmode $infh, ':encoding(utf8)';
$string_of_characters = do { local $/; <$infh> };
and, of course, if you want your string of characters from a raw bytes representation you can use Encode::decode
:
my $string_of_characters = decode('utf8', $string_of_bytes);
The funny thing is that this Encode::decode
lets you decide from which encoding you start, not only utf8, but always starting from the assumption that you start from the raw representation. The same applies to Encode::encode
, of course; anyway, in my opinion new stuff should stick to utf8 so I'll not dig this further.
What About JSON?
Now that we have set the baseline:
- all internal stuff will be using Perl's internal Unicode support, which means strings of characters, which means that scalars containing stuff outside ASCII will have their flag set;
- communications with the external world will be done using the utf8 encoding;
we can finally move on to using the JSON module properly. We have to cope with four different case, depending on the following factors:
- format: the JSON representation is a string of bytes or a string of characters?
- direction: are we converting from JSON to Perl or vice-versa?
We'll never say this too many times: in all cases, all the string scalars in the Perl data structure will always be strings of characters (which might appear from utf8::is_utf8
or not depending on whether they contain data outside ASCII or not as we already noted).
String of Bytes
If you're playing with raw bytes on the JSON side, decode_json
and encode_json
are the functions for you:
$data_structure = decode_json($json_as_string_of_bytes);
$json_as_string_of_bytes = encode_json($data_structure);
String of Characters
On the other hand, if your JSON scalar is (or has to be) a string of characters, you have to use from_json
and to_json
:
$data_structure = from_json($json_as_string_of_characters);
$json_as_string_of_characters = to_json($data_structure);
Summary
A code fragment is worth a billion words. For stuff that you read:
$json = read_in_some_consistent_way();
$data_structure = utf8::is_utf8($json)
? from_json($json)
: decode_json($json);
For stuff that you have to write:
$json = want_characters()
? to_json($data_structure)
: encode_json($data_structure);
As a final note, if you want to be precise in your new projects you should always stick to using the utf8 Perl IO layer, in order to properly enforce checks on your inputs and forget about encoding issues in your outputs. This means of course that you end up using from_json
/to_json
.
There is one more special case when
utf8::is_utf8
says no, but you are dealing with characters not bytes - this is the dangerous case - and it is when you have characters that are internally encoded as Latin1. Because of thisutf8::is_utf8
is rather useless. I think youris_character
idea is something that should definitively get into the core - the lack of introspection here is painful.@zby: the underlying assumption in the whole post is that you're dealing with utf8, either explicitly or implicitly, and not another encoding.
As far as I understand, Perl scalars have a flag that tells whether the internal sequence of bytes (because there is always a sequence of bytes!) has to be interpreted as plain octets or as a Unicode string encoded in utf8.
The utf8::is_utf8() function is somewhat correct in its name, because it tells whether the flag for utf8 encoding is set or not, which means whether Perl thinks that the scalar contains a Unicode string encoded in utf8 or not. It is unfortunate because the fact that the internal representation is utf8 has nothing to do with its external usage: I want to know if Perl thinks that the scalar is a Unicode string made of characters, not how it knows this. This is why I'd like it to be called something like "is_character".
Your problem, on the other hand, is more or less unsolvable in my opinion. You have to know where your data come from and how they are supposed to be encoded, and use the relevant function from Encode in order to "import" a string encoded with some different scheme. The same sequence of octets might be a valid string of characters in different encodings, so I see it difficult to support this kind of automagic behaviour.
In other terms, if you read raw octets you MUST know how they are encoded in order to have a Perl scalar string of characters using Encode::decode(). After you have it, anyway, the string itself is internally stored as utf8 INDEPENDENTLY from the initial encoding of the raw octets. For this reason, calling utf8::is_utf8() actually works and says that this is a string of characters if there were characters out of the ASCII scope.
Are you sure you're attacking this problem the right way? The moment you start worrying about how perl stores the strings internally you're going off the deep end :)
From: http://perldoc.perl.org/perlunifaq.html
"Please, unless you're hacking the internals, or debugging weirdness, don't think about the UTF8 flag at all. That means that you very probably shouldn't use is_utf8 , _utf8_on or _utf8_off at all."
I still maintain that
$data_structure = utf8::is_utf8($json)
? from_json($json)
: decode_json($json);
can easily lead to crashes, even if $json is already decoded character data see my blog post: http://perlalchemy.blogspot.com/2011/08/isutf8-is-useless-can-we-have.html
@gizzlon: the summary was - well - a summary to provide a flash of what had been discussed above.
As I was suggesting, in new stuff one should use all character-based strings in the code and adopt by design that utf8 encoding at the interfaces. Which basically boils down to forgetting to use utf8::is_utf8(), because you know that it will tell you "yes" unless when the string is pure ASCII, in which case the "no" is equivalent to a "yes" (for the purpose of using it as some sort of "is_character", of course).
On the other hand, it's not easy to guess how the JSON module works and what to_json/encode_json actually do. As I said, the documentation did not help me to make it 100% clear... so I was actually forced to use utf8::is_utf8() in order to see what was going on!
Sorry for being rather persistent - but this is a common misconception - so I think it is important to clear it at out at the internets.
This is incorrect - there is the third case of character data, one that I wrote about in my first comment, where is_utf8 says "no", but it is not pure ASCII - this is when the internal representation is Latin1. Bear in mind that I am not talking about byte strings containing Latin1 encoded text - this is about character data internally represented as Latin1.
If you are looking at the UTF8 flag in application code then your code is broken. Period.
The flag does NOT mean that the string contains character data. It ONLY means that the string contains a character > 255 either now or could have, at some point in the past.
For strings that contain characters > 127 < 256, you have no idea whatsoever about whether the string contains characters or bytes regardless of what the UTF-8 flag says. (For strings with only characters < 128 there is no difference whether they are ASCII or binary, and strings with any characters > 255, must be either all characters – or corrupt (character and binary data mixed together).)
The following will produce a perfect copy of lolcat.jpg in lolol.jpg:
It makes no difference at all that the data was internally UTF-8-encoded at one point! Because those UTF-8-encoded string elements still represented bytes. And the UTF8 flag would not – because it could not, because that’s not what it means – (not) tell you that this is binary data.
So don’t ask it that.
The flag tells perl how the data is stored, it does not tell Perl code what it means. Strings in Perl are, essentially, simply sequences of arbitrarily large integers that know nothing of characters or bytes; and perl has two different internal formats for storing such integer sequences – a compact random-access one that cannot store integers > 255, and a variable-width one that can store anything and just so happens to use the same encoding as UTF-8 because that is convenient. The UTF8 flag just says which of these two formats a string uses. (It has been pointed out many times over the years that the flag should have been called UOK instead, in keeping with the other internal flags.)
If you check the perldelta you will in fact find that the major innovation that started in Perl 5.12 and largely completed in 5.14 is to (finally) stop the regex engine from making this exact mistake, namely, it no longer derives semantics about a string from its internal representation.
Whether any particular packed integer sequence represents bytes or characters, in Perl, is for the programmer to know. The strings themselves carry no knowledge of what they are. If you are designing an API in which bytes vs characters is a concern, then in each case it accepts a string you must decide whether you prefer bytes or characters there, then document that this is what you expect, and then write the code so it always treats that string the same way – either always as characters or always as bytes. No trying to magically do the right thing: you can’t.
I agree 100% with Aristotle's explanation and point: the programmer MUST know what comes in and what comes out, and this also means that there is never the need to ask Perl how data are stored internally.
Apart when you want to investigate a bit on how another programmer approached the problem :-)
My use of utf8::is_utf8() in the first place was to understand how the JSON module works. I found that there were two different functions to convert "some JSON" into "a Perl data structure" so I wanted to figure out what was going on and how I was supposed to use them.
Looking at the aftermaths, the "Summary" section was a mistake, because it created much more trouble than use (i.e. only a quick recap on the two pair of functions).