The UTF8 flag and you

If you are looking at the UTF8 flag in application code then your code is broken. Period.

The flag does NOT mean that the string contains character data. It ONLY means that the string contains a character > 255 either now or could have, at some point in the past.

For strings that contain characters > 127 < 256, you have no idea whatsoever about whether the string contains characters or bytes regardless of what the UTF-8 flag says. (For strings with only characters < 128 there is no difference whether they are ASCII or binary, and strings with any characters > 255, must be either all characters – or corrupt (character and binary data mixed together).)

The following will produce a perfect copy of lolcat.jpg in lolol.jpg:

my $funneh = do { local ( @ARGV, $/ ) = 'lolcat.jpg'; <> };
utf8::upgrade( $funneh );
# now the utf8 flag on $funneh is set!
open my $copy, '>', 'lolol.jpg' or die $!;
print $copy $funneh;

It makes no difference at all that the data was internally UTF-8-encoded at one point! Because those UTF-8-encoded string elements still represented bytes. And the UTF8 flag would not – because it could not, because that’s not what it means – (not) tell you that this is binary data.

So don’t ask it that.

The flag tells perl how the data is stored, it does not tell Perl code what it means. Strings in Perl are, essentially, simply sequences of arbitrarily large integers that know nothing of characters or bytes; and perl has two different internal formats for storing such integer sequences – a compact random-access one that cannot store integers > 255, and a variable-width one that can store anything and just so happens to use the same encoding as UTF-8 because that is convenient. The UTF8 flag just says which of these two formats a string uses. (It has been pointed out many times over the years that the flag should have been called UOK instead, in keeping with the other internal flags.)

If you check the perldelta you will in fact find that the major innovation that started in Perl 5.12 and largely completed in 5.14 is to (finally) stop the regex engine from making this exact mistake, namely, it no longer derives semantics about a string from its internal representation.

Whether any particular packed integer sequence represents bytes or characters, in Perl, is for the programmer to know. The strings themselves carry no knowledge of what they are. If you are designing an API in which bytes vs characters is a concern, then in each case it accepts a string you must decide whether you prefer bytes or characters there, then document that this is what you expect, and then write the code so it always treats that string the same way – either always as characters or always as bytes. No trying to magically do the right thing: you can’t.

10 Comments

The following will produce a perfect copy of lolcat.jpg in lolol.jpg:

Except on Windows. Because you didn't binmode the file handles.

A good explanation of the current status - but I wish you formulated some positive plan for all these negatives. This is something many people trip over.

Still, upgrading strings will make a difference when they are used as file names.

As luck would have it I just wrote the following code today. Is this a legitimate use of is_utf8, or is even this a toss up?

  croak "Expecting a byte string, but looks like we got characters"
    if ($bytes =~ /[^\x00-\x7F]/ and utf8::is_utf8($bytes) );

Thanks for the input! My problem is that I do not want to do anything behind the scenes, as it may make a crucial difference. I want to limit my interface to only bulletproof inputs, and have the user accomodate me being paranoid. Based on your feedback I ended up with the following, please comment on sanity therein :)

  croak "Expecting a byte string, but received characters"
    if $bytes =~ /[^\x00-\xFF]/;
  croak "Expecting a byte string, but received what looks like *possible* characters, please utf8_downgrade the input"
    if ($bytes =~ /[\x80-\xFF]/ and utf8::is_utf8($bytes) );

My corresponding tests are:

throws_ok { poke(123, "abc\x{14F}") } qr/Expecting a byte string, but received characters/;

my $itsatrap = "\x{AE}\x{14F}";
throws_ok { poke(123, substr($itsatrap, 0, 1)) }
  qr/\QExpecting a byte string, but received what looks like *possible* characters, please utf8_downgrade the input/;

Noted, indeed your test is much saner.

As far as why force the downgrade: As seen in the code above this is a poke() routine - i.e. write random bytes to a random part of memory. I feel like "random bytes" should be something very deterministic, and want the user to stop and think when supplying e.g. a is_utf8 "\x{AE}":

Was this meant to be ®?
If so was it meant to be the utf8 representation \xC2\xAE of it, or the latin1 \xAE?
Or was it simply meant to be the byte 0xAE with no regard to character representation?

Was the utf8 flag a mistake?

Leave a comment

About Aristotle

user-pic Waxing philosophical