The UTF8 flag and you

By Aristotle on August 30, 2011 12:18 PM under Essay

If you are looking at the UTF8 flag in application code then your code is broken. Period.

The flag does NOT mean that the string contains character data. It ONLY means that the string contains a character > 255 either now or could have, at some point in the past.

For strings that contain characters > 127 < 256, you have no idea whatsoever about whether the string contains characters or bytes regardless of what the UTF-8 flag says. (For strings with only characters < 128 there is no difference whether they are ASCII or binary, and strings with any characters > 255, must be either all characters – or corrupt (character and binary data mixed together).)

The following will produce a perfect copy of lolcat.jpg in lolol.jpg:

my $funneh = do { local ( @ARGV, $/ ) = 'lolcat.jpg'; <> };
utf8::upgrade( $funneh );
# now the utf8 flag on $funneh is set!
open my $copy, '>', 'lolol.jpg' or die $!;
print $copy $funneh;

It makes no difference at all that the data was internally UTF-8-encoded at one point! Because those UTF-8-encoded string elements still represented bytes. And the UTF8 flag would not – because it could not, because that’s not what it means – (not) tell you that this is binary data.

So don’t ask it that.

The flag tells perl how the data is stored, it does not tell Perl code what it means. Strings in Perl are, essentially, simply sequences of arbitrarily large integers that know nothing of characters or bytes; and perl has two different internal formats for storing such integer sequences – a compact random-access one that cannot store integers > 255, and a variable-width one that can store anything and just so happens to use the same encoding as UTF-8 because that is convenient. The UTF8 flag just says which of these two formats a string uses. (It has been pointed out many times over the years that the flag should have been called UOK instead, in keeping with the other internal flags.)

If you check the perldelta you will in fact find that the major innovation that started in Perl 5.12 and largely completed in 5.14 is to (finally) stop the regex engine from making this exact mistake, namely, it no longer derives semantics about a string from its internal representation.

Whether any particular packed integer sequence represents bytes or characters, in Perl, is for the programmer to know. The strings themselves carry no knowledge of what they are. If you are designing an API in which bytes vs characters is a concern, then in each case it accepts a string you must decide whether you prefer bytes or characters there, then document that this is what you expect, and then write the code so it always treats that string the same way – either always as characters or always as bytes. No trying to magically do the right thing: you can’t.

10 comments

10 Comments

bart | August 30, 2011 2:01 PM | Reply

The following will produce a perfect copy of lolcat.jpg in lolol.jpg:

Except on Windows. Because you didn't binmode the file handles.

zby | August 30, 2011 3:41 PM | Reply

A good explanation of the current status - but I wish you formulated some positive plan for all these negatives. This is something many people trip over.

Christoph | August 30, 2011 5:27 PM | Reply

Still, upgrading strings will make a difference when they are used as file names.

Peter Rabbitson | August 31, 2011 7:13 AM | Reply

As luck would have it I just wrote the following code today. Is this a legitimate use of is_utf8, or is even this a toss up?

  croak "Expecting a byte string, but looks like we got characters"
    if ($bytes =~ /[^\x00-\x7F]/ and utf8::is_utf8($bytes) );

Aristotle | August 31, 2011 9:23 AM | Reply

bart: argh. Add a use open ':raw'; I guess. I’m not sure that will cover ARGV, though. It might need explicitly opening the file, instead of using the local @ARGV slurp idiom… which I used because I didn’t want to distract with irrelevant details in the first place.

zby: first of all perl itself needs to become completely consistent. Then all emphasis on the UTF8 flag needs to vanish from the docs – it needs to be hidden in a corner where people will only find it if they need it. Then once the attractive nuisances are gone we should see what people really need, before we decide how to help them. Something like BLOB could be a stopgap, although I think the polarity is wrong and it should be called TEXT instead (since a character string is always Unicode, whereas there are a myriad things bytes in a byte string may represent).

Christoph: that’s a bug – perl itself is not yet free of them yet.

ribasushi: that’s wrong. It’s fine if you extend the range to \xFF: a string with characters > 255 can’t possibly be a valid byte string. But then you can toss the is_utf8 check because that’s always going to be true if the match succeeds. At most you could carp if you ever find the UTF8 flag turned on, but even that is… debatable. I can think of a number of ways in which code in the caller could “pollute” byte strings with the flag which aren’t caused by bugs. (Just declaring your source to be in UTF-8, f.ex., will probably do it in a number of cases – possibly differing in future perls.)

Instead of checking, I’d just downgrade the string (non-fatally if you want to croak with your own error message). That’ll be a no-op for packed strings (faster) and uses more purpose-specific code to scan the string than the regex engine (faster). The conversion cost is amortized because you’d have to do it when you hit I/O anyway. There may be some reallocation cost (I’m not even sure of that) but across the whole data flow I’m almost certain it’d be a win. Benchmark it.

If you stick with the match, you could do the UTF8 flag check first, purely as an optimisation, to skip the match on packed strings, since it can’t succeed on them anyway.

Peter Rabbitson | August 31, 2011 9:51 AM | Reply

Thanks for the input! My problem is that I do not want to do anything behind the scenes, as it may make a crucial difference. I want to limit my interface to only bulletproof inputs, and have the user accomodate me being paranoid. Based on your feedback I ended up with the following, please comment on sanity therein :)

  croak "Expecting a byte string, but received characters"
    if $bytes =~ /[^\x00-\xFF]/;
  croak "Expecting a byte string, but received what looks like *possible* characters, please utf8_downgrade the input"
    if ($bytes =~ /[\x80-\xFF]/ and utf8::is_utf8($bytes) );

My corresponding tests are:

throws_ok { poke(123, "abc\x{14F}") } qr/Expecting a byte string, but received characters/;

my $itsatrap = "\x{AE}\x{14F}";
throws_ok { poke(123, substr($itsatrap, 0, 1)) }
  qr/\QExpecting a byte string, but received what looks like *possible* characters, please utf8_downgrade the input/;

Aristotle | August 31, 2011 10:19 AM | Reply

That’s not wrong. I don’t see the point of making the caller downgrade the string though. If it doesn’t have elements > 255, it doesn’t.

You could also avoid some unnecessary work there, per above.

if ( utf8::is_utf8($bytes) and $bytes =~ /([^\x00-\x7F])/ ) {
    croak "Expecting a byte string but input ", 255 < ord $1
        ? "is characters"
        : "could be characters, please utf8::downgrade";
}

Note order of tests. If the UTF8 flag is off you can never find characters > 255 in which case there’s no need to scan the string for them. And you guarded a match in the other case with a check for the flag anyway.

Peter Rabbitson | August 31, 2011 10:48 AM | Reply

Noted, indeed your test is much saner.

As far as why force the downgrade: As seen in the code above this is a poke() routine - i.e. write random bytes to a random part of memory. I feel like "random bytes" should be something very deterministic, and want the user to stop and think when supplying e.g. a is_utf8 "\x{AE}":

Was this meant to be ®?
If so was it meant to be the utf8 representation \xC2\xAE of it, or the latin1 \xAE?
Or was it simply meant to be the byte 0xAE with no regard to character representation?

memo | September 4, 2011 2:19 AM | Reply

Was the utf8 flag a mistake?

Aristotle replied to comment from memo | September 4, 2011 2:34 AM | Reply

No, the flag itself is fine. Its name was a minor mistake and its prominence a big one, along with many bugs in its implementation (such as the bug with open mentioned above, or the regex engine deriving semantics from it). But the concept as such, if implemented and documented consistently, is entirely sane.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Aristotle

Waxing philosophical

More info »

Aristotle