Unicode rant

People these days are saying that you should always decode your utf8 strings, enable utf8 binmode, etc.
This is not true.

I live in Russia, so half of strings I deal with contain cyrillic.
99% of the time they are not decoded.
And you know what, it's fine.

Here is the problem with decoded strings:

$ perl -MEncode -E 'my $a = "абв"; my $b = "где"; say $a.decode_utf8($b)'
абвгде

If you concatenate two strings, either both of them must be decoded, or neither of them.
You can't mix two styles.

So there is no way to migrate gradually to unicode in existing project if it's large enough.

But 99% of the time you don't need decoded strings at all.
And when you actually need them, simple wrappers are enough:

sub lc_utf8 {
my $x = shift;
$x = decode_utf8($x, 1);
$x = lc($x);
$x = encode_utf8($x);
return $x;
}

Well, of course that depends on the tasks you solve in your code. If you use regexes with /i or do complex linguistic stuff, you need decoded strings and 'use utf8' and all the other crazy stuff tchrist wrote about.
In all other cases... not really.

If you choose to use decoded strings everywhere, you have to always worry about all your inputs and outputs: set binmode for each filehandle, decode output from databases and other I/O modules, etc.
I'm not sure if it's worth it.

Anyway, important thing that I wanted to say is: if you're a CPAN module author and your module does some I/O, for example it's HTTP library, or maybe JSON serializer/deserializer, don't make the choice for me. Let me use byte strings, at least as an option.

6 Comments

Anyway, important thing that I wanted to say is: if you're a CPAN module author and your module does some I/O, for example it's HTTP library, or maybe JSON serializer/deserializer, don't make the choice for me. Let me use byte strings, at least as an option.
I think it's a very important point; sometimes byte strings are necessary. I recently wrote a decoder for some old Japanese CD-ROMs with the text encoded in Shift-JIS in the midst of random binary junk, and it was really difficult to deal with the bytes in Perl because Perl kept promoting stuff into Unicode characters. In the end I just wrote it in C because it was so much hassle trying to stop Perl from auto-promoting the bytes.

But mostly I have switched over to using UTF-8 encoding everywhere after an awful experience with a bug where in the byte-encoded Japanese, there is actually an @ mark inside the Shift-JIS encoding for the "ideographic space character" (Unicode 0x3000). So it looks like a space character in the editor, but actually it has a @ inside it, maybe you can imagine. After that I usually try to go to all-Unicode.

it looks like the inverse of the 5.14 'use unicode_strings' feature would be useful in these cases (i.e. get Perl to never auto-promote strings). I wonder if one has been proposed?

(using 'no feature unicode_strings' wouldn't work because that would just revert to the current confusing behaviour)

That’s absolutely not what the unicode_strings feature does. It actually doesn’t change any of the decade-long-standing behaviour of Perl in this area. All it does is make perl consistent in its application of semantics to strings.

If you concatenate two strings, either both of them must be decoded, or neither of them. You can’t mix two styles.

Well yes. Obviously. One is binary data, the other is text. You can’t mix a txt file and a jpg file and get sensible results either.

So there is no way to migrate gradually to unicode in existing project if it’s large enough.

That’s true. It’s very painful to fix a codebase that is pervasively broken. You pretty much have to do it all in one shot, and then a flag day for rolling it over. Quite probably you’ll need to clean up your database as well (it’s likely that garbage has crept into it if you have been cavalier with encodings so far).

But 99% of the time you don’t need decoded strings at all.

You need them as soon as you look inside the strings. If you use any multibyte encoding and you need length and substr to do something sensible, then you will need to decode the text. If you are only concatenating data together, then you can deal in bytes just as well as in characters.

I’m not sure if it’s worth it.

The alternative to worrying about each input and each output is worrying about each string operation. And when you decide not to worry about inputs and outputs now, that means that if you realise it was a mistake in the future, when your codebase has already grown a fair amount – and you said yourself what it means to try to fix large codebases… too painful.

So just do it right to start with.

As a minor argument, but one that can be very painful in some cases – if you merely use wrappers, then you will sometimes end up having to decode and re-encode the same string over and over and over: a huge waste of cycles. In terms of performance, it’s more expensive to decode everything at the inputs and outputs, but it’s a fairly constant cost, whereas it can be much more expensive to find out that the decision not to decode was a mistake.

Leave a comment

About Vyacheslav Matyukhin

user-pic I wrote Ubic. I worked at Yandex for many years, and now i'm building my own startup questhub.io (formerly PlayPerl). I'm also working on Flux, streaming data processing framework. CPAN ID: MMCLERIC.