November 2011 Archives

Unicode rant

People these days are saying that you should always decode your utf8 strings, enable utf8 binmode, etc.
This is not true.

I live in Russia, so half of strings I deal with contain cyrillic.
99% of the time they are not decoded.
And you know what, it's fine.

Here is the problem with decoded strings:

$ perl -MEncode -E 'my $a = "абв"; my $b = "где"; say $a.decode_utf8($b)'
абвгде

If you concatenate two strings, either both of them must be decoded, or neither of them.
You can't mix two styles.

So there is no way to migrate gradually to unicode in existing project if it's large enough.

But 99% of the time you don't need decoded strings at all.
And when you actually need them, simple wrappers are enough:

sub lc_utf8 {
my $x = shift;
$x = decode_utf8($x, 1);
$x = lc($x);
$x = encode_utf8($x);
return $x;
}

Well, of course that depends on the tasks you solve in your code. If you use regexes with /i or do complex linguistic stuff, you need decoded strings and 'use utf8' and all the other crazy stuff tchrist wrote about.
In all other cases... not really.

If you choose to use decoded strings everywhere, you have to always worry about all your inputs and outputs: set binmode for each filehandle, decode output from databases and other I/O modules, etc.
I'm not sure if it's worth it.

Anyway, important thing that I wanted to say is: if you're a CPAN module author and your module does some I/O, for example it's HTTP library, or maybe JSON serializer/deserializer, don't make the choice for me. Let me use byte strings, at least as an option.

About Vyacheslav Matyukhin

user-pic I wrote Ubic. I worked at Yandex for many years, and now i'm building my own startup questhub.io (formerly PlayPerl). I'm also working on Flux, streaming data processing framework. CPAN ID: MMCLERIC.