Unicode rant
People these days are saying that you should always decode your utf8 strings, enable utf8 binmode, etc.
This is not true.
I live in Russia, so half of strings I deal with contain cyrillic.
99% of the time they are not decoded.
And you know what, it's fine.
Here is the problem with decoded strings:
$ perl -MEncode -E 'my $a = "абв"; my $b = "где"; say $a.decode_utf8($b)'
абвгде
If you concatenate two strings, either both of them must be decoded, or neither of them.
You can't mix two styles.
So there is no way to migrate gradually to unicode in existing project if it's large enough.
But 99% of the time you don't need decoded strings at all.
And when you actually need them, simple wrappers are enough:
sub lc_utf8 {
my $x = shift;
$x = decode_utf8($x, 1);
$x = lc($x);
$x = encode_utf8($x);
return $x;
}
Well, of course that depends on the tasks you solve in your code. If you use regexes with /i or do complex linguistic stuff, you need decoded strings and 'use utf8' and all the other crazy stuff tchrist wrote about.
In all other cases... not really.
If you choose to use decoded strings everywhere, you have to always worry about all your inputs and outputs: set binmode for each filehandle, decode output from databases and other I/O modules, etc.
I'm not sure if it's worth it.
Anyway, important thing that I wanted to say is: if you're a CPAN module author and your module does some I/O, for example it's HTTP library, or maybe JSON serializer/deserializer, don't make the choice for me. Let me use byte strings, at least as an option.
But mostly I have switched over to using UTF-8 encoding everywhere after an awful experience with a bug where in the byte-encoded Japanese, there is actually an @ mark inside the Shift-JIS encoding for the "ideographic space character" (Unicode 0x3000). So it looks like a space character in the editor, but actually it has a @ inside it, maybe you can imagine. After that I usually try to go to all-Unicode.
it looks like the inverse of the 5.14 'use unicode_strings' feature would be useful in these cases (i.e. get Perl to never auto-promote strings). I wonder if one has been proposed?
(using 'no feature unicode_strings' wouldn't work because that would just revert to the current confusing behaviour)
That’s absolutely not what the
unicode_strings
feature does. It actually doesn’t change any of the decade-long-standing behaviour of Perl in this area. All it does is make perl consistent in its application of semantics to strings.Well yes. Obviously. One is binary data, the other is text. You can’t mix a txt file and a jpg file and get sensible results either.
That’s true. It’s very painful to fix a codebase that is pervasively broken. You pretty much have to do it all in one shot, and then a flag day for rolling it over. Quite probably you’ll need to clean up your database as well (it’s likely that garbage has crept into it if you have been cavalier with encodings so far).
You need them as soon as you look inside the strings. If you use any multibyte encoding and you need
length
andsubstr
to do something sensible, then you will need to decode the text. If you are only concatenating data together, then you can deal in bytes just as well as in characters.The alternative to worrying about each input and each output is worrying about each string operation. And when you decide not to worry about inputs and outputs now, that means that if you realise it was a mistake in the future, when your codebase has already grown a fair amount – and you said yourself what it means to try to fix large codebases… too painful.
So just do it right to start with.
As a minor argument, but one that can be very painful in some cases – if you merely use wrappers, then you will sometimes end up having to decode and re-encode the same string over and over and over: a huge waste of cycles. In terms of performance, it’s more expensive to decode everything at the inputs and outputs, but it’s a fairly constant cost, whereas it can be much more expensive to find out that the decision not to decode was a mistake.
On topic of the perl concat behavior: I don't understand why it has to be this way. Why can't perl check if one string is decoded and another is encoded, and concat their byte versions? Won't it be saner? Why does it allow me to encode byte-string?
I heard something about how UTF flag is unreliable and I think it's related to the issue somehow... but it's just an unfortunate consequence of bad design choices made in the past, right?
Oh, I see, it's because byte-strings are also latin-1 strings. I should read 'perlunicode' before getting into arguments...