Strings: Characters vs Data

By Brad Gilbert on December 1, 2010 12:00 PM

I’ve been thinking, why is it that Character strings, and Binary strings are mostly the same in Perl. If you read in a binary file, why would you want it to behave as if it were a Character file?

The main difference currently, is that Unicode Character strings have the utf8 bit set. There isn’t really any difference between ASCII Character strings, and Data strings. It also presents a challenge if you want to read in Data strings that have Character strings in them, especially if the encoding of the Character strings is anything other than UTF8.

There are some historical reasons for this conflagration, Perl was originally designed to work with ASCII Character strings. Since there was almost no difference between the 8 bit ASCII, and 8 bit data; there wasn’t any real need to separate the two. Unfortunately the world (of programming) is no longer this simple, Perl needed to change to handle Character data that wouldn’t fit in a single 8-bit byte.

The way this was done was by adding a flag to the string type. When it is set, it means the Character string may have characters that are larger than could fit in a single byte. When the utf8 flag is not set, it means there probably isn’t any characters larger than 1 byte. Unfortunately it could also mean that the string is actually a binary Data string.

All of that is ignoring the fact that there are actually quite a few other ways to store Character strings than just UTF8. Currently the only way to handle them is to convert them to UTF8 Character strings, and setting the utf8 flag.

Unfortunately the way this has been implemented, makes it more difficult to handle Data strings. - You have to be careful when appending strings to them, otherwise it could get the utf8 flag incorrectly applied to it. - You also have to carefully construct your regular expressions, to avoid matching something other than what you intended. - You will also have to use pack and unpack to convert the binary data to numbers, and to convert them back.

The only real way out of this mess is to actually have separate, or mostly separate implementations of Character strings, and Data strings. Unfortunately this is a lot easier to say than it would be to implement. It would probably require incompatible changes, which would hinder backwards compatibility. There are probably many places in the code that rely on how strings are currently implemented. There is also a question about how you would control the behavior from the Perl language itself. Also how would the IO system have to be modified to handle this correctly?

This change would also go a long way towards having more than one internal encoding to handle Character strings.

I’m not endorsing any of the ideas above, I just want you to consider the possibilities.

3 comments

Tagged as:

string binary character data

3 Comments

brian d foy | December 1, 2010 12:39 PM | Reply

Writing about this was perhaps the most painful part of Effective Perl Programming (but hopefully not as painful for a new Llama). Instead of saying "strings", which causes a lot of confusion, we'd just write "octets", "characters", and "graphemes". We had to rewrite quite a bit because we were so used to being sloppy, and getting away with it, when talking about characters and strings.

Not that "characters" isn't what most people think it is either, so it's even more complicated than you say.

Aristotle | December 2, 2010 3:53 AM | Reply

The main difference currently, is that Unicode Character strings have the utf8 bit set.

This is not true. The utf8 flag can be set on octet strings, and it can be unset on text strings. The fact is, there is no way whatsoever to tell the kind of a string just by looking at it. All you can do is carefully document which kind of string your interfaces expect/return, and always treat them accordingly.

Denis Howe | December 3, 2010 3:37 AM | Reply

Thanks for this article. So mostly it just works but occasionally something odd happens, like when I had to copy and paste the mystical incantation:

open $F, "<:encoding(utf8)", $file;
$_ = <$F>;
s/foo/bar/g;
open $F, ">:encoding(utf8)", $file;
print $F $_;

without really understand WHY it fixed the funny characters in my output. What I'd love to see is an explanation with simple examples of bit patterns in files and in memory showing when and why things break and how to fix them.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Brad Gilbert

More info »

Brad Gilbert