December 2010 Archives

Strings: Characters vs Data

I’ve been thinking, why is it that Character strings, and Binary strings are mostly the same in Perl. If you read in a binary file, why would you want it to behave as if it were a Character file?

The main difference currently, is that Unicode Character strings have the utf8 bit set. There isn’t really any difference between ASCII Character strings, and Data strings. It also presents a challenge if you want to read in Data strings that have Character strings in them, especially if the encoding of the Character strings is anything other than UTF8.

There are some historical reasons for this conflagration, Perl was originally designed to work with ASCII Character strings. Since there was almost no difference between the 8 bit ASCII, and 8 bit data; there wasn’t any real need to separate the two. Unfortunately the world (of programming) is no longer this simple, Perl needed to change to handle Character data that wouldn’t fit in a single 8-bit byte.

The way this was done was by adding a flag to the string type. When it is set, it means the Character string may have characters that are larger than could fit in a single byte. When the utf8 flag is not set, it means there probably isn’t any characters larger than 1 byte. Unfortunately it could also mean that the string is actually a binary Data string.

All of that is ignoring the fact that there are actually quite a few other ways to store Character strings than just UTF8. Currently the only way to handle them is to convert them to UTF8 Character strings, and setting the utf8 flag.


Unfortunately the way this has been implemented, makes it more difficult to handle Data strings. - You have to be careful when appending strings to them, otherwise it could get the utf8 flag incorrectly applied to it. - You also have to carefully construct your regular expressions, to avoid matching something other than what you intended. - You will also have to use pack and unpack to convert the binary data to numbers, and to convert them back.


The only real way out of this mess is to actually have separate, or mostly separate implementations of Character strings, and Data strings. Unfortunately this is a lot easier to say than it would be to implement. It would probably require incompatible changes, which would hinder backwards compatibility. There are probably many places in the code that rely on how strings are currently implemented. There is also a question about how you would control the behavior from the Perl language itself. Also how would the IO system have to be modified to handle this correctly?

This change would also go a long way towards having more than one internal encoding to handle Character strings.


I’m not endorsing any of the ideas above, I just want you to consider the possibilities.

About Brad Gilbert

user-pic