Understanding Unicode/UTF8 in Perl

I'm not a Unicode Guru, but working with third parties, I often find that a lot of people consistently fail to get the basics right about Unicode and encoding. So here's yet another set of slides about unicode/utf8 in Perl.

It's not meant to be a comprehensive presentation of all Unicode things in Perl. It's meant to insist on a couple of guidelines and give some pointers to get a good start writing a unicode compliant application and avoiding common issues.

View them here.


4 Comments

A character isn't a combination of one or more glyphs. Glyphs are used to represent characters and are provided by fonts, which are collections of glyphs. Unicode doesn't define glyphs (although it includes example glyphs in the code charts) and Perl doesn't have any understanding of glyphs. Unicode defines characters, which have abstract meaning but not specific shapes. These characters are assigned code points and can be stored in bytes using character encoding forms. Throughout your slides, the word glyph can generally be replaced with the word character. The Unicode standard uses the terms base character, combining character, and precomposed character. See http://www.unicode.org/glossary/ for details.

The term people usually wanted when they use “glyph” (which they usually use when they’ve heard that “character” does not mean what they think it means) is “grapheme”.

In those slides, the word glyph was erroneously being used for individual code points. Generally when talking about Unicode and its encodings, the three best technical terms for different levels of things that represent characters are bytes, code points, and graphemes. When used in the technical sense, the word character is equivalent to code point (e.g., base character, combining character, control character) but for common English usage it is equivalent to grapheme. No wonder there's so much confusion!

I attempted to explain these concepts in slides at YAPC::NA 2012.

Perl 6 syntax really helps to clarify these concepts.

$str.graphs                   # number of graphemes
$str.codes                    # number of code points
$str.encode($encoding).bytes  # number of bytes when encoded

And there's also $str.chars for the number of characters in "the current (lexically scoped) idea of what a normal character is, usually graphemes." Fortunately, length is banned from the language!

For details, see Synopsis 32: Setting Library - Str.

Leave a comment

About Jerome Eteve

user-pic I'm a Perl programmer and I blog about Perl and other stuff.