Coping with double encoded UTF-8

By Christian Hansen on October 2, 2010 8:46 PM

A few months ago a client asked if could help them with a "double encoded UTF-8 data problem", they had managed to store several GB of data with "corrupted" UTF-8 (technically it's not corrupt UTF-8 since its well-formed UTF-8). During the process I developed several regexs that I would like to share and may prove useful to you someday.

Due to the UTF-8 encoding usage of prefix codes it's easy to spot a double encoded UTF-8 sequence, the prefix code is within the range of U+00C2 to U+00F4, followed by one or more continuation codes within in the range U+0080 to U+00BF.

The above translated to a regex:

# matches a "double" encoded UTF-8 sequence within the range U+0000 - U+10FFFF
my $UTF8_double_encoded = qr/
    \xC3 (?: [\x82-\x9F] \xC2 [\x80-\xBF]                                    # U+0080 - U+07FF
           |  \xA0       \xC2 [\xA0-\xBF] \xC2 [\x80-\xBF]                   # U+0800 - U+0FFF
           | [\xA1-\xAC] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF]                   # U+1000 - U+CFFF
           |  \xAD       \xC2 [\x80-\x9F] \xC2 [\x80-\xBF]                   # U+D000 - U+D7FF
           | [\xAE-\xAF] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF]                   # U+E000 - U+FFFF
           |  \xB0       \xC2 [\x90-\xBF] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF]  # U+010000 - U+03FFFF
           | [\xB1-\xB3] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF]  # U+040000 - U+0FFFFF
           |  \xB4       \xC2 [\x80-\x8F] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF]  # U+100000 - U+10FFFF
          )
/x;

If we only want to match within Latin-1 Supplement range:

# matches a "double" encoded UTF-8 sequence within the range U+0080 - U+00FF
my $UTF8_double_encoded_latin1 = qr/
    \xC3 [\x82-\x84] \xC2 [\x80-\xBF] # U+0080 - U+00FF
/x;

Example of usage:

sub decode_double_encoded($) {
    local $_ = shift;
    s/[\xC2-\xC3]//g;
    s/\A(.)/chr(0xC0 | (ord($1) & 0x3F))/e;
    $_;
}

# following assumes $fh returns octets (binmode() / ":raw")

while (<$fh>) {

    s/($UTF8_double_encoded)/decode_double_encoded($1)/geo;

    print {$output} $_;

}

A few days after fixing clients data they asked if I could developed a routine that would detect double encoded UTF-8 data, I came up with this:

# matches a well-formed UTF-8 encoded sequence within the range U+0080 - U+10FFFF
my $UTF8 = qr/
    (?: [\xC2-\xDF] [\x80-\xBF]                           # U+0080 - U+07FF
      |  \xE0       [\xA0-\xBF] [\x80-\xBF]               # U+0800 - U+0FFF
      | [\xE1-\xEC] [\x80-\xBF] [\x80-\xBF]               # U+1000 - U+CFFF
      |  \xED       [\x80-\x9F] [\x80-\xBF]               # U+D000 - U+D7FF
      | [\xEE-\xEF] [\x80-\xBF] [\x80-\xBF]               # U+E000 - U+FFFF
      |  \xF0       [\x90-\xBF] [\x80-\xBF] [\x80-\xBF]   # U+010000 - U+03FFFF
      | [\xF1-\xF3] [\x80-\xBF] [\x80-\xBF] [\x80-\xBF]   # U+040000 - U+0FFFFF
      |  \xF4       [\x80-\x8F] [\x80-\xBF] [\x80-\xBF]   # U+100000 - U+10FFFF
    )
/x;

sub looks_like_double_encoded_utf8 {

    @_ == 1 || croak(q/Usage: looks_like_double_encoded_utf8(string)/);

    my $count = do {

        if (&utf8::is_utf8) {

            () = $_[0] =~ /$UTF8/og;

        }

        else {

            () = $_[0] =~ /$UTF8_double_encoded/og;

        }

    };

    return $count;

}

Before developing the above I did have a look at the following modules, they didn't suite my need because of a required audit process, but they may suite yours:

--chansen

2 comments

Tagged as:

double encoded UTF-8, UTF-8

2 Comments

confuseAcat | June 24, 2013 3:05 PM | Reply

First comment!

Where is $UTF8_double_encoded? And what's the purpose of if (&utf8::is_utf8)?

Christian Hansen replied to comment from confuseAcat | June 26, 2013 11:10 PM | Reply

$UTF8_double_encoded is defined in the first code snippet. The usage of &utf8::is_utf8 is to determine to match Perl's internal representation of wide characters or the octet reprepresentation of UTF-8.

--
chansen

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Christian Hansen

I blog about Perl.

More info »

Christian Hansen