Coping with double encoded UTF-8

A few months ago a client asked if could help them with a "double encoded UTF-8 data problem", they had managed to store several GB of data with "corrupted" UTF-8 (technically it's not corrupt UTF-8 since its well-formed UTF-8). During the process I developed several regexs that I would like to share and may prove useful to you someday.

Due to the UTF-8 encoding usage of prefix codes it's easy to spot a double encoded UTF-8 sequence, the prefix code is within the range of U+00C2 to U+00F4, followed by one or more continuation codes within in the range U+0080 to U+00BF.

The above translated to a regex:

# matches a "double" encoded UTF-8 sequence within the range U+0000 - U+10FFFF
my $UTF8_double_encoded = qr/
    \xC3 (?: [\x82-\x9F] \xC2 [\x80-\xBF]                                    # U+0080 - U+07FF
           |  \xA0       \xC2 [\xA0-\xBF] \xC2 [\x80-\xBF]                   # U+0800 - U+0FFF
           | [\xA1-\xAC] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF]                   # U+1000 - U+CFFF
           |  \xAD       \xC2 [\x80-\x9F] \xC2 [\x80-\xBF]                   # U+D000 - U+D7FF
           | [\xAE-\xAF] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF]                   # U+E000 - U+FFFF
           |  \xB0       \xC2 [\x90-\xBF] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF]  # U+010000 - U+03FFFF
           | [\xB1-\xB3] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF]  # U+040000 - U+0FFFFF
           |  \xB4       \xC2 [\x80-\x8F] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF]  # U+100000 - U+10FFFF
          )
/x;


If we only want to match within Latin-1 Supplement range:

# matches a "double" encoded UTF-8 sequence within the range U+0080 - U+00FF
my $UTF8_double_encoded_latin1 = qr/
    \xC3 [\x82-\x84] \xC2 [\x80-\xBF] # U+0080 - U+00FF
/x;


Example of usage:

sub decode_double_encoded($) {
    local $_ = shift;
    s/[\xC2-\xC3]//g;
    s/\A(.)/chr(0xC0 | (ord($1) & 0x3F))/e;
    $_;
}

# following assumes $fh returns octets (binmode() / ":raw")
while (<$fh>) {
s/($UTF8_double_encoded)/decode_double_encoded($1)/geo;
print {$output} $_;
}


A few days after fixing clients data they asked if I could developed a routine that would detect double encoded UTF-8 data, I came up with this:

# matches a well-formed UTF-8 encoded sequence within the range U+0080 - U+10FFFF
my $UTF8 = qr/
    (?: [\xC2-\xDF] [\x80-\xBF]                           # U+0080 - U+07FF
      |  \xE0       [\xA0-\xBF] [\x80-\xBF]               # U+0800 - U+0FFF
      | [\xE1-\xEC] [\x80-\xBF] [\x80-\xBF]               # U+1000 - U+CFFF
      |  \xED       [\x80-\x9F] [\x80-\xBF]               # U+D000 - U+D7FF
      | [\xEE-\xEF] [\x80-\xBF] [\x80-\xBF]               # U+E000 - U+FFFF
      |  \xF0       [\x90-\xBF] [\x80-\xBF] [\x80-\xBF]   # U+010000 - U+03FFFF
      | [\xF1-\xF3] [\x80-\xBF] [\x80-\xBF] [\x80-\xBF]   # U+040000 - U+0FFFFF
      |  \xF4       [\x80-\x8F] [\x80-\xBF] [\x80-\xBF]   # U+100000 - U+10FFFF
    )
/x;

sub looks_like_double_encoded_utf8 {
@_ == 1 || croak(q/Usage: looks_like_double_encoded_utf8(string)/);
my $count = do {
if (&utf8::is_utf8) {
() = $_[0] =~ /$UTF8/og;
}
else {
() = $_[0] =~ /$UTF8_double_encoded/og;
}
};
return $count;
}


Before developing the above I did have a look at the following modules, they didn't suite my need because of a required audit process, but they may suite yours:


--chansen

2 Comments

First comment!

Where is $UTF8_double_encoded? And what's the purpose of if (&utf8::is_utf8)?

Leave a comment

About Christian Hansen

user-pic I blog about Perl.