Coping with double encoded UTF-8
A few months ago a client asked if could help them with a "double encoded UTF-8 data problem", they had managed to store several GB of data with "corrupted" UTF-8 (technically it's not corrupt UTF-8 since its well-formed UTF-8). During the process I developed several regexs that I would like to share and may prove useful to you someday.
Due to the UTF-8 encoding usage of prefix codes it's easy to spot a double encoded UTF-8 sequence, the prefix code is within the range of U+00C2 to U+00F4, followed by one or more continuation codes within in the range U+0080 to U+00BF.
The above translated to a regex:
# matches a "double" encoded UTF-8 sequence within the range U+0000 - U+10FFFF my $UTF8_double_encoded = qr/ \xC3 (?: [\x82-\x9F] \xC2 [\x80-\xBF] # U+0080 - U+07FF | \xA0 \xC2 [\xA0-\xBF] \xC2 [\x80-\xBF] # U+0800 - U+0FFF | [\xA1-\xAC] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF] # U+1000 - U+CFFF | \xAD \xC2 [\x80-\x9F] \xC2 [\x80-\xBF] # U+D000 - U+D7FF | [\xAE-\xAF] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF] # U+E000 - U+FFFF | \xB0 \xC2 [\x90-\xBF] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF] # U+010000 - U+03FFFF | [\xB1-\xB3] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF] # U+040000 - U+0FFFFF | \xB4 \xC2 [\x80-\x8F] \xC2 [\x80-\xBF] \xC2 [\x80-\xBF] # U+100000 - U+10FFFF ) /x;
If we only want to match within Latin-1 Supplement range:
# matches a "double" encoded UTF-8 sequence within the range U+0080 - U+00FF my $UTF8_double_encoded_latin1 = qr/ \xC3 [\x82-\x84] \xC2 [\x80-\xBF] # U+0080 - U+00FF /x;
Example of usage:
sub decode_double_encoded($) { local $_ = shift; s/[\xC2-\xC3]//g; s/\A(.)/chr(0xC0 | (ord($1) & 0x3F))/e; $_; }# following assumes $fh returns octets (binmode() / ":raw")
while (<$fh>) {
s/($UTF8_double_encoded)/decode_double_encoded($1)/geo;
print {$output} $_;
}
A few days after fixing clients data they asked if I could developed a routine that would detect double encoded UTF-8 data, I came up with this:
# matches a well-formed UTF-8 encoded sequence within the range U+0080 - U+10FFFF my $UTF8 = qr/ (?: [\xC2-\xDF] [\x80-\xBF] # U+0080 - U+07FF | \xE0 [\xA0-\xBF] [\x80-\xBF] # U+0800 - U+0FFF | [\xE1-\xEC] [\x80-\xBF] [\x80-\xBF] # U+1000 - U+CFFF | \xED [\x80-\x9F] [\x80-\xBF] # U+D000 - U+D7FF | [\xEE-\xEF] [\x80-\xBF] [\x80-\xBF] # U+E000 - U+FFFF | \xF0 [\x90-\xBF] [\x80-\xBF] [\x80-\xBF] # U+010000 - U+03FFFF | [\xF1-\xF3] [\x80-\xBF] [\x80-\xBF] [\x80-\xBF] # U+040000 - U+0FFFFF | \xF4 [\x80-\x8F] [\x80-\xBF] [\x80-\xBF] # U+100000 - U+10FFFF ) /x;sub looks_like_double_encoded_utf8 {
@_ == 1 || croak(q/Usage: looks_like_double_encoded_utf8(string)/);
my $count = do {
if (&utf8::is_utf8) {
() = $_[0] =~ /$UTF8/og;
}
else {
() = $_[0] =~ /$UTF8_double_encoded/og;
}
};
return $count;
}
Before developing the above I did have a look at the following modules, they didn't suite my need because of a required audit process, but they may suite yours:
--chansen
First comment!
Where is $UTF8_double_encoded? And what's the purpose of if (&utf8::is_utf8)?
$UTF8_double_encoded is defined in the first code snippet. The usage of &utf8::is_utf8 is to determine to match Perl's internal representation of wide characters or the octet reprepresentation of UTF-8.
--
chansen