October 2010 Archives

What if sv_utf8_upgrade() used heuristic encoding?

Heuristic: decode as native unless is well-formed UTF-X:

sub heuristic_utf8_upgrade {
    utf8::upgrade($_[0])
      unless utf8::decode($_[0]);
    return !!0;
}

Here is some code to play with:

#!/usr/bin/perl
use strict;
use warnings;

{
package encoding::heuristic;

our $Encoding;

BEGIN {
require Encode;
$Encoding = Encode::find_encoding('utf8');
}

sub import {
${^ENCODING} = bless \my $x, __PACKAGE__;…

Coping with double encoded UTF-8

A few months ago a client asked if could help them with a "double encoded UTF-8 data problem", they had managed to store several GB of data with "corrupted" UTF-8 (technically it's not corrupt UTF-8 since its well-formed UTF-8). During the process I developed several regexs that I would like to share and may prove useful to you someday.

Due to the UTF-8 encoding usage of prefix codes it's easy to spot a double encoded UTF-8 sequence, the prefix code is within the range of ="h…

About Christian Hansen

user-pic I blog about Perl.