October 2010 Archives

What if sv_utf8_upgrade() used heuristic encoding?

By Christian Hansen on October 3, 2010 12:59 PM

Heuristic: decode as native unless is well-formed UTF-X:

sub heuristic_utf8_upgrade {
    utf8::upgrade($_[0])
      unless utf8::decode($_[0]);
    return !!0;
}

Here is some code to play with:

#!/usr/bin/perl
use strict;
use warnings;

{

    package encoding::heuristic;

    our $Encoding;

    BEGIN {

        require Encode;

        $Encoding = Encode::find_encoding('utf8');

    }

    sub import {

        ${^ENCODING} = bless \my $x, __PACKAGE__;…

0 comments

Coping with double encoded UTF-8

By Christian Hansen on October 2, 2010 8:46 PM

A few months ago a client asked if could help them with a "double encoded UTF-8 data problem", they had managed to store several GB of data with "corrupted" UTF-8 (technically it's not corrupt UTF-8 since its well-formed UTF-8). During the process I developed several regexs that I would like to share and may prove useful to you someday.

Due to the UTF-8 encoding usage of prefix codes it's easy to spot a double encoded UTF-8 sequence, the prefix code is within the range of ="h…

2 comments

Main Index | Archives | August 2014 »

About Christian Hansen

I blog about Perl.

More info »

Christian Hansen

October 2010 Archives

What if sv_utf8_upgrade() used heuristic encoding?

Coping with double encoded UTF-8

About Christian Hansen

Search this blog