March 2019 Archives

What to do with doubly-broken UTF-8?

I recently got a few test reports like this:

www.cpantesters.org/cpan/report/49de90f8-4ec9-11e9-98fa-fc611f24ea8f

Although I've put all kinds of stuff in my test file:

https://metacpan.org/source/BKB/Lingua-JA-Moji-0.56/t/katakana2syllable.t#L9-13

the cpan testers doesn't like that. How to deal with this garbage characters?

The solution is this:

#!/home/ben/software/install/bin/perl
use warnings;
use strict;
no utf8;
use FindBin '$Bin';
my $got = 'ック';
my $expected = 'ソー';

dec ($got);
dec ($expected);

exit;

sub dec
{
my ($in) = @_;
utf8::decode ($in);
utf8::decode ($in);
print "$in\n";
}

This turns the doubly-decoded garbage back into readable characters:

[ben@mikan] {14:28 25} moji 513 $ perl ~/oneoff/superdecode.pl 
ック
ソー

About Ben Bullock

user-pic Perl user since about 2006, I have also released some CPAN modules.