Reading META.yml when it's not UTF-8
Part of the 3% of the distributions I couldn't index with MyCPAN had encoding issues. YAML is supposed to be UTF-8, but when I don't always get UTF-8 when I generate a META.yml for files that don't have one. I guess I could do the work to poke around in Makemaker, etc, to convert all the values before I generate the META.yml, but um, no. Not only that, not all of the META.yml files already in the dists are UTF-8. Remember, however, this is a very small part of BackPAN: about 700 distributions out of 140,000 (or about 1/7th of my problem cases).
A couple hundred distros have Makefile.PL files encoded as Latin-1 in a way that it matters. If it's not collapsable to ASCII, the META.yml ends up with Latin-1 in it. Some YAML parsers refuse to deal with that.
I'm not particularly satisfied with this solution, but I assume that it's UTF-8, which is mostly true, but if the YAML loader barfs on it, I try to load it as Latin-1 and convert it.
sub _load_meta_yml { $_[0]->_try_utf8( $_[1] ) || $_[0]->_try_latin1( $_[1] ) }
sub _try_utf8 { $_[0]->_load_yaml( $_[0]->_load_file( 'utf8', $_[1] ) ) }
sub _try_latin1 {
require Encode;
Encode::from_to( my $utf8 = $_[0]->_load_file( 'bytes', $_[1] ), 'latin1', 'utf8' );
$_[0]->_load_yaml( $utf8 );
}
sub _load_file {
$logger->debug( "Trying to load $_[2] as $_[1]" );
local $/; open my $f, "<:$_[1]", $_[2];
my $content = scalar <$f>;
}
sub _load_yaml {
require YAML::Syck;
my( $caller ) = ( caller(1) )[3];
my $yaml = eval { YAML::Syck::Load( $_[1] ) } or
$logger->error( "$caller: $@" );
$yaml;
}
I liked YAML::XS for a bit, but it has a problem with the utf8 pramga that messed up some other stuff I was handling. I don't quite understand it, but LibYAML seems to be fine if everything was always UTF-8, and not so fine otherwise.
Leave a comment