Unicode is 20++ years old and still a problem

Just did a quick hack to read out product data from an old shop site and import it into a new one:

- wget -r
- File::Find
- Mojo::Dom for parsing
- Text::CSV::Slurp for the result

After 11 minutes running for 14 K pages I experienced the bad surprise:

One file had non-ASCII characters in its name and File::Find does not use char-mode. I forgot about this. Text::CSV::Slurp crashed.

Why the hell are there so many CPAN modules still ignoring Unicode?

7 Comments

Maybe the guy who wrote the software hasn't bumped into your usage pattern? If you want to make him/her aware of what you found, try filing a bug report (with a failing test, if possible :).

Also, it might be worth considering that people who publish their software often are quite busy; And while they may not have the time to think of everything up-front (and therefore don't always publish "perfect" code), they frequently appreciate comments and feedback that can help them improve their code over time.

I'm positive BABF would consider your patch, failing test or bug report. :)

https://rt.cpan.org/Public/Dist/Display.html?Name=Text-CSV-Slurp

Most likely, your text file and your file system are not using the same encoding. File systems use a variety of encodings and I am unaware of success unifying these (see Joliet for an example).

This is mostly an OS issue and there is very little that Perl can do, as guessing the file system encoding seems to be a worse cure than explicitly de-/encoding the filenames.

There seems to be some kind of consensus for Linux to use UTF-8 for filenames / directory entries, but all the world is not Linux(ish).

Further reading might be:

You can workaround and find any file with File::Find. Just convert file/dir names to/from bytes before/after passing to file::find.

How is File::Find supposed to know how file names are encoded on your particular filesystem? (Hint: not all filesystems store names as UTF-8 Unicode.)

That said, I don't see what Text::CSV::Slurp is doing to filenames that would cause a problem. If it gets octets from File::Find, it looks like it's passing them right back to an open call.

It's not opening files in UTF-8 mode, but that's sort of a separate problem.

> File::Find could guess the encoding via e.g. Encode::Locale with an accuracy of 99.99%.

So you want it to be broken by design, and write/read garbage from filesystems in some cases?
Filesystem encodings never should not be detected with locale !
Your proposed design leads to data loss.

Leave a comment

About Helmut Wollmersdorfer

user-pic I blog about Perl.