Unicode is 20++ years old and still a problem

By Helmut Wollmersdorfer on June 17, 2013 1:16 PM

Just did a quick hack to read out product data from an old shop site and import it into a new one:

- wget -r
- File::Find
- Mojo::Dom for parsing
- Text::CSV::Slurp for the result

After 11 minutes running for 14 K pages I experienced the bad surprise:

One file had non-ASCII characters in its name and File::Find does not use char-mode. I forgot about this. Text::CSV::Slurp crashed.

Why the hell are there so many CPAN modules still ignoring Unicode?

7 comments

Tagged as:

Perl, Unicode

7 Comments

Salve J. Nilsen | June 17, 2013 3:40 PM | Reply

Maybe the guy who wrote the software hasn't bumped into your usage pattern? If you want to make him/her aware of what you found, try filing a bug report (with a failing test, if possible :).

Also, it might be worth considering that people who publish their software often are quite busy; And while they may not have the time to think of everything up-front (and therefore don't always publish "perfect" code), they frequently appreciate comments and feedback that can help them improve their code over time.

I'm positive BABF would consider your patch, failing test or bug report. :)

https://rt.cpan.org/Public/Dist/Display.html?Name=Text-CSV-Slurp

Helmut Wollmersdorfer replied to comment from Salve J. Nilsen | June 17, 2013 4:30 PM | Reply

It's File::Find which does not use char-mode.

There is nothing wrong with Text::CSV::Slurp.

Max Maischein | June 17, 2013 8:29 PM | Reply

Most likely, your text file and your file system are not using the same encoding. File systems use a variety of encodings and I am unaware of success unifying these (see Joliet for an example).

This is mostly an OS issue and there is very little that Perl can do, as guessing the file system encoding seems to be a worse cure than explicitly de-/encoding the filenames.

There seems to be some kind of consensus for Linux to use UTF-8 for filenames / directory entries, but all the world is not Linux(ish).

Further reading might be:

https://www.google.com/accounts/o8/id?id=AItOawlKCInGw3AtPhoc4NBTcIFE1ggZxSjXMXY | June 18, 2013 11:19 AM | Reply

You can workaround and find any file with File::Find. Just convert file/dir names to/from bytes before/after passing to file::find.

xdg | June 18, 2013 8:17 PM | Reply

How is File::Find supposed to know how file names are encoded on your particular filesystem? (Hint: not all filesystems store names as UTF-8 Unicode.)

That said, I don't see what Text::CSV::Slurp is doing to filenames that would cause a problem. If it gets octets from File::Find, it looks like it's passing them right back to an open call.

It's not opening files in UTF-8 mode, but that's sort of a separate problem.

Helmut Wollmersdorfer replied to comment from xdg | June 19, 2013 8:27 AM | Reply

How is File::Find supposed to know how file names are encoded on your particular filesystem?

File::Find could guess the encoding via e.g. Encode::Locale with an accuracy of 99.99%. Of course the sanity must be checked during decoding, because Posix handles only bytes.

Each module importing strings from an external representation should take care of the correct decoding, or as a minimum should document and warn about its limitations.

That said, I don't see what Text::CSV::Slurp is doing to filenames that would cause a problem.

I used Text::CSV::Slurp only to stringify an Array of Hashes to a CSV-string. TCS crashed because filenames with bad characters were in the cells.

vsespb | June 19, 2013 10:12 AM | Reply

> File::Find could guess the encoding via e.g. Encode::Locale with an accuracy of 99.99%.

So you want it to be broken by design, and write/read garbage from filesystems in some cases?
Filesystem encodings never should not be detected with locale !
Your proposed design leads to data loss.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Helmut Wollmersdorfer

I blog about Perl.

More info »

Helmut Wollmersdorfer