UTF-16 and Windows CRLF, oh my

I recently had to do some quick search/replace on a bundle of Windows XML files. They are all encoded as UTF-16LE, with the Windows \n\r line endings encoded as 0D 00 0A 00.

Perl can handle UTF-16LE just fine, and it handles CRLF endings on windows out-of-the-box, but the problem is that the default CRLF translation happens too close to the filehandle- on the wrong side of the Unicode translation. The fix is to use the PerlIO layers :raw:encoding(UTF-16LE):crlf - the ":raw" prevents the default CRLF translation from happening at the byte level, the UTF section translates between characters and the encoded bytes, and the final ":crlf" handles the line endings at the encoded-as-UTF16 level.

Knowing that is half the battle. The other half is applying those layers. This was a one-time, quick-and-dirty command-line edit, along these lines:

perl -pi.bak -e "s/old-dir/new-dir/gi" file1.xml file2.xml file3.xml

There are two filehandles in play, ARGV and ARGVOUT. Since they are opened after perl interprets your code, you can use the "open" pragma on the command line:

perl -Mopen=IO,:raw:encoding(UTF-16LE) -pi.bak -e "s/old-dir/new-dir/gi" file1.xml file2.xml file3.xml

Alas if you are piping or redirecting input and output, those filehandles will already be open before the open pragma can alter their behavior. You'll need to call binmode at BEGIN time. Here I use a trick "-M5;code" to run some code before the line-looping begins-

some_command_emitting_UTF-16 | perl "-M5;binmode($_,':raw:encoding(UTF-16LE):crlf')for(STDIN,STDOUT)" -pe "s/old-dir/new-dir/gi" > new_file_in_UTF16_with_CRLF_line_endings.txt

Hope this post can prevent some hair-pulling going forward...

6 Comments

Beware of quoting pitfalls in a *nix shell like bash. Within doublequotes, $_ would be interpolated by the shell rather than perl.

perl "-M5;binmode($_,':raw:...')for ..." should perhaps better be written as perl '-M5;binmode($_,":raw:...")for ...' in such a command line interpreter.

If you happen to process files originating from a Windows environment somewhere else, that is. In a Windows command line, the first type of quoting is required, of course.

The PERLIO environmental variable may also be able to help you there.

This blog post saved my day! I was getting frustrated at my code for massaging UTF-16 XML files on Windows because I noticed extra CR being added for some reason. Changed the file open to :raw:encode(utf16) and it just worked. So easy. Came across the post on reddit. Thanks again!

Leave a comment

About Yary

user-pic Programming for decades, and learning something new every day.