UTF-16 and Windows CRLF, oh my
I recently had to do some quick search/replace on a bundle of Windows XML files. They are all encoded as UTF-16LE, with the Windows \n\r line endings encoded as 0D 00 0A 00.
Perl can handle UTF-16LE just fine, and it handles CRLF endings on windows out-of-the-box, but the problem is that the default CRLF translation happens too close to the filehandle- on the wrong side of the Unicode translation. The fix is to use the PerlIO layers :raw:encoding(UTF-16LE):crlf - the ":raw" prevents the default CRLF translation from happening at the byte level, the UTF section translates between characters and the encoded bytes, and the final ":crlf" handles the line endings at the encoded-as-UTF16 level.
Knowing that is half the battle. The other half is applying those layers. This was a one-time, quick-and-dirty command-line edit, along these lines:
perl -pi.bak -e "s/old-dir/new-dir/gi" file1.xml file2.xml file3.xml
There are two filehandles in play, ARGV and ARGVOUT. Since they are opened after perl interprets your code, you can use the "open" pragma on the command line:
perl -Mopen=IO,:raw:encoding(UTF-16LE) -pi.bak -e "s/old-dir/new-dir/gi" file1.xml file2.xml file3.xml
Alas if you are piping or redirecting input and output, those filehandles will already be open before the open pragma can alter their behavior. You'll need to call binmode at BEGIN time. Here I use a trick "-M5;code" to run some code before the line-looping begins-
some_command_emitting_UTF-16 | perl "-M5;binmode($_,':raw:encoding(UTF-16LE):crlf')for(STDIN,STDOUT)" -pe "s/old-dir/new-dir/gi" > new_file_in_UTF16_with_CRLF_line_endings.txt
Hope this post can prevent some hair-pulling going forward...
Beware of quoting pitfalls in a *nix shell like bash. Within doublequotes, $_ would be interpolated by the shell rather than perl.
perl "-M5;binmode($_,':raw:...')for ..." should perhaps better be written as perl '-M5;binmode($_,":raw:...")for ...' in such a command line interpreter.
If you happen to process files originating from a Windows environment somewhere else, that is. In a Windows command line, the first type of quoting is required, of course.
Right and in fact, in Windows cmd shell (the good old DOS-heritage one) no quoting at all is needed around the -M option, since in that shell the $ sign has no special meaning!
And yes, this tip is good outside of Windows too- it's good anytime you need to work with files neither ASCII nor UTF8. With UTF8, the -C options can get your command-line disposable working quickly. And as a bonus it also detects UTF-16 or 32, BUT the CRLFs in my files still tripped it up, requiring me to suss out these options.
The PERLIO environmental variable may also be able to help you there.
This blog post saved my day! I was getting frustrated at my code for massaging UTF-16 XML files on Windows because I noticed extra CR being added for some reason. Changed the file open to :raw:encode(utf16) and it just worked. So easy. Came across the post on reddit. Thanks again!
yay! Glad it's helping.