\r, \n, and ... \R ?

It's common knowledge that on Windows, a line of text generally ends with a \r\n characters sequence, and on a POSIX system a line of text will end with \n.

Less well known is Perl's support for the escape sequence '\R'.

What's \R?

It's definitely not the inverse of \r.

It's a pattern (so for now its only useful in regexes) that matches Unicode's TR-13, The Unicode Consortium's guidelines for what counts as a newline. It's useful in a regular expression to match \r, \n, \r\n, or a few other character sequences that are used to represent newlines, so you don't have to remember them. It's worth noting that it is a character sequence, not a character, so it doesn't really make sense in a bracketed character class.

There has been talk on perl5-porters over the course of 5.20's development cycle about how to expand that out of regular expressions into stream processing, and while nothing has happened in core yet, the discussion inspired a module I've posted to CPAN: PerlIO::unicodeeol. Using that module will add a PerlIO layer that will convert anything that matches \R into a simple \n on input, so that text can be processed in a uniform way without regard to whether it had \r, \r\n, \n, or any of the rest as the line ending. All that's necessary to use it is to add ":unicodeeol" with binmode or when opening the file (perldoc PerlIO has far more details), and now everything considered a line ending in Unicode looks like \n.

The process is not reversible though, so this is not suitable if you want to preserve the actual line ending.

Thanks to Karl Williamson for his work on Unicode in Perl, and Audrey Tang, whose PerlIO::eol I cribbed from.

4 Comments

cool!

Do you have a link to the related p5p discussions on this? I'm curious to see what technological hurdles have been identified for having something like this in core.

However hard it may be, I do hope that one day we get a pause-and-ask-for-more mechanism in the regexp engine…

As in Audrey Tang's PerlIO::eol, I would suggest that instead of always substituting \R with \n, why not allow the user to specify to which platform to substitute: Like Audrey's, allow LF/CR/LFCR/NATIVE (or Unix/Mac/Windows/Native). Default would be LF(Unix) to be backward compatible.

Leave a comment

About Peter Martini

user-pic I like thinking about machines, especially virtual machines like Perl's VM, the Java VM, and kvm/qemu