ruby vs perl / github cannot use utf8

My little ruby derived vm cannot parse unicode codepoints >0xffff yet.
perl can of course.

user sromanov found this little limitation and wanted to file a bugreport about it.
Since my vm is hosted on github and github is written in ruby and ruby has the same problem as my app, it turned out as catch-22.

See http://www.fileformat.info/info/unicode/utf8.htm

With 3byte sequences you can represent max 0xFFFF
with 4byte sequences max 0x10FFFF.

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3+6+6+6)=21 10FFFF hex (1,114,111)

In my case \x{2603} was representable, but \x{1f42b} not.
perl could of course parse and print \x{1f42b} properly.

We could not file this bugreport, in the first attempts the issue could not be saved without any error message.
Initially I thought this was caused by the recent UI relaunch, so I filed an github support request. I am still waiting for that email being ack'ed.
Then I went to the irc channel #github on freenode, where they asked me to describe my problem further. Apparently they knew nothing about any issues with their issue system.
Good! So I tried to find a syntax problem in my issue report, and partially deleted paragraphs from my text. And after removing the unicode char >0xffff I got the error message: "There was an error posting your comment: Body contains unicode characters above 0xffff"

There was still \x{1f42b} and \u1f42b in the text, but I deleted a 🐫.

I further added spaces after the \x and \u and could save my bugreport.
Then I found others having similar problems, they posted images of their characters, which github could not display.

The stripped down report is at https://github.com/fogus/potion/issues/33
The original report pasted to an app written in perl is at http://paste.perldancer.org/1wxXFG7Z3gluv

I find this extremely disturbing and relevating. If you want to host content on the web, please use a VM and database which supports the basics. I consider utf8 basic nowadays. Why the hell ruby? It's slower than perl, and can do less.

This Socialtext blog SW is written in perl, so it will be able to display those characters, and it will be able to store this blogpost, and it will not silently throw away my blogpost without any error message. So I will not consider theories that the database or the UI folks at socialtext messed up.

8 Comments

Just as a heads up, this is Moveable Type, not Socialtext.

In Perl it is also possible to use the range above 0x10FFFF. This feature was re-enabled with Perl 5.14 (or 5.16?).

Many editors have also the restriction of an 16 bit internal representation. So, sometimes it's hard to create test cases with an editor directly.

Unicode is very well designed for backwards and forward compatability. But some developers restrict it by design.

Fortunately JavaScript does not have the above restrictions.

Any proof for the fact that Ruby is slower than Perl ?

Don't know which version of ruby you're using (mine is 2.0.0p0, but confirmed with 1.9 as well) - Ruby can handle these characters with no issues.


% irb
irb(main):001:0> "\u{1f42b}".length
=> 1
irb(main):002:0> "\u{1f42b}".ord
=> 128043

github not being able to accept these comments are a probably separate issue.

Parsing UTF-8 for 0xffff+ bytes works as well, obviously.

[13] pry(main)> "\u{1f42b}".bytes.to_a.pack("C*")
=> "\xF0\x9F\x90\xAB"
[14] pry(main)> "\xF0\x9F\x90\xAB".force_encoding("UTF-8").ord
=> 128043

About Reini Urban

user-pic Working at cPanel on cperl, B::C (the perl-compiler), parrot, B::Generate, cygwin perl and more guts, keeping the system alive.