ruby vs perl / github cannot use utf8

By Reini Urban on June 21, 2013 5:22 PM

My little ruby derived vm cannot parse unicode codepoints >0xffff yet.
perl can of course.

user sromanov found this little limitation and wanted to file a bugreport about it.
Since my vm is hosted on github and github is written in ruby and ruby has the same problem as my app, it turned out as catch-22.

See http://www.fileformat.info/info/unicode/utf8.htm

With 3byte sequences you can represent max 0xFFFF
with 4byte sequences max 0x10FFFF.

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3+6+6+6)=21 10FFFF hex (1,114,111)

In my case \x{2603} was representable, but \x{1f42b} not.
perl could of course parse and print \x{1f42b} properly.

We could not file this bugreport, in the first attempts the issue could not be saved without any error message.
Initially I thought this was caused by the recent UI relaunch, so I filed an github support request. I am still waiting for that email being ack'ed.
Then I went to the irc channel #github on freenode, where they asked me to describe my problem further. Apparently they knew nothing about any issues with their issue system.
Good! So I tried to find a syntax problem in my issue report, and partially deleted paragraphs from my text. And after removing the unicode char >0xffff I got the error message: "There was an error posting your comment: Body contains unicode characters above 0xffff"

There was still \x{1f42b} and \u1f42b in the text, but I deleted a 🐫.

I further added spaces after the \x and \u and could save my bugreport.
Then I found others having similar problems, they posted images of their characters, which github could not display.

The stripped down report is at https://github.com/fogus/potion/issues/33
The original report pasted to an app written in perl is at http://paste.perldancer.org/1wxXFG7Z3gluv

I find this extremely disturbing and relevating. If you want to host content on the web, please use a VM and database which supports the basics. I consider utf8 basic nowadays. Why the hell ruby? It's slower than perl, and can do less.

This Socialtext blog SW is written in perl, so it will be able to display those characters, and it will be able to store this blogpost, and it will not silently throw away my blogpost without any error message. So I will not consider theories that the database or the UI folks at socialtext messed up.

8 comments

8 Comments

autarch.urth.org | June 21, 2013 6:00 PM

Just as a heads up, this is Moveable Type, not Socialtext.

Reini Urban | June 21, 2013 8:21 PM

Thanks Dave.
Movable Type is also written in perl, and it also looks safer than ruby. The available parsers are not as nice as the extended markdown syntax they use at github, but at first the parser should be stable. Even php can do that. I never had any unicode troubles with my phpwiki. mediawiki also has no troubles.

Reini Urban | June 21, 2013 8:25 PM

I extended now the available coderange for utf8 \uxxxx from 0xffff to 0x10ffff via the new syntax \UXXXXX in 20 minutes for potion/p2.
https://github.com/fogus/potion/issues/33

The 5th char to \U is optional.

Helmut Wollmersdorfer | June 22, 2013 6:37 AM

In Perl it is also possible to use the range above 0x10FFFF. This feature was re-enabled with Perl 5.14 (or 5.16?).

Many editors have also the restriction of an 16 bit internal representation. So, sometimes it's hard to create test cases with an editor directly.

Unicode is very well designed for backwards and forward compatability. But some developers restrict it by design.

Fortunately JavaScript does not have the above restrictions.

vsespb | June 22, 2013 8:15 PM

Any proof for the fact that Ruby is slower than Perl ?

Tatsuhiko Miyagawa | June 22, 2013 8:56 PM

Don't know which version of ruby you're using (mine is 2.0.0p0, but confirmed with 1.9 as well) - Ruby can handle these characters with no issues.

% irb irb(main):001:0> "\u{1f42b}".length => 1 irb(main):002:0> "\u{1f42b}".ord => 128043

github not being able to accept these comments are a probably separate issue.

Tatsuhiko Miyagawa | June 22, 2013 9:02 PM

Parsing UTF-8 for 0xffff+ bytes works as well, obviously.

[13] pry(main)> "\u{1f42b}".bytes.to_a.pack("C*") => "\xF0\x9F\x90\xAB" [14] pry(main)> "\xF0\x9F\x90\xAB".force_encoding("UTF-8").ord => 128043

Reini Urban | June 24, 2013 5:33 AM

Thanks Miyagawa san for checking this out.
In my case it was the parser, and my first suspicion in the github case is also the markdown parser. Otherwise a better error message would have appeared. If it would have been the database at least the preview would appear.

About Reini Urban

Working at cPanel on cperl, B::C (the perl-compiler), parrot, B::Generate, cygwin perl and more guts, keeping the system alive.

More info »

Reini Urban