Unicode abuse

I was looking at doing a little bit of political activism on twitter, and as part of this, though about maximising the amount of information in each tweet a la Tweet Compressor which is an abuse of unicode to increase the 140 character (not byte!) limit for tweets.

Here's the implementation:


use utf8;
sub tweet_compress {
my $tweet = shift;
$tweet =~ s/\. ?$//; # we don't need no end of sentence punctuation
my @orig = ( qw/cc ms ns ps in ls fi fl ffl ffi iv ix vi oy ii xi nj/, ". " ,", ");
my @new = qw/㏄ ㎳ ㎱ ㎰ ㏌ ʪ fi fl ffl ffi ⅳ ⅸ ⅵ ѹ ⅱ ⅺ nj . ,/;
$tweet =~ s/\Q$orig[$_]\E/$new[$_]/g for 0 .. $#orig;
return $tweet;
}

Doing the rest of the right thing with unicode is a bit annoying (e.g.
binmode STDOUT, ':utf8';
to output the tweet correctly to stdout), and I really wish there were better unicode docs that didn't have high cognitive load.

2 Comments

I had done this for fun a while back after seeing Tweet Compressor. It's important not to compress URLs in order for them to be useful without someone retyping them manually. TC detects them and leaves them unchanged.

Leave a comment

About Holy Zarquon's Singing Fish

user-pic Catalyst hacker, management researcher and health informatician.