Unicode abuse

By Holy Zarquon's Singing Fish on July 20, 2010 10:19 PM

I was looking at doing a little bit of political activism on twitter, and as part of this, though about maximising the amount of information in each tweet a la Tweet Compressor which is an abuse of unicode to increase the 140 character (not byte!) limit for tweets.

Here's the implementation:

use utf8; sub tweet_compress { my $tweet = shift; $tweet =~ s/\. ?$//; # we don't need no end of sentence punctuation my @orig = ( qw/cc ms ns ps in ls fi fl ffl ffi iv ix vi oy ii xi nj/, ". " ,", "); my @new = qw/㏄㎳㎱㎰㏌ ʪ ﬁ ﬂ ﬄ ﬃ ⅳ ⅸ ⅵ ѹ ⅱ ⅺ ǌ ．，/; $tweet =~ s/\Q$orig[$_]\E/$new[$_]/g for 0 .. $#orig; return $tweet; }

Doing the rest of the right thing with unicode is a bit annoying (e.g.

binmode STDOUT, ':utf8';

to output the tweet correctly to stdout), and I really wish there were better unicode docs that didn't have high cognitive load.

2 comments

2 Comments

Nova Patch | July 20, 2010 10:54 PM | Reply

I had done this for fun a while back after seeing Tweet Compressor. It's important not to compress URLs in order for them to be useful without someone retyping them manually. TC detects them and leaves them unchanged.

Holy Zarquon's Singing Fish replied to comment from Nova Patch | July 21, 2010 3:30 AM | Reply

Nick:

Good point. If this had been relevant to my problem space, I'd have realised this of course :)

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Holy Zarquon's Singing Fish

Catalyst hacker, management researcher and health informatician.

More info »

Holy Zarquon's Singing Fish

Unicode abuse

2 Comments

Leave a comment

About Holy Zarquon's Singing Fish

Search this blog