CPAN modules for the Indonesian language

By Steven Haryanto on August 10, 2012 9:31 AM

(This post is just a lazy "proof-of-concept" for articles that review modules, but not focusing on comparing similar/competing modules for a task, but on listing various modules surrounding a theme or task.)

So far, this is a short list. You'd guess correctly that more prominent languages like English would have far more tools for it.

Converting number to/from verbage. There's Lingua::ID::Nums2Words for converting number to verbage and Lingua::ID::Words2Nums for converting verbage to number. I've seen far more widespread application for the former task (e.g. 45000 to "empat puluh lima ribu") compared to the latter. Where the former is usually seen in applications that print receipts or cheques, the latter I've only seen used in a script for practising the Indonesian language. Side note: These two modules are actually my first CPAN modules ever, written 13 years ago!

Translating text to other languages. No offline modules exist for this, but you can use Google Translate which supports Indonesian. The API access is currently not free, however. But once you pay Google for access, there's WWW::Google::Translate and Lingua::Translate::Google to use it from Perl.

Detecting language of text. There's Lingua::Identify. Have not done any extensive testing, but it's good enough, I guess, as long the text is not too short (< 10-20 words) and uses formal dialect of the language.

Parsing number from text. There's Parse::Number::ID. So far it only parses numbers, but I would like it to support mixed phrases too, like "1,5 juta" (1.5e6).

Stemming (finding root form of a word) and hyphenation (breaking word into syllables). I've read somewhere that Microsoft has support for this in its Office suite, and the support was supposedly developed by their Indonesian team. But don't count my feeble memory on this. There are no free/open source libraries available that I know of.

Dictionary (monolingual). The official dictionary for the Indonesian language is Kamus Besar Bahasa Indonesia (KBBI). So far there are 4 editions: i (1988), ii (1991), iii (2005), and iv (2008). The third edition is available online from the official website since 2008, but there's no news of planned update to the fourth. For this third edition, I have scraped the website and converted the content into an offline Stardict database (see project repo). A permission has been obtained to redistribute it. The conversion is not perfect, due to lack of proper markup in the original content, but I guess it's good enough. No Perl library/API for this is written yet. You can use the available Stardict tools to extract the list of entries to a text file.

PDF's for the fourth edition has also been circulating on the file sharing websites/networks, but I am not sure of its legality and currently have no plans to convert it. The PDF has the same terrible, visually-oriented markup that lacks semantic, one that smells like the whole thing was composed in Microsoft Word.

Dictionary (bilingual). To be written. Summary: no good quality, free bilingual (English/Indonesian) dictionary is available online.

Person names.

Guessing gender from name. I wrote Locale::ID::GuessGender::FromFirstName a couple of years ago, but am not satisfied with it at all. I don't think it's usable yet.
Parsing name into components. No modules exist yet. I plan on writing one, but it hasn't materialized.

Dialect languages. To be written.

Fun modules. To be written. Alay, ...

4 comments

4 Comments

Neil Bowers | August 10, 2012 7:47 PM | Reply

This would be good as an interactive matrix: languages on one axis, and tasks on the other. Maybe start off with a public google docs spreadsheet?

Steven Haryanto replied to comment from Neil Bowers | August 10, 2012 8:23 PM | Reply

I'll consider it when lists for English and a couple of other languages have appeared. Nudge + wink. ;-)

Holy Zarquon's Singing Fish | August 11, 2012 8:19 AM | Reply

The big problem with Indonesian as written online is that much of it is done in txt spk with lots and lots of shortcuts. Which makes it a nightmare for me to read (given I don't read formal indonesian very well in the first place, and usually have to read out loud to have a better chance of understanding). But the text speak really throws me. I don't think the auto-translate tools do a brilliant job of it either (although toggletext does have an option that works to an extent).

Steven Haryanto replied to comment from Holy Zarquon's Singing Fish | August 11, 2012 8:39 AM | Reply

You're right. Even I don't read much Indonesian anymore :)

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Steven Haryanto

A programmer (mostly Perl 5 nowadays). My CPAN ID: SHARYANTO. I'm sedusedan on perlmonks. My twitter is stevenharyanto (but I don't tweet much). Follow me on github: sharyanto.

More info »

Of course I still use Perl