Using perl to create a Wikitionary database


Perl was known for its unmatched text processing but I couldn't quite catch it till I had used it for my personal project. I am still using only baby Perl that I am learning from 'Learning Perl' . I want to make an android app with content from wikitionary to create a database of words, parts of speech and meaning as three fields database with each word as a table. I downloaded the Wikitionary dump which is a Tab Separated Values file.

English\t\t\t

I had to make them as SQL statement and write them to separate files. i.e. "a" words to a.txt and so on till "z" words to z.txt and using SQLite Manager for Firefox to convert them as an SQLite database as its easy to work around in SQLite in android. I wrote a perl script to remove the "English" at the start and then used split operator to split them by "\t" and then write them to $word, $part and $meaning . Make the first letter lower case, remove the links found in TSV file, avoid phrases and replace ' with '' as ' breaks the SQL statements and '' backslashes them. I extracted the first letter of the word using substr.

The format is like insert into a values ("airplane","Noun","A flying machine"); and write it to a.txt and so on till z.txt.

I finished the script and executed it from the terminal. It had 200_000+ words and I thought it would take a while do it . I stretched a bit and took a yawn and just as I was about to get up my laptop beeped and thankfully I dint have any Mountain Dew in my mouth. I opened the folder and had 26 text files each corresponding to the letter and the SQL statements. Perl does it all in just 10 freaking seconds. It took me 20 seconds to open the TSV file which is 60MB. But Perl does this and much more in just 10 seconds. I copied the SQL statements and then it took me 10 minutes to make the database of 41MB with SQLite Manager.

But today I had learned that I could have made a similar database with DBI module. It could have been done in another 10 seconds. So that makes my first blog post. Thank you perl. You are a Hypercool beast. Inspired by the open source nature of Perl I too made my app open source at Github and the app is available at Google Play as Wordzilla. Hope it helps someone who wanted Wikitionary as a database or a 200_000+ word list and I am happy if it does.

#!perl

use 5.010;
use OpenSource;

say "Thank you Perl!" ;

Disclaimer :

The post is not a promotional one its just how perl helped me to get my job done. The app contains no ads and in-app purchases and get me no revenue. The app is just to promote the Hypercool work done by Open Source and Free software community.

Leave a comment

About Xtreak

user-pic An android developer and a perl newbie