An overview of spell checking modules

Spell checking is one of those problems that is already solved... sorta.

Like all problems it really depends on context. Take Jon Bentley's Programming pearls: a spelling checker where he examines the problem space and the differences between a spell checker and a spelling corrector. I start by searching the keyword 'spell' across all of CPAN.

wget http://www.cpan.org/modules/01modules.index.html
ack -i spell 01modules.index.html

The above covered all 22,442 distribution names but not the sub modules names. A few metacpan searches later and I was able to compile the following list.

Direct checkers - modules that actually do the spell checking


  • Lingua::Ispell A module encapsulating access to the Ispell program via IPC::Open2

  • Meta::Tool::Aspell run aspell for you. Meta is a class library of about 250 classes and is abandonware.

  • Text::Aspell Perl interface to the GNU Aspell library

  • Text::Hunspell Perl interface to the GNU Hunspell library

  • Text::Ispell A wrapper module for Ispell. The ispell cli is called via IPC::Open2.

Indirect - relies on another module to do the actual checking

POD only checkers


  • Pod::Spell::CommonMistakes Catches common typos in POD by using Pod::Spell to format the text and then comparing it against a custom wordlist from Pod::Spell::CommonMistakes::WordList. No system spell checker is required.

  • Pod::Spelling Send POD to a spelling checker using either Lingua::Ispell or Text::Aspell. A test library is provided via Test::Pod::Spelling

  • Test::Spelling check for spelling errors in POD files. Pod::Spell is used for parsing and an open3 call is made to either 'spell', 'aspell', 'ispell', or 'hunspell' for spell checking.

XML

Spell checking as a test

Checks spelling via remote service/application


  • Bing::Search::Source::Spell uses Bing to spell check text.

  • Lingua::AtD Provides an OO wrapper for After the Deadline grammar and spelling service.

  • Lingua::MSWordSpell Uses Microsoft Word's Spellchecker over OLE automation instead of something like ispell

  • Net::Google::Spelling simple OOP-ish interface to the Google SOAP API for spelling suggestions. This appears abandoned based on last update date, number of open bugs and the fact it has more failed test reports than passes.

  • WebService::KoreanSpeller A Korean spell checker

Everything else

While many of these modules are actively developed and useful many do not fit my requirements for this project. I want to spell check any kind of utf8 encoded text and not need an Internet connection or closed source program to accomplish this task. The first two groups direct and indirect spell checkers appear to meet these requirements.

So which one to use? Lets take a look under the hood. GNU Ispell gives spelling suggestions if a word is not found in its dictionary. When searching for possible corrections to present it uses a Damerau–Levenshtein distance of 1.

GNU Aspell is an Ispell replacement that can handle utf-8 by default, has 70 supported language dictionaries, and supports using multiple dictionaries at once.

Hunspell is an advanced spell checker based on MySpell that supports both dictionaries and rules. It is currently used by LibreOffice/OpenOffice and has dictionaries for 99 languages.

It looks like Aspell and Hunspell are the front runners. I rejected Meta::Tool::Aspell because it has not been updated since 2002 and the distribution it is a part of has tons of other modules that are not needed for this problem. That leaves Text::Aspell, Text::Hunspell, Search::Tools::SpellCheck, and Text::SpellChecker left. Now before I started trying these modules out, I decided to try the command line versions of aspell and hunspell against some test data and to compare the output.

Long story short they both generate too much noise due to how they function. If a word is in the dictionary or it can be matched then everything works out. Since I am dealing with text of any kind from source code files to memos and email there is too much noise. Things like company names, places, peoples' names, animals, plants, email addresses, and technical terms can easily be flagged as incorrect.

I wrote a few sample programs to try and work around this problem. I will cover the results of my research in a future post.

2 Comments

Any new post on this issue?

Leave a comment

About Kimmel

user-pic I like writing Perl code and since most of it is open source I might as well talk about it too. @KirkKimmel on twitter