An overview of spell checking modules
Spell checking is one of those problems that is already solved... sorta.
Like all problems it really depends on context. Take Jon Bentley's Programming pearls: a spelling checker where he examines the problem space and the differences between a spell checker and a spelling corrector. I start by searching the keyword 'spell' across all of CPAN.
ack -i spell 01modules.index.html
The above covered all 22,442 distribution names but not the sub modules names. A few metacpan searches later and I was able to compile the following list.
Direct checkers - modules that actually do the spell checking
- Lingua::Ispell A module encapsulating access to the Ispell program via IPC::Open2
- Meta::Tool::Aspell run aspell for you. Meta is a class library of about 250 classes and is abandonware.
- Text::Aspell Perl interface to the GNU Aspell library
- Text::Hunspell Perl interface to the GNU Hunspell library
- Text::Ispell A wrapper module for Ispell. The ispell cli is called via IPC::Open2.
Indirect - relies on another module to do the actual checking
- Search::Tools::SpellCheck Uses Text::Aspell to offer spelling suggestions
- Text::SpellChecker OO interface for spell-checking a block of text. Uses either Text::Aspell or Text::Hunspell
POD only checkers
- Pod::Spell::CommonMistakes Catches common typos in POD by using Pod::Spell to format the text and then comparing it against a custom wordlist from Pod::Spell::CommonMistakes::WordList. No system spell checker is required.
- Pod::Spelling Send POD to a spelling checker using either Lingua::Ispell or Text::Aspell. A test library is provided via Test::Pod::Spelling
- Test::Spelling check for spelling errors in POD files. Pod::Spell is used for parsing and an open3 call is made to either 'spell', 'aspell', 'ispell', or 'hunspell' for spell checking.
- Apache::AxKit::Language::SpellCheck is an XML Text Spell Checker for the Apache AxKit. Checking is done via Text::Aspell
- xml_spellcheck is a cli application for spell checking XML files. It makes a system call to 'aspell -c' directly.
Spell checking as a test
- Dist::Zilla::Plugin::PodSpellingTests is DEPRECATED! The old name of the PodSpelling plugin
- Dist::Zilla::Plugin::SpellingCommonMistakesTests Generates a Test::Pod::Spell::CommonMistakes release test
- Test::Pod::Spelling::CommonMistakes Checks POD for common spelling mistakes using Pod::Spell::CommonMistakes.
- Dist::Zilla::Plugin::Test::PodSpelling Generates a Test::Spelling author test
- Perl::Critic::Policy::Documentation::PodSpelling Spell check the POD. Aspell is used via an open command.
Checks spelling via remote service/application
- Bing::Search::Source::Spell uses Bing to spell check text.
- Lingua::AtD Provides an OO wrapper for After the Deadline grammar and spelling service.
- Lingua::MSWordSpell Uses Microsoft Word's Spellchecker over OLE automation instead of something like ispell
- Net::Google::Spelling simple OOP-ish interface to the Google SOAP API for spelling suggestions. This appears abandoned based on last update date, number of open bugs and the fact it has more failed test reports than passes.
- WebService::KoreanSpeller A Korean spell checker
- Gtk2::Spell Perl bindings to GtkSpell, used in concert with Gtk2::TextView.
- Lingua::Jspell Perl interface to the Jspell morphological analyzer.
- Lingua::Spelling::Alternative Use affix files generated by the ispell tools to return alternative spellings of a given word
- Pod::Spell a formatter for spell checking Pod, no actual checking capabilities built in.
- Text::SpellChecker::GUI Implements a user interface to Text::SpellChecker
- Tie::Ispell Ties a hash with an Ispell dictionary
- tkispell Perl/Tk user interface for Ispell
While many of these modules are actively developed and useful many do not fit my requirements for this project. I want to spell check any kind of utf8 encoded text and not need an Internet connection or closed source program to accomplish this task. The first two groups direct and indirect spell checkers appear to meet these requirements.
So which one to use? Lets take a look under the hood. GNU Ispell gives spelling suggestions if a word is not found in its dictionary. When searching for possible corrections to present it uses a Damerau–Levenshtein distance of 1.
GNU Aspell is an Ispell replacement that can handle utf-8 by default, has 70 supported language dictionaries, and supports using multiple dictionaries at once.
Hunspell is an advanced spell checker based on MySpell that supports both dictionaries and rules. It is currently used by LibreOffice/OpenOffice and has dictionaries for 99 languages.
It looks like Aspell and Hunspell are the front runners. I rejected Meta::Tool::Aspell because it has not been updated since 2002 and the distribution it is a part of has tons of other modules that are not needed for this problem. That leaves Text::Aspell, Text::Hunspell, Search::Tools::SpellCheck, and Text::SpellChecker left. Now before I started trying these modules out, I decided to try the command line versions of aspell and hunspell against some test data and to compare the output.
Long story short they both generate too much noise due to how they function. If a word is in the dictionary or it can be matched then everything works out. Since I am dealing with text of any kind from source code files to memos and email there is too much noise. Things like company names, places, peoples' names, animals, plants, email addresses, and technical terms can easily be flagged as incorrect.
I wrote a few sample programs to try and work around this problem. I will cover the results of my research in a future post.