Lingua::* - From 17 to 61 Languages: Resurrecting and Modernizing PetaMem's Number Conversion Suite
We took PetaMem's 13-year-old Lingua::* number conversion modules - dormant since 2013 with 17 languages - and brought them back to life. The suite now covers 61 languages across 7 writing systems (Latin, Cyrillic, Arabic, Devanagari, Armenian, Hebrew, CJK), including all 24 EU official languages plus Latin, Hindi, Yiddish, Mongolian, Uyghur, and more.
New in this release: cross-language numeral arithmetic with overloaded operators, ordinal support for 14 languages, capabilities introspection, and a Galois-field-based transitive test that walks the entire number space across all languages - 5000 steps, zero failures.
my $a = Lingua::Word2Num->new("zwanzig"); # German 20
my $b = Lingua::Word2Num->new("šestnáct"); # Czech 16
say ($a + $b)->as('fr'); # trente-six
say ($a + $b)->as('la'); # triginta sex
Everything on CPAN: cpanm Task::Lingua::PetaMem
Where We Started
PetaMem has maintained Lingua::* number conversion modules on CPAN since 2002. The original use case was straightforward: convert numbers to their written form for cheques and financial documents - "1234 - in Worten: eintausendzweihundertvierunddreißig". The reverse direction (Word2Num) came later for NLP applications.
By 2013, the collection covered 17 languages and went dormant. The code used SVN versioning, mixed coding styles, and each language module had been implemented independently - some using Parse::RecDescent grammars, others with regex pipelines, yet others with OO interfaces. Some were PetaMem originals, some were forks from other CPAN authors.
The Modernization
In March 2026, we decided to bring the suite back to life. The goals:
- Unified boilerplate:
use 5.16.0; use utf8; use warnings;everywhere - Standardize on
Export::Attrsand consistent API naming - Move all legacy module names to canonical
Lingua::XXX::Num2Word/Lingua::XXX::Word2Num - Date-based versioning (
0.YYMMDDX) - Parallel build system with
Parallel::ForkManager - Auto-discovery: wrappers find new language modules from the filesystem
- Proper CPAN kwalitee (Changes, META.json, LICENSE, SECURITY.md, tests)
61 Languages
The reference implementation - Lingua::DEU::Word2Num -
uses a clean Parse::RecDescent grammar in under 90 lines. We used this
as the template for every new language. Each language gets:
- Word2Num: A declarative RecDescent grammar parsing natural language numerals to integers
- Num2Word: A recursive function converting integers to natural language text
The current language roster spans seven writing systems:
Latin script: Afrikaans, Albanian, Basque, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Hungarian, Icelandic, Indonesian, Irish, Italian, Latvian, Lithuanian, Luxembourgish, Maltese, Norwegian, Occitan, Polish, Portuguese, Romanian, Sardinian, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Turkish, Vietnamese, Welsh, Azerbaijani, Latin
Cyrillic: Belarusian, Bulgarian, Kazakh, Kyrgyz, Macedonian, Mongolian, Russian, Serbian, Ukrainian
Arabic script: Arabic, Persian, Uyghur
Other scripts: Chinese (traditional), Greek, Hebrew, Hindi (Devanagari), Armenian, Japanese (romanized + kanji), Korean (Hangul + romanized), Thai, Yiddish (Hebrew script)
All 24 EU official languages are covered.
The Wrapper Interface
Individual modules can be used directly, but the wrappers provide a unified API accepting both ISO 639-1 and ISO 639-3 codes:
use Lingua::Num2Word qw(cardinal);
say cardinal('de', 42); # zweiundvierzig
say cardinal('ja', 42); # yon ju ni
say cardinal('ar', 42); # اثنان وأربعون
use Lingua::Word2Num qw(cardinal);
say cardinal('fr', 'quarante-deux'); # 42
Cross-Language Numeral Arithmetic
A distinctive feature: Lingua::Word2Num objects
support overloaded arithmetic across languages. The constructor
auto-detects the source language:
use Lingua::Word2Num;
my $a = Lingua::Word2Num->new("zwanzig"); # German 20
my $b = Lingua::Word2Num->new("šestnáct"); # Czech 16
say $a + $b; # 36
say ($a + $b)->as('de'); # sechsunddreissig
say ($a + $b)->as('fr'); # trente-six
say ($a + $b)->as('ar'); # ستة وثلاثون
$a++;
say $a->as('la'); # viginti unus
Arithmetic returns new numeral objects. ->as($lang)
renders into any supported language on demand. The semantics are
clean: arithmetic produces numbers, words require
explicit ->as().
Ordinals
Fourteen languages currently support ordinal conversion:
use Lingua::Num2Word qw(ordinal has_capability);
say ordinal('de', 3); # dritte
say ordinal('en', 3); # third
say ordinal('fr', 3); # troisième
say ordinal('tr', 3); # üçüncü
# check before calling
if (has_capability('de', 'ordinal')) { ... }
The capabilities() introspection lets callers discover
what each language module supports (cardinal, ordinal, and future
features) without trial and error.
The Galois Walk: Testing at Scale
Traditional per-language unit tests verify individual conversions. But how do you test cross-language consistency across 61 languages and the full number space without exhaustive enumeration?
We use a multiplicative generator over a prime field: g=7 mod
999999937 (the largest prime below 109). Starting
from 1, each step multiplies by 7 modulo the prime, producing a
deterministic, non-sequential walk through the entire number space -
from single digits through hundreds of millions. At each step, the
current value is converted to words in a rotating language, parsed
back to a number, and the generator advances.
A single test of 5000 steps touches all 61 languages, all magnitude
ranges, and all language-pair transitions. Values are clamped to each
language's declared range (via capabilities()), so
languages with smaller intervals still get tested within their valid
space.
The walk immediately proved its value: in its first run, it uncovered parser deficiencies in Korean, Chinese, and Bulgarian that had gone undetected through years of conventional testing. All were fixed - the current exhaustive walk runs 5000/5000 with zero failures.
CPAN Distribution Architecture
Each language produces two CPAN distributions (Num2Word + Word2Num), plus wrapper modules and three Task meta-packages:
shell> cpanm Task::Lingua::PetaMem # install everything shell> cpanm Task::Lingua::Word2Num # just word→number shell> cpanm Task::Lingua::Num2Word # just number→word
The Lingua modules are cherry-picked from a larger internal PetaMem
library and packaged for CPAN via an internal automated script. This
script auto-discovers language modules from the filesystem, builds
distributions in parallel, generates README files with native language
descriptions, derives changelogs from git history (sanitized - no
internal information leaks), and auto-tags after successful uploads.
A --query option fetches CPAN Testers results and CPANTS
kwalitee scores directly from the command line, without building
anything - giving us a tight feedback loop between development and
the CPAN ecosystem.
Legacy module names (the
pre-2026 Lingua::NLD::Numbers, Lingua::SPA::Numeros,
etc.) are preserved as deprecation wrappers that delegate to the
canonical Num2Word namespace with a carp
warning.
AI as a Development Partner
This modernization was carried out with heavy involvement of AI coding agents - credited in every module's POD as "PetaMem AI Coding Agents". This is not a disclaimer; it is a badge of pride.
The AI agents implemented language modules from linguistic specifications, wrote Parse::RecDescent grammars for languages they had never seen test data for, debugged subtle parser failures found by the Galois walk, and produced code that passes rigorous roundtrip verification across the full number space. The human role was specification, architecture, quality control, and linguistic verification - the kind of work where domain expertise matters. The agents handled the volume - 61 languages, each with two modules, POD, tests, and CPAN packaging.
We stand by the quality of the distributed code. Every module roundtrips correctly through the exhaustive Galois walk. Every distribution scores high on CPANTS kwalitee. The code is readable, documented, and tested. That it was produced with AI assistance does not diminish it - it enabled it. No single developer could have written and verified RecDescent grammars for Armenian, Uyghur, and Sardinian in the same week.
That said, we are deliberately holding back. The codebase moves fast when AI agents are involved - new languages, bug fixes, kwalitee improvements, and feature additions can happen in hours rather than weeks. But uploading 100+ distributions to CPAN daily, while technically possible, would cause unnecessary load on the mirrors and testing infrastructure, and frankly more attention strain than benefit for the community. We batch our releases, run the exhaustive Galois walk before every upload, and will aim for quality over frequency.
What's Next
The infrastructure is in place for continued growth:
- More ordinals: Currently supported for 14 languages, with the remaining languages to be added incrementally
- More languages: Adding a language requires just two .pm files - the wrappers and build system auto-discover them
- Phase 3 rewrite: A handful of legacy "foreign code" modules (IND, POR, ENG::Inflect) still use non-standard APIs. These are candidates for rewrite to the RecDescent pattern
- Decimal and negative numbers: The capabilities system is ready for these features
The full suite is on CPAN under PETAMEM. Source code is maintained by PetaMem s.r.o.
- Richard C. Jelinek, PetaMem s.r.o.
Leave a comment