Lingua::* - From 17 to 61 Languages: Resurrecting and Modernizing PetaMem's Number Conversion Suite

By PetaMem on March 28, 2026 1:03 PM under AI, Lingua, NLP

We took PetaMem's 13-year-old Lingua::* number conversion modules - dormant since 2013 with 17 languages - and brought them back to life. The suite now covers 61 languages across 7 writing systems (Latin, Cyrillic, Arabic, Devanagari, Armenian, Hebrew, CJK), including all 24 EU official languages plus Latin, Hindi, Yiddish, Mongolian, Uyghur, and more.

New in this release: cross-language numeral arithmetic with overloaded operators, ordinal support for 14 languages, capabilities introspection, and a Galois-field-based transitive test that walks the entire number space across all languages - 5000 steps, zero failures.

my $a = Lingua::Word2Num->new("zwanzig");      # German 20
my $b = Lingua::Word2Num->new("šestnáct");     # Czech 16
say ($a + $b)->as('fr');    # trente-six
say ($a + $b)->as('la');    # triginta sex

Everything on CPAN: cpanm Task::Lingua::PetaMem

Where We Started

PetaMem has maintained Lingua::* number conversion modules on CPAN since 2002. The original use case was straightforward: convert numbers to their written form for cheques and financial documents - "1234 - in Worten: eintausendzweihundertvierunddreißig". The reverse direction (Word2Num) came later for NLP applications.

By 2013, the collection covered 17 languages and went dormant. The code used SVN versioning, mixed coding styles, and each language module had been implemented independently - some using Parse::RecDescent grammars, others with regex pipelines, yet others with OO interfaces. Some were PetaMem originals, some were forks from other CPAN authors.

The Modernization

In March 2026, we decided to bring the suite back to life. The goals:

Unified boilerplate: use 5.16.0; use utf8; use warnings; everywhere
Standardize on Export::Attrs and consistent API naming
Move all legacy module names to canonical Lingua::XXX::Num2Word / Lingua::XXX::Word2Num
Date-based versioning (0.YYMMDDX)
Parallel build system with Parallel::ForkManager
Auto-discovery: wrappers find new language modules from the filesystem
Proper CPAN kwalitee (Changes, META.json, LICENSE, SECURITY.md, tests)

61 Languages

The reference implementation - Lingua::DEU::Word2Num - uses a clean Parse::RecDescent grammar in under 90 lines. We used this as the template for every new language. Each language gets:

Word2Num: A declarative RecDescent grammar parsing natural language numerals to integers
Num2Word: A recursive function converting integers to natural language text

The current language roster spans seven writing systems:

Latin script: Afrikaans, Albanian, Basque, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Hungarian, Icelandic, Indonesian, Irish, Italian, Latvian, Lithuanian, Luxembourgish, Maltese, Norwegian, Occitan, Polish, Portuguese, Romanian, Sardinian, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Turkish, Vietnamese, Welsh, Azerbaijani, Latin

Cyrillic: Belarusian, Bulgarian, Kazakh, Kyrgyz, Macedonian, Mongolian, Russian, Serbian, Ukrainian

Arabic script: Arabic, Persian, Uyghur

Other scripts: Chinese (traditional), Greek, Hebrew, Hindi (Devanagari), Armenian, Japanese (romanized + kanji), Korean (Hangul + romanized), Thai, Yiddish (Hebrew script)

All 24 EU official languages are covered.

The Wrapper Interface

Individual modules can be used directly, but the wrappers provide a unified API accepting both ISO 639-1 and ISO 639-3 codes:

use Lingua::Num2Word qw(cardinal);
say cardinal('de', 42);    # zweiundvierzig
say cardinal('ja', 42);    # yon ju ni
say cardinal('ar', 42);    # اثنان وأربعون
use Lingua::Word2Num qw(cardinal);
say cardinal('fr', 'quarante-deux');  # 42

Cross-Language Numeral Arithmetic

A distinctive feature: Lingua::Word2Num objects support overloaded arithmetic across languages. The constructor auto-detects the source language:

use Lingua::Word2Num;
my $a = Lingua::Word2Num->new("zwanzig");      # German 20
my $b = Lingua::Word2Num->new("šestnáct");     # Czech 16
say $a + $b;                # 36
say ($a + $b)->as('de');    # sechsunddreissig
say ($a + $b)->as('fr');    # trente-six
say ($a + $b)->as('ar');    # ستة وثلاثون
$a++;
say $a->as('la');           # viginti unus

Arithmetic returns new numeral objects. ->as($lang) renders into any supported language on demand. The semantics are clean: arithmetic produces numbers, words require explicit ->as().

Ordinals

Fourteen languages currently support ordinal conversion:

use Lingua::Num2Word qw(ordinal has_capability);
say ordinal('de', 3);    # dritte
say ordinal('en', 3);    # third
say ordinal('fr', 3);    # troisième
say ordinal('tr', 3);    # üçüncü
# check before calling
if (has_capability('de', 'ordinal')) { ... }

The capabilities() introspection lets callers discover what each language module supports (cardinal, ordinal, and future features) without trial and error.

The Galois Walk: Testing at Scale

Traditional per-language unit tests verify individual conversions. But how do you test cross-language consistency across 61 languages and the full number space without exhaustive enumeration?

We use a multiplicative generator over a prime field: g=7 mod 999999937 (the largest prime below 10⁹). Starting from 1, each step multiplies by 7 modulo the prime, producing a deterministic, non-sequential walk through the entire number space - from single digits through hundreds of millions. At each step, the current value is converted to words in a rotating language, parsed back to a number, and the generator advances.

A single test of 5000 steps touches all 61 languages, all magnitude ranges, and all language-pair transitions. Values are clamped to each language's declared range (via capabilities()), so languages with smaller intervals still get tested within their valid space.

The walk immediately proved its value: in its first run, it uncovered parser deficiencies in Korean, Chinese, and Bulgarian that had gone undetected through years of conventional testing. All were fixed - the current exhaustive walk runs 5000/5000 with zero failures.

CPAN Distribution Architecture

Each language produces two CPAN distributions (Num2Word + Word2Num), plus wrapper modules and three Task meta-packages:

shell> cpanm Task::Lingua::PetaMem    # install everything
shell> cpanm Task::Lingua::Word2Num   # just word→number
shell> cpanm Task::Lingua::Num2Word   # just number→word

The Lingua modules are cherry-picked from a larger internal PetaMem library and packaged for CPAN via an internal automated script. This script auto-discovers language modules from the filesystem, builds distributions in parallel, generates README files with native language descriptions, derives changelogs from git history (sanitized - no internal information leaks), and auto-tags after successful uploads. A --query option fetches CPAN Testers results and CPANTS kwalitee scores directly from the command line, without building anything - giving us a tight feedback loop between development and the CPAN ecosystem.

Legacy module names (the pre-2026 Lingua::NLD::Numbers, Lingua::SPA::Numeros, etc.) are preserved as deprecation wrappers that delegate to the canonical Num2Word namespace with a carp warning.

AI as a Development Partner

This modernization was carried out with heavy involvement of AI coding agents - credited in every module's POD as "PetaMem AI Coding Agents". This is not a disclaimer; it is a badge of pride.

The AI agents implemented language modules from linguistic specifications, wrote Parse::RecDescent grammars for languages they had never seen test data for, debugged subtle parser failures found by the Galois walk, and produced code that passes rigorous roundtrip verification across the full number space. The human role was specification, architecture, quality control, and linguistic verification - the kind of work where domain expertise matters. The agents handled the volume - 61 languages, each with two modules, POD, tests, and CPAN packaging.

We stand by the quality of the distributed code. Every module roundtrips correctly through the exhaustive Galois walk. Every distribution scores high on CPANTS kwalitee. The code is readable, documented, and tested. That it was produced with AI assistance does not diminish it - it enabled it. No single developer could have written and verified RecDescent grammars for Armenian, Uyghur, and Sardinian in the same week.

That said, we are deliberately holding back. The codebase moves fast when AI agents are involved - new languages, bug fixes, kwalitee improvements, and feature additions can happen in hours rather than weeks. But uploading 100+ distributions to CPAN daily, while technically possible, would cause unnecessary load on the mirrors and testing infrastructure, and frankly more attention strain than benefit for the community. We batch our releases, run the exhaustive Galois walk before every upload, and will aim for quality over frequency.

What's Next

The infrastructure is in place for continued growth:

More ordinals: Currently supported for 14 languages, with the remaining languages to be added incrementally
More languages: Adding a language requires just two .pm files - the wrappers and build system auto-discover them
Phase 3 rewrite: A handful of legacy "foreign code" modules (IND, POR, ENG::Inflect) still use non-standard APIs. These are candidates for rewrite to the RecDescent pattern
Decimal and negative numbers: The capabilities system is ready for these features

The full suite is on CPAN under PETAMEM. Source code is maintained by PetaMem s.r.o.

- Richard C. Jelinek, PetaMem s.r.o.

3 comments

Tagged as:

ai, cpan, lingua, nlp, petamem, text-processing

3 Comments

Matthew Persico | April 3, 2026 6:16 AM | Reply

This appears to be a vast amount of excellent work; kudos. But I am going to be "that guy":

If it were me, I would either

* convert to semantic (maj.min.prior) versioning with which most of the OSS world is already familiar

* put the century in the date. I know no one expects this stuff to be running 75 years from now but having lived through Y2K, looking at a century-less date evokes the same cringe as listening to nails on a chalkboard. I would also put some kind of separator character between the date and the YYYYMMDD and the X position to mke the string more readable.

Thank you. Good luck!

PetaMem | April 5, 2026 10:15 AM | Reply

Two things:

1)
How would you automate that?

How should a fully automated release mechanism decide which version to set?

2)
Is it really important that the version is readable to you?

Or is it more important that any tool (cpanm?) is able to discern which one is newer and suggest an update?

We'll add the century to the version when we get there. First, let's see what happens ~2038 with the epoch time.

Christian Hansen | April 6, 2026 9:12 PM | Reply

I took a look at Lingua::SWE::Word2Num, and the documentation states that the input text must be UTF-8 encoded. That feels like an awkward API, does it really expect raw UTF-8 code units instead of native Perl strings?

From the SYNOPSIS:

use Lingua::SWE::Word2Num; my $num = Lingua::SWE::Word2Num::w2n( 'fyrtiotve' );

There’s no word like “fyrtiotve” in Swedish, it should probably be “fyrtiotvå.”

Why publish each language as its own separate distribution when they appear to share the same dependencies? This approach creates a logistical nightmare for both end users and package managers.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About PetaMem

All things Perl.

More info »

PetaMem