Perl Weekly Challenge # 10: Roman Numerals and the Jaro-Winkler Distance

These are some answers to the Week 10 of the Perl Weekly Challenge organized by Mohammad S. Anwar.

Spoiler Alert: This weekly challenge deadline is due in several days from now (June 2, 2019). This blog post offers some solutions to this challenge, please don't read on if you intend to complete the challenge on your own.

Challenge # 1: Roman Numerals

Write a script to encode/decode Roman numerals. For example, given Roman numeral CCXLVI, it should return 246. Similarly, for decimal number 39, it should return XXXIX. Checkout Wikipedia page for more information.

Of course, there are some Perl modules on the CPAN to convert from and to Roman numerals, but there wouldn't be any challenge if the idea were to use an existing module.

Most people know more or less how Roman numerals work. They use Latin letters to represent numbers:

|---------------------------------------------------|
| Symbol |  I  |  V  |  X  |  L  |  C  |  D  |  M   |
|---------------------------------------------------|
| Value  |  1  |  5  |  10 |  50 | 100 | 500 | 1000 |
|---------------------------------------------------|

In general, Roman numerals use additive notation: for example, MCLXXIII means 1000 + 100 + 50 + 20 + 3 = 1173. Or, at least, this is so when the symbols are written from left to right in decreasing value order.

If, however, a given symbol has a smaller value than a symbol placed on its right, then this is an example of subtractive notation: in that case, the smaller symbol is subtracted from the one its right. For example, IV means 1 subtracted from 5, i.e. 5 - 1 = 4. Similarly, IX and XC respectively mean 10 - 1 = 9 and 100 - 10 = 90. And MCMXLIX corresponds to 1000 + ( 1000 - 100) + (50 - 10) + (10 - 1) = 1949.

The overall problem, though, is that there is no general standard for Roman numerals. Applying the rules above makes it possible to decode more or less unambiguously any Roman numeral coded according to such aforesaid rules, but there may be several different possible ways to encode a number into a Roman numeral. For example, 99 could be encoded as XCXI or IC (or even XCVIIII or possibly LXXXXVIIII). The first transcription (XCXI) seems to be the most frequent one, so this is the one we will chose when encoding to Roman numerals. Still, IC seems to be a valid Roman numeral for 99, so we will try at least to be able to decode it if we find it.

Note that there is no Roman numeral for zero and the largest possible Roman numeral with the above rules is 3,999.

Roman Numerals in Perl 5

If Roman numerals only had the additive notation, it would be very easy: for converting a Roman numeral, just pick up each of the symbols in turn, add them up, and you're done. The trouble comes with subtractive notation.

So my first idea to decode a Roman numeral was to remove any subtractive notation part from the input Roman numeral and replace it with an additive notation. For example, given the numeral MCIX, I would replace IX with VIIII, thus yielding MCVIIII; it is now very easy to add the symbols' values to find 1009. We can use a series of regex substitutions for that:

sub remove_subtractive {
    my $roman = shift;
    for ($roman) {
        s/IV/IIII/;             # 4
        s/IX/VIIII/;            # 9    
        s/IL/XLVIIII/;          # 49
        s/XL/XXXX/;             # 40 to 49
        s/IC/LXXXXVIIII/;       # 99
        s/XC/LXXXX/;            # 90 to 98
        s/ID/XDVIIII/;          # 499
        s/XD/CDLXXXX/;          # 490 to 499
        s/CD/CCCC/;             # 400 to 499
        s/IM/CMLXXXXVIIII/;     # 999
        s/XM/CMLXXXX/;          # 990 to 998
        s/CM/DCCCC/;            # 900 to 999
    }
    return $roman;
}

sub from_roman {
    my $additive = remove_subtractive uc shift;
    my $arabic = 0; 
    for (split //, $additive){
        $arabic += $roman_table{$_};
    }
    return $arabic;
}

That's of course way too complicated. As soon as I started typing the first few of these regex substitutions in the remove_subtractive subroutine, I knew I wanted to find a better way to decode Roman numerals. I nonetheless completed it, because I wanted to show it on this blog. I also tested it quite thoroughly, and it seems to work properly.

The new idea is to read the symbols one by one from left to right and to add the values, keeping track of the previously seen value. If the current value is larger than the previous value, then we were in a case of a subtractive combination at the previous step, and we need to subtract twice the previous value (once because it is a subtractive combination, and once again because we have previously erroneously added it). That's actually quite simple (see how the code of the from_roman subroutine below is much shorter and simpler than what we had tried above).

For encoding Arabic numerals to Roman numerals, the easiest is to perform integer division with decreasing values corresponding to Roman numerals (i.e. M D C L X V I). For example, suppose we want to encode 2019. We first try to divide by 1,000 (corresponding to M). We get 2, so the start of the string representing the Roman numeral will be MM. Then we continue with the remainder, i.e. 19. We try integer division successively with 500, 100 and 50 and get 0, so don't do anything with the result. Next we try with 10 and get 1, so the temporary result is now MMX. The remainder is 9; if we continue the same way with 9, we would divide by 5, add V to our string, and eventually obtain MMXVIIII, which is a correct (simplistic) Roman numeral for 2019, but not really what we want, since we want to apply the subtractive combination and get MMXIX.

Rather than reprocessing VIIII into IX (we've seen before how tedious this could be with regexes), we can observe that if our list of decreasing Roman values also includes IX (9), then it will work straight without any need to reprocess the result. So, our list of decreasing values corresponding to Roman numperals needs to be augmented with subtractive cases to M CM D CD C XC L XL X IX V IV I (corresponding to numbers 1000, 900, 500, 100, 90, 50, 40, 10, 9, 5, 4, 1). Using this list instead of the original one removes any need for special processing for subtractive combinations: we just need to keep doing integer divisions with the decreasing values and continue the rocessing with the remainder. This what the to_roman subroutine below does.

So our program to convert from and to Roman numerals is as follows:

#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;


my %rom_tab = (I => 1,  V => 5, X => 10, L => 50, C => 100, D => 500, M => 1000);

sub from_roman {
    my $roman = uc shift;
    my $numeric = 0;
    my $prev_letter = "M";
    for my $letter (split //, $roman) {
        $numeric -= 2 * $rom_tab{$prev_letter} 
            if $rom_tab{$letter} > $rom_tab{$prev_letter};
        $numeric += $rom_tab{$letter};
        $prev_letter = $letter;
    }
    return $numeric;
}

sub to_roman {
    my $arabic = shift;
    warn "$arabic out of bounds" unless $arabic > 0 and $arabic < 4000;
    my %hash = %rom_tab;
    $hash{$_->[0]} = $_->[1] for (['IV', 4], ['IX', 9], ['XL', 40], 
        ['XC', 90], ['CD', 400], ['CM', 900] );
    my $roman = "";
    for my $key (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
        my $num = int ($arabic / $hash{$key});
        $roman .= $key x $num;
        $arabic -= $hash{$key} * $num; 
    }
    return $roman;
}

say "From Roman to Arabic";
say "$_\t=>\t", from_roman $_ for qw <MM MCM LXXIII XCIII IC XCIX xv>;

my @test_nums = qw <19 42 67 90 97 99 429 498 687 938 949 996 2145 3597>;
say "From Arabic to Roman";
say "$_\t=>\t", to_roman $_ for @test_nums;

say "Some round trips: from Arabic to Roman to Arabic";
say "$_\t=>\t", from_roman to_roman $_ for @test_nums;

say "Sanity check (round trip through the whole range)";
for (1..3999) {
    my $result = from_roman to_roman $_;
    say "Error on $_ " unless $result == $_;
}

And this is the result of the various tests:

$ perl roman.pl
From Roman to Arabic
MM      =>      2000
MCM     =>      1900
LXXIII  =>      73
XCIII   =>      93
IC      =>      99
XCIX    =>      99
xv      =>      15
From Arabic to Roman
19      =>      XIX
42      =>      XLII
67      =>      LXVII
90      =>      XC
97      =>      XCVII
99      =>      XCIX
429     =>      CDXXIX
498     =>      CDXCVIII
687     =>      DCLXXXVII
938     =>      CMXXXVIII
949     =>      CMXLIX
996     =>      CMXCVI
2145    =>      MMCXLV
3597    =>      MMMDXCVII
Some round trips: from Arabic to Roman to Arabic
19      =>      19
42      =>      42
67      =>      67
90      =>      90
97      =>      97
99      =>      99
429     =>      429
498     =>      498
687     =>      687
938     =>      938
949     =>      949
996     =>      996
2145    =>      2145
3597    =>      3597
Sanity check (round trip through the whole range)

Note that the sanity check (the last test) does not print anything because the round trip worked correctly for the whole range between 1 and 3,999 (it would display only anomalies, but did not find any).

Also note that, in the Roman to Arabic conversion, both IC and XCIX return 99, as expected, whereas, in the opposite conversion, 99 returns XCIX.

Roman Numerals in Perl 6

Since we think that we now have a good algorithm to convert to and from Roman numerals (and we don't see any P6 feature to significantly simplify it), we can just translate that into Perl 6:

use v6;

subset Roman-str of Str where $_ ~~ /^<[IVXLCDMivxlcdm]>+$/;

my %rom-tab = < I 1   V 5   X 10   L 50   C 100  D 500  M 1000 
               IV 4  IX 9   XL 40  XC 90  CD 400   CM 900 >;
my @ordered_romans = reverse sort { %rom-tab{$_} }, keys %rom-tab;

sub from-roman (Roman-str $roman) {
    my $numeric = 0;
    my $prev_letter = "M";
    for $roman.uc.comb -> $letter {
        $numeric -= 2 * %rom-tab{$prev_letter} 
            if %rom-tab{$letter} > %rom-tab{$prev_letter};
        $numeric += %rom-tab{$letter};
        $prev_letter = $letter;
    }
    return $numeric;
}

sub to-roman (Int $arabic is copy where  { 0 < $_ < 4000 }) {
    my $roman = "";
    for @ordered_romans -> $key {
        my $num = ($arabic / %rom-tab{$key}).Int;
        $roman ~= $key x $num;
        $arabic -= %rom-tab{$key} * $num; 
    }
    return $roman;
}

Besides the small necessary syntactical adjustments between P5 and P6, there are two or three additional changes. First, I used the subroutine signatures to add some limited form of data validation. For checking input Roman numerals, I created the Roman-str subtype (well, really, a subset) which accepts strings that are made only with the seven letters used in Roman numerals (both lower and upper case). This makes it possible to validate (to a certain extent) the argument passed to the from-roman subroutine. Of course, some strings made of these letters may still be invalid Roman numerals, but, at least, we'll get an exception if we inadvertently pass an Arabic number to it.

Similarly, since, according to our rules, Roman numerals can represent numbers between 1 and 3,999, the signature of the to-roman subroutine only accepts integers larger than 0 and less than 4,000.

Also, since I knew from the beginning that I was going to use the M CM D CD C XC L XL X IX V IV I sequence of Roman numerals in the to-roman subroutine, I initialized the %rom-tab hash with all these values, rather than adding some of them (those with two letters) to the hash afterwards as it was done in the P5 version. This is not a problem since the other subroutine (from-roman) using the hash only looks up single letters.

Testing the Perl 6 Roman Numerals Program

It would be fairly easy to use almost exactly the same tests as in Perl 5;

say "$_\t=>\t", from_roman $_ for <MM MCM LXXIII XCIII IC XCIX xv>;

my @test_nums =  <19 42 67 90 97 99 429 498 687 938 949 996 2145 3597>;
say "From Arabic to Roman";
say "$_\t=>\t", to_roman $_.Int for @test_nums;

say "Some round trips: from Arabic to Roman to Arabic";
say "$_\t=>\t", from_roman to_roman $_.Int for @test_nums;

say "Sanity check (round trip through the whole range)";
for (1..3999) {
    my $result = from_roman to_roman $_;
    say "Error on $_ " unless $result == $_;
}

These tests work well and produce the same output as the one listed above for the Perl 5 program.

But there is a better way of testing, which I introduced in my blog on week 2 of the Perl Weekly Challenge: the Test module. I was hoping at the time that I would be able to provide a more significant example, and this challenge is a perfect opportunity to do that. This enables us to do much more thorough testing, including tests on exceptions (for example for a type mismatch in a subroutine's argument).

This my example testing code:

use Test;
plan 45;

say "\nFrom Roman to Arabic";
for < MM 2000 MCM 1900 LXXIII 73 XCIII 93 IC 99 XCIX 99 xv 15> -> $roman, $arabic {
    is from-roman($roman), $arabic, "$roman => $arabic";
}
isnt from-roman("VII"), 8, "OK: VII not equal to 8";
for <12 foo bar MCMA> -> $param {
    dies-ok {from-roman $param}, "Caught exception OK in from-roman: wrong parameter";
}
say "\nFrom Arabic to Roman";
my %test-nums = map { $_[0] => $_[1] }, (
    <19 42 67 90 97 99 429 498 687 938 949 996 2145 3597> Z 
    <XIX XLII LXVII XC XCVII XCIX CDXXIX CDXCVIII DCLXXXVII 
     CMXXXVIII CMXLIX CMXCVI MMCXLV MMMDXCVII>);
for %test-nums.keys -> $key {
    is to-roman($key.Int), %test-nums{$key}, "$key => %test-nums{$key}";
}
for 0, 4000, "foobar", 3e6 -> $param {
    dies-ok { to-roman $param}, "Caught exception OK in to-roman: wrong parameter";
}
say "\nSome round trips: from Arabic to Roman to Arabic";
for %test-nums.keys.sort -> $key {
    is from-roman(to-roman $key.Int), $key, "Round trip OK for $key";
}
my $upper-bound = 3999;
say "\nSanity check (round trip through the whole range 1 .. $upper-bound range)";

lives-ok {
    for (1..$upper-bound) -> $arabic {
        die "Failed round trip on $arabic" if from-roman(to-roman $arabic) != $arabic;
    }
}, "Passed round trip on the full 1..$upper-bound range";

The second line above says that we're going to run 45 test cases (the last test case, the sanity check round trip, is actually testing 3,999 subcases, but it counts as only 1 case).

The is function test for equality of its first two arguments (and isnt tests reports "ok" is the values are not equal). The dies-ok checks that the code being tested throws an exception (good here to check that invalid subroutine arguments are rejected) and the lives-ok check that the code block being tested does not throw any exception.

These tests produce the following output:

1..45

From Roman to Arabic
ok 1 - MM => 2000
ok 2 - MCM => 1900
ok 3 - LXXIII => 73
ok 4 - XCIII => 93
ok 5 - IC => 99
ok 6 - XCIX => 99
ok 7 - xv => 15
ok 8 - OK: VII not equal to 8
ok 9 - Caught exception OK in from-roman: wrong parameter
ok 10 - Caught exception OK in from-roman: wrong parameter
ok 11 - Caught exception OK in from-roman: wrong parameter
ok 12 - Caught exception OK in from-roman: wrong parameter

From Arabic to Roman
ok 13 - 687 => DCLXXXVII
ok 14 - 97 => XCVII
ok 15 - 938 => CMXXXVIII
ok 16 - 498 => CDXCVIII
ok 17 - 19 => XIX
ok 18 - 429 => CDXXIX
ok 19 - 3597 => MMMDXCVII
ok 20 - 2145 => MMCXLV
ok 21 - 67 => LXVII
ok 22 - 90 => XC
ok 23 - 99 => XCIX
ok 24 - 996 => CMXCVI
ok 25 - 949 => CMXLIX
ok 26 - 42 => XLII
ok 27 - Caught exception OK in to-roman: wrong parameter
ok 28 - Caught exception OK in to-roman: wrong parameter
ok 29 - Caught exception OK in to-roman: wrong parameter
ok 30 - Caught exception OK in to-roman: wrong parameter

Some round trips: from Arabic to Roman to Arabic
ok 31 - Round trip OK for 19
ok 32 - Round trip OK for 2145
ok 33 - Round trip OK for 3597
ok 34 - Round trip OK for 42
ok 35 - Round trip OK for 429
ok 36 - Round trip OK for 498
ok 37 - Round trip OK for 67
ok 38 - Round trip OK for 687
ok 39 - Round trip OK for 90
ok 40 - Round trip OK for 938
ok 41 - Round trip OK for 949
ok 42 - Round trip OK for 97
ok 43 - Round trip OK for 99
ok 44 - Round trip OK for 996

Sanity check (round trip through the whole range 1 .. 3999 range)
ok 45 - Passed round trip on the full 1..3999 range

If you want to know more about testing strategies in Perl 6, please look at the Perl 6 testing tutorial, the Test module documentation, or the Debugging section of Chapter 14 of my Perl 6 book.

Challenge 2: The Jaro-Winkler Distance

Write a script to find the Jaro-Winkler distance between two strings. For more information check the Wikipedia page.

The Jaro–Winkler distance is a string metric measuring an edit distance between two sequences. The lower the Jaro–Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 1 equates to no similarity and 0 is an exact match.

To find the Jaro-Winkler distance between two strings, we need first to find the Jaro similarity. Then we compute the Jaro-Winkler similarity, and the Jaro-Winkler distance is just 1 minus the Jaro-Winkler similarity.

Jaro-Winkler in Perl 5

This is my attempt at computing the Jaro-Winkler distance in Perl 5:

#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;

sub max { $_[0] > $_[1] ? $_[0] : $_[1] }
sub min { $_[0] > $_[1] ? $_[1] : $_[0] }

sub simj {
    my ($str1, $str2) = @_;
    my $len1 = length $str1;
    my $len2 = length $str2;
    my $dmax = int (max($len1, $len2) / 2) -1;
    my @st1 = split //, $str1;
    my $i = 0;
    my @matches;
    for my $letter (split //, $str2) {
        push @matches, $letter if (grep { $letter eq $_ } 
            @st1[max(0,$i-$dmax)..min($i+$dmax,$len1-1)]);
        $i++;
    }
    my $nb_matches = scalar @matches;
    return 0 if $nb_matches == 0;
    my %matching_letters = map { $_ => 1} @matches;
    my @matches_str1 = grep exists $matching_letters{$_}, 
        split //, $str1;
    my $disorder = 0;
    for my $i (0..$nb_matches-1) {
        $disorder++ if $matches[$i] ne $matches_str1[$i];
    }
    my $transposition = $disorder / 2;
    return ($nb_matches / $len1 + $nb_matches / $len2 + 
        ($nb_matches - $transposition)/$nb_matches) / 3;
}
sub simw {
    my ($str1, $str2) = @_;
    my $p_constant = 0.1;
    my $length_prefix = 0;
    for my $count (0..3) {
        last if substr $str1, $count, 1 ne substr $str2, $count, 1;
        $length_prefix++;
    }
    my $simj = simj $str1, $str2;
    return $simj + $length_prefix * $p_constant * (1 - $simj);
}

my @tests = ( ["FOO", "BAR"], ["CRATE", "TRACE"], 
    ["CRATE", "CRATE"], ["TRACE", "CRATE"], 
    ["CREATE", "TRACT"], ["DWAYNE", "DUANE"], );
for my $word_pair (@tests) {
    my ($w1, $w2) = @$word_pair;
    my $simw = simw $w1, $w2;
    say "Jaro-Winkler distance between $w1 and $w2 is: ", 1 - $simw;
}

Note that the Wikipedia description of the transposition calculation isn't very detailed, so I'm not sure to really understand it. I think that the way it is done here makes sense, but I'm not sure this is really what is required.

With the tests listed at the bottom of the above program, this program displays the following output:

$ perl jaro_dist.pl
Jaro-Winkler distance between FOO and BAR is: 1
Jaro-Winkler distance between CRATE and TRACE is: 0.266666666666667
Jaro-Winkler distance between CRATE and CRATE is: 0
Jaro-Winkler distance between TRACE and CRATE is: 0.266666666666667
Jaro-Winkler distance between CREATE and TRACT is: 0.3
Jaro-Winkler distance between DWAYNE and DUANE is: 0.177777777777778

Keeping in mind the doubt about the calculation of the transposition, the results seem consistent. In particular, two indentical words have a distance of 0 and two completely different words (no common letter) have a distance of 1.

Jaro-Winkler in Perl 6

This is the same algorithm for computing the Jaro-Winkler distance as in Perl 5:

use v6;

sub simjaro (Str $str1, Str $str2) {
    my $len1 = $str1.chars;
    my $len2 = $str2.chars;
    my $dmax = (max($len1, $len2) / 2).Int -1;
    my @st1 = $str1.comb;
    my $i = 0;
    my @matches;
    for $str2.comb -> $letter {
        push @matches, $letter if (grep { $letter eq $_ }, 
            @st1[max(0,$i-$dmax)..min($i+$dmax,$len1-1)]);
        $i++;
    }
    my $nb_matches = @matches.elems;
    return 0 if $nb_matches == 0;
    my %matching_letters = map { $_ => 1}, @matches;
    my @matches_str1 = grep { %matching_letters{$_}:exists }, 
        $str1.comb;
    my $disorder = 0;
    for 0..$nb_matches-1 -> $i {
        $disorder++ if @matches[$i] ne @matches_str1[$i];
    }
    my $transposition = $disorder / 2;
    return ($nb_matches / $len1 + $nb_matches / $len2 + 
        ($nb_matches - $transposition)/$nb_matches) / 3;
}
sub simwinkler (Str $str1, Str $str2) {
    my $p_constant = 0.1;
    my $length_prefix = 0;
    for 0..3 -> $count {
        last if substr $str1, $count, 1 ne substr $str2, $count, 1;
        $length_prefix++;
    }
    my $simj = simjaro $str1, $str2;
    return $simj + $length_prefix * $p_constant * (1 - $simj);
}

my @tests = < FOOB BARF   CRATE TRACE   CRATE CRATE   TRACE CRATE   
    CREATE TRACT   DWAYNE DUANE >;
for @tests -> $w1, $w2 {
    my $simw = simwinkler $w1, $w2;
    say "Jaro-Winkler distance between $w1 and $w2 is: ", (1 - $simw).fmt("\t%.3f");
}

This program displays essentially the same output as the P5 program:

perl6 jano_dist.p6
Jaro-Winkler distance between FOOB and BARF is:         1.000
Jaro-Winkler distance between CRATE and TRACE is:       0.267
Jaro-Winkler distance between CRATE and CRATE is:       0.000
Jaro-Winkler distance between TRACE and CRATE is:       0.267
Jaro-Winkler distance between CREATE and TRACT is:      0.300
Jaro-Winkler distance between DWAYNE and DUANE is:      0.178

Aside from the small syntax adjustments needed between P5 and P6, note the convenient possibility in P6 to iterate over several values of a list at each step:

for @tests -> $w1, $w2 {
    # ...
}

Wrapping up

There was a third challenge this week: Send email using SendGrid API. For more information, visit the official page. This challenge was proposed by Gabor Szabo. The API challenge is optional but would love to see your solution.

As mentioned in earlier blog posts, I know next to nothing about this kind of topic, so I won't undertake anything on that subject and even less blog about it. Please try this challenge and provide answers if you know more that I do on such topic.

The next week Perl Weekly Challenge is due to start very soon. If you're interested in participating in this challenge, please check https://perlweeklychallenge.org/ and make sure you answer the challenge before 6 p.m. BST (British summer time) on next Sunday, June 9. And, please, also spread the word about the Perl Weekly Challenge if you can.

Leave a comment

About laurent_r

user-pic I am the author of the "Think Perl 6" book (O'Reilly, 2017) and I blog about Perl (5 and 6).