Perl Weekly Challenge 96: Reverse Words and Edit Distance (and Decorators in Perl)

These are some answers to the Week 96 of the Perl Weekly Challenge organized by Mohammad S. Anwar.

Spoiler Alert: This weekly challenge deadline is due in a couple of days (January 24, 2021). This blog post offers some solutions to this challenge, please don’t read on if you intend to complete the challenge on your own.

Task 1: Reverse Words

You are given a string $S.

Write a script to reverse the order of words in the given string. The string may contain leading/trailing spaces. The string may have more than one space between words in the string. Print the result without leading/trailing spaces and there should be only one space between words.

Example 1:

Input: $S = "The Weekly Challenge"
Output: "Challenge Weekly The"

Example 2:

Input: $S = "    Perl and   Raku are  part of the same family  "
Output: "family same the of part are Raku and Perl"

Reverse Words in Raku

we simply chain the words, reverse and join method invocations:

use v6;

my $input = @*ARGS[0] // "    Perl and   Raku are  part of the same family  ";
say $input.words.reverse.join(" ");

Example output:

$ raku reverse-words.raku
family same the of part are Raku and Perl
~
$ raku reverse-words.raku "Don't ask what your country can do for you, ask what you can do for   your country  "
country your for do can you what ask you, for do can country your what ask Don't

Of course, this short script can easily be transformed into a Raku one-liner:

$ raku -e '@*ARGS[0].words.reverse.join(" ").say;' "    Perl and   Raku are  part of the same family  "
family same the of part are Raku and Perl

Reverse Words in Perl

In Perl, we use the same idea, just chaining function calls instead of method invocations:

use strict;
use warnings;
use feature "say";

my $input = shift // "    Perl and   Raku are  part of the same family  ";
say join " ", reverse split /\s+/, $input;

Output:

$ perl reverse-words.pl
family same the of part are Raku and Perl

Of course, we could also make it as a Perl one-liner:

$ perl -E 'say join " ", reverse split /\s+/, shift' "    Perl and   Raku are  part of the same family  "
family same the of part are Raku and Perl

Reverse Words in Scala

Using the same idea as in Raku, i.e. chaining the split, reverse, and mkString method invocations:

object reverseWords extends App {
  val in = "    Perl and   Raku are  part of the same family  "
  println(in.split("\\s+").reverse.mkString(" "))
}

Output:

family same the of part are Raku and Perl

Reverse Words in Python

Whether using a function-call syntax (in Perl), or a method-invocation syntax (in Raku and Scala), our three programs above all use a functional programming approach chaining pure functions and using immutable data, more precisely a data flow or pipeline pattern. In this programming model, each piece of data is channeled through a series of successive transformations in which the returned data elements produced by one of the operations are fed to the next operation, and so on. This is possible because each operation of the pipeline is a “pure function” that takes an input and produces an output to be processed by the next operation.

This is not easily transposable in Python, because some operators acts are pure function as in the three other languages, and some others (such as reverse in our case) are not pure functions and modify the data in-place instead of sending back the modified data as a return value. It would certainly be possible to work around the limitation in Python (using for example maps), but this is much less natural than in the three other languages, and it probably doesn’t make much sense to try to force the data flow model into Python. Therefore, our Python implementation will use intermediate temporary variables, as in traditional procedural programming.

def reverse_words(in_str):
    words = in_str.split()
    words.reverse()
    return " ".join(words)

input = "    Perl and   Raku are  part of the same family  "
print(reverse_words(input))

Output:

$ python3 reverse-words.py
Perl and Raku are part of the same family

Task 2: Edit Distance

You are given two strings $S1 and $S2.

Write a script to find out the minimum operations required to convert $S1 into $S2. The operations can be insert, remove or replace a character. Please check out Wikipedia page for more information.

Example 1:

Input: $S1 = "kitten"; $S2 = "sitting"
Output: 3

Operation 1: replace 'k' with 's'
Operation 2: replace 'e' with 'i'
Operation 3: insert 'g' at the end

Example 2:

Input: $S1 = "sunday"; $S2 = "monday"
Output: 2

Operation 1: replace 's' with 'm'
Operation 2: replace 'u' with 'o'

In computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations (usually single character edits) required to transform one string into the other. When the operations permettied are insertion, deletion, or substitution of a character, edit distance is usually called Levenshtein distance, named after the Soviet mathematician Vladimir Levenshtein.

The Levenshtein distance between two strings a, b (of length |a| and |b| respectively) is given by lev ⁡ (*a*, *b*) where

levenstein_dist.jpg

where the tail of some string x is a string of all but the first character of x and *x*[*n*] is the nth character of the string x, starting with character 0.

Note that, in the above formula, the first element in the minimum corresponds to deletion, the second to insertion and the third to replacement.

This definition can lead directly to a naïve recursive implementation. The problem, though, is that such naïve implementation would have an exponential time complexity and would unusable even for moderately large strings (especially if the strings are markedly different). As an example, the naïve (not optimized) version of the Raku recursive subroutine implementation presented below for the pseudo random strings “LMIjkHFSAE” and “dmqkdjfERZG” takes more than one minute:

$ time raku edit-distance.raku
11 LMIjkHFSAE - dmqkdjfERZG
-
real    1m15,592s
user    0m0,015s
sys     0m0,046s

The reason for that is that the recursive subroutine is called many times with the same input in the process. If we can cache (or memoize) the results to avoid having to recompute again and again the same result, then the time to compute the Levenshtein distance falls dramatically and becomes roughly proportional to the product of the two string lengths. This still makes the process quite inefficient for very long strings (such as DNA strands), but it’s usually OK for strings representing words in common human languages.

The recursive approach is a form of top-down dynamic programming, i.e. it breaks a large problem into smaller and smaller subproblems, until the subproblems can be solved. Other solutions use bottom-up dynamic programming, i.e. start from small elementary problems and expand them to larger problems; they often use a matrix or table and iterate over the matrix to expand it. The bottom-up approach also as a time complexity roughly proportional to the product of the two string lengths. So, while the two approaches may not be equivalent (the bottom-up approach is likely to be faster), they have the same time complexity, which means that they essentially scale up essentially the same way when the size of the strings increase. Here, we will use the recursive (top-down) approach.

Note that our implementations will work the other way around, from right to left. The only reason for this is that I had written about four years ago an edit-distance program in Perl, and I found it easier to start from that implementation.

Edit Discance in Raku

We present two implementations of a cached implementation of the recursive solution.

Memoized Version (Raku)

First, we implement a “naïve” edit-distance recursive function and use the Raku Memoize module (written by my friend Elizabeth Mattijsen as a port of Mark-Jason Dominus’s equivalent Perl module) to automatically perform the caching of already computed distances:

use v6;
use Memoize;

sub edit-distance (Str $left, Str $right) {
    # If one of the substrings is empty, return the length of the other
    return $right.chars unless $left;
    return $left.chars unless $right;
    my $shortened-left  = substr $left,  0, *-1;
    my $shortened-right = substr $right, 0, *-1;

    # If the last chars are the same, we ignore them
    # and call edit-distance on shortened strings    
    return edit-distance $shortened-left, $shortened-right 
        if substr($left, *-1) eq substr($right, *-1);

    # Else find the minimum between the three operations
    return 1 + min(
        edit-distance($left,       $shortened-right), #insert
        edit-distance($shortened-left,  $right),      #remove
        edit-distance($shortened-left,  $shortened-right)  #replace
    );
}    

memoize("edit-distance");

my @test-pairs = (
    [ < kitten sitting >],
    [ < Monday Friday > ],
    [ < Sunday Saturday > ],
    [ < January February > ],
    [ < November December > ],
    [ < constitutionally anticonstitutional > ],
    [ < LMIjkHFSAE dmqkdjfERZG >],
);
for @test-pairs -> @test {
    my ($str1, $str2) = @test;
    print edit-distance($str1, $str2), " $str1 - $str2\n";
}

This program produces the following output:

$ time raku edit-distance.raku
3 kitten - sitting
3 Monday - Friday
3 Sunday - Saturday
4 January - February
3 November - December
6 constitutionally - anticonstitutional
11 LMIjkHFSAE - dmqkdjfERZG
-
real    0m1,452s
user    0m0,015s
sys     0m0,046s

Note how much faster this program is (1.45 seconds for 8 tests), compared to the non-optimized version with only the last test discussed above (about 1 min 15 sec for only one string pair).

Implementing A Cache Manually in Raku

I have often said that I eschew using off-the-shelf modules in the context of a programming challenge because I feel that it is sort of cheating. In the case of the above solution, I used the Memoize module because it wasn’t really part of the Levenshtein distance algorithm, but only a performance optimization. In that case in point, however, that performance optimization is crucial (making the difference between a usable and a not usable implementation), so that I feel it is necessary to show a solution that implements the cache manually. This might be more useful for a beginner or a student wishing to understand how caching or memoizing works:

use v6;

my %cache;

sub edit-distance (Str $left, Str $right) {
    sub cache-distance (Str $l, Str $r) {
        %cache{"$l;$r"} = edit-distance($l, $r) unless %cache{"$l;$r"}:exists;
        return %cache{"$l;$r"};
    }

    # If one of the substrings is empty, return the length of the other
    return $right.chars unless $left;
    return $left.chars unless $right;
    my $shortened-left  = substr $left,  0, *-1;
    my $shortened-right = substr $right, 0, *-1;
    # say " $shortened-left  $shortened-right";

    # If the last chars are the same, we ignore them
    # and call edit-distance on shortened strings
    if substr($left, *-1) eq substr($right, *-1) { 
        return cache-distance $shortened-left, $shortened-right;
    }

    # Else find the minimum between the three operations
    return 1 + min(
        cache-distance($left,       $shortened-right), #insert
        cache-distance($shortened-left,  $right),      #remove
        cache-distance($shortened-left, $shortened-right)  #replace
    );
}    

my @test-pairs = (
    [ < kitten sitting >],
    [ < Monday Friday > ],
    [ < Sunday Saturday > ],
    [ < January February > ],
    [ < November December > ],
    [ < constitutionally anticonstitutional > ],
    [ < LMIjkHFSAE dmqkdjfERZG >],
);
for @test-pairs -> @test {
    my ($str1, $str2) = @test;
    print edit-distance($str1, $str2), " $str1 - $str2\n";
}

Note that I have implemented the cache management as a separate lexically-scoped subroutine, cache-distance, because there were four recursive calls to edit-distance in the body of the edit-distance and I did not want to implement the cache management code four times.

This script displays the following output:

$ time raku edit-distance_cache.raku
3 kitten - sitting
3 Monday - Friday
3 Sunday - Saturday
4 January - February
3 November - December
6 constitutionally - anticonstitutional
11 LMIjkHFSAE - dmqkdjfERZG

real    0m0,398s
user    0m0,015s
sys     0m0,031s

We see another significant performance improvement, probably because it is more efficient to tailor a cache for a specific problem, compared to a general solution such as using the Memoize module (and perhaps also because it takes some time to compile the module, not quite sure).

Edit Distance in Perl

Memoized Version (Perl)

First, we implement a “naïve” edit-distance recursive function and use the core Perl Memoize module (written by Mark-Jason Dominus) to automatically perform the caching of already computed distances:

use strict;
use warnings;
use feature 'say';
use Memoize;

sub min {
    my $rv = shift;
    for my $tmp (@_) {
        $rv = $tmp if $tmp < $rv;
    }
    return $rv;
}

sub edit_distance {
    my ($left, $right) = @_;

    # If one of the substrings is empty, return the length of the other
    return length $right unless $left;
    return length $left  unless $right;

    my $shortened_left  = substr $left,  0, -1;
    my $shortened_right = substr $right, 0, -1;
    # In the last chars are the same, we ignore them
    # and call edit_distance on shortened strings
    return edit_distance ($shortened_left, $shortened_right) if substr($left, -1) eq substr($right, -1);

    # Else find the minimum between the three operations
    return 1 + min(
        edit_distance($left, $shortened_right), #insert
        edit_distance($shortened_left, $right), #remove
        edit_distance($shortened_left, $shortened_right) #replace
    );
}

memoize("edit_distance");

my @test_pairs = (
    [ qw<kitten sitting>],
    [ qw<Monday Friday> ],
    [ qw<Sunday Saturday> ],
    [ qw<January February> ],
    [ qw<November December > ],
    [ qw<constitutionally anticonstitutional > ],
    # [ qw<LMIjkHFSAE dmqkdjfERZG>],
);
for my $ar (@test_pairs) {
    my ($str1, $str2) = @$ar;
    say edit_distance($str1,$str2), " $str1 - $str2 ";
}

This program displays the following output and time measures:

$ time perl edit-distance.pl
3 kitten - sitting
3 Monday - Friday
3 Sunday - Saturday
4 January - February
3 November - December
6 constitutionally - anticonstitutional
11 LMIjkHFSAE - dmqkdjfERZG

real    0m0,103s
user    0m0,015s
sys     0m0,046s

Creating a Decorator in Perl

As I said before, I don’t like to use ready-made modules in programming challenges (which the reason why I also implemented myself the min subroutine). We could manually implement some form of wrapper around the naïve edit_distance subroutine, as we more or less did in Raku, to manage the cache. However, I thought it would be more fun to implement a decorator in order to modify the behavior of the edit_distance subroutine.

Originally, a decorator was a OO-programming design pattern making it possible to assign new properties or responsibilities to an object, without modifying that object’s class. The idea has later been expanded to other programming models. For example, PEP 318 in Python introduced decorators as functions that are designed to change the behavior of other functions without modifying the code of these other functions. This may be useful, for example, to modify the behavior of a legacy or complex function that you don’t want to change, or to modify in some specific case the behavior of a module that you otherwise don’t want to change because other programs may be using that module. It is this extended meaning of decorators that we mean here. So, let’s assume we don’t want to modify the code of our edit_distance subroutine (perhaps it’s complicated and I spent so much time debugging it that I want to leave it alone). Rather than changing the code of that subroutine, we will decorate it to add to it the caching functionality.

Contrary to Python, Perl doesn’t have a specific syntax for decorators, but, as we will see, it is relatively easy to use higher-order functions and typeglobs to implement our own decorators.

First, we define a decorate subroutine which takes as input parameter a reference to the subroutine to be cached and returns an anonymous subroutine that checks the cache and returns the value in the cache if it exists, and else calls the subref it has received as a parameter:

sub decorate {
    my $coderef = shift;  # the argument is a ref to edit_distance
    my %cache;
    return sub {
        my ($l, $r) = @_;
        $cache{"$l;$r"} = $coderef->(@_) unless exists $cache{"$l;$r"};
        return $cache{"$l;$r"};
    }
}

Note that we define the %cache in the decorate subroutine. The anonymous subroutine thus acts as a closure and keeps its access to the cache.

Then, we replace the original edit_distance subroutine with the anonymous subroutine returned by decorate in the main symbol table:

{
    # local scope for the deactivation of the redefine warning
    no warnings 'redefine';
    # we replace the edit-distance subroutine by its 
    # decorated version in the main symbol table
    *main::edit_distance = decorate(\&edit_distance);
}

Here, *main::edit_distance is a typeglob representing the entry of the original edit_distance subroutine in the symbol table. Before that, we deactivate the redefine warning (to avoid a warning upon the subroutine definition, and we do that in a code block to limit the scope of the deactivation to this code line.

Now, whenever the code will call the edit_distance subroutine, it is the anonymous subroutine produced by the decorate subroutine that will be called instead. Thus, the edit_distance subroutine seen by the rest of the program is now memoized (the edit distances are cached), although we did not change anything to the code defining it. That subroutine is called recursively four times in its own code, but we don’t need to charge all these subroutine calls.

The overall program now looks like this:

use strict;
use warnings;
use feature qw/say/;

sub min {
    my $rv = shift;
    for my $tmp (@_) {
        $rv = $tmp if $tmp < $rv;
    }
    return $rv;
}

sub edit_distance {
    my ($left, $right) = @_;

    # If one of the substrings is empty, return the length of the other
    return length $right unless $left;
    return length $left  unless $right;

    my $shortened_left  = substr $left,  0, -1;
    my $shortened_right = substr $right, 0, -1;
    # In the last chars are the same, we ignore them
    # and call edit_distance on shortened strings
    return edit_distance ($shortened_left, $shortened_right) if substr($left, -1) eq substr($right, -1);

    # Else find the minimum between the three operations
    return 1 + min(
        edit_distance($left, $shortened_right), #insert
        edit_distance($shortened_left, $right), #remove
        edit_distance($shortened_left, $shortened_right) #replace
    );
}

# The decorator returns the edit_distance subroutine wrapped in 
# code lines performing the caching of values
sub decorate {
    my $coderef = shift;
    my %cache;
    return sub {
        my ($l, $r) = @_;
        $cache{"$l;$r"} = $coderef->(@_) unless exists $cache{"$l;$r"};
        return $cache{"$l;$r"};
    }
}

{
    # local scope for the deactivation of the redefine warning
    no warnings 'redefine';
    # we replace the edit-distance subrouytine by its 
    # decorated version in the main symbol table
    *main::edit_distance = decorate(\&edit_distance);
}

my @test_pairs = (
    [ qw<kitten sitting> ],
    [ qw<Monday Friday> ],
    [ qw<Sunday Saturday> ],
    [ qw<January February> ],
    [ qw<November December > ],
    [ qw<constitutionally anticonstitutional > ],
    [ qw<LMIjkHFSAE dmqkdjfERZG>],
);
for my $ar (@test_pairs) {
    my ($str1, $str2) = @$ar;
    say edit_distance($str1,$str2), " $str1 - $str2 ";
}

This script produces the following output and execution times:

$ time perl  edit-distance_decorator.pl
3 kitten - sitting
3 Monday - Friday
3 Sunday - Saturday
4 January - February
3 November - December
6 constitutionally - anticonstitutional
11 LMIjkHFSAE - dmqkdjfERZG

real    0m0,064s
user    0m0,000s
sys     0m0,046s

Note that our manually decorated subroutine is slightly faster that the original memoized version.

Edit Distance in Python

I do not know whether there is something equivalent to the Memoize module in Python, so I will manage the cache manually (in the cached_distance subroutine):

cache = dict()

def cached_distance(left, right):
    key = left + ';' + right
    if key not in cache:
        cache[key] = edit_distance(left, right)  
    return cache[key]

def edit_distance(left, right):
    lr = len(right)
    ll = len(left)
    if not left: return lr
    if not right: return ll
    shortened_l = left[0:ll-1]
    shortened_r = right[0:lr-1]
    if left[ll-1] == right[lr-1]:
        return cached_distance(shortened_l, shortened_r)

    return 1 + min( cached_distance(left, shortened_r),     # Insert 
                    cached_distance(shortened_l, right),    # Remove 
                    cached_distance(shortened_l, shortened_r)  # Replace 
                  )     

tests = ( [ "kitten", "sitting" ], [ "Monday", "Friday" ], 
          [ "Sunday", "Saturday" ], [ "January", "February" ],
          [ "November", "December" ],
          [ "constitutionally", "anticonstitutional" ],
        )

for test in tests:
    print (test[0], test[1], edit_distance(test[0], test[1]  ))

This produces the following output:

$ time python3 edit-distance.py
kitten sitting 3
Monday Friday 3
Sunday Saturday 3
January February 4
November December 3
constitutionally anticonstitutional 6

real    0m0,114s
user    0m0,000s
sys     0m0,062s

Wrapping up

The next week Perl Weekly Challenge will start soon. If you want to participate in this challenge, please check https://perlweeklychallenge.org/ and make sure you answer the challenge before 23:59 BST (British summer time) on Sunday, January 31, 2021. And, please, also spread the word about the Perl Weekly Challenge if you can.

Leave a comment

About laurent_r

user-pic I am the author of the "Think Perl 6" book (O'Reilly, 2017) and I blog about the Perl 5 and Raku programming languages.