Perl Weekly Challenge 96: Reverse Words and Edit Distance (and Decorators in Perl)
These are some answers to the Week 96 of the Perl Weekly Challenge organized by Mohammad S. Anwar.
Spoiler Alert: This weekly challenge deadline is due in a couple of days (January 24, 2021). This blog post offers some solutions to this challenge, please don’t read on if you intend to complete the challenge on your own.
Task 1: Reverse Words
You are given a string $S
.
Write a script to reverse the order of words in the given string. The string may contain leading/trailing spaces. The string may have more than one space between words in the string. Print the result without leading/trailing spaces and there should be only one space between words.
Example 1:
Input: $S = "The Weekly Challenge"
Output: "Challenge Weekly The"
Example 2:
Input: $S = " Perl and Raku are part of the same family "
Output: "family same the of part are Raku and Perl"
Reverse Words in Raku
we simply chain the words, reverse and join method invocations:
use v6;
my $input = @*ARGS[0] // " Perl and Raku are part of the same family ";
say $input.words.reverse.join(" ");
Example output:
$ raku reverse-words.raku
family same the of part are Raku and Perl
~
$ raku reverse-words.raku "Don't ask what your country can do for you, ask what you can do for your country "
country your for do can you what ask you, for do can country your what ask Don't
Of course, this short script can easily be transformed into a Raku one-liner:
$ raku -e '@*ARGS[0].words.reverse.join(" ").say;' " Perl and Raku are part of the same family "
family same the of part are Raku and Perl
Reverse Words in Perl
In Perl, we use the same idea, just chaining function calls instead of method invocations:
use strict;
use warnings;
use feature "say";
my $input = shift // " Perl and Raku are part of the same family ";
say join " ", reverse split /\s+/, $input;
Output:
$ perl reverse-words.pl
family same the of part are Raku and Perl
Of course, we could also make it as a Perl one-liner:
$ perl -E 'say join " ", reverse split /\s+/, shift' " Perl and Raku are part of the same family "
family same the of part are Raku and Perl
Reverse Words in Scala
Using the same idea as in Raku, i.e. chaining the split
, reverse
, and mkString
method invocations:
object reverseWords extends App {
val in = " Perl and Raku are part of the same family "
println(in.split("\\s+").reverse.mkString(" "))
}
Output:
family same the of part are Raku and Perl
Reverse Words in Python
Whether using a function-call syntax (in Perl), or a method-invocation syntax (in Raku and Scala), our three programs above all use a functional programming approach chaining pure functions and using immutable data, more precisely a data flow or pipeline pattern. In this programming model, each piece of data is channeled through a series of successive transformations in which the returned data elements produced by one of the operations are fed to the next operation, and so on. This is possible because each operation of the pipeline is a “pure function” that takes an input and produces an output to be processed by the next operation.
This is not easily transposable in Python, because some operators acts are pure function as in the three other languages, and some others (such as reverse
in our case) are not pure functions and modify the data in-place instead of sending back the modified data as a return value. It would certainly be possible to work around the limitation in Python (using for example maps), but this is much less natural than in the three other languages, and it probably doesn’t make much sense to try to force the data flow model into Python. Therefore, our Python implementation will use intermediate temporary variables, as in traditional procedural programming.
def reverse_words(in_str):
words = in_str.split()
words.reverse()
return " ".join(words)
input = " Perl and Raku are part of the same family "
print(reverse_words(input))
Output:
$ python3 reverse-words.py
Perl and Raku are part of the same family
Task 2: Edit Distance
You are given two strings $S1
and $S2
.
Write a script to find out the minimum operations required to convert $S1
into $S2
. The operations can be insert, remove or replace a character. Please check out Wikipedia page for more information.
Example 1:
Input: $S1 = "kitten"; $S2 = "sitting"
Output: 3
Operation 1: replace 'k' with 's'
Operation 2: replace 'e' with 'i'
Operation 3: insert 'g' at the end
Example 2:
Input: $S1 = "sunday"; $S2 = "monday"
Output: 2
Operation 1: replace 's' with 'm'
Operation 2: replace 'u' with 'o'
In computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations (usually single character edits) required to transform one string into the other. When the operations permettied are insertion, deletion, or substitution of a character, edit distance is usually called Levenshtein distance, named after the Soviet mathematician Vladimir Levenshtein.
The Levenshtein distance between two strings a, b (of length |a| and |b| respectively) is given by lev (*a*, *b*)
where
where the tail of some string x is a string of all but the first character of x and *x*[*n*]
is the nth character of the string x, starting with character 0.
Note that, in the above formula, the first element in the minimum corresponds to deletion, the second to insertion and the third to replacement.
This definition can lead directly to a naïve recursive implementation. The problem, though, is that such naïve implementation would have an exponential time complexity and would unusable even for moderately large strings (especially if the strings are markedly different). As an example, the naïve (not optimized) version of the Raku recursive subroutine implementation presented below for the pseudo random strings “LMIjkHFSAE” and “dmqkdjfERZG” takes more than one minute:
$ time raku edit-distance.raku
11 LMIjkHFSAE - dmqkdjfERZG
-
real 1m15,592s
user 0m0,015s
sys 0m0,046s
The reason for that is that the recursive subroutine is called many times with the same input in the process. If we can cache (or memoize) the results to avoid having to recompute again and again the same result, then the time to compute the Levenshtein distance falls dramatically and becomes roughly proportional to the product of the two string lengths. This still makes the process quite inefficient for very long strings (such as DNA strands), but it’s usually OK for strings representing words in common human languages.
The recursive approach is a form of top-down dynamic programming, i.e. it breaks a large problem into smaller and smaller subproblems, until the subproblems can be solved. Other solutions use bottom-up dynamic programming, i.e. start from small elementary problems and expand them to larger problems; they often use a matrix or table and iterate over the matrix to expand it. The bottom-up approach also as a time complexity roughly proportional to the product of the two string lengths. So, while the two approaches may not be equivalent (the bottom-up approach is likely to be faster), they have the same time complexity, which means that they essentially scale up essentially the same way when the size of the strings increase. Here, we will use the recursive (top-down) approach.
Note that our implementations will work the other way around, from right to left. The only reason for this is that I had written about four years ago an edit-distance
program in Perl, and I found it easier to start from that implementation.
Edit Discance in Raku
We present two implementations of a cached implementation of the recursive solution.
Memoized Version (Raku)
First, we implement a “naïve” edit-distance
recursive function and use the Raku Memoize module (written by my friend Elizabeth Mattijsen as a port of Mark-Jason Dominus’s equivalent Perl module) to automatically perform the caching of already computed distances:
use v6;
use Memoize;
sub edit-distance (Str $left, Str $right) {
# If one of the substrings is empty, return the length of the other
return $right.chars unless $left;
return $left.chars unless $right;
my $shortened-left = substr $left, 0, *-1;
my $shortened-right = substr $right, 0, *-1;
# If the last chars are the same, we ignore them
# and call edit-distance on shortened strings
return edit-distance $shortened-left, $shortened-right
if substr($left, *-1) eq substr($right, *-1);
# Else find the minimum between the three operations
return 1 + min(
edit-distance($left, $shortened-right), #insert
edit-distance($shortened-left, $right), #remove
edit-distance($shortened-left, $shortened-right) #replace
);
}
memoize("edit-distance");
my @test-pairs = (
[ < kitten sitting >],
[ < Monday Friday > ],
[ < Sunday Saturday > ],
[ < January February > ],
[ < November December > ],
[ < constitutionally anticonstitutional > ],
[ < LMIjkHFSAE dmqkdjfERZG >],
);
for @test-pairs -> @test {
my ($str1, $str2) = @test;
print edit-distance($str1, $str2), " $str1 - $str2\n";
}
This program produces the following output:
$ time raku edit-distance.raku
3 kitten - sitting
3 Monday - Friday
3 Sunday - Saturday
4 January - February
3 November - December
6 constitutionally - anticonstitutional
11 LMIjkHFSAE - dmqkdjfERZG
-
real 0m1,452s
user 0m0,015s
sys 0m0,046s
Note how much faster this program is (1.45 seconds for 8 tests), compared to the non-optimized version with only the last test discussed above (about 1 min 15 sec for only one string pair).
Implementing A Cache Manually in Raku
I have often said that I eschew using off-the-shelf modules in the context of a programming challenge because I feel that it is sort of cheating. In the case of the above solution, I used the Memoize
module because it wasn’t really part of the Levenshtein distance algorithm, but only a performance optimization. In that case in point, however, that performance optimization is crucial (making the difference between a usable and a not usable implementation), so that I feel it is necessary to show a solution that implements the cache manually. This might be more useful for a beginner or a student wishing to understand how caching or memoizing works:
use v6;
my %cache;
sub edit-distance (Str $left, Str $right) {
sub cache-distance (Str $l, Str $r) {
%cache{"$l;$r"} = edit-distance($l, $r) unless %cache{"$l;$r"}:exists;
return %cache{"$l;$r"};
}
# If one of the substrings is empty, return the length of the other
return $right.chars unless $left;
return $left.chars unless $right;
my $shortened-left = substr $left, 0, *-1;
my $shortened-right = substr $right, 0, *-1;
# say " $shortened-left $shortened-right";
# If the last chars are the same, we ignore them
# and call edit-distance on shortened strings
if substr($left, *-1) eq substr($right, *-1) {
return cache-distance $shortened-left, $shortened-right;
}
# Else find the minimum between the three operations
return 1 + min(
cache-distance($left, $shortened-right), #insert
cache-distance($shortened-left, $right), #remove
cache-distance($shortened-left, $shortened-right) #replace
);
}
my @test-pairs = (
[ < kitten sitting >],
[ < Monday Friday > ],
[ < Sunday Saturday > ],
[ < January February > ],
[ < November December > ],
[ < constitutionally anticonstitutional > ],
[ < LMIjkHFSAE dmqkdjfERZG >],
);
for @test-pairs -> @test {
my ($str1, $str2) = @test;
print edit-distance($str1, $str2), " $str1 - $str2\n";
}
Note that I have implemented the cache management as a separate lexically-scoped subroutine, cache-distance
, because there were four recursive calls to edit-distance
in the body of the edit-distance
and I did not want to implement the cache management code four times.
This script displays the following output:
$ time raku edit-distance_cache.raku
3 kitten - sitting
3 Monday - Friday
3 Sunday - Saturday
4 January - February
3 November - December
6 constitutionally - anticonstitutional
11 LMIjkHFSAE - dmqkdjfERZG
real 0m0,398s
user 0m0,015s
sys 0m0,031s
We see another significant performance improvement, probably because it is more efficient to tailor a cache for a specific problem, compared to a general solution such as using the Memoize
module (and perhaps also because it takes some time to compile the module, not quite sure).
Edit Distance in Perl
Memoized Version (Perl)
First, we implement a “naïve” edit-distance
recursive function and use the core Perl Memoize
module (written by Mark-Jason Dominus) to automatically perform the caching of already computed distances:
use strict;
use warnings;
use feature 'say';
use Memoize;
sub min {
my $rv = shift;
for my $tmp (@_) {
$rv = $tmp if $tmp < $rv;
}
return $rv;
}
sub edit_distance {
my ($left, $right) = @_;
# If one of the substrings is empty, return the length of the other
return length $right unless $left;
return length $left unless $right;
my $shortened_left = substr $left, 0, -1;
my $shortened_right = substr $right, 0, -1;
# In the last chars are the same, we ignore them
# and call edit_distance on shortened strings
return edit_distance ($shortened_left, $shortened_right) if substr($left, -1) eq substr($right, -1);
# Else find the minimum between the three operations
return 1 + min(
edit_distance($left, $shortened_right), #insert
edit_distance($shortened_left, $right), #remove
edit_distance($shortened_left, $shortened_right) #replace
);
}
memoize("edit_distance");
my @test_pairs = (
[ qw<kitten sitting>],
[ qw<Monday Friday> ],
[ qw<Sunday Saturday> ],
[ qw<January February> ],
[ qw<November December > ],
[ qw<constitutionally anticonstitutional > ],
# [ qw<LMIjkHFSAE dmqkdjfERZG>],
);
for my $ar (@test_pairs) {
my ($str1, $str2) = @$ar;
say edit_distance($str1,$str2), " $str1 - $str2 ";
}
This program displays the following output and time measures:
$ time perl edit-distance.pl
3 kitten - sitting
3 Monday - Friday
3 Sunday - Saturday
4 January - February
3 November - December
6 constitutionally - anticonstitutional
11 LMIjkHFSAE - dmqkdjfERZG
real 0m0,103s
user 0m0,015s
sys 0m0,046s
Creating a Decorator in Perl
As I said before, I don’t like to use ready-made modules in programming challenges (which the reason why I also implemented myself the min
subroutine). We could manually implement some form of wrapper around the naïve edit_distance
subroutine, as we more or less did in Raku, to manage the cache. However, I thought it would be more fun to implement a decorator in order to modify the behavior of the edit_distance
subroutine.
Originally, a decorator was a OO-programming design pattern making it possible to assign new properties or responsibilities to an object, without modifying that object’s class. The idea has later been expanded to other programming models. For example, PEP 318 in Python introduced decorators as functions that are designed to change the behavior of other functions without modifying the code of these other functions. This may be useful, for example, to modify the behavior of a legacy or complex function that you don’t want to change, or to modify in some specific case the behavior of a module that you otherwise don’t want to change because other programs may be using that module. It is this extended meaning of decorators that we mean here. So, let’s assume we don’t want to modify the code of our edit_distance subroutine (perhaps it’s complicated and I spent so much time debugging it that I want to leave it alone). Rather than changing the code of that subroutine, we will decorate it to add to it the caching functionality.
Contrary to Python, Perl doesn’t have a specific syntax for decorators, but, as we will see, it is relatively easy to use higher-order functions and typeglobs to implement our own decorators.
First, we define a decorate
subroutine which takes as input parameter a reference to the subroutine to be cached and returns an anonymous subroutine that checks the cache and returns the value in the cache if it exists, and else calls the subref it has received as a parameter:
sub decorate {
my $coderef = shift; # the argument is a ref to edit_distance
my %cache;
return sub {
my ($l, $r) = @_;
$cache{"$l;$r"} = $coderef->(@_) unless exists $cache{"$l;$r"};
return $cache{"$l;$r"};
}
}
Note that we define the %cache
in the decorate
subroutine. The anonymous subroutine thus acts as a closure and keeps its access to the cache.
Then, we replace the original edit_distance
subroutine with the anonymous subroutine returned by decorate
in the main symbol table:
{
# local scope for the deactivation of the redefine warning
no warnings 'redefine';
# we replace the edit-distance subroutine by its
# decorated version in the main symbol table
*main::edit_distance = decorate(\&edit_distance);
}
Here, *main::edit_distance
is a typeglob representing the entry of the original edit_distance
subroutine in the symbol table. Before that, we deactivate the redefine
warning (to avoid a warning upon the subroutine definition, and we do that in a code block to limit the scope of the deactivation to this code line.
Now, whenever the code will call the edit_distance
subroutine, it is the anonymous subroutine produced by the decorate
subroutine that will be called instead. Thus, the edit_distance
subroutine seen by the rest of the program is now memoized (the edit distances are cached), although we did not change anything to the code defining it. That subroutine is called recursively four times in its own code, but we don’t need to charge all these subroutine calls.
The overall program now looks like this:
use strict;
use warnings;
use feature qw/say/;
sub min {
my $rv = shift;
for my $tmp (@_) {
$rv = $tmp if $tmp < $rv;
}
return $rv;
}
sub edit_distance {
my ($left, $right) = @_;
# If one of the substrings is empty, return the length of the other
return length $right unless $left;
return length $left unless $right;
my $shortened_left = substr $left, 0, -1;
my $shortened_right = substr $right, 0, -1;
# In the last chars are the same, we ignore them
# and call edit_distance on shortened strings
return edit_distance ($shortened_left, $shortened_right) if substr($left, -1) eq substr($right, -1);
# Else find the minimum between the three operations
return 1 + min(
edit_distance($left, $shortened_right), #insert
edit_distance($shortened_left, $right), #remove
edit_distance($shortened_left, $shortened_right) #replace
);
}
# The decorator returns the edit_distance subroutine wrapped in
# code lines performing the caching of values
sub decorate {
my $coderef = shift;
my %cache;
return sub {
my ($l, $r) = @_;
$cache{"$l;$r"} = $coderef->(@_) unless exists $cache{"$l;$r"};
return $cache{"$l;$r"};
}
}
{
# local scope for the deactivation of the redefine warning
no warnings 'redefine';
# we replace the edit-distance subrouytine by its
# decorated version in the main symbol table
*main::edit_distance = decorate(\&edit_distance);
}
my @test_pairs = (
[ qw<kitten sitting> ],
[ qw<Monday Friday> ],
[ qw<Sunday Saturday> ],
[ qw<January February> ],
[ qw<November December > ],
[ qw<constitutionally anticonstitutional > ],
[ qw<LMIjkHFSAE dmqkdjfERZG>],
);
for my $ar (@test_pairs) {
my ($str1, $str2) = @$ar;
say edit_distance($str1,$str2), " $str1 - $str2 ";
}
This script produces the following output and execution times:
$ time perl edit-distance_decorator.pl
3 kitten - sitting
3 Monday - Friday
3 Sunday - Saturday
4 January - February
3 November - December
6 constitutionally - anticonstitutional
11 LMIjkHFSAE - dmqkdjfERZG
real 0m0,064s
user 0m0,000s
sys 0m0,046s
Note that our manually decorated subroutine is slightly faster that the original memoized version.
Edit Distance in Python
I do not know whether there is something equivalent to the Memoize
module in Python, so I will manage the cache manually (in the cached_distance
subroutine):
cache = dict()
def cached_distance(left, right):
key = left + ';' + right
if key not in cache:
cache[key] = edit_distance(left, right)
return cache[key]
def edit_distance(left, right):
lr = len(right)
ll = len(left)
if not left: return lr
if not right: return ll
shortened_l = left[0:ll-1]
shortened_r = right[0:lr-1]
if left[ll-1] == right[lr-1]:
return cached_distance(shortened_l, shortened_r)
return 1 + min( cached_distance(left, shortened_r), # Insert
cached_distance(shortened_l, right), # Remove
cached_distance(shortened_l, shortened_r) # Replace
)
tests = ( [ "kitten", "sitting" ], [ "Monday", "Friday" ],
[ "Sunday", "Saturday" ], [ "January", "February" ],
[ "November", "December" ],
[ "constitutionally", "anticonstitutional" ],
)
for test in tests:
print (test[0], test[1], edit_distance(test[0], test[1] ))
This produces the following output:
$ time python3 edit-distance.py
kitten sitting 3
Monday Friday 3
Sunday Saturday 3
January February 4
November December 3
constitutionally anticonstitutional 6
real 0m0,114s
user 0m0,000s
sys 0m0,062s
Wrapping up
The next week Perl Weekly Challenge will start soon. If you want to participate in this challenge, please check https://perlweeklychallenge.org/ and make sure you answer the challenge before 23:59 BST (British summer time) on Sunday, January 31, 2021. And, please, also spread the word about the Perl Weekly Challenge if you can.
Leave a comment