Perl Weekly Challenge # 18: Longest Common Substrings, Priority Queues, and a Functional Object System
These are some answers to the Week 18 of the Perl Weekly Challenge organized by Mohammad S. Anwar.
Spoiler Alert: This weekly challenge deadline is due in several days from now (July 28, 2019). This blog post offers some solutions to this challenge, please don't read on if you intend to complete the challenge on your own.
Challenge # 1: Longest Common Substrings
Write a script that takes 2 or more strings as command line parameters and print the longest common substring. For example, the longest common substring of the strings “ABABC”, “BABCA” and “ABCBA” is string “ABC” of length 3. Other common substrings are “A”, “AB”, “B”, “BA”, “BC” and “C”. Please check this wiki page for details.
I can see at least two ways to tackle the problem (to simplify, let's say between two strings). One is to have two nested loops, one on the letters of the first string and the second one on the letters of the second string, and to store the substrings (or, possibly, the longest so far). The other is to generate all the substrings of each word and then to compare them. I'll use the first approach in Perl 5 and the second one in Perl 6 (because P6 has some functionalities making the second approach easy and interesting, and probably quite efficient).
Note that the programs below will consider only extended ASCII strings for simplicity. A couple of very minor changes might be needed for dealing properly with full Unicode strings.
Longest Common Substring in Perl 5
We loop over the input strings and, for each pair of input strings, use the compare2str
subroutine to find the common substrings. The %common
hash is used to store the common substrings (we use a hash because it automatically removes duplicate substrings (which we don't need) and it is easier and faster to filter the result in order to keep only substrings that are common to all input strings.
#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
sub compare2str {
my ($str1, $str2) = @_;
my @st1 = split //, $str1;
my @st2 = split //, $str2;
my %result;
my $common = '';
my ($i, $j) = (0, 0);
while ($i <= $#st1) {
while ($j <= $#st2) {
if ($st1[$i] eq $st2[$j]) {
$common .= $st1[$i];
$result{$common} = 1;
my ($k, $l) = ($i, $j);
while (1) {
$k++; $l++;
if ($k <= $#st1 and $l<= $#st2
and $st1[$k] eq $st2[$l]) {
$common .= $st1[$k];
$result{$common} = 1;;
} else {
$common = '';
last;
}
}
}
$j++;
}
$j = 0;
$i++;
}
return keys %result;
}
die "Must supply at least two strings\n" unless @ARGV >= 2;
my %common = map { $_ => 1 } compare2str shift, $ARGV[0];
while (@ARGV > 1) {
%common = map { $_ => 1 } grep $common{$_},
compare2str shift, $ARGV[0];
}
my $max = "";
for (keys %common) {
$max = $_ if length $_ > length $max;
}
say "Largest common substring: $max";
This works as expected. For example:
$ perl substrings.pl ABABCTO BABCTO ABCTBA
Largest common substring: ABCT
However, it turns out the code is quite complicated for a relatively easy task.
Of course, we could use a CPAN module such as String::LCSS, but that would be cheating: we want to implement the algorithm ourselves. It is certainly quite possible to improve the code above here and there in order to make it more concise and possibly clearer, but I will not do that because I want to test a different approach. I'll come back to that after we've seen our Perl 6 solution.
Longest Common Substring in Perl 6
As mentioned earlier, our approach for the Perl 6 version is to generate all the unique substrings of each input string and then only keep the substrings that are common to all strings. It is then easy to find the longest substring.
To generate all the substrings of a given string, we could probably use the regex engine with the :exhaustive
adverb, to get all the overlapping matches, but we'll rather use the rotor
built-in subroutine, which isn't mentioned very often although it is very powerful and expressive.
These are two examples using rotor
under the REPL:
> 'abcd'.comb.rotor(1);
((a) (b) (c) (d))
> 'abcd'.comb.rotor(2);
((a b) (c d))
In these examples, rotor
groups the elements of the invocant into groups of 1 and 2 elements. We're a long way from generating all the substrings of a given string. But we can do better:
> say 'abcd'.comb.rotor($_) for 1..4;;
((a) (b) (c) (d))
((a b) (c d))
((a b c))
((a b c d))
This is already much better, but we're still missing some of the desired substrings such as bc
and bcd
.
The rotor
method can take as parameter a key-value pair, whose value (the second item) specifies a gap between the various matches:
> (1..10).rotor(2 => 1)
((1 2) (4 5) (7 8))
As you can see, we obtain pairs of values, with a gap of 1 between the pairs (item 3, 6 and 9 are omitted from the list. Now, the gap can also be negative and, in that case, we get all successive pairs from the range:
> (1..10).rotor(2 => -1)
((1 2) (2 3) (3 4) (4 5) (5 6) (6 7) (7 8) (8 9) (9 10))
The rotor
subroutine can in fact do much more than that (check the rotor documentation), but I've basically shown the features that we'll use here.
The other Perl 6 functionality that we will use here is the the Set
type and the associated intersection (∩
or (&)
) operator. This operator does exactly what set intersection do in the mathematical set theory: it returns the elements that are common to the two sets.
We can now code the largest common substring in Perl 6:
use v6;
use Test;
sub substrings (Str $in) {
my @result = $in.comb;
append @result, map { .join('') }, $in.comb.rotor: $_ => 1-$_ for 2..$in.chars;
return set @result;
}
sub largest-substring (@words) {
my Set $intersection = substrings shift @words;
while (my $word = shift @words) {
$intersection ∩= substrings $word;
}
return $intersection.keys.max({.chars});
}
multi MAIN (*@words where *.elems > 1) {
say largest-substring @words;
}
multi MAIN () {
plan 2;
my @words = <ABABC BABCA ABCBA>;
cmp-ok largest-substring(@words), 'eq', 'ABC', "Testing 3 strings";
@words = 'abcde' xx 5;
cmp-ok largest-substring(@words), 'eq', 'abcde', "Testing identical strings";
}
Running the program with no argument to run the tests produces this:
$ perl6 substrings.p6
1..2
ok 1 - Testing 3 strings
ok 2 - Testing identical strings
And with the same parameters as our P5 implementation, we get the same result:
perl6 substrings.p6 ABABCTO BABCTO ABCTBA
ABCT
Longest Common Substring in Perl 5 - Reloaded
The P6 script doing the bulk of the actual work is only 12 lines of code, whereas the equivalent in our initial Perl 5 implementation is 31 lines. Even though the P6 version is using features that don't exist in P5, perhaps we can mimic the same approach:
#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
sub substrings {
my @chars = split //, shift;
my %substr; # using a hash to remove duplicates
for my $i (0..$#chars) {
for my $j ($i..$#chars) {
$substr{ join '', @chars[$i..$j] } = 1;
}
}
return keys %substr;
}
my %result = map { $_ => 1} substrings shift;
for my $word (@ARGV) {
%result = map {$_ => 1} grep $result{$_}, substrings $word;
}
my $max = 0;
for (keys %result) {
$max = $_ if length $_ > length $max;
}
say "Largest common substring: $max";
This prints out:
$ perl substrings.pl ABABCTO BABCTO ABCTBA
Largest common substring: ABCT
Now, the bulk of the actual work is done in 14 lines of code, less than half the original P5 implementation, and barely more that the P6 implementation. It is quite interesting: the features of P6 gave some ideas on how to solve the problem better in Perl 5.
Priority Queues
Write a script to implement Priority Queue. It is like regular queue except each element has a priority associated with it. In a priority queue, an element with high priority is served before an element with low priority. Please check this wiki page for more informations. It should serve the following operations:
1) is_empty: check whether the queue has no elements.
2) insertwithpriority: add an element to the queue with an associated priority.
3) pullhighestpriority_element: remove the element from the queue that has the highest priority, and return it. If two elements have the same priority, then return element added first.
There are numerous ways to design simple priority queues (at least when performance is not an issue, for instance is the data set isn't very large). For example, it might be sufficient to maintain an array of arrays (AoA), where each of the arrays is a pair containing the item and associated priority. Or an array of hashes (AoH) based on the same idea. This means that each time we want to pull the highest priority element, we need to traverse the whole data structure to find the item with the highest priority. This may be quite slow when there are many items, but this may not matter if our data structure only has a few dozen items.
Another way is to build a hash of arrays (HoA), where the hash keys are the priority and the hash values references to arrays. When the number of priorities is relatively small (compared to the number of items in the queues), this tends to be more efficient, but note that we still need to traverse the keys of the hash until we find the highest priority. An AoA with the index being the priority and the sub-hashes the item might be more efficient (because the priorities remain sorted), but this requires the priorities to be relatively small positive integers. We still have to traverse the top data structure until we find the first non-empty sub-array. We will use this data structure for our Perl 5 implementation.
Another possibility is to use a heap, a data structure that usually has better performance (when it matters). We'll come back to that later.
Priority Queues in Perl 5 (Objects in Functional Programming)
A typical way of implementing queues with their various attached methods would obviously be with object-oriented programming (OOP), using either the standard Perl 5 object system, or any of the more modern variants such as Moose, Moo, Mouse, Mo, or whatever.
But since I'm not a typical person (and since, as those who have read some of my previous posts know by now, I like functional programming), I'll do it in an atypical way and build my objects with functional programming. As you'll see, it's quite simple. Defining my own light-weight object systems and my simple "queue class" takes just one subroutine with only 21 lines of code:
sub new_queue {
my @queue; # an AoA
my $is_empty = sub {
for my $item (@queue) {
next unless defined $item;
return 0 if @$item > 0;
}
return 1;
};
my $insert_with_prio = sub {
my ($item, $prio) = @_;
push @{$queue[$prio]}, $item;
};
my $pull_highest_prio = sub {
for my $item (reverse @queue) {
next unless defined $item;
return shift @$item if @$item > 0;
}
};
return $is_empty, $insert_with_prio, $pull_highest_prio;
}
The new_queue
subroutine creates an empty @queue
object and returns three coderefs ($is_empty
, $insert_with_prio
, and $pull_highest_prio
). The new_queue
subroutine can be considered as a function factory which manufactures three coderefs for the benefit of the caller. Each of the three coderefs is an anonymous function that closes over the @queue
AoA (in other words, they are closures).
The code that will call the new_queue
"constructor" will only receive in return the three coderefs (let's call them "methods"). It will have no access whatsoever to the @queue
object other than using these methods, so that's pretty good encapsulation.
Let's now look at the calling code:
my ($is_empty, $insert, $pull_prio) = new_queue;
for my $num (1..20) { # inserting 20 items into the queue
$insert->($num,
$num % 10 == 0 ? 10 :
$num % 5 == 0 ? 5 :
$num % 3 == 0 ? 3 :
$num % 2 == 0 ? 2 :
1);
}
for my $num (1..20) {
say $pull_prio->();
}
say "Empty queue" if $is_empty->();
The first code line is a call to the object constructor and the code retrieves the three coderefs (the methods).
The rest of the code above is just for testing purpose. We will insert into our queue 20 numbers (between 1 and 20) and assign them priorities as follows:
Priority 10 for numbers evenly divisible by 10;
Priority 5 for numbers evenly divisible by 5 (but not 10 and 20);
Priority 3 for numbers evenly divisible by 3 (but not 15);
Priority 2 for other even numbers;
Priority 1 for other numbers.
This is the structure of the @queue
after insertion of the 20 items with their priority:
0 undef
1 ARRAY(0x6003018a0)
0 1
1 7
2 11
3 13
4 17
5 19
2 ARRAY(0x600301978)
0 2
1 4
2 8
3 14
4 16
3 ARRAY(0x6007fdf90)
0 3
1 6
2 9
3 12
4 18
4 undef
5 ARRAY(0x600842a88)
0 5
1 15
6 undef
7 undef
8 undef
9 undef
10 ARRAY(0x600842bf0)
0 10
1 20
So, when running this code, we should get the items in the following order: 10, 20, 5, 15, numbers evenly divided by 3 (except 15) in ascending order, etc. And we should finally get a "Empty queue" message if we have exhausted the queue. Let's see:
$ perl queues.pl
10
20
5
15
3
6
9
12
18
2
4
8
14
16
1
7
11
13
17
19
Empty queue
Looks good, doesn't it?
Priority Queues in Perl 6
For the Perl 6 implementation, we want use a binary heap to store the data. But this requires quite a bit of explanations, and this blog post is going to get too long.
So, I'll stop here for now and prepare another blog post for that part in the next couple of days. Stay tuned if you want to know about priorities queues using binary heaps.
Update: This is the new post on heaps and priority queues in Perl 6 (and also in Perl 5)
Wrapping up
The next week Perl Weekly Challenge is due to start soon. If you want to participate in this challenge, please check https://perlweeklychallenge.org/ and make sure you answer the challenge before 23:59 BST (British summer time) on Sunday, August 4. And, please, also spread the word about the Perl Weekly Challenge if you can.
And see you hopefully at the European Perl Conference in Riga, Aug. 7 to 9, 2019.
I simply loved the solutions. My favorite this time is the Perl5 solution mimicking Perl6 feature. I amazed how much command you have over the language. Also I liked the Perl5 solution for Priority Queue, specially new_queue(). Thanks for great explanation as well, no book can teach this.