Perl Weekly Challenge # 5: Anagrams

These are some answers to the Week 5 of the Perl Weekly Challenge organized by Mohammad S. Anwar.

Challenge #1: Anagrams of a Word

Write a program which prints out all anagrams for a given word. For more information about Anagram, please check this Wikipedia page.

I'm not sure I fully understand the question, but my understanding is that we are looking for existing words which are anagrams of each other. For example, "pots", "spot", "tops", and "post" are anagrams of each other because they have exactly the same letters rearranged in a different order.

Just as for last week second challenge, I will use a words.txt file containing 113,809 lower-case English words usually accepted for crossword puzzles and other word games. The words.txt file can be found on my Github repository. The original list was contributed to the public domain by Internet activist Grady Ward in the context of the Moby Project. This word list is also mirrored at Project Gutenberg.

For the purpose of testing the programs below, the words.txt file is located in my current directory.

Perl 5 Solution: the is_anagram Subroutine

We first create an is_anagram subroutine that takes two words as parameters and return a true value if the words are anagrams of each other and false otherwise.

use strict;
use warnings;
use feature 'say';
sub is_anagram {
    my ($word1, $word2) = @_;
    return 0 if length $word1 != length $word2;
    my $letters1 = join "", sort split "", $word1;
    my $letters2 = join "", sort split "", $word2;
    return $letters1 eq $letters2
}
for ( [qw / ban bane/], [qw /post post/], [qw / post spot /], [qw /post spot/], [qw / pots spot/], [qw /pots taps/] ) {
    say "$$_[0] $$_[1]:\t", is_anagram($$_[0], $$_[1]) ? "True" : "False";
}

The is-anagram subroutine first returns a false value if the input words don't have the same length. Words of different lengths cannot be anagrams and it is believed that checking the length of the input words is significantly faster than the subsequent processing of their individual letters. Therefore, it should speed up processing when we will use this subroutine repeatedly for a large number of word combinations. When words have the same length, then we just split the words into individual letters, sort the letters, and produce new "normalized" or "canonical form" strings with the input letters; if the new normalized strings are equal, then the words were anagrams of each other.

This will display the following output:

ban bane:       False
post post:      True
post spot:      True
post spot:      True
pots spot:      True
pots taps:      False

Of course, we consider here "post" to be an anagram of itself. We could make a different special rule when the words are the same, if needed.

Finding All Anagrams of a Word From a Word List

Another possible interpretation of the challenge is that we are given a word and a word list, and should output the words that are anagrams of the input word. We can use the same is_anagram subroutine as follows:

sub is_anagram { ... }
my $word = "post";
for (qw /past post spot tops taps pots top/) {
    say "$word $_\t", is_anagram($word, $_) ? "True" : "False";
}

We're now getting the following output:

post past       False
post post       True
post spot       True
post tops       True
post taps       False
post pots       True
post top        False

Let's see now whether we can find more anagrams in our word list (still using the same is_anagram subroutine as before):

sub is_anagram { ... }
my $word = "post";
while (<>) {
    chomp;
    say if is_anagram $word, $_;
}

Here, the words.txt file is passed as a parameter to the script.

And, yes, we find two more anagrams that we had not thought about before in our word list ("opts" and "stop"):

opts
post
pots
spot
stop
tops

Perl 6 Solution: the is-anagram Subroutine

Normalizing the Words

Even though it might end up being slightly less useful than in P5, let's start with a similar is-anagram subroutine:

sub is-anagram (Str $word1, Str $word2) {
    return False if $word1.chars != $word2.chars;
    return $word1.comb.sort eq $word2.comb.sort;
}
for <ban bane post stop pots stop post pots pots taps> -> $w1, $w2 {
    say "$w1 $w2:\t", is-anagram $w1, $w2;
}

This displays the following output:

$ perl6 is-anagram.p6
ban bane:       False
post stop:      True
pots stop:      True
post pots:      True
pots taps:      False

Note that this works correctly because the eq operator in the is-anagram subroutine coerces its arguments into strings. Also note that the code is significantly shorter and simpler in Perl 6; I can understand that some people don't like the changes introduced in P6 compared to P5 (in some cases, I initially had the same reaction), but this illustrates that there are good reasons for these syntax changes.

Using a Bag

Another approach is to use one of a variety of Perl 6 data structure types called Set, Bag and Mix. They are immutable ordered collections of unique and weighed items.

You might construct a Set as follows:

> my $s = set <banana apple orange orange banana pear apple>;
set(apple banana orange pear)

As you can see, duplicates have been removed. Sets only tell us whether at least one item of a given name has been encountered.

A bag, by contrast, also keeps track of how many of each item have been seen:

> my $b = bag <banana apple orange orange banana pear apple orange>;
bag(banana(2), orange(3), pear, apple(2))
> say $b{'banana'}
2

Mixes are similar to bags, except that the elements' weights don't have to be integers.

One interesting thing about these collections is that they can use many set operators commonly used in mathematics, such as the (elem) set membership operator or (<) subset operator (you can also use the Unicode math set theory symbols in the P6 code, but I'm not sure how they would render on this blog post):

> say "Found it!" if 'apple' (elem) $s; 
Found it!
> say "It is a subset" if qw/orange banana/ (<) $s
It is a subset
> say "Found several oranges" if  qw/orange orange/ (<) $b
Found several oranges

We can now try the following alternate subroutine using bags:

sub is-anagram (Str $word1, Str $word2) {
    return $word1.comb.Bag === $word2.comb.Bag;
}
for <ban bane post stop pots stop post pots pots taps> -> $w1, $w2 {
    say "$w1 $w2:\t", is-anagram $w1, $w2;
}

The === value identity operator used between two bags returns True it the bags are identical. This displays the same output as before.

Creating an Anagram Operator

Just a bit of fun: rather than creating an is-anagram subroutine, we could create the ana operator:

sub infix:<ana> (Str $word1, Str $word2) {
    return $word1.comb.Bag === $word2.comb.Bag;
}
for <ban bane post stop pots stop post pots pots taps> -> $w1, $w2 {
    say "$w1 $w2:\t", $w1 ana $w2;
}

This prints the same result. Note, however that this feels significantly slower. My guess is that it is mainly compilation time. Adding the following line at the end of the previous script:

say now - INIT now;

shows that the run time is less than 0.02 seconds:

perl6 is-anagram.p6
ban bane:       False
post stop:      True
pots stop:      True
post pots:      True
pots taps:      False
0.0156261

I might undertake a serious benchmark one day, but it really seems that the perceived response time really has to do with compile time.

Just as with P5, the possible understanding of the challenge is that we are given a word and a word list, and should output the words that are anagrams of the input word. We can use the same is-anagram subroutine as follows and use directly the words.txt file:

my $post-bag = "post".comb.Bag;
sub is-anagram (Str $word) {
    return $word.comb.Bag === $post-bag;
}
for "words.txt".IO.lines -> $line {
    say $line if is-anagram $line; 
}

This program displays the same anagrams as our equivalent P5 program:

perl6 is-anagram.p6
opts
post
pots
spot
stop
tops

I must admit that, at this point, the P6 version is running markedly slower than the P5 version. Having said that, I should add that I haven't upgraded my Rakudo version for more than a year, maybe it is better now. More on this later.

Challenge #2: Highest Number of Anagrams

Write a program to find the sequence of characters that has the most anagrams.

Again, I'm not completely sure about the requirement, but I'll assume that the point is to go through all items of a word list and find those that have the highest number of anagrams.

Perl 5 Solution

The one thing that we don't want to do here is to compare each word with each other, for example in two nested loops, as this would take ages with a large input list (we would need almost 13 billion comparisons with our word list). Rather, we want to read each word from the file only once and store these words in a hash, with the key being the canonical or normalized form of the word (say a string where the letters of the word have been sorted) and the value an array of the original words. In other words, we will use a hash of arrays (HoA). Once this data structure is populated, we can sort it by the number of items in the array or, probably better, roll out some other way to pick up the entry with the largest number of items.

Since we know we have at least one case with 6 anagrams (post, spot, etc.), we can skip input words with less than three letters. Similarly, in the for loop to find the highest counts of anagrams, we can initialize the $max variable at 5 to skip any entry with a lower count.

use strict;
use warnings;
use feature 'say';

my %words;   # our HoA
my $file_in = "words.txt";
open my $IN, "<", $file_in or die "Ouverture impossible $file_in $!";
while (my $word = <$IN>) {
    next unless $word =~ /\w/;  # skipping empty lines if any
    $word =~ s/\s+$//;          # removing trailing spaces, new lines and carriage returns (if any)
    next if length $word < 3; 
    my $key = join '', sort split //, $word;  # normalizing the word for the hash key
    push @{$words{$key}}, $word;              # storing the word in the HoA
}
close $IN;
my @max_anagrams;
my $max = 5;
for my $key (keys %words) {
    next if @{$words{$key}} < $max;
    if (@{$words{$key}} == $max) {
        push @max_anagrams, $key;
    } else {
        @max_anagrams = ($key); 
        $max = scalar @{$words{$key}};
    }
}
say "$_:\t @{$words{$_}}" for (@max_anagrams);

The script finds two groups of eleven anagrams and runs in less than one second:

$ time perl anagrams.pl
aelrst:  alerts alters artels estral laster ratels salter slater staler stelar talers
aeprs:   apers asper pares parse pears prase presa rapes reaps spare spear

real    0m0.957s
user    0m0.781s
sys     0m0.077s

I must admit that there are in these two lists quite a few words which I had never seen before.

Perl 6 Solution

I wish I could suggest a markedly different solution in Perl 6, for example using bags or other such structures as we did before, but storing the words in a hash as in Perl 5 really seems to be the best and simplest solution. I will still try to depart from the P5 version and use idiomatic P6:

my %words;
for "words.txt".IO.lines -> $line { 
    next unless $line ~~ /\w/;  # skipping empty lines if any
    $line ~~ s/\s+$//;          # removing trailing spaces if any
    next if $line.chars < 3;
    my $key = $line.comb.sort.join('');
    push %words{$key}, $line;
}
my @max-anagrams;
my $max = 5;
for %words.keys -> $key {
    given %words{$key}.elems {
        when $_ < $max  { next }
        when $_ == $max { @max-anagrams.push($key) }
        default         { @max-anagrams = $key,; $max = $_}
    }
}
say "$_:\t %words{$_}" for (@max-anagrams);

The script finds the same groups of eleven anagrams as the P5 script:

aelrst:  alerts alters artels estral laster ratels salter slater staler stelar talers
aeprs:   apers asper pares parse pears prase presa rapes reaps spare spear

The P6 script is significantly more concise (by about one third), in part because it avoids some boiler plate code, and in part because the P6 syntax is more expressive. The P6 version is also, in my humble opinion, easier to read, in the sense that it avoids the somewhat obscure syntax of P5 references.

When I first wrote the above code yesterday, I had sort of forgotten that there is a max builtin routine that is designed to return the maximum value from a list. We could remove the second for loop and simplify the code above quite significantly:

my %words;  
for "words.txt".IO.lines -> $line { 
    next unless $line ~~ /\w/;  # skipping empty lines if any
    $line ~~ s/\s+$//;          # removing trailing spaces if any
    next if $line.chars < 3;
    my $key = $line.comb.sort.join('');
    push %words{$key}, $line;
}
my $max = max map { .elems }, values %words; 
say "$_:\t %words{$_}" if %words{$_}.elems == $max for %words.keys;

Finding the largest count of anagrams (the $max variable defined in the penultimate code line above) could also be written with a method invocation syntax and the use of the unary >> hyperoperator as:

my $max = %words.values>>.elems.max;

Either way, the P6 script is now far more concise and, with just 10 code lines, about 2.7 times shorter than the P5 code. Note, however, that the code of this new version might be slightly less efficient, because we are now traversing the %words hash twice instead of only once. As we will see below, this is somewhat immaterial because performance bottlenecks are elsewhere (namely, in the first for loop of our original version).

Performance Issue with Perl 6

The P6 script above is more concise than its P5 counterpart, but it runs in about 19.3 seconds, which is about 20 times slower than the P5 equivalent. Let's face it: that's a real problem when dealing with a relatively large amount of data (although a file of 113-k words really isn't that large). In my daily work activity with Perl 5, I'm processing regularly files that have several GB or even sometimes tens of GB of data; I simply couldn't afford to do it with P6 at this point.

I tried to investigate a bit the reasons for such slow processing. It appears that the program is spending more than 97.5% of the time in the first for loop (of the original version) which reads the words.txt file and populates the %words hash (18.8 sec. out of a total of 19.3 sec.). Commenting out the two code lines with regexes (which are not really needed with my clean input file) brings the time spent for the first for loop to 13.0 sec. So, quite obviously, the regex engine's speed (or, rather, lack thereof) is a significant part of the problem, but there is more.

What else in that for loop is so slow? I suspected it might use the regex engine in the IO.lines method that provides individual lines to the program. But, further tests showed that this is not the case. The real slow lines in that loop are those two:

my $key = $line.comb.sort.join('');   #  6.5 seconds
push %words{$key}, $line;             #  6.3 seconds

Then, it occurred to me that (as mentioned earlier) I have a rather old version of Perl 6/Rakudo on the box where I ran these tests:

perl6 -v
This is Rakudo Star version 2017.10 built on MoarVM version 2017.10
implementing Perl 6.c.

Maybe a newer version of Rakudo will bring better performance, as a lot of performance enhancement work has been done recently. I tried to download a more recent version, but, unfortunately, the Rakudo download site is temporarily unavailable these days. So I decided at least to try on another PC where I have a slightly more recent version (2018.04.1), although this other PC is, I believe, less powerful and slower. The performance seems to be significantly better:

  • Full for loop, including regexes: 13.9 seconds;
  • Full for loop, without the regexes: 9.6 seconds;
  • Full for loop, without the regexes and without the push: 5.3 seconds.

That's about 30% faster, so it's better, but we are still very far behind Perl 5. I wish I could test with a much more recent version, but that's not possible for the time being. I might come back to this subject when the download server becomes available again.

Wrapping up

The next week Perl Weekly Challenge is due to start very soon. If you're interested in participating in this challenge, please check https://perlweeklychallenge.org/ and make sure you answer the challenge before 6 p.m. BST (British summer time) on next Sunday, May 5. And, please, also spread the word about the Perl Weekly Challenge if you can.

1 Comment

Leave a comment

About laurent_r

user-pic I am the author of the "Think Perl 6" book (O'Reilly, 2017) and I blog about the Perl 5 and Raku programming languages.