Perl Weekly Challenge #014

Challenge 1: Van Eck's sequence

I must admit I found the definition of the van Eck sequence hard to follow. The EXAMPLE at OEIS helped:

We start with a(1) = 0. 0 has not appeared before, so the rule says a(2) = 0. Now 0 HAS occurred before, at a(1), which is 1 term before, so a(3) = 1. 1 has not occurred before, so a(4) = 0. 0 appeared most recently at term a(2), which is 2 terms earlier, so a(5) = 2. 2 has not occurred before, so a(6) = 0. And so on.

Ok, so this is a recursive sequence because new elements are defined in terms of previously-occurring elements (Wikipedia). But the best algorithm for generating a recursive sequence is not always a recursive algorithm. In this case, it seems easiest to use a simple loop while maintaining two arrays, containing (i) the sequence so far, and (ii) the index of each element’s most recent occurrence in the sequence. With this data to hand, generating the next element is fairly straightforward. The only tricky part is deciding when to update the index array. In generating element n + 1, we need the distance between n and the most recent previous occurrence of the nth element. So the index data for the nth element should be recorded only after the (n + 1)th element has been generated.

Perl 5 solution

Once again I use Regexp::Common to verify that the command-line input (the required sequence length) is an integer. In sub van_eck it is simpler to by-pass the first (i.e., 0-index) element of the @seq array, thereby treating $seq[n] as the nth term in the series. To accomplish this I prepend undef to the array before calculating the series, then shift it off before returning the completed sequence.

File ch-1.pl


use strict;
use warnings;
use Const::Fast;
use Regexp::Common;

const my $LENGTH =>  27;
const my $USAGE  => "USAGE: perl $0 [<length>]";

$| = 1;

MAIN:
{
    scalar @ARGV <= 1
        or die "\n$USAGE\n";

    my $length = $ARGV[0] // $LENGTH;

    $length =~ /^$RE{num}{int}$/ && $length > 0
        or die "\nInvalid length '$length': " .
                 "must be an integer > 0\n";

    print "\nThe first $length terms in Van Eck's sequence are:\n",
           join( ', ', van_eck($length)->@* ), "\n";
}

sub van_eck
{
    my ($len) =  @_;
    my  @seq  = (undef, 0);
    my  %indices;

    for my $n (1 .. $len - 1)
    {
        my $old_term = $seq[$n];

        push @seq, exists $indices{$old_term} ?
                     $n - $indices{$old_term} : 0;

        $indices{$old_term} = $n;
    }

    shift   @seq;
    return \@seq;
}

Perl 6 solution

The translation from Perl 5 to Perl 6 is straightforward. Note that Perl 5’s undef becomes Nil, and exists changes from a function into a subscript adverb.

File ch-1.p6


use v6;

my Int constant $LENGTH = 27;

sub MAIN(Int:D $length = $LENGTH)
{
    say "\nThe first $length terms in Van Eck's sequence are:\n",
         van-eck($length).join(', ');
}

sub van-eck(Int:D $length --> Array)
{
    my @seq = (Nil, 0);
    my %indices;

    for 1 .. $length - 1 -> Int $n
    {
        my $old_term = @seq[$n];

        push @seq, %indices{$old_term}:exists ??
              $n - %indices{$old_term}        !! 0;

        %indices{$old_term} = $n;
    }

    shift  @seq;
    return @seq;
}

Challenge 2: Making words from US state postal abbreviations

For the dictionary I chose “words_alpha.txt,” a plaintext file containing 370,099 English words (one word per line), which I had previously downloaded from GitHub. The US state names, together with each state’s two-character postal abbreviation, are taken from the Wikipedia article.

First thought in a case like this is to build candidate “words” using 2-character abbreviations as building blocks; then look them up in the dictionary to determine whether they’re valid English words. But a simple calculation shows that this is impractical: 12-character words could be built in 506 = 15,625,000,000 different ways!

A better approach is to test each word in the dictionary to determine whether it can be decomposed into the given building blocks. This is still a costly operation, but it’s far more manageable. Also, it can be streamlined: (1) any word with an odd number of letters can be immediately discarded; and (2) a word can be discarded as soon as any of its constituent 2-letter sequences fails to match a building block.

(There are other potential optimisations which I haven’t investigated. For example, a candidate word could be discarded up-front if it contains any letter not present in at least one of the state abbreviations.)

Perl 5 solution

For testing each 2-letter substring of a candidate word, I build $STATES_REGEX as an alternation of all the allowable two-letter codes. Writing this by hand would involve duplication of the data already present in %STATES_POSTAL, which would violate the DRY principle and make unnecessary work! So I use eval together with qr// and join, and Perl builds the regex for me.

In two places I use global matching in an expression of the form $string =~ /../g (in list context, of course) to decompose $string into its constituent 2-character substrings.

Once all the solutions have been found, it’s necessary to find the longest. First, I sort the solutions in descending order of word length. This guarantees that $words->[0] is as long as any other solution; but there might be others of the same length. (In fact, using the dictionary I’ve chosen, there is only one solution — but the script can’t know that in advance!) So I continue up the array, incrementing $index, until array element $words->[$index] has a length shorter than that of the first solution. The last solution is thus $words->[$index - 1]. The set of longest solutions is returned as an array slice.

File ch-2.pl


use strict;
use utf8;
use warnings;
use Const::Fast;
use constant TIMER => 1;

const my %STATES_POSTAL =>
      (
          AL => 'Alabama',        AK => 'Alaska',
          AZ => 'Arizona',        AR => 'Arkansas',
          CA => 'California',     CO => 'Colorado',
          CT => 'Connecticut',    DE => 'Delaware',
          FL => 'Florida',        GA => 'Georgia',
          HI => 'Hawaii',         ID => 'Idaho',
          IL => 'Illinois',       IN => 'Indiana',
          IA => 'Iowa',           KS => 'Kansas',
          KY => 'Kentucky',       LA => 'Louisiana',
          ME => 'Maine',          MD => 'Maryland',
          MA => 'Massachusetts',  MI => 'Michigan',
          MN => 'Minnesota',      MS => 'Mississippi',
          MO => 'Missouri',       MT => 'Montana',
          NE => 'Nebraska',       NV => 'Nevada',
          NH => 'New Hampshire',  NJ => 'New Jersey',
          NM => 'New Mexico',     NY => 'New York',
          NC => 'North Carolina', ND => 'North Dakota',
          OH => 'Ohio',           OK => 'Oklahoma',
          OR => 'Oregon',         PA => 'Pennsylvania',
          RI => 'Rhode Island',   SC => 'South Carolina',
          SD => 'South Dakota',   TN => 'Tennessee',
          TX => 'Texas',          UT => 'Utah',
          VT => 'Vermont',        VA => 'Virginia',
          WA => 'Washington',     WV => 'West Virginia',
          WI => 'Wisconsin',      WY => 'Wyoming',
      );

# Regular expression to match (case-insensitively, but otherwise
# exactly) any one of the 50 state postal abbreviations

const my $STATES_REGEX =>
          eval 'qr/^(?:' . join('|', keys %STATES_POSTAL) . ')$/i';

const my $WORDSFILE => 'words_alpha.txt';

$| = 1;

MAIN:
{
    use if TIMER, 'Time::HiRes' => qw( gettimeofday tv_interval );
    my  $t0 = [gettimeofday] if TIMER;
    my ($total, $words) = get_words();

    printf "\n%d words read from file '$WORDSFILE', of which\n" .
             "%6d can be formed from US state abbreviations\n",
              $total, scalar @$words;

    if (scalar @$words == 0)
    {
        print "\nNo solutions found\n";
    }
    else
    {
        my $solutions = get_solutions($words);

        printf "\nThe longest of these ha%s %d letters:\n",
               (scalar @$solutions == 1) ? 's' : 've',
                length $solutions->[0];

        for my $solution (@$solutions)
        {
            my @states = map { $STATES_POSTAL{uc $_} }
                         $solution =~ /../g;

            printf "%s = %s\n", $solution, join(' + ', @states);
        }
    }

    my $t = tv_interval($t0)     if TIMER;
    print "\n", $t, " seconds\n" if TIMER;
}

sub get_words
{
    my $total = 0;
    my @words;

    open my $fh, '<', $WORDSFILE
        or die "Cannot open file '$WORDSFILE' for reading, stopped";

    WORD:
    while (my $word = <$fh>)
    {
        ++$total;
        chomp $word;
        next unless length($word) % 2 == 0;

        for my $pair ($word =~ /../g)
        {
            next WORD unless $pair =~ $STATES_REGEX;
        }

        push @words, $word;
    }

    close $fh
        or die "Cannot close file '$WORDSFILE', stopped";

    return ($total, \@words);
}

sub get_solutions
{
    my ($words)  = @_;
       @$words   = sort { length $b <=> length $a } @$words;  # Desc
    my  $index   = 0;
    my  $max_len = length $words->[$index];

    1 while length $words->[++$index] == $max_len;

    return [ @$words[0 .. --$index] ];
}

Perl 6 solution

To improve performance (see Timings, below) I abandoned the alternation-regex approach in favour of Perl 6’s built-in Set functionality. So, instead of $pair ~~ $STATES_REGEX I have:

$pair.uc ∈ $STATES;

where $STATES is a set of strings (the 2-character postal abbreviations) and the operator tests whether the object on its LHS is an element of the set on its RHS.

In sub get-solutions the sort on word length is accomplished with a simpler syntax: $words.sort: { .chars };. However, I don’t yet know how to make this sort in descending order, so I had two choices:

  1. Use reverse, e.g. $words.sort( { .chars } ).reverse;
  2. Reverse the index logic and search down the array by decrementing $index.

In the event, I chose the latter option, although either would work.

File ch-2.p6


use v6;

my constant $TIMER = 1;

# Abbreviations according to the "USPS" (United States Postal
# Service) column of the Table in
# https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations

my constant %STATES_POSTAL =
(
    :AL('Alabama'),        :AK('Alaska'),
    :AZ('Arizona'),        :AR('Arkansas'),
    :CA('California'),     :CO('Colorado'),
    :CT('Connecticut'),    :DE('Delaware'),
    :FL('Florida'),        :GA('Georgia'),
    :HI('Hawaii'),         :ID('Idaho'),
    :IL('Illinois'),       :IN('Indiana'),
    :IA('Iowa'),           :KS('Kansas'),
    :KY('Kentucky'),       :LA('Louisiana'),
    :ME('Maine'),          :MD('Maryland'),
    :MA('Massachusetts'),  :MI('Michigan'),
    :MN('Minnesota'),      :MS('Mississippi'),
    :MO('Missouri'),       :MT('Montana'),
    :NE('Nebraska'),       :NV('Nevada'),
    :NH('New Hampshire'),  :NJ('New Jersey'),
    :NM('New Mexico'),     :NY('New York'),
    :NC('North Carolina'), :ND('North Dakota'),
    :OH('Ohio'),           :OK('Oklahoma'),
    :OR('Oregon'),         :PA('Pennsylvania'),
    :RI('Rhode Island'),   :SC('South Carolina'),
    :SD('South Dakota'),   :TN('Tennessee'),
    :TX('Texas'),          :UT('Utah'),
    :VT('Vermont'),        :VA('Virginia'),
    :WA('Washington'),     :WV('West Virginia'),
    :WI('Wisconsin'),      :WY('Wyoming'),
);

my constant $STATES = Set[Str].new( %STATES_POSTAL.keys );

# Dictionary file downloaded from
# https://github.com/dwyl/english-words

my Str constant $WORDS_FILE = 'words_alpha.txt';

sub MAIN()
{
    my DateTime $t0 = DateTime.now if $TIMER;

    my (Int:D $total, Array:D $words) = get-words();

    say "\n$total words read from file '$WORDS_FILE', of which\n",
         sprintf('%*d', $total.chars, $words.elems),
        ' can be formed from US state abbreviations';

    if ($words.elems == 0)
    {
        say "\nNo solutions found";
    }
    else
    {
        my (Int:D $max-len, List:D $solutions) =
            get-solutions($words);

        say "\nThe longest of these ha",
             $solutions.elems == 1 ?? 's' !! 've',
            "$max-len letters:";

        for @$solutions -> Str:D $solution
        {
            my @states = ($solution ~~ m:g/../).map:
                         { %STATES_POSTAL{.uc} };

            say $solution, ' = ', @states.join(' + ');
        }
    }

    if ($TIMER)
    {
        my DateTime $t = DateTime.now;
        say "\nTime elapsed: { $t - $t0 } seconds";
    }
}

sub get-words(--> List:D)
{
    my Int $total = 0;
    my     @words;

    WORD: for $WORDS_FILE.IO.lines -> Str $word
    {
        ++$total;

        next unless $word.chars % 2 == 0;

        for $word ~~ m:g/../ -> Match $pair
        {
            next WORD unless $pair.uc ∈ $STATES;
        }

        @words.push: $word;
    }

    return $total, @words;
}

sub get-solutions(Array:D $words --> List:D)
{
    my     @words   = $words.sort: { .chars };
    my Int $index   = @words.end;
    my Int $max-len = @words[ $index ].chars;

    Nil while @words[ --$index ].chars == $max-len;

    return $max-len, @words[ ++$index .. * ];
}

Timings

The main problem with the Perl 6 solution, as compared with its Perl 5 counterpart, is its performance:

Script Time (seconds)
ch-2.pl 1.6
ch-2-REGEX.p6 36.4
ch-2.p6 22.7


where “ch-2-REGEX.p6” is like “ch-2.p6” but uses an alternation regex similar to the one used in “ch-2.pl.” As can be seen, using Perl 6’s operator on a Set gives a significant speedup over an alternation-regex; but even so, the Perl 6 code is an order of magnitude slower than its Perl 5 equivalent.

Solution

The output of the Perl 5 script above (with timing code added) is:


16:03 >perl ch-2.pl

370099 words read from file 'words_alpha.txt', of which
   532 can be formed from US state abbreviations

The longest of these has 12 letters:
cacogalactia = California + Colorado + Georgia + Louisiana + Connecticut + Iowa

1.635201 seconds

16:04 >

For the record, cacogalactia means bad milk: either “a bad condition of the milk” (Wordnik) or “producing unhealthy milk” (Medical Dictionary).

2 Comments

Hi Athanasius,

With respect to the words from US Postal Codes challenge in P6, a possible solution that I did not use because I did not build a list of words but just kept track of the longest word so far. But since you build a list, using the "max" method might be slightly simpler and possibly slightly faster than sorting the data.

    $solution ~~ m:g/../
    $solution.comb(2) # much faster
    $a % 2 == 0
    $a %% 2 # clearer

Leave a comment

About Athanasius

user-pic PerlMonk; experienced Perl5 hacker; learning Perl6.