Perl Weekly Challenge # 22: Sexy Prime Pairs and Compression Algorithm

These are some answers to the Week 22 of the Perl Weekly Challenge organized by Mohammad S. Anwar.

Spoiler Alert: This weekly challenge deadline is due in several days from now (August 25, 2019). This blog post offers some solutions to this challenge, please don't read on if you intend to complete the challenge on your own.

Challenge # 1: Sexy Prime Pairs

Write a script to print first 10 Sexy Prime Pairs. Sexy primes are prime numbers that differ from each other by 6. For example, the numbers 5 and 11 are both sexy primes, because 11 - 5 = 6. The term “sexy prime” is a pun stemming from the Latin word for six: sex. For more information, please checkout wiki page.

My first question, when reading this definition, was whether sexy primes had to be consecutive prime numbers. The example provided (as well as those found in the the Wikipedia page) shows that it needs not be the case: 5 and 11 are not consecutive primes (since 7 is also prime). If sexy primes had to be consecutive primes, then the first such pair would be (23, 29). With that answer to my question, it seems to me that all we need to do is to look at each prime number p and check whether p + 6 is prime (and stop as soon as we have 10 sexy pairs.

Note that (1, 7) is not a sexy prime pair (despite having a gap of 6), because 1 is not considered to be a prime number. Therefore, to avoid the risk of finding a false sexy prime pair, we will start our search with number 2.

Sexy Prime Pairs in Perl 6

We first build a lazy infinite list @sexy-primes of prime numbers such that each such prime + 6 is also prime, and then print the pairs:

use v6;

my @sexy-primes = grep { .is-prime and ($_ + 6).is-prime}, (2, 3, *+2 ... Inf);
say "@sexy-primes[$_] ", @sexy-primes[$_] + 6 for ^10;

Note that, as a basis for finding the primes, we use a sequence operator with an explicit generator in order to check parity only for odd numbers. This avoids useless computations on even numbers which cannot be prime (except for 2). This might be considered premature optimization (and we all know what Donald Knuth said about premature optimization). Well, yes, but, at the same time, I don't like to let my programs do unnecessary work.

And this prints:

$ perl6 sexy-pairs.p6
5 11
7 13
11 17
13 19
17 23
23 29
31 37
37 43
41 47
47 52

This program is so short that we can easily get rid of the @sexy-primes temporary array and transform the script into a Perl6 one-liner:

$ perl6 'say "$_ ", $_+6 for (2...*).grep({.is-prime && ($_ + 6).is-prime})[^10];'
5 11
7 13
11 17
13 19
17 23
23 29
31 37
37 43
41 47
47 53

Sexy Prime Pairs in Perl 5

Since we know from our tests with Perl 6 that we're not going to look for prime numbers much larger than 50, we don't need to try hard to optimize the primality check subroutine as we've done in some previous weekly challenges. Our is_prime subroutine will simply test all possible factors between 2 and the square root of the number being checked.

Since Perl 5 doesn't have infinite lists, we will just use an infinite while loop instead and break out of it with a last statement once we've found what we need.

#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;

sub is_prime{
    my $num = shift;
    for my $i (2 .. $num ** .5) {
        return 0 if $num % $i == 0;
    }
    return 1;
}

my ($candidate, $count) = (2, 0);
while (1) {
    if (is_prime $candidate and is_prime $candidate + 6) {
        say "$candidate ", $candidate + 6;
        $count ++
    }
    last if $count >= 10;
    $candidate ++;
}

And this prints out the same output as our P6 programs, no point of repeating it here.

Note that the program runs in 0.08 second, there was really no need to try to optimize performance.

Lempel–Ziv–Welch (LZW) Compression

Write a script to implement Lempel–Ziv–Welch (LZW) compression algorithm. The script should have method to encode/decode algorithm. The wiki page explains the compression algorithm very nicely.

Lempel–Ziv–Welch (LZW) is a lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978.

The scenario described by Welch encodes sequences of 8-bit data as fixed-length 12-bit codes. The codes from 0 to 255 represent 1-character sequences consisting of the corresponding 8-bit character, and the codes 256 through 4095 are created in a dictionary for sequences encountered in the data as it is encoded. At each stage in compression, input bytes are gathered into a sequence until the next character would make a sequence with no code yet in the dictionary. The code for the sequence (without that character) is added to the output, and a new code (for the sequence with that character) is added to the dictionary.

For encoding (or, really, compressing) a string, we buffer input the characters in a sequence (note that we use here the variables names from the Wikipedia page to facilitate understanding) until the next is not in the %dict hash. Emit the code for , and add plus the next character to the hash. Start buffering again with the next character. Concretely, we first populate the %dict hash with the single possible letters. Then, we traverse the input string character by character and build the sequence as long as it exists in the dict hash. When the new sequence to be built does not exist in the hash, we add the previous sequence to the result, add the new one to the hash and start a new sequence with the last visited character.

For decoding (decompressing), we use the same initial hash as when encoding (we don't need the final hash, so we don't need to transmit the dictionary, which can be hard coded). Additional entries can be reconstructed as they are always simply concatenations of previous entries. Concretely, we populate %dict hash as before, but inverting keys and values. Then we go through the codes one by one; if a code exists in the hash, we just convert it and add it to the output; else, we build the new sequence, add it to the output and add the sequence concatenated with the sequence's first character to the hash.

There is nothing specific to either Perl 5 or Perl 6 in the above explanations, so they apply to both our P5 and P6 implementations below.

LZW Compression in Perl 6

For a start, we will use an input string ('TOBEORNOTTOBEORTOBEOR...') consisting only of capital letters ('A'..'Z'), as in the Wikipedia article, and populate our initial hash %dict with corresponding numeric codes between 0 and 25.

use v6;

constant $start-dict-size = 26;

sub encode (Str $in) {
    my %dict = map { $_[0] => $_[1] }, 
        ( ('A'..'Z') Z (^$start-dict-size) );
    my $ω = "";
    my @result = gather {
        for $in.comb -> $c {
            my $ωc = $ω ~ $c;
            if %dict{$ωc}:exists {
                $ω = $ωc;
            } else {
                take %dict{$ω};
                %dict{$ωc} = +%dict;
                $ω = $c;
            }
        }
        take %dict{$ω} if $ω.chars;
    }
    # say %dict;
    return @result;
}
sub decode (@encoded) {
    my $dict-size = $start-dict-size;
    my %dict = map { $_[1] => $_[0] }, 
        ( ('A'..'Z') Z (^$start-dict-size) );
    my $ω = %dict{shift @encoded};
    my @result = gather {
        take $ω; 
        for @encoded -> $i {
            my $str;
            if %dict{$i}:exists {
                $str = %dict{$i};
            } elsif  $i == $dict-size {
                $str = $ω ~ $ω.substr(0,1) 
            }
            take $str;
            %dict{$dict-size++} = $ω ~ $str.substr(0,1);
            $ω = $str;
        }
    }
    return join "", @result;
}

my $input_str = 'TOBEORNOTTOBETOBEORNOTTOBETOBEORNOTTOBE';
my @encoded = encode $input_str;
say @encoded;
say decode @encoded;

Running this code produces a correct round trip and displays the following output:

$ perl6 LZW_compression.p6
[19 14 1 4 14 17 13 14 19 26 28 35 29 31 33 37 37 30 32 34 27 4]
TOBEORNOTTOBETOBEORNOTTOBETOBEORNOTTOBE

The encoded (compressed) code has 22 numbers that could each be encoded over 6 bits, so that's a total of 132 bits. The input string had 39 bytes, i.e. 312 bits. In other words, we obtain a compression ratio of 2.36. Admittedly, we could have used a fixed-length encoding scheme and encoded each character of the input string over 5 bits, which would have led to a total of 195 bits, leading to a compression ratio of 1.6. We still get an LZW compression ratio which is 1.47 times better than a fixed-length encoding.

The reason for this better compression ratio is that many of our numeric codes represent two letters of the input, and some of them even more letters; for example, numeric code (35) stands for 3 letters, "TOB", and code 37 stands for 4 letters, "TOBE":

19 14 1 4 14 17 13 14 19 26 28 35  29 31 33 37   37   30 32 34 27 4
T  O  B E O  R  N  O  T  TO BE TOB EO RN OT TOBE TOBE OR NO TT OB E

Encoding only ASCII capital letters is of course very limited. Leaving aside Unicode, we would like at least to be able to compress bytes encoded over 256 bits. For this, we only need to change the $start-dict-size constant to 256 and to populate the initial %dict hash accordingly. For example, this way for the encode subroutine:

my %dict = map { .chr => $_ }, ^$start-dict-size;

And this way in the decode subroutine:

my %dict = map { $_ => .chr }, ^$start-dict-size;

The compressed code still has 22 numbers, but the compression rate would fall down, because these numbers would now need to be encoded over more bits:

[84 79 66 69 79 82 78 79 84 256 258 265 259 261 263 267 267 260 262 264 257 69]

And we can now compress data not comprising only of capital ASCII letters. For example, with the following input string:

To be or not to be, to be or not to be, that's the question

we obtain the following output:

perl6 LZW_compression.p6
[84 111 32 98 101 32 111 114 32 110 111 116 32 116 257 259 44 268 270 260 262 264 266 
273 258 101 272 116 104 97 116 39 115 268 104 260 113 117 101 115 116 105 111 110]
To be or not to be, to be or not to be, that's the question

LZW Compression in Perl 5

As for the P6 implementation, we will use an input string ('TOBEORNOTTOBE...') consisting only of capital letters ('A'..'Z'), as in the Wikipedia article, and populate our initial hash %dict with corresponding numeric codes between 0 and 25. Translating the Perl 6 implementation into Perl 5 is a bit tedious because of all these small pesky syntax differences between P5 and P6, but is conceptually a piece of cake. Here we go:

#!/usr/bin/perl
use strict;
use warnings;
use feature qw /say/;
use constant start_dict_size => 256;
use utf8;

sub encode {
    my $in = shift;
    my %dict = map { chr $_ => $_ } 0 .. start_dict_size - 1;
    my $ω = "";
    my @result;

    for my $c (split //, $in) {
        my $ωc = $ω . $c;
        if (exists $dict{$ωc}) {
            $ω = $ωc;
        } else {
            push @result, $dict{$ω};
            $dict{$ωc} = scalar keys %dict;
            $ω = $c;
        }
    }
    push @result, $dict{$ω} if length $ω;
    return @result;
}
sub decode {
    my @encoded = @_;
    my $dict_size = start_dict_size;
    my %dict = map { $_ => chr } 0 .. start_dict_size - 1;;
    my $ω = $dict{shift @encoded};
    my @result = ($ω); 
    for my $i (@encoded) {
        my $str;
        if (exists $dict{$i}) {
            $str = $dict{$i};
        } elsif  ($i == $dict_size) {
            $str = $ω . substr $ω, 0, 1; 
        } else { die "Error on $i" }
        push @result, $str;
        $dict{$dict_size++} = $ω . substr $str, 0, 1;
        $ω = $str;
    }
    return join "", @result;
}

my $input_str = 'TOBEORNOTTOBETOBEORNOTTOBETOBEORNOTTOBE';
my @encoded = encode $input_str;
say "@encoded";
say decode(@encoded);

The round trip works as before:

$ perl LZW_compression.pl
84 79 66 69 79 82 78 79 84 256 258 265 259 261 263 267 267 260 262 264 257 69
TOBEORNOTTOBETOBEORNOTTOBETOBEORNOTTOBE

(The numerical codes are not the same as in the original P6 implementation because we've used directly a starting %dict hash with 256 entries, but they are the same as in out second P6 test.)

Wrapping up

The next week Perl Weekly Challenge is due to start soon. If you want to participate in this challenge, please check https://perlweeklychallenge.org/ and make sure you answer the challenge before 23:59 BST (British summer time) on Sunday, September 1. And, please, also spread the word about the Perl Weekly Challenge if you can.

Leave a comment

About laurent_r

user-pic I am the author of the "Think Perl 6" book (O'Reilly, 2017) and I blog about the Perl 5 and Raku programming languages.