July 2019 Archives

Chopping substrings

The first task of the 18th Weekly Challenge was to find the longest common substring(s) in a set of strings. That is, given a set of strings like “ABABC”, “BABCA” and “ABCBA”, print out “ABC”.

The “best practice” solution is a technique known as suffix trees, which requires some moderately complex coding. However, we can get very reasonable performance for strings up to hundreds of thousands of characters long using a much simpler approach.

Let’s start with a very concise, but suboptimal, solution. If we has a list of all the possible substrings for each string:


...then we could treat each list as a set, take the intersection of those sets:

ABC, AB, BA, BC, A, B, C

...then simply find the longest element: ABC

Of course, sometimes the intersection of all possible substrings will have more than one longest element:


...so we need to ensure that our algorithm finds them all.

Given that Raku has sets and set operations built right into the language, the only challenge here is to generate the full set of all possible substrings for each string. And, given that Raku has very sophisticated regular expression matching, that’s no real challenge either.

In a normal regex match:

   my $str = '_ABC_ABCBA_';

   say $str.match(/ <[ABC]>+ /);
   # Says: 「ABC」

The string is searched for the first substring that matches the pattern: in the above example, the first sequence of one or more of 'A', 'B', or 'C'. But we can also ask the Raku regex engine to return every separate substring that matches, by adding a :g (or :global) adverb:

say $str.match(/ <[ABC]>+ /, :g);
# Says: ( 「ABC」 「ABCBA」 )

Or we can ask for every overlapping substring that matches, by adding an :ov
(or :overlap) adverb:

say $str.match(/ <[ABC]>+ /, :ov);
# Says: ( 「ABC」 「BC」 「C」 「ABCBA」 「BCBA」 「CBA」 「BA」 「A」 )

Or we can ask for every possible substring that matches, by adding an :ex
(or :exhaustive) adverb:

say $str.match(/ <[ABC]>+ /, :ex);
# Says: ( 「ABC」 「AB」 「A」 「BC」 「B」 「C」
# 「ABCBA」 「ABCB」 「ABC」 「AB」 「A」
# 「BCBA」 「BCB」 「BC」 「B」
# 「CBA」 「CB」 「C」 「BA」 「B」 「A」 )

And, of course, this last alternative is precisely what we need: a way to extract every possible substring of a string. Except that we need to find substrings containing all possible characters, not just A/B/C, so we need to match against the “match anything” dot metacharacter:

say $str.match(/ .+ /, :ex);
# Says: ( 「_ABC_ABCBA_」 「_ABC_ABCBA」 「_ABC_ABCB」
# 「_ABC_ABC」 「_ABC_AB」 「_ABC_A」)
# ...(54 more elements here)...
# 「BA_」 「BA」 「B」 「A_」 「A」 「_」 )

As the above example illustrates, these regex matches return a sequence of Match objects (denoted by ... brackets on output), rather than a list of strings. But such objects are easy to convert to a list of strings, by calling the string coercion method on each of them (using a vector method call: ».):

say $str.match(/ .+ /, :ex)».Str;
# ...(54 more elements here)...
# BA_ BA B A_ A _ )

To find the common substrings of two or more strings, we need the exhaustive list of substrings for each of them. If the strings are stored in an array, we can use another vector method call (».) on that array to perform the same substring-ification on each of the strings:

@strings».match(/.+/, :ex)».Str

Now we just need to treat each of the resulting lists as a set, and take the intersection of them. That’s easy because Raku has a built-in operator, which we could use to reduce all the lists to a single intersection, like so:

[∩] @strings».match(/.+/, :ex)».Str

To extract a list of substrings from this set, we simple ask for its keys:

keys [∩] @strings».match(/.+/, :ex)».Str

We now have a list of the common substrings, so we just need a way to find the longest substring(s) in that set. Raku has a built-in max function, but that would only find the first substring of maximal length. We need all of them.

To get them, we’ll augment the existing built-in function with a new variant; one that returns every maximal value, not just the first. This new subroutine takes the same arguments as the existing max builtin, but requires an extra :all adverb, to indicate that we want all the maxima:

   # "Exhaustive" maximal...
   multi max (:&by = {$_}, :$all!, *@list) {
       # Find the maximal value...
       my $max = max my @values = @list.map: &by;

       # Extract and return all values matching the maximal...
       @list[ @values.kv.map: {$^index unless $^value cmp $max} ];

Now we can keep just the longest common substrings (i.e. those substrings whose .chars value is maximal) like so

max :all, :by{.chars}, keys [∩] @strings».match(/.+/, :ex)».Str

The original challenge specification requires that the program accept a list of strings as command-line arguments, and prints their longest common substrings, which looks like this:

   sub MAIN (*@strings) {
       .say for
           max :all, :by{.chars},
                keys [∩] @strings».match(/.+/, :ex)».Str

Problem solved!

Except for those pesky bioinformaticists. Give them this solution and they’ll immediately look for the longest common substrings in half a dozen DNA sequences with ten thousand base pairs. And instantly discover the fundamental flaw in this approach: that the number of possible substrings in a string of length N is (N²+N)/2, so the costs of computing and storing them quickly become prohibitive.

For example, to find the longest common substrings of six 10000-base DNA strings, we’d need to extract and refine just over 300 million substrings. That quickly becomes inpractical, especially when the bioinformaticists decide that they’re more interested in DNA strands with hundreds of thousands of base pairs, at which point the initial exhaustive lists of substrings now have at least 30 billion elements collectively. We clearly need a more scalable approach.

Fortunately, it’s not hard to devise one. The number of substrings of any length in a N-character string is O(), but the number of substrings of a specific length M is only N-M+1. For example in a 10-character string like "0123456789", there are only six 5-character substrings: "01234", "12345", "23456", "34567", "45678", and "56789".

So, if we knew in advance how long the longest common substring was going to be
(i.e. the optimal value of M), we could generate and compare only substrings of that
exact length (/. ** {$M}/), using :ov overlapping pattern matching
(instead of :ex exhaustive):

@strings».match(/. ** {M}/, :ov)».Str

That would reduce the computational complexity of the algorithm to a far more scalable O(N).

Of course, initially we have no clue how long the longest common substring is going to be (except that it must be between 1 and N characters long). So we do what all good computer scientists do in such situations: we guess. But good computer scientists don’t like to call it “guessing”; that doesn’t sound scientific at all. So they have a fancy formal name for this approach: the Binary-Chop Search Algorithm.

In a binary search, we start with some kind of reasonable upper and lower bounds
of the optimal substring length we’re seeking ($min and $max). At each step,
we compute our “guess” of that optimal length ($M) by averaging those two bounds
($M = round ($min+$max)/2), then somehow check whether our guess was correct:
in this case, by checking if the strings do in fact have any common sunstrings of length $M.

If the guess is too small, then the optimal substring length must be bigger, so the guess becomes our new lower bound ($min = $M + 1). Similarly, if the guess is too high, then it becomes our new upper bound ($max = $M - 1). We repeat this guessing process, halving the search interval at each step, until we either guess correctly, or else run out of candidates...which occurs when the updated value of $min eventually exceeds the updated value of $max.

Because a binary-chop search halves the search interval at each step, the complete search always requires O(log N) guesses in the worst case, so if we check each guess using our O(N) fixed-length common substring computation, we’ll end up with an O(N log N) algorithm for finding the longest common substring.

Which looks like this:

   sub lcs (@strings) {
       # Initial bounds are 1 to length of shortest string...
       my ($min, $max) = (1, @strings».chars.min);

       # No substrings found yet...
       my @substrs;

       # Search until out-of-bounds...
       while $min ≤ $max {
           # Next guess for length of longest common substring...
           my $M = round ($min+$max)/2;

           # Did we find any common substrings of that length???
           if [∩] @strings».match(/. ** {$M}/,:ov)».Str -> $found {
               # If so, remember them, and raise the lower bound...
               @substrs = $found.keys;
               $min = $M+1;
           else {
               # Otherwise, lower the upper bound...
               $max = $M-1;

       # Return longest substrings found...
       return @substrs;

That’s ten times as much code to achieve exactly the same result, but with it we also achieved a considerable improvement in performance. The one-liner solution takes just under 90 seconds to find the longest common substrings from six 1000-character strings; the binary-chop solution finds the same answer in 1.27 seconds. The general performance of the two solutions looks like this:

Graph of algorithmic performance showing that the cost of the binary-chop approach growing logarithmically with string length, compared to the quadratic growth of the one-liner

More importantly, at about 1000 characters per string, the one-liner starts failing
(possibly due to memory issues), whereas the binary-chop version has no problem
up to 10000-character strings.

Except that, at that point, the binary chop solution is taking nearly a minute to find the common substrings. Time to optimize our optimization.

The slow-down on longer string lengths seems mostly to be coming from the use of a regex to break the string into overlapping M-character substrings on each iteration. Regexes in Raku are immensely powerful, but they are still being internally optimized, so they’re not yet as fast as (say) Perl regexes.

More significantly, Raku regexes have a fundamental performance limitation in a situation like this: each result from the match has to be returned in a full Match object, which then has to be converted back to a regular Str object. That greatly increases the time and space required to run the algorithm.

But if we could work directly with strings we’d save most of the space, and a great deal of the time spent in constructing and transforming Match objects. And, in this case, that’s easy to do...because all we need at each search step is to construct a list of overlapping substrings of a fixed length $M:

   "ABCDEFGH"           # $str

   "ABCDE"              # $str.substr(0,$M)
    "BCDEF"             # $str.substr(1,$M)
     "CDEFG"            # $str.substr(2,$M)
      "DEFGH"           # $str.substr(3,$M)

In other words, for each string, we just need to construct:

   # For each offset, start at offset, take substring of length M
   (0..$str.chars-$M).map(-> $offset { $str.substr($offset, $M) })

...which, for all the strings is just:

   # For each string...
   @strings.map(-> $str {
       # Build a list of substrings at each offset...
       (0..$str.chars-$M).map(-> $offset {$str.substr($offset,$M)})

Or we could write it more compactly, using the topic variable $_ instead of $str,
and a placeholder variable $^i instead of $offset:

   @strings.map:{(0 .. .chars-$M).map:{.substr($^i,$M)}}

Using that expression to replace the overlapping regex match, and tidying the binary search loop by using internal state variables instead of external my variables, we get:

   sub lcs (@strings) {
      # Search...
      loop {
         # Bounds initially 1 to length of shortest string...
         state ($min, $max) = (1, @strings».chars.min);

         # When out-of-bounds, return longest substrings...
         return state @substrs if $max < $min;

         # Next guess for length of longest common substring...
         my $M = round ($min+$max)/2;

         # Did we find common substrings of that length???
         if [∩] @strings.map:{(0.. .chars-$M).map:{.substr($^i,$M)}}
             # If so, remember them, and increase the lower bound...
             @substrs = $^found.keys;
             $min = $M+1;
         else {
             # Otherwise, reduce the upper bound...
             $max = $M-1;

Changing the way we extract candidate substrings provides another considerable boost in performance. Whereas the regex-based version took 54.6 seconds to find the longest common substring of six 10000-character strings, this new substring-based version finds the same answer in 3.1 seconds. And our new version scales much better too:

Graph of algorithmic performance showing that substring-based binary-chop is asymptotically around 20 times faster than the previous regex-based approach

Which just goes to show: no matter how cool and powerful your programming language is, no matter how much it lets you achieve in a cunning one-liner, when it comes to scalable performance, nothing beats the right algorithm.


Six slices of pie

The first task of the 15th Weekly Challenge was to find the optimal position in the line for dessert.

This is a reformulation of the old conundrum of creating buggy whips or cowboy movies or physical book stores or carbon-based energy sources: when is the best time to opt in for an ever-increasing slice of an ever-diminishing pie?

In this task, the pie is big enough for 100 people, but the first person in line gets a slice equivalent to only 1% of it. The second person gets 2% of what’s left, the third gets 3% of what remains after that, et cetera, et cetera, until the one-hundredth person receives 100% of whatever crumbs are left after the preceding ninety-nine have taken their share. In other words, we’re “Making Apple-pie Great Again!”

The actual task is to determine which is the optimal position in the line. Is it better to get in early and get a small slice of a still-large pie? Or to wait and eventually obtain a bigger slice of the leftovers? Let’s answer that question six different ways.

1. Imperative slices

I started my developer career as a C programmer and I still find it natural to approach small problems like this one in an imperative mode. So my very first attempt to answer the question (mercifully, in Raku rather than in C) looked like this:

    # Start with a full pie, and no idea who gets the most...
    my $pie       = 1.00;    # i.e. 100%
    my $max-share = 0.00;    # i.e.   0%
    my $max-index;

    # For everyone in line...
    for 1..100 -> $N {

        # Take N% of what’s left as the next share...
        my $share = $pie * $N/100;
        $pie -= $share;

        # Remember it if it’s maximal...
        if $share > $max-share {
            ($max-index, $max-share) = ($N, $share);

    # Report the biggest share taken...
    say "$max-index => $max-share";

It’s not pretty, nor concise, nor simple, but it’s quick and it produces an answer:

    10 => 0.06281565095552947

In other words, the optimal place in line is to be tenth, which secures you just over 6¼ percent of the original pie.

But this solution has some problems...and not just the obvious ones of gawkiness, prolixity, and far too many moving parts. The deeper issue is that it makes an significant unwarranted assumption (one that happens to be correct, though that’s no excuse as I didn’t know it at the time).

So let’s set about improving the solution.

The unwarranted assumption I made was that there actually was a single optional place in the queue. But what if there were two equally good places? Or three? Or a plateau after which all places were equally good? The code above actually only tells us that the person in tenth place is the first person who receives a maximal share. What if there were others?

It’s also a red flag that we’re identifying that first maximal value “manually”, by checking each new share as we create it within the loop. Raku has a perfectly good built-in max function that could do that for us. Though, of course, what we really need is a max function that returns all the places where the maximum is found, not just the first. Well...Raku has that too: maxpairs

Yet another red flag in the code is the need to add comments documenting the various numbers representing percentages. (e.g. 1.00 and $N/100). There ought to be a way to have the code itself make that obvious.

Here’s my second attempt at an imperative solution, incorporating all those improvements:

    # Add a percentage operator...
    sub postfix:<%> (Numeric $x) { $x / 100 }

    # Start with 100% of the pie, and no shares yet...
    my $pie = 100%;
    my %share;

    # For everyone in line...
    for 1..100 -> $N {
        # Remove a share for the Nth person from the pie...
        $pie -= ( %share{$N} = $pie * $N% );

    # Finally, report the share(s) with maximal values...
    say %share.maxpairs;

Surprisingly, Raku doesn’t have a “percentage” operator built in. Well, perhaps it’s not so surprising, given that it already uses the % symbol as a prefix sigil, an infix operator, an infix regex metaoperator, and a nameless static variable. Adding a permanent fifth meaning to that single character might not greatly improve overall code clarity.

Except that, here, it definitely does. Which is why Raku makes it trivially easy to add a postfix % operator that simply divides its numeric argument by 100:

    sub postfix:<%> (Numeric $x) { $x / 100 }

So now we can just write:

$pie = 100%


%share{$N} = $pie * $N%

without needing comments to explain that they're percentages.

And because we’re storing each computed share in a hash table, we can use the built-in .maxpairs method to retrieve a list of all the entries of %share with maximal values. Printing out that list, we find it contains only a single pair:

    (10 => 0.06281565095552947)

...thereby confirming that the tenth person in line is the only one who achieves optimal pie.

2. Stateful pie

That second impertaive version is 33% shorter and probably 33% more readable, but it still has some issues. For example, any time I see a variable like %shares that exists only to be populated in a for loop and then used exactly once afterwards, I get twitchy. And then I reach for a gather/take:

    say maxpairs gather for 1..100 -> $N {
        state $pie = 100%;
        $pie -=  my $share = $pie * $N%;
        take $N => $share;

In this version we’ve moved the $pie variable inside the loop, declaring it as a (persistent) state variable instead of a (transient) my variable. We still compute N% of the remaining pie on each iteration, but now we take that value, along with the current value of N, as a pair: take $N => $share

This means that, after the loop terminates, the gather will return a list of pairs whose keys are the positions in line, and whose values are the respective shares of the pie.

Once the loop terminates and the gather returns the list of pairs, we just need to filter out the maximal shares. Surprisingly, Raku doesn’t offer a suitable maxpairs built-in function (only the .maxpairs method on hashes), but fortunately Raku makes it easy to build a very powerful one that will handle everything we’re going to throw at it in the rest of this exploration:

  # Maximal values from a list of pairs...
  multi maxpairs (@list where @list.all ~~ Pair, :&by = {$_}) {
      my $max = max my @values = @list».value.map: &by;
      @list[ @values.pairs.map: {.key unless .value cmp $max} ];

  # Maximal values extracted from a list of containers...
  constant Container = Iterable | Associative;
  multi maxpairs (@list where @list.all ~~ Container, :&by!) {
      my $max = max my @values = @list.map: &by;
      @list[ @values.pairs.map: {.key unless .value cmp $max} ] :p;

  # Maximal values directly from a list of anything else...
  multi maxpairs (:&by = {$_}, *@list) {
      my $max = max my @values = @list.map: &by;
      @list[ @values.pairs.map: {.key unless .value cmp $max} ] :p;

There are three multiply dispatched versions of the function because we need to process various types of lists slightly differently. Nevertheless, in each case, we first extract the actual values we'll be maximizing by (my @values = @list.map: &by), then find the maximum of those values (my $max = max my @values), then then we find the indices of every list element with that maximal value (.key unless .value cmp $max), and finally return the elements at those indices, as a list of pairs (@list[...] :p).

With that function now “built in”, we can immediately identify and print the list of every maximal pair gathered in the loop: say maxpairs gather for 1..100 {...}

3. Sequential pie

Whenever I realise that I'm using a gather to generate a sequence of values in Raku, I immediately wonder whether I could do so more easily with the ... sequence generation operator. This operator is especially well adapted to building new values from previous values, which is exactly what we’re doing here: progressively reducing the pie as we progressively increase the percentages we’re handing out.

The ... operator has three components: an initial value, some code for building the next value from the previous, and a terminating condition. In this case, our initial condition is just that we have a full pie, and no-one has yet been given a share of it:

    hash( pie => 100%, N => 0, share => 0% )

To build a new value from the previous one, we simply increment N, then build a new hash in which N is given its updated value, a new share of N% is allocated, and the pie itself is diminished by that share (by reducing it to (100-N)%). That is:

sub next-share (\previous) {
given previous<N> + 1 -> \N {
return hash
N => N,
share => previous<pie> * N%
pie => previous<pie> * (100 - N)%,

Then we can build our list of shares:

constant shares
= hash(pie=>100%, N=>0, share=>0%), &next-share ... *;

Note that our terminating condition is * (“Whatever, dude!”), so this is actually a lazily evaluated infinite list. But we’re only interested in the first 101 values (i.e. from N=>0 to N=>100), so we extract those and then find the maximal pair(s), using each element’s share value to compare them:

    say maxpairs :by{.<share>}, shares[^100];

Notice how far we have now come from the original iterative solution. This approach more-or-less just describes what we want. We simply state the initial condition and describe how the values change. Then we ask for the specific information we require: Show us the maximum of the pairs, sorted by share, from the first 100 shares.

Both solutions require about the same amount of code, but this one is far less likely to include algorithmic mistakes, computational errors, or out-by-one problems. It doesn’t even require any variables to get the job done.

4. Functional pie

And, speaking of not requiring any variables, what about a pure functional solution?

Well that’s easy too. We just use the classic recursive list generation pattern:

To return a list of shares of the remaining pie...
With the next share (i.e. N% of what’s left)...
Return that share and its position in the list
Plus the list of shares of the remaining pie
(Provided there’s still any pie left to share)

Which, translated line-for-line into Raku, looks like this:

    sub shares (\pie = 100%, \N = 1) {
        given N% * pie -> \share {
            pair(N, share),
            |shares( pie - share, N+1)
                if 1% <= share <= 100%

Or, if you prefer your recursive and base cases better distinguished:

    # Recursive definition of shares list...
    multi shares (\pie = 100%, \N = 1) {
        given pie * N% -> \share {
            pair(N, share), |shares(pie - share, N+1)

    # Base case returns an empty list...
    multi shares (\pie, \N where N > 100) { Empty }

Either way, once we generate the complete list of shares, we just need to extract the maximal pair(s) from that list:

    say maxpairs shares();

5. Mathematical pie

A few days after I completed that flurry of code evolutions, I woke up one morning chagrined by the realization that I’d missed an “utterly obvious” solution.

I thought about the pattern of percentages allocated to each person in line; how they increased linearly:

    1%, 2%, 3%, 4%, ...

And then I though about how much pie was still available at any point, and how it decreased not linearly, but exponentially (because taking each share decreases the pie by N% of what's left, not by N% of the original pie):

    100%,  100%×99%,  100%×99%×98%,  100%×99%×98%×97%, ...

If I had those two complete lists (successive share percentages and progressive pie remains), then I could just multiple each of the corresponding elements together to find out how much each person in line would receive:

     1%,     2%,           3%,             4%, ...
      ×       ×             ×               ×
    100%,  100%×99%,  100%×99%×98%,  100%×99%×98%×97%, ...

Creating a list of the linearly increasing percentages is trivial in Raku:

    (1%, 2% ... 100%)

And creating a list of exponentially decreasing leftovers is hardly any more difficult:

    [\*] (100%, 99% ... 1%)

The [\*] operator produces a list of the partial products from successive multiplication of the elements in percentages list. In other words, it multiplies the first two elements, and then the first three, and then the first four, et cetera, until it’s produced a list of all of the partial products. So we get:

    100%, (100% * 99%), (100% * 99% * 98%) ...

Given those two lists, we can simply multiple each percentage by each corresponding leftover amount, using vector multiplication:

(1%, 2% ... 100%) »*« [\*] (100%, 99% ... 1%)

...giving us a list of who got how much pie. Then we can just filter that list for the maximal index/pie-share pair:

    say maxpairs (1%, 2%...100%)  »*«  [\*] (100%, 99%...1%);

And there’s our answer once again...now in a single line.

6. Visual pie

Of course, every one of these solutions has relied on us having that very handy maxpairs function, to search through the outcomes and locate the optional place in line.

If we didn’t have that function, or any other “find the maximum” function, we could still eyeball the answer, if only we could see what the data looks like.

Well, that’s easy in Raku too, thanks to a lovely pair of modules created by Moritz Lenz:

    use SVG;
    use SVG::Plot;

    # Work out the shares, as before...
    constant shares = (1%, 2% ... 100%) »*« [\*] (100%, 99% ... 1%);

    # Plot the data and write it to an SVG file...
    spurt 'chart.svg'.IO,
                title  => 'Percentage shares of the pie',
                values => [shares »*» 100,],
                x      => [0..100],

    # Take a look...
    shell 'open chart.svg';

...which generates and displays the following graph:

Graph of pie shares, peaking at (10, 6.28) with a long tail

...which, at a glance, tells us that person number 10 is still getting the best deal, with just over 6% of the pie.

So, no matter whether you prefer to slice your pies imperatively, statefully, declaratively, functionally, mathematically, or visually, Raku makes it easy write code that works in the way you are most comfortable thinking.

And that’s important...whether it’s something as simple as letting you seamlessly add a missing built-in like % or maxpairs, or something as convenient as generating complex sequences and then applying vector arithmetic between them, or something as profound as allowing you to code fluently in whichever programming paradigm helps you to reason about problems most effectively.

Because people shouldn’t have to adapt to the needs of programming languages;
programming languages should adapt to the needs of people.


Vigenère vs Vigenère

The second task of the 15th Weekly Challenge was to implement
little more complicated than it seems, because the cipher that's named
after Blaise de Vigenère wasn’t actually invented by him, and the cipher
that Vigenère actually invented isn’t named after him.

So should we implement the Vigenère Cipher...or Vigenère’s cipher?
Why not both!

The Vigenère Cipher was devis…

Infinite work is less work

first ten strong and weak primes. A prime pn is "strong" if it's larger
than the average of its two neighbouring primes (i.e. pn > (pn-1+pn+1)/2).
A prime is "weak" if it's smaller than the average of its two neighbours.

Of course, this challenge would be trivial if we happened to have a list
of all the prime numbers. Then we'd just filter ou…

As simple as possible...but no simpler

The first task of last week's Weekly Challenge

0, 0, 1, 0, 2, 0, 2, 2, 1, 6, 0, 5, 0, 2, 6, 5, 4, 0,...

The first challenge is to understand just what van Eck's sequence is,
as the various online explanations are less than instantly helpful.

Van Eck's sequence is a list of integers starting at zero, where the next number
in the sequence is given by the distance between the current number
and the nea…

About Damian Conway