As simple as possible...but no simpler

By Damian Conway on July 1, 2019 10:11 PM

The first task of last week's Weekly Challenge
was to generate van Eck's sequence:

0, 0, 1, 0, 2, 0, 2, 2, 1, 6, 0, 5, 0, 2, 6, 5, 4, 0,...

The first challenge is to understand just what van Eck's sequence is,
as the various online explanations are less than instantly helpful.

Van Eck's sequence is a list of integers starting at zero, where the next number
in the sequence is given by the distance between the current number
and the nearest preceding occurrence of that same number.

For example, if the current number at index N (let's call it: Aₙ) is 7,
then to compute the number at index N+1 we look back through the
sequence for the most recent preceding occurrence of 7
(at some earlier index M). Then the next number in the sequence
is simply the distance between those two occurrences of 7.
Namely: N - M

The only complication is that, if there is no prior occurrence of the current number
anywhere in the preceding sequence, then the next number is simply zero.
That's less arbitrary than you might suppose: we can think of that zero value
as the distance from Aₙ to itself. That is, if there is no preceding occurrence,
then the only possible candidate for a "preceding" occurrence is just Aₙ itself,
so the distance in that case is N - N, which is zero.

Okay, so that's van Eck's sequence. Now how do we generate it in Raku?

If you've been following my recent blog posts then it probably won't
come as much of a surprise to learn that Raku has an appropriate
built-in tool that will let us accomplish the entire task a single
statement. That tool is the sequence operator.

The ... sequence operator takes an initial list of values,
followed by a code object (i.e. a block or subroutine),
followed by a termination condition:

my @sequence = @initial-list, &code-obj ... $termination;

It builds a sequence (literally an object of class Seq), starting with the initial values,
then generates extra values after the initial list by calling the code object repeatedly
on the final elements of the initial list. Each time the code object is called, the value
it returns is appended to the sequence, until a value is returned that smartmatches
the termination condition, at which point the sequence is complete. For example:

# initial code-obj termination
my @odd-nums = 1, 3, 5, {$^n + 2} ... 999;
my @N-powers = $N, $N², {$^x * $N} ... * > 1_000_000;
my @fibonacci = 0, 1, 1, {$^a + $^b} ... ∞;

These examples differ from van Eck's sequence in one important respect.
Each of them uses only one of the preceding values (or in the case of
Fibonacci, two of them) in order to compute the next. But the next number
in van Eck's sequence depends on both the immediate preceding value (Aₙ)
and the entire sequence before that...which must be searched to locate
an appropriate previous occurrence of Aₙ.

Naturally, Raku supports this kind of sequence too: if your code object uses
the special array parameter @_ (or indeed, any other variadic array parameter),
then it will be passed the entire preceding sequence every time.

So now all we need to do is build a code object (in this case: a subroutine)
that, when passed the entire preceding sequence in @_, computes the next
van Eck's number. It might look like this:

sub next-vanEck's {
my $N = @_.end;
my $M = try ($N-1...0).first: {@_[$^m] == @_[$N]};

with $M { return $N - $M }
else { return 0 }
}

The final index in the current sequence is given by @_.end
(which is like $#_ in Perl), so that's our current N position.
To find the nearest preceding identical number, we then walk backwards
from index N-1 down to zero, looking for the first index ($^m)
where the value at that index (i.e. @_[$^m]) is equal to the value
at index N (i.e. @_[$N]). Then we return N-M if M was found,
or zero otherwise.

By the way:

That with $M {...} block is just a convenient shorthand
for: if $M.defined { my $_ := $M; ...}
The try prefix on the search for M is there to handle the edge-case
where $N is zero, in which case $N-1...0 would produce
a negative array index, which would raise an exception.
Yes, it's perfectly okay to use an apostrophe or hyphen in a subroutine name
(or in any other identifier) in Raku, provided it's surrounded
by normal identifier characters, so it can't be confused with
the start of a character string.

Once we have a code object that calculates the next van Eck's number,
then the entire van Eck's sequence is just:

constant vanEck's = 0, &next-vanEck's ... ∞;

It's fine to let the sequence run to infinity because ... generates
its values lazily: only computing as many elements as are actually needed
to meet a given request.

Once we have our constant sequence, we can ask for any particular index,
at which point next-vanEck's will be called as many times as necessary
to compute the requested value:

say vanEck's[100]; # prints: 23

That's certainly a clean, compact, uncluttered solution,
but it was not accomplished as advertised: "in a single statement".

However, the version above is only a starting point,
There are numerous improvements we could make to it
...by thinking and coding more Raku-ishly.

For a start, it's often a red flag when a piece of code iterates though
indexes and then looks up the corresponding array elements. Raku,
in particular, makes it just as easy to iterate those array elements directly
and then find the corresponding index if we need it (as we do here):

# Version 2...
sub next-vanEck's {
my $N = @_.end;
my $M = @_.head(*-1).first(@_.tail):end:k;

with $M { return $N - $M }
else { return 0 }
}

In this version, we find the index M by taking all the elements of the list
prior to the last (@_.head(*-1)), then searching for the first such element
that is equal to the final element (.first(@_.tail)).

But, because we need the closest preceding match, we need to search
backwards through the prior elements, so we add an :end adverb to tell
.first to search from the end. Then, when we find a match, we need the
position at which that match was found (not the value that was found there),
so we add a :k adverb to tell .first to return the matching "key"
(i.e. the index where the match occurred) instead of the matching value.

This new way of locating the M index requires marginally less code,
and is less prone to off-by-one errors, but the real benefit is that it's
also significantly faster. Because it's iterating array elements instead
of array indexes, it doesn't have to do two explicit array look-ups
in order to compare every prior element with the final element.

There's another minor code optimization possible here. As we saw back
at the beginning of this exploration, the special case of returning a zero
if there's no M index is equivalent to setting M to N in that case.
If we do that, then we can just return N-M, without ever needing
to test for the special case. Like so:

# Version 3...
sub next-vanEck's {
my $N = @_.end;
my $M = @_.head(*-1).first(@_.tail):end:k //$N;
return $N - $M;
}

If the .first search fails to find a preceding match, it will return
an undefined value instead, so the trailing defined-or will simply
use the N value instead, in which case N-M will be zero, as required.

And, of course, having already reduced the code so much, the only reason
to persevere with the $N and $M variables is for self-documentation.
Otherwise, we could simply replace $N with @_.end,
and $M with the new head-first-tail search expression, producing:

# Version 4...
sub next-vanEck's {
@_.end - (@_.head(*-1).first(@_.tail):end:k // @_.end)
}

At which point, we hardly need the named subroutine either.
Instead of:

constant vanEck's
= 0, &next-vanEck's ... ∞;

we could simply "inline" the subroutine body as a code block,
producing a genuine single-statement solution:

constant vanEck's
= 0, {@_.end-(@_.head(*-1).first(@_.tail):end:k//@_.end)}...∞;

Job done!

Or is it? Sure, if we now write:

say vanEck's[100]; # Should print: 23

...we get 23 printed out, as expected.

But what happens if we write:

say vanEck's[10_000]; # Should print: 14

Well, we do get 14 printed out...eventually.
About a minute later. :-(

If we then tried vanEck's[100_000], we'd have time
for a long, leisurely lunch break...at a restaurant an hour away.
And if we wanted vanEck's[1_000_000], we would
quite literally have to come back for it next week.

So...we managed to make our solution as simple as possible,
but at the cost of truly woeful, completely non-scalable
runtime performance.

And, when you stop to think about it, it's not surprising that it runs
so poorly. To generate the next van Eck's number, our code has to look
through (potentially) every one of the N-1 earlier numbers, comparing
each of them to the most recent number in the sequence. In other
words, computing a single van Eck's number takes O(N) operations,
so computing N of them must be O(N²). Hence, over a minute to find
a mere 10 thousand numbers of the sequence, several hours to find the
first 100 thousand, and nearly seven days to find the first million.

Obviously, before we can reap the countless benefits of an unlimited supply
of van Eck's numbers, we're going to need to find a way to eliminate
that repeated O(N) search inside the code object.

Happily, that's not too difficult. There's a kind of algorithmic Lorentz contraction
that we can often apply to these kinds of situations: we can literally shrink
the time dimension...by expanding the space dimension. More prosaically,
we can trade more memory for extra speed.

The trick in this case is to permanently track (in a static variable) the index M
of the most recent prior instance of each number in the sequence so far.
Then, when we encounter that number again, we can perform a single look-up
on our variable to find the most-recent prior index for that number.
All we have to do then is update our remembered M value each time
(since we've just seen the same number again at a later index: N).

A next-vanEck's subroutine based on that approach would look like this:

# Version 4...
sub next-vanEck's {
state @M;
my $N = @_.end;
my $M = @M[@_.tail] // $N;
@M[@_.tail] = $N;
return $N - $M;
}

The @M array tracks the most recent prior indexes of each previous value.
So, when we need that M value to compute the next number, now we simply
look up its location up directly: @M[@_.tail]

We then update the @M array with the new most-recent position of
the current value: @M[@_.tail] = $N

Running the sequence operation with this version of the subroutine,
we can compute vanEck's[10_000] in about 15 seconds, instead of 60.
A fourfold improvement in performance is always gratifying,
but we can do much better than that.

We're only passing the entire preceding sequence in (as @_) so we can
access the final value (@_.tail) and the current value of N (@.end)
at which it occurs. But we could, instead, simply pass in that final value directly.
And the value of N merely increments on each call, so we could track it
trivially in another state variable.

Which looks like this:

# Version 5...
sub next-vanEck's ($Aₙ) {
state ($N, @M);
$N++;
my $M = @M[$Aₙ] // $N;
@M[$Aₙ] = $N;
return $N - $M;
}

Because the subroutine now takes a single argument $Aₙ
(yes, Raku allows subscripted alphanumerics in identifiers),
each time the sequence operator generates a new value,
it now passes the code object only the final value
of the sequence so far.

We also track the current index, by incrementing $N
each time the subroutine is called.

As before, we determine the M value (which is now: @M[$Aₙ]),
then update it, then return the newly computed next value.

But, because the sequence operator no longer needs to pass the entire
sequence-so-far into the subroutine for every new number it generates,
the cost of each call to next-vanEck's drops significantly.
With this version of the subroutine we can compute vanEck's[10_000]
approximately 100 times faster than previously, in about 0.15 seconds.
Or find vanEck's[100_000] in about 1.5 seconds.
Or vanEck's[1_000_000] in about 15 seconds.
That's not just vastly faster than the previous approach;
it's now a completely different order of complexity:
clearly O(N) linear time, rather than O(N²) quadratic.

This massive improvement in performance was not, however, free.
The @M array is consistently about 97% as long as the sequence itself,
though slightly sparse, so it may only require about 85% as much space.
Nevertheless, this means that the memory footprint of this solution has
effectively doubled compared to version 1. In most situations, though, that's
a perfectly acceptable trade-off: memory is relatively cheap; time is priceless.

But could we also code this version in a single statement?
Of course, we can.

Once again, we simply inline the subroutine's code block:

constant vanEck's = 0, { state ($N, @M);
$N++;
my $M = @M[$^Aₙ] // $N;
@M[$^Aₙ] = $N;
$N - $M;
} ... ∞;

Now, of course, if I'd started with that, just sprung it on you with
no warning and no explanation, then you'd be quite justified in
complaining that it's scary and incomprehensible and (gasp!) different.
New concepts and new approaches and new tools and new syntax
are always challenging. Especially all four at once.

But as you come to understand the power and expressiveness of Raku,
then perhaps a solution like that is not so scary, and you start to see
how it's possible to write code that looks very much like the original
description of the problem, in something very like the original notation
of the problem, and hence which is actually more comprehensible
to anyone who is already familiar with the task domain.

And, because there's always more than one way to do it in Raku,
it's still perfectly fine to write that final version as:

constant vanEck's = 0, &next-vanEck's ... ∞;

...hiding away the algorithm and the optimizations inside that subroutine.
And later changing them within the subroutine when you find a cleverer
approach, a better algorithm, or an even more efficient Lorentzian trade-off.

Both Perl and Raku cop a lot of flak for their unapologetically
"More Than One Way To Do It" philosophies. But, to me,
the flexible and transparadigm nature of the two languages
simply means that I'm never stuck with the inadequate concepts,
or the inefficient algorithms, or the inconvenient software tools,
or the intrinsic performance limitations that someone else
hard-wired into their "One Right Way To Do It".

And, for me, that makes the choice of Raku...as simple as possible.

5 comments

5 Comments

Mohammad S Anwar | July 2, 2019 1:16 PM | Reply

Great analysis and big push for the language itself. You have been inspirational as ever.

$Bob | July 2, 2019 6:03 PM | Reply

Damian's communication style (in person, in presentation, in writing, or in any other form that I've interacted with him) is always so clear. This is yet another example of why I love the Perl community: we're trying to do better than each other, in a way that brings the whole community up with us.

Keep up the good work!

Cuong Manh Le | July 4, 2019 4:11 AM | Reply

You version 4 must be:

```
constant vanEck's = 0, {@_.end - (@_.head(*-1).first(@_.tail):end:k//(@_.end))}...∞;
```

Otherwise, we will get undefined error on `$N`.

Vinny | July 4, 2019 4:13 PM | Reply

I learned a lot reading this--thanks! Your explanations are so clear, and the improvements in speed you got from thinking and algorithm improvement makes the idea that perl6, which might be 30% slower than xyz language, moot. I'm trying to switch to perl6, and hope you keep up this quality of blogging!

Damian Conway replied to comment from Cuong Manh Le | July 5, 2019 9:03 AM | Reply

Yes indeed!
Thanks for spotting that.
I've corrected it in the text.
Much appreciated!

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Damian Conway

More info »

Damian Conway