PDL features I'd like to see in Perl 6

I recently asked around #perl6 as to a mailing list where I might discuss PDL features that I'd like to see in Perl 6. Synopsis 9 is supposed to discuss these features, but the PDLish ideas feels like a straight port of PDL, rather than a rethink of what's important. I wanted to discuss things a bit.

The answer was, "Write a blog post." This post contains what I consider to be the essential ingredients of PDL that I think are easily achievable for Perl 6 v1.0. I want Perl 6 to provide an expressive language for writing operations on high dimensional data. I believe that we (and I do include myself) can get this done by Christmas if others can help me out.

PDL's most important feature is that the low-level looping is implemented in C. I do not expect Perl 6 to produce C code to implement this functionality, and so I should say up front that I do not expect a Perl 6 implementation of these features to be fast. But I don't care about fast for v1.0. I'll be happy if they're simply present. We can work on fast later.

Machine data representation of multidimensional arrays

Perl 6 has support for machine representation of multidimensional arrays, so we have the basic data container already.

autolooping

This is, hands down, the killer feature of PDL.

In PDL parlance, this called "threading", specifically "implicit threading". It has nothing to do with parallel threads, which is why I call it autolooping.

PDL's support for autolooping is why I keep coming back to Perl and PDL in my moments of weakness. No other language supports autolooping as well as PDL does. For a more complete explanation of this concept, see PDL::Threading and the Threading section of PDL::Indexing.

The idea is that you define operations expecting a certain number of dimensions. If the data that you end up analyzing has more dimensions, the higher dimensions are automatically looped over. If I write code that expects a matrix, autolooping would allow it to handle a "stack" of matrices.

The idea goes further, though. If one or some of your input has fewer dimensions, but the extent of that dimension can be ascertained from another array that is supposed to share the dimension's extent, then the higher dimensions for the short data are "filled in" by simply repeating the data! If I expect a matrix but I am given a scalar, it is treated like a matrix of the correct dimensions, all equal to the given scalar. For example, this is how PDL implements adding a vector to each row of a matrix.

Believe it or not, a basic implementation of this is not as difficult as it might at first seem. You need to make sure that the dimensions of all arrays are commensurate, and then you need to tabulate the strides for each dimension of each array. The bookkeeping gets a little tedious, but it's doable and I will be happy to dig into details when it comes time to implement something. I discuss notation and examples at greater length below.

Slices that operate as aliases

In Perl 6, you can create an alias of a variable. Changes you make to the alias modify the original. In PDL, saying my $copy = $pdl_data; automatically creates an alias, since that's how references work in Perl 5.

What's more, with PDL it's possible to take a slice data and assign that slice to a variable. This expression, my $slice = $pdl_data->slice([0, 20, 2]), creates an alias to every other element of $pdl_data from the first to the 21st element. Modifications to the contents of $slice modify the data in $pdl_data. (This is called "data flow" in PDL parlance.)

It would be really nice for Perl 6 to have array aliases as a language feature—does it?—in which case multi-dimensional aliases should come along for the ride without much fuss.

Arbitrary slices (that also operate as aliases)

This is the second most important feature of PDL.

We all know that in Perl 5 you can slice an arbitrary list of elements from an array by saying @slice = @data[2, 3, 5, 7, 11]. If you happen to assign to such a slice, you can overwrite the values.

PDL takes this one step further, allowing you to alias arbitrary slices. In PDL you might use the index method to pull out a similar slice as the one just demonstrated: $slice = $data->index([2, 3, 5, 7, 11]). The fact that this is an alias makes it much more expressive than the simple array slicing provided by Perl 5. I can pass this slice along to other methods, which can modify the contents and have the effects reflected in my original data. PDL also has the where method, which makes it particularly easy to identify data with interesting properties. For example, if I want to set all negative numbers to zero, I simply say $data->where($data < 0) .= 0. In Perl 6, not only can we provide this method, but we can optimize it in ways that PDL cannot.

I know exactly how to implement such aliases in Perl 5 using tied arrays; surely a similar approach can be used for arbitrary slices.

Dimension checking and alignment via function signatures

In Perl 6 you can declare the type of a variable in the function signature. While you may not consider array dimensions to be part of the type, some sort of notation for this kind of validation will be necessary to avoid unwieldy boilerplate. Whether it is implemented in the function signature or via separate syntax, the logic for this sort of checking can be implemented by borrowing parts of the autolooping machinery. For detailed reading, see the links already given under autolooping.

Basically, I want more than just to say, "The two arguments to this function should be 2d arrays." I want to be able to say, sub foo ($first($nx, $ny), $second($nx, $ny)) to declare a function in which $first and $second are autoloop compatible. In fact, I might have a function in which I expect $second to have three dimensions, like this: sub foo ($first($nx, $ny), $second($nx, $nz, $ny)). The notation here should be identical to the autoloop notation discussed below.

Easy dimension rearrangement

You will notice that I suggest autolooping and dimension alignment in function signatures. I have implied that the dimension size enforcement is based on the order of the dimensions, akin to implicit threading in PDL. If we go that route, then we need an easy way to rearrange dimensions. Suppose you have PDL $data that is 10x5. If you want to access the very last element, you'd look at position (9, 4). With PDL, it's easy to transpose the data and to access it the other way around, i.e. by saying (4, 9).

More on autolooping

As a basic example, suppose I want to loop over the pixels of an RGB image (dimensions 3 by Nx by Ny) and produce a gray scale image (dimensions Nx by Ny). I could easily implement this using for loops. But, what happens if I feed this a movie, i.e. a stack of such images? In that case I would have an array with dimensions (3, Nx, Ny, Nframes) and I would expect an output "movie", an array with dimensions (Nx, Ny, Nframes). The traditional approach is to manually wrap the original loop in yet another loop to handle the higher dimension. The problem is that such an approach is hard to handle in the general case and requires much too much boilerplate. The best solution would be a loop syntax that automatically loops over all frames and applies the code to the data in each frame. I would like Perl 6 to have the keyword autoloop, demonstrated in the two examples below. (See this stack overflow discussion for details about the formula.)

my $gam = 2.2;
autoloop (@input(3, $Nx, $Ny), my @output($Nx, $Ny)) {
    for my $i (0 .. $Nx-1) {
        for my $j (0 .. $Ny-1) {
            my ($r, $g, $b) = @input[:, $i, $j];
            my $y = 0.2126 * $r**$gam + 0.7152 * $g**$gam
                + 0.0722 * $b**$gam;
            @output[$i, $j] = 116 * $y**(1/3) - 16;
        }
    }
}

In my conception, autoloop not only delimits a block and provides autolooping, but it also serves to declare the dimension variables, scoped within the block, and the output variable my @output($Nx, $Ny) which is scoped to live outside the block.

Notice that the output at $i and $j only depends upon the input at $i and $j. This means I can be even more succinct:

my $gam = 2.2;
autoloop (@input(3), my @output()) {
    my ($r, $g, $b) = @input;
    my $y = 0.2126 * $r**$gam + 0.7152 * $g**$gam
        + 0.0722 * $b**$gam;
    @output[] = 116 * $y**(1/3) - 16;
}

With the magic of autolooping, if @input is three dimensional, @output will be two dimensional.

I could wrap this in a function, and specify my high-dimensional array, using a function signature like this:

sub rgb_to_grayscale (@input(3, *)) {
    my $gam = 2.2;
    autoloop (@input(3), my @output()) {
        my ($r, $g, $b) = @input;
        my $y = 0.2126 * $r**$gam + 0.7152 * $g**$gam
            + 0.0722 * $b**$gam;
        @output[] = 116 * $y**(1/3) - 16;
    }
    return @output;
}

How we get to something a bit trickier: an implementation of convolution with reflective boundary conditions. This is easier to implement than other boundary conditions due to the way that negative subscripts are handled. Notice that the function declaration indicates that I want two 2d arrays, but I do not require that they have the same dimensions.

sub convolve2d ($input($Nx, $Ny), $weights($px, $py)) {
    # Not strictly necessary, but it's highly doubtful that they
    # want a convolution with reflection boundary conditions in
    # which the weights is wider/taller than the input itself.
    croak("Weights patch ($px x $py) is larger than the image ($Nx x $Nx)")
        if $px > $Nx or $py > $Ny;

    autoloop ($input($Nx, $Ny), $weights($px, $py), my $output($Nx, $Ny)) {
        # In here, I have variables $Nx, $Ny, $px, and $py, which are
        # the sizes of their respective dimensions. These are new
        # lexical variables, independent of the variables by the same
        # name in the outer scope.

        # Loop over all points in the input image
        for my $i (0 .. $Nx - 1) {
            for my $j (0 .. $Ny - 1) {

                # Start with empty output
                @output[$i, $j] = 0;

                # Loop over the patch
                for my $k (0 .. $px - 1) {
                    my $di = $k - int($px / 2);
                    for my $l (0 .. $py - 1) {
                        my $dj = $l - int($py / 2);
                        # Negative indexing automatically handles
                        # reflection boundary for me.
                        @output[$i, $j] += $input[$i + $di, $j + $dj]
                             * @weights[$k, $l];
                    }
                }

            } # loop over j/Ny
        } # loop over i/Nx
    }

    return @output;
}

6 Comments

In my perfect world, you don't need autoloop at all.

The compiler should be able to infer the dimmensions mismatch and do the autoloop itself, and on the way, use multiple CPUs when available and vectorization extensions efficiently.

If you happen to assign to such a slice, you can overwrite the values.
That's not exactly true:

@data  = ('a' .. 'l');
@slice = @data[2, 3, 5, 7, 11];
print "@slice\n";
$_++ for @slice;
print "@slice\n@data <-- no change!\n";

smls here. (Using stackexchange OpenID since logging in normally is borked right now for me.)

Type checks against array dimensions/sizes

This seems to be firmly planned, but not yet implemented:

➜ perl6 -e 'sub foo (@a($x, $y)) { ... };'
===SORRY!=== Error while compiling -e
Shape declaration with () is reserved
...

Aliasing slices

The following already works in Perl 6 right now (with Rakudo on MoarVM):


my int @a = 2, 4, 6, 8; # create a compact native array
my @b := @a[1, 3];      # take a slice, and bind it to a new name
@b = 40, 80;
@a[3]++;
@b[1]++;
say @a;                 #-> 2 40 6 82

However, this is probably not performant and flexible enough to satisfy PDL use-cases.

For a start, a slice of a compact native array is not itself a compact native array - it is a normal List object[1] that happens to contain individual native references as elements.

That is, the aliasing does not happen at the slice level, but at the element level. This means that even when taking a slice using a range specifier, it has to be unrolled - e.g. 1..10000 is a memory-efficient Range object in Perl 6, but @a[1..10000] returns a List of 10000 elements.

Also, the native references can't be introspected to find out which source array (or which index in the source array) they're from - all that is guaranteed is that reading from / writing to them will read/write at the correct piece of memory.

S09#Subscript_and_slice_notation recognizes that a hypothetical PDL-in-Perl6 may want to implement a different behavior for its slices.

PDL in Perl 6 according to S09

All in all, the design/behavior of Perl 6's built-in "compact native array" type seems to have been optimized for simple array-munging and C-library-interoperability use-cases, and not to serve as the main type for an advanced PDL-like API.

In fact, I don't interpret S09 as suggesting that there will ever be a PDL implementation in Perl 6 core, at all.

When it mentions PDL, it does so only in order to demonstrate that the array-related syntactic elements in Perl 6 are generic enough that a hypothetical PDL implementation *could* make use of them, i.e. that such a PDL type (possibly provided by a CPAN module) could implement the "same interface" as Perl 6's built-in "compact native array" type.

These syntactic elements include:

  • @-sigiled variables
  • element type declarations for such variables
  • shape declarations for such variables (i.e. dimensions and sizes)
  • type checks against the aforementioned declarations in function signatures etc.
  • the postcircumfix [ ] operator for positional indexing and slicing

A prospective PDL type could benefit from all of them, but it would overload them with different semantics than the built-in arrays where appropriate.

For example, the PDL type might overload postcircumfix [ ] such that @piddle1[5000..10000] returns a new PDL object which efficiently aliases the requested slice of the existing PDL object @piddle1.

How this PDL type would be implemented, is IMO orthogonal to those interface concerns. It could be a pure Perl 6 class munging "compact native array" instances under the hood, or it could be a thin binding over a bundled C library.

Autolooping

I suppose the PDL type would also take care of autolooping operations on it, though I don't understand things well enough to comment on what that would entail.

Your proposed 'autoloop' keyword does feel strange to me. Are you sure there wouldn't be a way to integrate this with normal for loops? (Which, in Perl 6, can iterate over lazy lists and iterators.)

But I'll leave that discussion to people with a better understanding of both PDL and Perl 6... :)

----

[1] Well, technically it is a Parcel object in current Rakudo, but it will hopefully become List after the GLR ("Great List Refactor") is finished.

I think the new refaliasing feature (experimental in 5.22) will allow you to alias slices like so:

\( @slice[ 0 .. 4 ] ) = \( @data[ 2, 3, 5, 7, 11 ] );

Obviously it’s ugly to have to count out the LHS slice by hand:

\( @slice[ 0 .. $#$_ ] ) = \( @data[ @$_ ] )
    for [2, 3, 5, 7, 11];

That’s quite a mouthful, but at least you can package it up in a function:

sub alias_slice {
    my ( $src, @dst ) = shift;
    \( @dst[ 0 .. $#_ ] ) = \( @$src[ @_ ] );
    \@dst;
}

\my @slice = alias_slice \@data, 2, 3, 5, 7, 11;

Not exactly pretty… but also not too terribly awful either.

Call me crazy, but I had a freelance gig converting Python numpy/scipy to Perl PDL... And barely knowing Python it was a lot easier to understand than PDL.

I can see why numpy/scipy has gained a lot of traction.

I'd like to see P6 PDL to be like numpy/scipy to ease crossover to Perl 6 and lower the entry level for beginners. In my opinion usability and clear documentation for the non-math scholar has hurt what PDL could have been and could be a lesson for PDL Perl 6.

Leave a comment

About David Mertens

user-pic This is my blog about numerical computing with Perl.