Perl's first real CUDA bindings released

Since my first blog post back in December I've written and made thorough use of a simple Perl interface for CUDA. Today, I've posted it on github, and in this post I'll give a relatively simple example of how to use CUDA with Perl via Inline::C. (In case you're wondering, CUDA is a technology provided by nVidia that lets you compile and execute highly parallel code on your CUDA-capable video card.)

First, of course, you'll need to install ExtUtils::nvcc. At the moment this only works with Linux (maybe with Mac OSX, definitely not yet with Windows). It has only been confirmed with Ubuntu. See directions on the ExtUtils::nvcc wiki. (If you manage to install it on other systems, please let me know and edit the wiki or send me your notes!) If you have that installed, installing CUDA::Minimal is just a simple CPAN install.

First Script

So, at this point I will assume you've installed CUDA::Minimal. What can you do with it? Here's a simple example:

use strict;
use warnings;
use CUDA::Minimal;
use ExtUtils::nvcc;
use Inline C => DATA => ExtUtils::nvcc::Inline;

# Some CUDA kernels and their Perl wrappers are defined below. Let's
# create some data and invoke them!

my $N_values = 10;
my $host_data = pack('f*', 1..$N_values);

# Copy the data to the video card and get the pointer in the video card's
# memory. MallocFrom allocates enough memory and copies the contents:
my $input_dev_ptr = MallocFrom($host_data);

# Before processing the data on the video card, I need to allocate some
# memory on the card where the results will be stored. Malloc allocates
# enough memory but does not copy contents:
my $output_dev_ptr = Malloc($host_data);

# Run the kernel:
invoke_the_kernel($input_dev_ptr, $output_dev_ptr, $N_values);

# We would like to see the results, allocate an new host array:
SetSize(my $results_array, length($host_data));
# and copy the results back:
Transfer($output_dev_ptr => $results_array);
print "$_\n" foreach (unpack 'f*', $results_array);

# Finally, free the device memory:
Free($input_dev_ptr, $output_dev_ptr);

__END__

__C__

// A simple kernel that triples the value of the input data and stores
// the result in the output array:
__global__ void triple(float * in_g, float * out_g) {
    out_g[threadIdx.x] = in_g[threadIdx.x] * 3;
}

// A  little wrapper for the kernel that Inline::C knows how to parse:
void invoke_the_kernel(SV * in_SV, SV * out_SV, int N_values) {
    // Unpack the device pointers:
    float * d_in = INT2PTR(float *, SvIV(in_SV));
    float * d_out = INT2PTR(float *, SvIV(out_SV));

    // invoke the kernel:
    triple <<<1, N_values>>>(d_in, d_out);
}


That's not exactly hello world. Let's pull it apart.

Boiler Plate

Starting from the top, we see some fairly standard boiler-plate using strictures and warnings. Since this is an example for CUDA::Minimal, we'll need that too.

The last two lines of use statements should like a bit intriguing to you:

use ExtUtils::nvcc;
use Inline C => DATA => ExtUtils::nvcc::Inline;


ExtUtils::nvcc is part of the CUDA toolchain and it provides some simple functions for configuring the three main build tools: ExtUtils::MakeMaker, Module::Build, and as shown here Inline::C. (This sets the cc and ld flags, as explained in the docs.)

Allocating Memory

After creating a Perl string filled with a packed array of floating-point data, we come to the first lines of CUDA::Minimal:

my $input_dev_ptr = MallocFrom($host_data);


MallocFrom was imported from CUDA::Minimal by default. (Yeah, it imports functions by default. It's supposed to be easy to use. :-) MallocFrom is one of those handy functions that packs a lot of functionality compared with its CUDA C counterparts. It (1) determines the size of your host-side memory, (2) allocates the same amount of memory on the device, (3) copies the host-side data to the device, and (4) returns the pointer to the memory location on the device. All that with one function call!

The next step allocates even more memory on the device. This is the memory on the video card where the results will go:

my $output_dev_ptr = Malloc($host_data);


Malloc, is very similar to MallocFrom except that it does not copy the contents of $host_data over to the device. It simply allocates the memory and returns the device pointer.

Location and Terminology

This is a good point to introduce some terminology. CUDA provides a way for running almost arbitrary code in parallel on your video card. Video cards to not have direct access to your CPU's RAM, and your CPU does not have direct access to your video card's RAM. (nVidia's CUDA Toolkit 4.0 makes this a small lie, but stick with me.) Therefore, it is common convention to refer to the video card as the device and the CPU and its associated RAM as the host. Device pointers are often prefixed with a d_ and host pointers are commonly prefixed with a h_ to help with bookkeeping.

Although we clarify which memory is which (host vs device), we use an entirely different name for functions run on the video card. They are called kernels. We call functions on the host CPU and we launch kernels on the device.

Launching the Kernel

CUDA::Minimal does not provide a means for launching kernels directly. (Perl bindings for the so-called CUDA Driver API, which allows you to do this and many other things outside the scope of CUDA::Minimal, are my next project.) However, Perl provides a means for calling C functions using either Inline::C or plain ol' XS. If said code is compiled using nvcc (using ExtUtils::nvcc to simplify configuration), you can invoke a kernel using the CUDA-C kernel invocation syntax. I'll discuss that in a little bit. The point is that from the standpoint of Perl we are simply calling a function which happens to be defined using XS code instead of Perl code:

invoke_the_kernel($input_dev_ptr, $output_dev_ptr, $N_values);


The kernel launch is a bit of magic of which Perl is blissfully unaware.

Getting the Results and Cleaning Up

Having run the kernel, I next bring back the results. I do this by first allocating some new memory on the CPU. Perl provides a handful of methods for setting the length of scalar variables, but I can never remember them so I created a SetSize function to handle it for me. I copy the results from the video card back to this host memory and print the results:

SetSize(my $results_array, length($host_data));
Transfer($output_dev_ptr => $results_array);
print "$_\n" foreach (unpack 'f*', $results_array);


You may have noticed that I use Transfer to copy data both to and from the device. The use of the fat comma (=>) is highly recommended as it gives a very clear indication of the flow of data. Under the hood, Transfer examines the details of the scalars that you pass and determines if either or both arguments are device pointers, and takes the appropriate action. (If both are device pointers, however, you must specify the number of bytes to copy in a third, optional argument.) Finally, we free the device-side memory:

Free($input_dev_ptr, $output_dev_ptr);

Your program will execute fine without freeing the memory and the memory will (as best I can tell) be reclaimed by the video card at the close of your program. However, if you have a long-running script, failure to free device memory may lead to allocation issues, so it is a good practice to free device-side memory when you're done with it.

The Kernel Definition

CUDA kernels are defined like normal functions in C with one special addition: the use of __global__ before the return value:

__global__ void triple(float * in_g, float * out_g) {
    out_g[threadIdx.x] = in_g[threadIdx.x] * 3;
}


Furthermore, all kernels have access to the variables threadIdx, blockIdx, blockDim, and gridDim, though I will not explain those now. Inline::C that does not recognize such statements as normal C function declarations, which is very important for the use of Inline::C with CUDA. (It is likely a bug, but it sure is useful.)

The Kernel Wrapper

As I said earlier, Perl does not (yet) have wrappers for API that would allow for direct kernel invocation. However, Inline::C knows how to parse a standard function definition and expose that function to Perl for me. The role of the C function is then to unpack the arguments and invoke the kernel:

void invoke_the_kernel(SV * in_SV, SV * out_SV, int N_values) {
    // Unpack the device pointers:
    float * d_in = INT2PTR(float *, SvIV(in_SV));
    float * d_out = INT2PTR(float *, SvIV(out_SV));

    // invoke the kernel:
    triple <<<1, N_values>>>(d_in, d_out);
}


The first two lines of this function convert Perl's internal representation of the input scalars into native C float pointers. These pointers point to the location on the video card where the data resides and are passed along to the kernel so it knows from where it retrieves its input and stores its results.

The last and most interesting part of this whole chunk of code is the means by which we invoke the kernel. func-name <<< N-blocks, block-size >>> ( args ). If you strip away the parts between the triple-brackets, we have something that looks like a normal function call. The triple-brackets are an extension to ANSI-C provided by nvcc and is one reason that CUDA code must be compiled with nvcc: when nvcc finds triple-brackets, it inserts code to initialize the video card to run the kernel with the associated block dimensions and grid dimensions.

Summary

Today I've given a simple Perl script that manages CUDA memory, compiles a CUDA kernel, and invokes the CUDA kernel with a very thin wrapper written in C. Tomorrow I'll discuss CUDA interoperability with PDL.

1 Comment

YAY! This is awesome thanks!!!

Leave a comment

About David Mertens

user-pic This is my blog about numerical computing with Perl.