CUDA::Minimal takes two steps backwards, one step forward
Edit: I got my original approach to work, see my follow-up.
A week ago I wrote about how I though play-perl was great. I put up a bunch of ideas and waited to see what others would encourage me to do. Two of my ideas got two votes each (some others got single votes), one of which involved revising CUDA::Minimal sot it works again. (CUDA is a means for executing massively parallelized code on your video card.) Well, it compiles now, but it doesn't quite work the way I had hoped and I am facing some basic architectural redesign issues. Herein I describe the old way things worked, and why that won't work anymore. Can anybody offer some ideas for how I might move forward?
First, let me explain the original design. Originally, I released ExtUtils::nvcc. The goal of this module was to operate as a drop-in replacement for you C compiler in ExtUtils::MakeMaker, Module::Build, and Inline::C that actually invoked nVidia's nvcc. nvcc lets normal C (or C++) code pass through without touching it, but it handles special minor language extensions, making it easy and relatively painless to mix device (video card) and host (regular CPU) code. Using this compiler wrapper, I could write XS code with CUDA-C, and it Just Worked.
That was three years ago (almost exactly), and I hadn't touched the code since July of 2010. Last September, I tried recompiling CUDA::Minimal, but they didn't work anymore. I do not understand in any way what is going on, but it seems as though nVidia introduced a C macro into their libraries (which get added when you compile with nvcc) that trips up some of Perl's internals. (I tried compiling with different versions of Perl---including the one I originally used in my development---and they all failed, so I presume it was a change on nVidia's end, not Perl's.)
I need to fix this because I wrote some serious numerical code a couple of years ago that I want to use again, and which is locked into this old system. I am also motivated to score my first points on play-perl.org. :-)
My current solution gets CUDA::Minimal back up-and-running. You can once again use it to transfer data to your video card, and transfer data back from your video card because the underlying interface for those were C functions to begin with. However, you can't just write you CUDA code in your XS files anymore. You can't write simple XS wrappers around kernel launches.
To get an idea of what nvcc lets you do, see the code under this post. The function on line 10, which starts with __global__, is meant to be compiled for and run on your video card. You need to run that through nvcc. The other important pieces is on line 30, where you see the tripple-angle-bracket notation. That code is replced by nvcc with a set of declarations and function invocations that builds the call stack and invokes it on the video card. Once upon a time, you could include these triple-angle-bracket invocations directly into your XS or Inline::C code and it worked. Now it doesn't.
Architecturally, you can get around this by creating a separate source file with C functions that perform whatever kernel invocation you want, and compiling that source file with nvcc. You would then write an XS file that uses a common C header file, and link the nvcc-compiled code at the last second. However, that's a major change in how one would call kernels from XS code. It's an extra pair of files, and managing the proper linker arguments is a pain.
I could try to dig around in the Perl source to figure out which C macro is getting tripped, but that will be a lot of work. (The error is reportedly from somewhere in the regex engine.) I could try to write wrapper modules that would minimize the extra effort for consuming modules. I could try to introduce a new module that takes a string containing your source code and returns a ready-to-call sub. All of these are hard work, and none of them are either easy nor as elegant as the original approach. Does anybody have a better idea?