PDL in Rust -- Part Two


Logo.png


Two weeks ago we posted "PDL in Rust - A Native
Reimplementation of the Perl Data Language"
. At the time the
score was 45 tests, all green. That was enough to say "it compiles,
it runs, here is the arithmetic surface." It was not enough to say
"you can use this."

We are now on the second number. The current PDL implementation in pperl covers roughly 3,000 assertions end-to-end: about 1,400 on the Perl-facing connector side and about 1,600 on the engine side. As of this writing roughly 98% of the connector assertions match upstream PDL 2.103 exactly, and most of the remaining couple of dozen we already know why they fail. By the time you read this the numbers will have drifted a little in our favour - give or take - but the shape is the point, not the decimal.

Everything described below is shipping, today, at https://perl.petamem.com.

What Part One Was

Part One was a flag-planting exercise. Someone on Reddit asked whether pperl would support PDL; we said yes; we then had to actually do it. Forty-five tests covering construction, arithmetic, reductions, a handful of transcendentals, and operator overloading. The point was to prove the mechanism - that a pure-Rust engine behind a pperl native module could stand up to Perl-side code that expects `use PDL; my $x = pdl([1,2,3]);` to just work. It did. But forty-five tests is not a usable PDL.

This post is about what happened when we turned the heat up.

What's in the Box

The Perl-facing module hierarchy now mirrors upstream PDL's own, file-for-file where it matters:

PDL              - top-level boot, exports pdl/zeroes/ones/...
PDL::Core        - constructors, accessors, DESTROY, magic
PDL::Ops         - arithmetic, comparison, bitwise, unary
PDL::Ufunc       - reductions, sorting, statistics
PDL::Math        - trig, hyperbolic, Bessel, erf, special functions
PDL::Primitive   - matmult, which, where, clip, append, convolve
PDL::Slices      - slice, xchg, mv, reshape, clump, dummy, range
PDL::Basic       - sequence, xvals, yvals, zvals, rvals, linvals
PDL::MatrixOps   - det, inv, LU, eigens, trace, norm
PDL::Bad         - bad-value flags, setbadat, copybad, locf
PDL::FFT         - Cooley-Tukey radix-2 FFT/IFFT
PDL::NiceSlice   - source filter: $x(:,0) syntax
PDL::Lite        - lightweight import variant
PDL::LiteF       - lightweight import with fewer functions

Fifteen data types. Broadcasting. Bad values. Dataflow. Affine slicing. Storable round-trip. FFT. 2D image convolutions. Matrix decompositions. Source-filter support for PDL::NiceSlice's $x(:,0) syntax. Operator overloading for ~40 operators. It is PDL, not a demo subset. Whether it walks quite like the grown-up yet is for the user to judge.

The Connector, in One Example

Part One described the split at a high level - a standalone Rust crate for the engine, a pperl native module for the bridge. This time let us walk one operation end to end, because the connector is where the Perl audience will want the detail.

Consider $A x $B, the matrix-multiply overload:

use PDL;
my $A = pdl([[1,2],[3,4]]);
my $B = pdl([[5,6],[7,8]]);
my $C = $A x $B;

Here is what the x overload looks like on our side. The handler is a plain Rust function registered into the PDL stash at boot. No use overload block, no BEGIN-time aliasing, no XSUB wrapper file:

// At boot, overload entries go straight into the PDL stash.
let overloads: &[(&[u8], unsafe extern "C" fn(*mut CV))] = &[
    (b"(+\0",    xs_plus_overload),
    (b"(*\0",    xs_mult_overload),
    (b"(x\0",    xs_matmult_overload),   // matrix multiply
    // ~40 more operators …
];
for &(name, func) in overloads {
    newXS(/* PDL::(X */, Some(func), file);
}

The handler itself extracts the PDLs from their SVs, validates dimensions, calls into the engine, wraps the result:

unsafe extern "C" fn xs_matmult_overload(_cv: *mut CV) {
    let mark  = xs_pop_mark();
    let items = xs_items(mark);
    let (a_sv, b_sv, _swap) = binop_args(mark, items);
    xs_sp_set(mark);
    let mut a_temp = false;
    let mut b_temp = false;
    let a = sv_to_pdl_or_scalar(a_sv, &mut a_temp);
    let b = sv_to_pdl_or_scalar(b_sv, &mut b_temp);
    let a_dims = core_vtable::pdl_dims(a);
    let b_dims = core_vtable::pdl_dims(b);
    if a_dims[0] != b_dims[1] {
        Perl_croak(b"PDL: matmult: dimension mismatch\0".as_ptr() as *const c_char);
    }
    let c = core_vtable::pdl_alloc_output();
    // --- THE ENTIRE DISPATCH ---
    // One call into Rust-PDL. No XS wrapping, no interpreter
    // round-trip, no hidden allocations.
    let err = rust_pdl::primitive::pp_matmult
                 ::pdl_run_matmult(a, b, c);
    if a_temp { core_vtable::pdl_destroy(a); }
    if b_temp { core_vtable::pdl_destroy(b); }
    xs_push(sv_2mortal(pdl_to_sv(c)));
}

Three things about this example are worth drawing out for a Perl audience.

The overload dispatch is a function-pointer table. $A x $B resolves as: stash lookup → function-pointer call → Rust. Upstream PDL's equivalent path is: use overload table lookup → dispatch to a Perl sub → XSUB boundary → PP-generated C → inner kernel. pperl removes the two middle layers entirely, because the handler is the kernel-caller, compiled into the interpreter binary.

sv_to_pdl is a magic walk, not a method call. PDL SVs in pperl store the engine's *mut c_pdl as PERL_MAGIC_ext on the inner SV - the same mechanism Perl itself uses for tied variables, dualvars, and a dozen other features. Extracting the pointer is a flag check plus a linked-list walk. No method dispatch, no can(), no string comparisons:

let flags = SvFLAGS(inner);
if flags & (SVs_GMG | SVs_SMG | SVs_RMG) != 0 {
    let mut mg = (*sv_any).xmg_u.xmg_magic;
    while !mg.is_null() {
        if (*mg).mg_virtual == &PDL_MAGIC_VTBL {
            return (*mg).mg_ptr as *mut c_pdl;  // ← the c_pdl pointer, raw
        }
        mg = (*mg).mg_moremagic;
    }
}

So $A->at(0) in pperl is: one stash method lookup, one magic walk, one pointer-arithmetic offset inside a Rust function. We did not invent a new mechanism for binding Rust objects to Perl variables - we reused Perl's own magic infrastructure, because that is what it is for.

Type promotion: upstream's rule, reimplemented from scratch. Matrix multiply with mixed types - long x float - must promote to double for numerical stability. Upstream's rule is "an NV scalar forces minimum promotion to double; an NV PDL keeps its wider type if already ≥ double." We ported the rule verbatim; the implementation is thirty lines of Rust match arms. No C preprocessor switch tables. No .xs file. No PDL::PP code-generation templates. When we discovered that pperl was keeping float instead of promoting to double for float + 0.2 - the fix was two match arms, one rebuild, twenty-five tests recovered. In the upstream world the equivalent change would touch PDL::PP's templates and trigger a rebuild of forty op files.

pd2multi - We build a Compiler

Upstream PDL has PDL::PP: a macro/template processor that weaves .pd definitions with C fragments and hands the output to cc. It has served PDL for thirty years. It works. But the output is always C, and it is a text-substitution engine, not a compiler.

We could not extend it. We needed something that could emit Rust today, and potentially other targets later - a C backend to validate against upstream's own output, a GPU backend for OpenCL, a Perl backend to let pperl's JIT see through the operation into the surrounding loop. So we built a proper parser → AST → emitter pipeline. pd2multi ingests the same input language, builds an explicit intermediate representation, and delegates output to pluggable backend emitters.

This is internal infrastructure. Nothing user-facing ships from it yet; it is the scaffolding under the engine. It is roughly two orders of magnitude more work than the template-substitution approach it replaces. The payoff is optionality: when we want to add a GPU backend, it is one more emitter walking the AST, not a new compiler.

Correctness

The connector test suite runs every assertion against both upstream perl5+PDL and pperl+Rust-PDL, compares outputs, and categorizes each result. A representative snapshot from the harness at the time of writing:

Tests: ~1380 matched, ~25 mismatched, ~110 extra / ~1400 total

Matched - the large majority, give or take: pperl produces the same output as upstream.

Mismatched - on the order of two dozen: pperl differs from upstream. Most of these are real work in progress and we know where the gaps are. The biggest cluster is in forward-flow dataflow - when a parent PDL is mutated, downstream slice children should see the update eagerly. Our implementation is currently lazy, which catches most cases via dirty-flag refresh but misses the paths that read child data without going through the normal accessor. A handful of tests in 220-broadcasting fail on that. The rest are distributed drift: error message text, one edge case in unbroadcast, STL round-trip issues. None are conceptual; all are bounded.

Extra - around a hundred: tests that pperl passes and upstream does not. Some of these are test-surface coverage - we wrote more thorough tests for overload dispatch, import semantics, and native-function dispatch than upstream did. Others are legitimate behavioural differences where we produce a correct result and upstream does not; the precision and diagnostics subsections below cover the interesting cases.

There is a separate, smaller number worth singling out.

Where pperl and Upstream Disagree, and pperl is Right

A handful of tests produce different results between the two implementations where pperl's result is more correct. Six in the first sweep; we are no longer surprised when we find another. Three patterns are worth naming.

Precision. pctover computes the k-th percentile along a dimension:

my $x = pdl(0,0,6,3,5,0,7,14,94,5,5,8,7,7,1,6,7,13,10,2,101,19,7,7,5);
my $r = $x->pctover(0.9);
printf "%.20f\n", $r->at();

Upstream says 17.00000000000000710543. pperl says 17.00000000000000000000. Both stringify to 17, so to the eye they are identical. But $r->at() == 17 is false on upstream (the 7e-15 drift catches the numeric comparison) and true on pperl. The test was written as a plain == 17 assertion, assuming the kernel would return an exact integer.

Why does pperl get it right? Same numerical recipe, different float accounting. Upstream's PP-generated C coerces intermediates into PDL_Double early and rounds once per sample. The Rust kernel, passing through LLVM's optimizer, keeps a wider accumulator and rounds once at the final divide. It is not cleverness on our side - it is what the modern toolchain does with the same numerical recipe when nothing in the code forces early narrowing. The same pattern appears in xlogvals and a couple of other reductions.

Perl context. lgamma returns two values: the log-gamma value, and the sign. Upstream's PP-generated XS returns both as a two-element list regardless of context. In scalar context, Perl takes the last element - so my $y = lgamma($x) assigns the sign (always 1), not the actual gamma. That is not what a Perl programmer expects. pperl respects GIMME_V: scalar context returns the value, list context returns the pair.

This is not a numerical-methods win. It is a Perl-semantics win - the kind that only a Perl-aware reimplementation can notice. We didn't set out to behave differently from upstream; we set out to behave the way Perl specifies.

Diagnostics. pdl("nonsense blurb") should reject its input. Upstream does reject it - but the error message reads found 'e' as part of a larger word in nonsense blurb, because the parser is reporting that it saw e (Euler's constant) embedded in a longer token, and that is what it complains about. pperl reports found disallowed character(s) 'nonsense,blurb', which is what an actual user wanted to know. A smaller case of the same pattern: $x->slice('0:-14') on a 10-element PDL silently returns a garbage view upstream; pperl croaks with an out-of-bounds error. Good diagnostics are a correctness property.

And in the interest of honest accounting: we have also found places where our implementation was wrong and upstream was right, and we conformed. The float + 0.2 promotion rule mentioned earlier was one of those: we were returning float, upstream correctly promoted to double, we fixed ourselves to match. The flow of corrections is bidirectional and that is the point - two independent implementations of the same spec are a stronger guarantee of correctness than either one alone.

Performance

The startup story from Part One holds and has tightened up. Here is the harness output from the current connector test suite, picking a representative sample:

PDL/010-constructor.t     ... ok  (17/17)   perl5: 84ms  / pperl: 9ms
PDL/110-core.t            ... ok  (121/121) perl5: 88ms  / pperl: 13ms
PDL/120-ops.t             ... ok  (75/75)   perl5: 94ms  / pperl: 12ms
PDL/150-primitive.t       ... ok  (60/60)   perl5: 86ms  / pperl: 11ms
PDL/190-matrixops.t       ... ok  (35/35)   perl5: 90ms  / pperl: 11ms
PDL/260-pdl-from-string.t ... ok  (60/60)   perl5: 86ms  / pperl: 13ms
PDL/300-niceslice.t       ... ok  (30/30)   perl5: 137ms / pperl: 75ms
PDL/330-flexraw-fastraw.t ... ok  (30/30)   perl5: 96ms  / pperl: 20ms

Across the full PDL suite, pperl's per-test wall time sits at 9–13ms for the typical path and climbs to 20–75ms for tests that do real I/O or source filtering. Upstream's corresponding numbers are 82–137ms. The typical startup advantage is 8–10×; the fastest cases reach 13×. This is not a compute-throughput benchmark - it is the cost of use PDL plus a few operations. And it holds because the entire PDL stack is compiled into the pperl binary. There is no module loading, no PP compilation, no dynamic linking.

Compute-throughput numbers are what most readers will actually care about, so here is a table from the current benchmark suite - release build, microseconds per call, averaged over a warm loop. Worth stating up front: these are dry numbers. No JIT loop-fusion, no Rayon parallelism, no GPU offload - none of the mechanisms that define pperl's architectural advantage are engaged here. What is being measured is the raw kernel-versus-kernel cost and the per-call dispatch overhead. The ceiling is considerably higher than what these numbers show; the floor is what they are.

op                    size    upstream µs    pperl µs    ratio
---------------   --------    -----------    --------    -----
pdl_from_str            10           66.6         3.4    20x
pdl_from_str           100          583.4        18.3    32x
pdl_from_str          1000         6667.7       183.4    36x

stringify               10            8.6         1.5     5.6x
stringify              100           60.0        13.1     4.6x
stringify             1000          620.6       140.8     4.3x

add_scalar             100            3.4         1.2     2.8x
matmult_sq              10            5.4         2.3     2.3x

slice                  100            2.0         1.3     1.5x
slice                10000            2.1         1.3     1.7x
slice              1000000            2.2         1.4     1.5x

add_vec            1000000         1860.4      1939.1     0.96x
mul_vec            1000000         1879.3      1790.6     1.05x
sum                1000000         1468.3      1344.1     1.09x
exp                1000000         8174.9      8110.6     1.01x
matmult_sq             200         6329.4      7070.6     0.90x

sumover            1000000         4723.4     13214.5     0.36x

Three things stand out.

Parsing and formatting win cleanly. pdl_from_str - the pdl("1 2 3; 4 5 6") literal parser - is 20-36× faster because ours is a native Rust tokenizer where upstream's is a pure-Perl regex chain. Stringify is 4-6× faster for the same structural reason: a native formatter rather than Perl-side string assembly. These gains are not tuning artefacts; they will persist.

Small-op dispatch wins. On small arrays (100-element add_scalar, 10×10 matmult), pperl is 2-3× faster because the XS overhead is gone. This is the function-pointer overload dispatch from the connector walkthrough, now with numbers: Perl → stash lookup → Rust kernel, no intervening XSUB layer. On large arrays the kernel dominates and the dispatch win disappears into the noise - which is the correct outcome, not a disappointment.

Slice is zero-copy and measurably so. A slice operation on a 1M-element PDL takes about 1.4 µs on pperl, independent of input size - a constant-time affine view construction, as it should be. Upstream is within the same order of magnitude (about 2 µs); both implementations do the right thing here.

Outliers do happen. sumover at 1M elements is currently slower than upstream; it was detected, it is understood, and it is being worked on. That is the expected shape of a young engine against a thirty-year-old one.

These numbers are just a baseline. The "dry" performance without JIT, Autoparallelization or GPU. Autovec by the compiler happens for both. The numbers for JIT+Rayon and/or GPU will land in a dedicated post once the relevant parts of the JIT are ready.

How pperl PDL Compares to Upstream

Areapperl + Rust-PDLupstream + C PDL
Core numerics ~98% parity with upstream; occasionally more precise The reference implementation
Startup (use PDL + trivial op) ~9–13ms typical ~82–90ms typical
Build dependency cargo build, no C toolchain C compiler, XS, PP code-gen, Makefile toolchain
Thread model Rayon (prototypes working) Perl ithreads + pthread-guarded globals
Source filters Yes - NiceSlice works Yes
Forward-flow dataflow Partial; a handful of known test failures Full
PDL::PP compiler No - replaced by pd2multi (internal) Yes - runs at build time
Inline::Pdlpp No (no in-process C compiler) Yes
FITS / astronomy I/O Out of scope (user-space, on top of PDL) Yes, via Astro::FITS
Perl ithreads No - Rayon replaces the model Yes

On Compatibility

Our commitment is to upstream PDL's specified behaviour. When the specification and the current implementation disagree, we side with the specification. lgamma in scalar context is the cleanest illustration: Perl's GIMME_V contract is unambiguous; upstream's PP-generated XS happens not to honour it; ours does. We did not set out to deviate - we set out to mirror Perl.

Everywhere else, upstream is the reference. Thirty years of PDL carry numerical decisions and edge-case wisdom that the documentation does not always explain. Whenever we disagreed with upstream during implementation, our default assumption was that we were missing context. The float + 0.2 promotion case earlier in this post is exactly that: we thought we were right. However, upstream was right - we conformed.

What's Next

Three threads, in rough priority order.

Rayon-based parallelism. First prototypes are running: PDL operations inside pperl's JIT-compiled while-loops distribute across cores via work-stealing. Early results look promising. Mandelbrot-scale workloads already parallelize cleanly; the interesting work is in making parallelism show up for realistic PDL pipelines without per-call overhead eating the win.

OpenCL / GPU acceleration. This is a priority goal, and it is the reason pd2multi exists as a proper compiler rather than a template system. A GPU backend is one more emitter walking the AST, not a separate codegen project. The engineering cost is real but bounded.

JIT fusion across PDL boundaries. Cranelift compilation of surrounding Perl loops that call PDL operations, so the JIT can see through the op and fuse it with the loop body. This is the structural payoff of having a pure-Rust engine - the JIT can look inside - and it is the thing that will make the case for reimplementation in performance terms, not just compatibility terms. Serious work on this is pending the Rayon and GPU threads stabilizing.

No dates. We will post numbers when there are numbers to post.

In the meantime: use PDL; in pperl, today, at https://perl.petamem.com. Roughly three thousand assertions of work, about ninety-eight per cent of them matching the upstream beacon, a double-digit startup win, and a handful of cases where independent implementation caught things that a single implementation had missed.

- Richard C. Jelinek, PetaMem s.r.o.

Leave a comment

About PetaMem

user-pic All things Perl.