PDL in Rust -- Part Two

Two weeks ago we posted "PDL in Rust - A Native
Reimplementation of the Perl Data Language". At the time the
score was 45 tests, all green. That was enough to say "it compiles,
it runs, here is the arithmetic surface." It was not enough to say
"you can use this."
We are now on the second number. The current PDL implementation in pperl covers roughly 3,000 assertions end-to-end: about 1,400 on the Perl-facing connector side and about 1,600 on the engine side. As of this writing roughly 98% of the connector assertions match upstream PDL 2.103 exactly, and most of the remaining couple of dozen we already know why they fail. By the time you read this the numbers will have drifted a little in our favour - give or take - but the shape is the point, not the decimal.
Everything described below is shipping, today, at https://perl.petamem.com.
What Part One Was
Part One was a flag-planting exercise. Someone on Reddit asked whether pperl would support PDL; we said yes; we then had to actually do it. Forty-five tests covering construction, arithmetic, reductions, a handful of transcendentals, and operator overloading. The point was to prove the mechanism - that a pure-Rust engine behind a pperl native module could stand up to Perl-side code that expects `use PDL; my $x = pdl([1,2,3]);` to just work. It did. But forty-five tests is not a usable PDL.
This post is about what happened when we turned the heat up.
What's in the Box
The Perl-facing module hierarchy now mirrors upstream PDL's own, file-for-file where it matters:
PDL - top-level boot, exports pdl/zeroes/ones/... PDL::Core - constructors, accessors, DESTROY, magic PDL::Ops - arithmetic, comparison, bitwise, unary PDL::Ufunc - reductions, sorting, statistics PDL::Math - trig, hyperbolic, Bessel, erf, special functions PDL::Primitive - matmult, which, where, clip, append, convolve PDL::Slices - slice, xchg, mv, reshape, clump, dummy, range PDL::Basic - sequence, xvals, yvals, zvals, rvals, linvals PDL::MatrixOps - det, inv, LU, eigens, trace, norm PDL::Bad - bad-value flags, setbadat, copybad, locf PDL::FFT - Cooley-Tukey radix-2 FFT/IFFT PDL::NiceSlice - source filter: $x(:,0) syntax PDL::Lite - lightweight import variant PDL::LiteF - lightweight import with fewer functions
Fifteen data types. Broadcasting. Bad values. Dataflow. Affine
slicing. Storable round-trip. FFT. 2D image convolutions. Matrix
decompositions. Source-filter support for PDL::NiceSlice's
$x(:,0) syntax. Operator overloading for ~40
operators. It is PDL, not a demo subset. Whether it walks quite
like the grown-up yet is for the user to judge.
The Connector, in One Example
Part One described the split at a high level - a standalone Rust crate for the engine, a pperl native module for the bridge. This time let us walk one operation end to end, because the connector is where the Perl audience will want the detail.
Consider $A x $B, the matrix-multiply overload:
use PDL; my $A = pdl([[1,2],[3,4]]); my $B = pdl([[5,6],[7,8]]); my $C = $A x $B;
Here is what the x overload looks like on our side.
The handler is a plain Rust function registered into the PDL stash
at boot. No use overload block, no BEGIN-time aliasing,
no XSUB wrapper file:
// At boot, overload entries go straight into the PDL stash.
let overloads: &[(&[u8], unsafe extern "C" fn(*mut CV))] = &[
(b"(+\0", xs_plus_overload),
(b"(*\0", xs_mult_overload),
(b"(x\0", xs_matmult_overload), // matrix multiply
// ~40 more operators …
];
for &(name, func) in overloads {
newXS(/* PDL::(X */, Some(func), file);
}
The handler itself extracts the PDLs from their SVs, validates dimensions, calls into the engine, wraps the result:
unsafe extern "C" fn xs_matmult_overload(_cv: *mut CV) {
let mark = xs_pop_mark();
let items = xs_items(mark);
let (a_sv, b_sv, _swap) = binop_args(mark, items);
xs_sp_set(mark);
let mut a_temp = false;
let mut b_temp = false;
let a = sv_to_pdl_or_scalar(a_sv, &mut a_temp);
let b = sv_to_pdl_or_scalar(b_sv, &mut b_temp);
let a_dims = core_vtable::pdl_dims(a);
let b_dims = core_vtable::pdl_dims(b);
if a_dims[0] != b_dims[1] {
Perl_croak(b"PDL: matmult: dimension mismatch\0".as_ptr() as *const c_char);
}
let c = core_vtable::pdl_alloc_output();
// --- THE ENTIRE DISPATCH ---
// One call into Rust-PDL. No XS wrapping, no interpreter
// round-trip, no hidden allocations.
let err = rust_pdl::primitive::pp_matmult
::pdl_run_matmult(a, b, c);
if a_temp { core_vtable::pdl_destroy(a); }
if b_temp { core_vtable::pdl_destroy(b); }
xs_push(sv_2mortal(pdl_to_sv(c)));
}
Three things about this example are worth drawing out for a Perl audience.
The overload dispatch is a function-pointer table.
$A x $B resolves as: stash lookup → function-pointer
call → Rust. Upstream PDL's equivalent path is: use overload
table lookup → dispatch to a Perl sub → XSUB boundary → PP-generated C
→ inner kernel. pperl removes the two middle layers entirely, because
the handler is the kernel-caller, compiled into the interpreter
binary.
sv_to_pdl is a magic walk, not a method call.
PDL SVs in pperl store the engine's *mut c_pdl as
PERL_MAGIC_ext on the inner SV - the same mechanism Perl
itself uses for tied variables, dualvars, and a dozen other features.
Extracting the pointer is a flag check plus a linked-list walk. No
method dispatch, no can(), no string comparisons:
let flags = SvFLAGS(inner);
if flags & (SVs_GMG | SVs_SMG | SVs_RMG) != 0 {
let mut mg = (*sv_any).xmg_u.xmg_magic;
while !mg.is_null() {
if (*mg).mg_virtual == &PDL_MAGIC_VTBL {
return (*mg).mg_ptr as *mut c_pdl; // ← the c_pdl pointer, raw
}
mg = (*mg).mg_moremagic;
}
}
So $A->at(0) in pperl is: one stash method lookup, one
magic walk, one pointer-arithmetic offset inside a Rust function. We
did not invent a new mechanism for binding Rust objects to Perl
variables - we reused Perl's own magic infrastructure, because that is
what it is for.
Type promotion: upstream's rule, reimplemented from scratch.
Matrix multiply with mixed types - long x float - must
promote to double for numerical stability. Upstream's rule is "an NV
scalar forces minimum promotion to double; an NV PDL keeps its wider
type if already ≥ double." We ported the rule verbatim; the
implementation is thirty lines of Rust match arms. No C preprocessor
switch tables. No .xs file. No PDL::PP code-generation
templates. When we discovered that pperl was keeping float
instead of promoting to double for float + 0.2
- the fix was two match arms, one rebuild, twenty-five tests
recovered. In the upstream world the equivalent change would touch
PDL::PP's templates and trigger a rebuild of forty op files.
pd2multi - We build a Compiler
Upstream PDL has PDL::PP: a macro/template processor that weaves
.pd definitions with C fragments and hands the output to
cc. It has served PDL for thirty years. It works. But
the output is always C, and it is a text-substitution engine, not
a compiler.
We could not extend it. We needed something that could emit Rust
today, and potentially other targets later - a C backend to validate
against upstream's own output, a GPU backend for OpenCL, a Perl
backend to let pperl's JIT see through the operation into the
surrounding loop. So we built a proper parser → AST → emitter
pipeline. pd2multi ingests the same input language,
builds an explicit intermediate representation, and delegates output
to pluggable backend emitters.
This is internal infrastructure. Nothing user-facing ships from it yet; it is the scaffolding under the engine. It is roughly two orders of magnitude more work than the template-substitution approach it replaces. The payoff is optionality: when we want to add a GPU backend, it is one more emitter walking the AST, not a new compiler.
Correctness
The connector test suite runs every assertion against both upstream perl5+PDL and pperl+Rust-PDL, compares outputs, and categorizes each result. A representative snapshot from the harness at the time of writing:
Tests: ~1380 matched, ~25 mismatched, ~110 extra / ~1400 total
Matched - the large majority, give or take: pperl produces the same output as upstream.
Mismatched - on the order of two dozen: pperl
differs from upstream. Most of these are real work in progress and
we know where the gaps are. The biggest cluster is in forward-flow
dataflow - when a parent PDL is mutated, downstream slice children
should see the update eagerly. Our implementation is currently
lazy, which catches most cases via dirty-flag refresh but misses
the paths that read child data without going through the normal
accessor. A handful of tests in 220-broadcasting fail on that. The
rest are distributed drift: error message text, one edge case in
unbroadcast, STL round-trip issues. None are
conceptual; all are bounded.
Extra - around a hundred: tests that pperl passes and upstream does not. Some of these are test-surface coverage - we wrote more thorough tests for overload dispatch, import semantics, and native-function dispatch than upstream did. Others are legitimate behavioural differences where we produce a correct result and upstream does not; the precision and diagnostics subsections below cover the interesting cases.
There is a separate, smaller number worth singling out.
Where pperl and Upstream Disagree, and pperl is Right
A handful of tests produce different results between the two implementations where pperl's result is more correct. Six in the first sweep; we are no longer surprised when we find another. Three patterns are worth naming.
Precision. pctover computes the
k-th percentile along a dimension:
my $x = pdl(0,0,6,3,5,0,7,14,94,5,5,8,7,7,1,6,7,13,10,2,101,19,7,7,5); my $r = $x->pctover(0.9); printf "%.20f\n", $r->at();
Upstream says 17.00000000000000710543. pperl says
17.00000000000000000000. Both stringify to 17,
so to the eye they are identical. But $r->at() == 17
is false on upstream (the 7e-15 drift catches the numeric
comparison) and true on pperl. The test was written as a
plain == 17 assertion, assuming the kernel would
return an exact integer.
Why does pperl get it right? Same numerical recipe, different
float accounting. Upstream's PP-generated C coerces intermediates
into PDL_Double early and rounds once per sample. The
Rust kernel, passing through LLVM's optimizer, keeps a wider
accumulator and rounds once at the final divide. It is not
cleverness on our side - it is what the modern toolchain does with
the same numerical recipe when nothing in the code forces early
narrowing. The same pattern appears in xlogvals and a
couple of other reductions.
Perl context. lgamma returns two
values: the log-gamma value, and the sign. Upstream's PP-generated
XS returns both as a two-element list regardless of context. In
scalar context, Perl takes the last element - so
my $y = lgamma($x) assigns the sign (always
1), not the actual gamma. That is not what a Perl
programmer expects. pperl respects GIMME_V: scalar
context returns the value, list context returns the pair.
This is not a numerical-methods win. It is a Perl-semantics win - the kind that only a Perl-aware reimplementation can notice. We didn't set out to behave differently from upstream; we set out to behave the way Perl specifies.
Diagnostics. pdl("nonsense blurb")
should reject its input. Upstream does reject it - but the error
message reads found 'e' as part of a larger word in nonsense
blurb, because the parser is reporting that it saw
e (Euler's constant) embedded in a longer token, and
that is what it complains about. pperl reports found
disallowed character(s) 'nonsense,blurb', which is what an
actual user wanted to know. A smaller case of the same pattern:
$x->slice('0:-14') on a 10-element PDL silently
returns a garbage view upstream; pperl croaks with an out-of-bounds
error. Good diagnostics are a correctness property.
And in the interest of honest accounting: we have also found
places where our implementation was wrong and upstream was right,
and we conformed. The float + 0.2 promotion rule
mentioned earlier was one of those: we were returning
float, upstream correctly promoted to double,
we fixed ourselves to match. The flow of corrections is
bidirectional and that is the point - two independent
implementations of the same spec are a stronger guarantee of
correctness than either one alone.
Performance
The startup story from Part One holds and has tightened up. Here is the harness output from the current connector test suite, picking a representative sample:
PDL/010-constructor.t ... ok (17/17) perl5: 84ms / pperl: 9ms PDL/110-core.t ... ok (121/121) perl5: 88ms / pperl: 13ms PDL/120-ops.t ... ok (75/75) perl5: 94ms / pperl: 12ms PDL/150-primitive.t ... ok (60/60) perl5: 86ms / pperl: 11ms PDL/190-matrixops.t ... ok (35/35) perl5: 90ms / pperl: 11ms PDL/260-pdl-from-string.t ... ok (60/60) perl5: 86ms / pperl: 13ms PDL/300-niceslice.t ... ok (30/30) perl5: 137ms / pperl: 75ms PDL/330-flexraw-fastraw.t ... ok (30/30) perl5: 96ms / pperl: 20ms
Across the full PDL suite, pperl's per-test wall time sits at
9–13ms for the typical path and climbs to 20–75ms for tests that do
real I/O or source filtering. Upstream's corresponding numbers are
82–137ms. The typical startup advantage is 8–10×; the fastest cases
reach 13×. This is not a compute-throughput benchmark - it is the
cost of use PDL plus a few operations. And it holds
because the entire PDL stack is compiled into the pperl binary.
There is no module loading, no PP compilation, no dynamic linking.
Compute-throughput numbers are what most readers will actually care about, so here is a table from the current benchmark suite - release build, microseconds per call, averaged over a warm loop. Worth stating up front: these are dry numbers. No JIT loop-fusion, no Rayon parallelism, no GPU offload - none of the mechanisms that define pperl's architectural advantage are engaged here. What is being measured is the raw kernel-versus-kernel cost and the per-call dispatch overhead. The ceiling is considerably higher than what these numbers show; the floor is what they are.
op size upstream µs pperl µs ratio --------------- -------- ----------- -------- ----- pdl_from_str 10 66.6 3.4 20x pdl_from_str 100 583.4 18.3 32x pdl_from_str 1000 6667.7 183.4 36x stringify 10 8.6 1.5 5.6x stringify 100 60.0 13.1 4.6x stringify 1000 620.6 140.8 4.3x add_scalar 100 3.4 1.2 2.8x matmult_sq 10 5.4 2.3 2.3x slice 100 2.0 1.3 1.5x slice 10000 2.1 1.3 1.7x slice 1000000 2.2 1.4 1.5x add_vec 1000000 1860.4 1939.1 0.96x mul_vec 1000000 1879.3 1790.6 1.05x sum 1000000 1468.3 1344.1 1.09x exp 1000000 8174.9 8110.6 1.01x matmult_sq 200 6329.4 7070.6 0.90x sumover 1000000 4723.4 13214.5 0.36x
Three things stand out.
Parsing and formatting win cleanly.
pdl_from_str - the pdl("1 2 3; 4 5 6")
literal parser - is 20-36× faster because ours is a native Rust
tokenizer where upstream's is a pure-Perl regex chain. Stringify is
4-6× faster for the same structural reason: a native formatter
rather than Perl-side string assembly. These gains are not tuning
artefacts; they will persist.
Small-op dispatch wins. On small arrays
(100-element add_scalar, 10×10 matmult),
pperl is 2-3× faster because the XS overhead is gone. This is the
function-pointer overload dispatch from the connector walkthrough,
now with numbers: Perl → stash lookup → Rust kernel, no intervening
XSUB layer. On large arrays the kernel dominates and the dispatch
win disappears into the noise - which is the correct outcome, not
a disappointment.
Slice is zero-copy and measurably so. A
slice operation on a 1M-element PDL takes about 1.4
µs on pperl, independent of input size - a constant-time affine
view construction, as it should be. Upstream is within the same
order of magnitude (about 2 µs); both implementations do the right
thing here.
Outliers do happen. sumover at 1M elements is
currently slower than upstream; it was detected, it is understood,
and it is being worked on. That is the expected shape of a young
engine against a thirty-year-old one.
These numbers are just a baseline. The "dry" performance without JIT, Autoparallelization or GPU. Autovec by the compiler happens for both. The numbers for JIT+Rayon and/or GPU will land in a dedicated post once the relevant parts of the JIT are ready.
How pperl PDL Compares to Upstream
| Area | pperl + Rust-PDL | upstream + C PDL |
|---|---|---|
| Core numerics | ~98% parity with upstream; occasionally more precise | The reference implementation |
Startup (use PDL + trivial op) |
~9–13ms typical | ~82–90ms typical |
| Build dependency | cargo build, no C toolchain |
C compiler, XS, PP code-gen, Makefile toolchain |
| Thread model | Rayon (prototypes working) | Perl ithreads + pthread-guarded globals |
| Source filters | Yes - NiceSlice works | Yes |
| Forward-flow dataflow | Partial; a handful of known test failures | Full |
| PDL::PP compiler | No - replaced by pd2multi (internal) | Yes - runs at build time |
| Inline::Pdlpp | No (no in-process C compiler) | Yes |
| FITS / astronomy I/O | Out of scope (user-space, on top of PDL) | Yes, via Astro::FITS |
| Perl ithreads | No - Rayon replaces the model | Yes |
On Compatibility
Our commitment is to upstream PDL's specified behaviour. When
the specification and the current implementation disagree, we side
with the specification. lgamma in scalar context is
the cleanest illustration: Perl's GIMME_V contract is
unambiguous; upstream's PP-generated XS happens not to honour it;
ours does. We did not set out to deviate - we set out to mirror
Perl.
Everywhere else, upstream is the reference. Thirty years of PDL
carry numerical decisions and edge-case wisdom that the
documentation does not always explain. Whenever we disagreed with
upstream during implementation, our default assumption was that we
were missing context. The float + 0.2 promotion case
earlier in this post is exactly that: we thought we were right.
However, upstream was right - we conformed.
What's Next
Three threads, in rough priority order.
Rayon-based parallelism. First prototypes are running: PDL operations inside pperl's JIT-compiled while-loops distribute across cores via work-stealing. Early results look promising. Mandelbrot-scale workloads already parallelize cleanly; the interesting work is in making parallelism show up for realistic PDL pipelines without per-call overhead eating the win.
OpenCL / GPU acceleration. This is a priority goal, and it is the reason pd2multi exists as a proper compiler rather than a template system. A GPU backend is one more emitter walking the AST, not a separate codegen project. The engineering cost is real but bounded.
JIT fusion across PDL boundaries. Cranelift compilation of surrounding Perl loops that call PDL operations, so the JIT can see through the op and fuse it with the loop body. This is the structural payoff of having a pure-Rust engine - the JIT can look inside - and it is the thing that will make the case for reimplementation in performance terms, not just compatibility terms. Serious work on this is pending the Rayon and GPU threads stabilizing.
No dates. We will post numbers when there are numbers to post.
In the meantime: use PDL; in pperl, today, at
https://perl.petamem.com.
Roughly three thousand assertions of work, about ninety-eight per
cent of them matching the upstream beacon, a double-digit startup
win, and a handful of cases where independent implementation caught
things that a single implementation had missed.
- Richard C. Jelinek, PetaMem s.r.o.
Leave a comment