Happy sharing
So you've got a bunch of Perl worker processes and they need to share state. A work queue, a counter, a lookup table - the usual. What do you reach for?
Perl has solid options here, and they've been around for a while. File::Map gives you clean zero-copy access to mmap'd files - substr, index, even regexes run directly on mapped memory. LMDB_File wraps the Lightning Memory-Mapped Database - mature, ACID-compliant, lock-free concurrent readers via MVCC, crash-safe persistence. Hash::SharedMem offers a purpose-built concurrent hash with lock-free reads and atomic copy-on-write updates. Cache::FastMmap is the workhorse for shared caching - mmap-backed pages with per-page fcntl locking, LRU eviction, and optional compression.
These are all good, proven tools. But they have something in common: they're about storage. You put data in, you get data out. They don't give you a queue that consumers can block on. They don't give you a pub/sub channel, a ring buffer, a semaphore, a priority heap, or a lock-free MPMC algorithm. They don't do atomic counters or futex-based blocking with timeouts.
That's the gap the Data::*::Shared family fills - fourteen Perl modules that give you proper, typed, concurrent data structures backed by mmap. Not better storage - concurrent data structures that happen to live in shared memory. Queues, hash maps, pub/sub, stacks, ring buffers, heaps, graphs, sync primitives - the works. All written in XS/C, all designed to work across fork()'d processes with zero serialization overhead.
Let me walk you through what's in the box.
The Approach
Every module in the family uses the same core recipe:
mmap(MAP_SHARED)for the actual shared memory - no serialization, no copies, just raw memory visible to all processes- Linux futex for blocking/waiting - when a queue is empty and you want to wait for data, you sleep in the kernel, not in a spin loop
- CAS (compare-and-swap) for lock-free operations where possible - no mutex, no contention, just atomic CPU instructions
- PID-based crash recovery - if a process dies holding a lock, other processes detect the stale PID and recover automatically
Requires Linux (futex, memfd), 64-bit Perl 5.22+. A deliberate tradeoff - portable it isn't, but fast it is.
Three ways to create the backing memory:
# File-backed - persistent, survives restarts
my $q = Data::Queue::Shared::Int->new('/tmp/myq.shm', 1024);
# Anonymous - fork-inherited, no filesystem footprint
my $q = Data::Queue::Shared::Int->new(undef, 1024);
# memfd - passable via Unix socket fd, no filesystem visibility
my $q = Data::Queue::Shared::Int->new_memfd("my_queue", 1024);
The Modules
Here's the full roster, grouped by use case.
Message Passing
Data::Queue::Shared - Your bread-and-butter MPMC (multi-producer, multi-consumer) bounded queue. Integer variants use the Vyukov lock-free algorithm; string variant uses a mutex with a circular arena. Blocking and non-blocking modes, batch operations, the whole deal.
use Data::Queue::Shared;
my $q = Data::Queue::Shared::Int->new(undef, 4096);
# In producer
$q->push(42);
$q->push_multi(1, 2, 3, 4, 5);
# In consumer
my $val = $q->pop_wait(1.5); # block up to 1.5s
my @batch = $q->pop_multi(100);
Single-process throughput: ~5M ops/s for integers. That's roughly 3x MCE::Queue and 6x POSIX message queues.
Data::PubSub::Shared - Broadcast pub/sub over a ring buffer. Publishers write, subscribers each track their own cursor. If a subscriber falls behind, it auto-recovers to the oldest available message. No back-pressure on writers.
my $ps = Data::PubSub::Shared::Int->new(undef, 8192);
$ps->publish(42);
my $sub = $ps->subscribe;
my $val = $sub->poll_wait(1.0);
Batch publishing hits ~170M msgs/s for integers. Yes, really. It's just writing to mapped memory.
Data::ReqRep::Shared - Request-response pattern with per-request reply routing. Client acquires a response slot, sends a request carrying the slot ID, server replies to that specific slot. Supports both sync and async client styles.
# Server
my ($request, $id) = $rr->recv_wait(1.0);
$rr->reply($id, "processed: $request");
# Client (async)
my $id = $rr->send("do something");
my $response = $rr->get_wait($id, 2.0);
Around 200K req/s cross-process - competitive with Unix domain sockets but with true MPMC support.
Key-Value
Data::HashMap::Shared - This is the big one. Concurrent hash map with elastic capacity, optional LRU eviction (clock algorithm with lock-free reads), optional per-key TTL, atomic counters, sharding, cursors. Eleven type variants from II (int-int) to SS (string-string).
use Data::HashMap::Shared::SS;
my $map = Data::HashMap::Shared::SS->new('/tmp/cache.shm', 100_000);
$map->put("user:123", "alice");
my $name = $map->get("user:123");
# LRU cache with max 10K entries
my $cache = Data::HashMap::Shared::SS->new('/tmp/lru.shm', 100_000, 10_000);
# TTL - entries expire after 60 seconds
my $ttl = Data::HashMap::Shared::II->new('/tmp/ttl.shm', 100_000, 0, 60);
# Atomic counter (lock-free fast path under read lock)
$map->incr("hits:page_a");
Cross-process string reads: 3.25M/s. Integer lookups hit ~10M/s. And you get built-in LRU and TTL without an external cache layer.
Sequential & Positional
Data::Stack::Shared - Lock-free LIFO stack. Push, pop, peek. ~6.4M ops/s.
Data::Deque::Shared - Double-ended queue. Push/pop from both ends. Lock-free CAS. ~6.3M ops/s.
Data::RingBuffer::Shared - Fixed-size circular buffer that overwrites on wrap. No consumer tracking - you just read by position. Great for metrics windows and rolling logs. ~11.7M writes/s.
Data::Log::Shared - Append-only log. Unlike Queue (consumed on read) or RingBuffer (overwritten), Log retains everything until explicitly truncated. CAS-based append, cursor-based reads. ~8.9M appends/s.
Resource Management
Data::Pool::Shared - Object pool with allocate/free. CAS-based bitmap allocation, typed slots (I64, I32, F64, Str), scope guards for automatic cleanup, raw C pointers for FFI integration. PID-tracked slots are auto-recovered when a process dies.
my $pool = Data::Pool::Shared::I64->new(undef, 256);
my $idx = $pool->alloc;
$pool->set($idx, 42);
# ...
$pool->free($idx);
# Or with auto-cleanup
{
my $guard = $pool->alloc_guard;
$pool->set($$guard, 99);
} # auto-freed here
Data::BitSet::Shared - Fixed-size bitset with per-bit atomic CAS operations. Good for flags, membership tracking, allocation bitmaps. ~10.5M ops/s.
Data::Buffer::Shared - Type-specialized arrays (I8 through F64, plus Str) with atomic per-element access. Seqlock for bulk reads, RW lock for bulk writes. Think shared sensor arrays or metric buffers.
Graphs & Priority
Data::Graph::Shared - Directed weighted graph with mutex-protected mutations. Node bitmap pool, adjacency lists, per-node data. ~3.9M node adds/s, ~13.3M lookups/s.
Data::Heap::Shared - Binary min-heap for priority queues. Mutex-protected, futex blocking when empty. ~5.3M pushes/s.
Synchronization Primitives
Data::Sync::Shared - Five cross-process sync primitives in one module: Semaphore, Barrier, RWLock, Condvar, and Once. All futex-based, all with PID-based stale lock recovery, all with scope guards.
use Data::Sync::Shared;
my $sem = Data::Sync::Shared::Semaphore->new(undef, 4); # 4 permits
{
my $guard = $sem->acquire_guard;
# at most 4 processes here concurrently
}
my $barrier = Data::Sync::Shared::Barrier->new(undef, $num_workers);
$barrier->wait; # blocks until all workers arrive
my $once = Data::Sync::Shared::Once->new(undef);
if ($once->enter) {
init_expensive_thing();
$once->done;
}
At a Glance
| Module | Pattern | Concurrency | Throughput |
|---|---|---|---|
| Queue::Shared | MPMC queue | lock-free (Int), mutex (Str) | ~5M ops/s |
| PubSub::Shared | broadcast pub/sub | lock-free (Int), mutex (Str) | ~170M/s batched |
| ReqRep::Shared | request-response | lock-free (Int), mutex (Str) | ~200K req/s |
| HashMap::Shared | hash map + LRU/TTL | futex RW lock, seqlock reads | ~10M gets/s |
| Stack::Shared | LIFO stack | lock-free CAS | ~6.4M ops/s |
| Deque::Shared | double-ended queue | lock-free CAS | ~6.3M ops/s |
| RingBuffer::Shared | circular buffer | lock-free CAS | ~11.7M writes/s |
| Log::Shared | append-only log | lock-free CAS | ~8.9M appends/s |
| Pool::Shared | object pool | lock-free bitmap | ~3.3M alloc/s |
| BitSet::Shared | bitset | lock-free CAS | ~10.5M ops/s |
| Buffer::Shared | typed arrays | atomic + seqlock | per-type |
| Graph::Shared | directed graph | mutex | ~13.3M lookups/s |
| Heap::Shared | priority queue | mutex | ~5.3M pushes/s |
| Sync::Shared | sem/barrier/rwlock/condvar/once | futex | - |
Type Specialization
Most modules come in typed variants - Int16, Int32, Int64, Str, and so on. This isn't just for type safety. An Int16 queue uses half the memory of an Int64 queue, which means double the cache density on the same hardware. When you're doing millions of operations per second, cache lines matter.
Event Loop Integration
Every module supports eventfd() for integration with event loops like EV, Mojo, or AnyEvent:
my $fd = $q->eventfd;
# register $fd with your event loop
# on readable: $q->eventfd_consume; then poll/pop
Signaling is explicit ($q->notify) so you can batch writes before waking consumers.
Playing Nice with Others: PDL, FFI::Platypus, OpenGL::Modern
One thing I want to highlight is that these aren't isolated islands. Because everything lives in mmap'd memory with known layouts, you get natural interop with other systems that work with raw pointers and packed data.
PDL is the obvious one. If you're doing numerical work in Perl - signal processing, image manipulation, statistics - PDL is your workhorse. The Buffer module's as_scalar returns a zero-copy scalar reference directly over the mmap'd region. Feed that to PDL and you've got an ndarray backed by shared memory:
use Data::Buffer::Shared::F64;
use PDL;
my $buf = Data::Buffer::Shared::F64->new('/tmp/signal.shm', 10000);
# one process fills the buffer with sensor data...
# another process reads it as a PDL:
my $pdl = PDL->new_from_specification(double, 10000);
${$pdl->get_dataref} = ${$buf->as_scalar};
$pdl->upd_data;
printf "mean=%.4f stddev=%.4f\n", $pdl->stats;
For typed arrays you can also use get_raw/set_raw for bulk transfers - a single memcpy under the hood, seqlock-guarded for consistency. That means you can build a multiprocess image pipeline where one process captures frames into a shared U8 buffer, another runs PDL convolutions on it, and a third renders the result - all communicating through shared memory with eventfd notifications, no serialization anywhere.
FFI::Platypus works just as naturally. Pool and Buffer both expose ptr() / data_ptr() - raw C pointers as unsigned integers, ready to hand to any C function through FFI. Need to call libc qsort directly on your shared data? Go ahead:
use Data::Pool::Shared;
use FFI::Platypus;
my $pool = Data::Pool::Shared::I64->new(undef, 1000);
# ... alloc and fill slots ...
my $ffi = FFI::Platypus->new(api => 2);
$ffi->lib(undef); # libc
$ffi->attach([qsort => 'c_qsort'] =>
['opaque', 'size_t', 'size_t', '(opaque,opaque)->int'] => 'void');
c_qsort($pool->data_ptr, 1000, 8, $comparator);
# slots are now sorted in-place, visible to all processes
Pool slots are contiguous in memory (data_ptr + idx * elem_size), so any C library that expects a flat array works out of the box.
OpenGL::Modern is where it gets fun. Buffer::F32 is essentially a shared vertex buffer. One process computes positions, another renders them - connected by a shared mmap region and eventfd:
# Compute process:
my $verts = Data::Buffer::Shared::F32->new('/tmp/verts.shm', 30000);
$verts->set_slice(0, @new_positions);
$verts->notify;
# Render process:
my $ref = $verts->as_scalar;
# on eventfd readable:
glBufferSubData_p(GL_ARRAY_BUFFER, 0, $$ref); # zero-copy upload
Pool goes further - it's a natural fit for particle systems. Particles are dynamically spawned (alloc) and despawned (free), each with a fixed-size state struct. A spawner process allocates particles, a physics process updates them, and the renderer uploads the live slots to a VBO via ptr(). The raw pointer goes straight to glBufferSubData_c - no packing, no intermediate copies.
The common thread here is that the data is already in the format the consuming library expects. F32 buffers are packed floats. I64 pools are packed int64s. There's no Perl-side serialization layer to bypass because there was never one to begin with.
Optional Keyword API
If you install XS::Parse::Keyword, several modules expose lexical keywords that bypass Perl method dispatch entirely:
use Data::Queue::Shared;
q_int_push $q, 42;
my $v = q_int_pop $q;
Zero dispatch overhead. The XS function gets called directly. It's optional - the method API works fine - but it's there when you need every last microsecond.
The Big Picture
Here's how the pieces fit together in a typical system:
- Data::Queue::Shared distributes work from producers to a pool of workers
- Data::HashMap::Shared acts as a shared cache or config store that all workers read from
- Data::PubSub::Shared broadcasts events or status updates to whoever's listening
- Data::Sync::Shared coordinates startup (Barrier), limits concurrency (Semaphore), and protects shared initialization (Once)
- Data::Pool::Shared manages reusable resource slots
- Data::RingBuffer::Shared or Data::Log::Shared holds recent metrics or audit trails
All of this running across fork()'d processes, communicating through shared memory at millions of operations per second, no serialization overhead.
Getting Started
Values are typed C scalars or fixed-length strings - no automatic serialization of arbitrary Perl structures. That's by design: raw mmap'd memory is what makes everything fast and FFI-friendly, but it means you won't be sharing hashrefs or blessed objects directly.
All modules follow the same pattern:
use Data::Queue::Shared;
# Pick your backing: file, anonymous, or memfd
my $q = Data::Queue::Shared::Int->new(undef, 4096);
if (fork() == 0) {
$q->push($$); # child pushes its PID
exit;
}
my $child_pid = $q->pop_wait(5.0);
say "Child reported in: $child_pid";
The modules are on GitHub under the vividsnow account. Each one has its own repo, test suite, and benchmarks you can run yourself.
If you've ever wished Perl had something like Go's channels and sync primitives but for fork()'d processes - well, now it does. Fourteen of them, actually.
Happy sharing
Brilliant!
Impressive technical work and excellent article.