Happy sharing

By vividsnow on April 13, 2026 9:31 PM

So you've got a bunch of Perl worker processes and they need to share state. A work queue, a counter, a lookup table - the usual. What do you reach for?

Perl has solid options here, and they've been around for a while. File::Map gives you clean zero-copy access to mmap'd files - substr, index, even regexes run directly on mapped memory. LMDB_File wraps the Lightning Memory-Mapped Database - mature, ACID-compliant, lock-free concurrent readers via MVCC, crash-safe persistence. Hash::SharedMem offers a purpose-built concurrent hash with lock-free reads and atomic copy-on-write updates. Cache::FastMmap is the workhorse for shared caching - mmap-backed pages with per-page fcntl locking, LRU eviction, and optional compression.

These are all good, proven tools. But they have something in common: they're about storage. You put data in, you get data out. They don't give you a queue that consumers can block on. They don't give you a pub/sub channel, a ring buffer, a semaphore, a priority heap, or a lock-free MPMC algorithm. They don't do atomic counters or futex-based blocking with timeouts.

That's the gap the Data::*::Shared family fills - fourteen Perl modules that give you proper, typed, concurrent data structures backed by mmap. Not better storage - concurrent data structures that happen to live in shared memory. Queues, hash maps, pub/sub, stacks, ring buffers, heaps, graphs, sync primitives - the works. All written in XS/C, all designed to work across fork()'d processes with zero serialization overhead.

Let me walk you through what's in the box.

The Approach

Every module in the family uses the same core recipe:

mmap(MAP_SHARED) for the actual shared memory - no serialization, no copies, just raw memory visible to all processes
Linux futex for blocking/waiting - when a queue is empty and you want to wait for data, you sleep in the kernel, not in a spin loop
CAS (compare-and-swap) for lock-free operations where possible - no mutex, no contention, just atomic CPU instructions
PID-based crash recovery - if a process dies holding a lock, other processes detect the stale PID and recover automatically

Requires Linux (futex, memfd), 64-bit Perl 5.22+. A deliberate tradeoff - portable it isn't, but fast it is.

Three ways to create the backing memory:

# File-backed - persistent, survives restarts
my $q = Data::Queue::Shared::Int->new('/tmp/myq.shm', 1024);

# Anonymous - fork-inherited, no filesystem footprint

my $q = Data::Queue::Shared::Int->new(undef, 1024);

# memfd - passable via Unix socket fd, no filesystem visibility

my $q = Data::Queue::Shared::Int->new_memfd("my_queue", 1024);

The Modules

Here's the full roster, grouped by use case.

Message Passing

Data::Queue::Shared - Your bread-and-butter MPMC (multi-producer, multi-consumer) bounded queue. Integer variants use the Vyukov lock-free algorithm; string variant uses a mutex with a circular arena. Blocking and non-blocking modes, batch operations, the whole deal.

use Data::Queue::Shared;

my $q = Data::Queue::Shared::Int->new(undef, 4096);

# In producer

$q->push(42);

$q->push_multi(1, 2, 3, 4, 5);

# In consumer

my $val = $q->pop_wait(1.5);    # block up to 1.5s

my @batch = $q->pop_multi(100);

Single-process throughput: ~5M ops/s for integers. That's roughly 3x MCE::Queue and 6x POSIX message queues.

Data::PubSub::Shared - Broadcast pub/sub over a ring buffer. Publishers write, subscribers each track their own cursor. If a subscriber falls behind, it auto-recovers to the oldest available message. No back-pressure on writers.

my $ps = Data::PubSub::Shared::Int->new(undef, 8192);
$ps->publish(42);

my $sub = $ps->subscribe;

my $val = $sub->poll_wait(1.0);

Batch publishing hits ~170M msgs/s for integers. Yes, really. It's just writing to mapped memory.

Data::ReqRep::Shared - Request-response pattern with per-request reply routing. Client acquires a response slot, sends a request carrying the slot ID, server replies to that specific slot. Supports both sync and async client styles.

# Server
my ($request, $id) = $rr->recv_wait(1.0);
$rr->reply($id, "processed: $request");

# Client (async)

my $id = $rr->send("do something");

my $response = $rr->get_wait($id, 2.0);

Around 200K req/s cross-process - competitive with Unix domain sockets but with true MPMC support.

Key-Value

Data::HashMap::Shared - This is the big one. Concurrent hash map with elastic capacity, optional LRU eviction (clock algorithm with lock-free reads), optional per-key TTL, atomic counters, sharding, cursors. Eleven type variants from II (int-int) to SS (string-string).

use Data::HashMap::Shared::SS;

my $map = Data::HashMap::Shared::SS->new('/tmp/cache.shm', 100_000);

$map->put("user:123", "alice");

my $name = $map->get("user:123");

# LRU cache with max 10K entries

my $cache = Data::HashMap::Shared::SS->new('/tmp/lru.shm', 100_000, 10_000);

# TTL - entries expire after 60 seconds

my $ttl = Data::HashMap::Shared::II->new('/tmp/ttl.shm', 100_000, 0, 60);

# Atomic counter (lock-free fast path under read lock)

$map->incr("hits:page_a");

Cross-process string reads: 3.25M/s. Integer lookups hit ~10M/s. And you get built-in LRU and TTL without an external cache layer.

Sequential & Positional

Data::Stack::Shared - Lock-free LIFO stack. Push, pop, peek. ~6.4M ops/s.

Data::Deque::Shared - Double-ended queue. Push/pop from both ends. Lock-free CAS. ~6.3M ops/s.

Data::RingBuffer::Shared - Fixed-size circular buffer that overwrites on wrap. No consumer tracking - you just read by position. Great for metrics windows and rolling logs. ~11.7M writes/s.

Data::Log::Shared - Append-only log. Unlike Queue (consumed on read) or RingBuffer (overwritten), Log retains everything until explicitly truncated. CAS-based append, cursor-based reads. ~8.9M appends/s.

Resource Management

Data::Pool::Shared - Object pool with allocate/free. CAS-based bitmap allocation, typed slots (I64, I32, F64, Str), scope guards for automatic cleanup, raw C pointers for FFI integration. PID-tracked slots are auto-recovered when a process dies.

my $pool = Data::Pool::Shared::I64->new(undef, 256);
my $idx = $pool->alloc;
$pool->set($idx, 42);
# ...
$pool->free($idx);

# Or with auto-cleanup

{

    my $guard = $pool->alloc_guard;

    $pool->set($$guard, 99);

}  # auto-freed here

Data::BitSet::Shared - Fixed-size bitset with per-bit atomic CAS operations. Good for flags, membership tracking, allocation bitmaps. ~10.5M ops/s.

Data::Buffer::Shared - Type-specialized arrays (I8 through F64, plus Str) with atomic per-element access. Seqlock for bulk reads, RW lock for bulk writes. Think shared sensor arrays or metric buffers.

Graphs & Priority

Data::Graph::Shared - Directed weighted graph with mutex-protected mutations. Node bitmap pool, adjacency lists, per-node data. ~3.9M node adds/s, ~13.3M lookups/s.

Data::Heap::Shared - Binary min-heap for priority queues. Mutex-protected, futex blocking when empty. ~5.3M pushes/s.

Synchronization Primitives

Data::Sync::Shared - Five cross-process sync primitives in one module: Semaphore, Barrier, RWLock, Condvar, and Once. All futex-based, all with PID-based stale lock recovery, all with scope guards.

use Data::Sync::Shared;

my $sem = Data::Sync::Shared::Semaphore->new(undef, 4);  # 4 permits

{

    my $guard = $sem->acquire_guard;

    # at most 4 processes here concurrently

}

my $barrier = Data::Sync::Shared::Barrier->new(undef, $num_workers);

$barrier->wait;  # blocks until all workers arrive

my $once = Data::Sync::Shared::Once->new(undef);

if ($once->enter) {

    init_expensive_thing();

    $once->done;

}

At a Glance

Module	Pattern	Concurrency	Throughput
Queue::Shared	MPMC queue	lock-free (Int), mutex (Str)	~5M ops/s
PubSub::Shared	broadcast pub/sub	lock-free (Int), mutex (Str)	~170M/s batched
ReqRep::Shared	request-response	lock-free (Int), mutex (Str)	~200K req/s
HashMap::Shared	hash map + LRU/TTL	futex RW lock, seqlock reads	~10M gets/s
Stack::Shared	LIFO stack	lock-free CAS	~6.4M ops/s
Deque::Shared	double-ended queue	lock-free CAS	~6.3M ops/s
RingBuffer::Shared	circular buffer	lock-free CAS	~11.7M writes/s
Log::Shared	append-only log	lock-free CAS	~8.9M appends/s
Pool::Shared	object pool	lock-free bitmap	~3.3M alloc/s
BitSet::Shared	bitset	lock-free CAS	~10.5M ops/s
Buffer::Shared	typed arrays	atomic + seqlock	per-type
Graph::Shared	directed graph	mutex	~13.3M lookups/s
Heap::Shared	priority queue	mutex	~5.3M pushes/s
Sync::Shared	sem/barrier/rwlock/condvar/once	futex	-

Type Specialization

Most modules come in typed variants - Int16, Int32, Int64, Str, and so on. This isn't just for type safety. An Int16 queue uses half the memory of an Int64 queue, which means double the cache density on the same hardware. When you're doing millions of operations per second, cache lines matter.

Event Loop Integration

Every module supports eventfd() for integration with event loops like EV, Mojo, or AnyEvent:

my $fd = $q->eventfd;
# register $fd with your event loop
# on readable: $q->eventfd_consume; then poll/pop

Signaling is explicit ($q->notify) so you can batch writes before waking consumers.

Playing Nice with Others: PDL, FFI::Platypus, OpenGL::Modern

One thing I want to highlight is that these aren't isolated islands. Because everything lives in mmap'd memory with known layouts, you get natural interop with other systems that work with raw pointers and packed data.

PDL is the obvious one. If you're doing numerical work in Perl - signal processing, image manipulation, statistics - PDL is your workhorse. The Buffer module's as_scalar returns a zero-copy scalar reference directly over the mmap'd region. Feed that to PDL and you've got an ndarray backed by shared memory:

use Data::Buffer::Shared::F64;
use PDL;

my $buf = Data::Buffer::Shared::F64->new('/tmp/signal.shm', 10000);

# one process fills the buffer with sensor data...

# another process reads it as a PDL:

my $pdl = PDL->new_from_specification(double, 10000);

${$pdl->get_dataref} = ${$buf->as_scalar};

$pdl->upd_data;

printf "mean=%.4f stddev=%.4f\n", $pdl->stats;

For typed arrays you can also use get_raw/set_raw for bulk transfers - a single memcpy under the hood, seqlock-guarded for consistency. That means you can build a multiprocess image pipeline where one process captures frames into a shared U8 buffer, another runs PDL convolutions on it, and a third renders the result - all communicating through shared memory with eventfd notifications, no serialization anywhere.

FFI::Platypus works just as naturally. Pool and Buffer both expose ptr() / data_ptr() - raw C pointers as unsigned integers, ready to hand to any C function through FFI. Need to call libc qsort directly on your shared data? Go ahead:

use Data::Pool::Shared;
use FFI::Platypus;

my $pool = Data::Pool::Shared::I64->new(undef, 1000);

# ... alloc and fill slots ...

my $ffi = FFI::Platypus->new(api => 2);

$ffi->lib(undef);  # libc

$ffi->attach([qsort => 'c_qsort'] =>

    ['opaque', 'size_t', 'size_t', '(opaque,opaque)->int'] => 'void');

c_qsort($pool->data_ptr, 1000, 8, $comparator);

# slots are now sorted in-place, visible to all processes

Pool slots are contiguous in memory (data_ptr + idx * elem_size), so any C library that expects a flat array works out of the box.

OpenGL::Modern is where it gets fun. Buffer::F32 is essentially a shared vertex buffer. One process computes positions, another renders them - connected by a shared mmap region and eventfd:

# Compute process:
my $verts = Data::Buffer::Shared::F32->new('/tmp/verts.shm', 30000);
$verts->set_slice(0, @new_positions);
$verts->notify;

# Render process:

my $ref = $verts->as_scalar;

# on eventfd readable:

glBufferSubData_p(GL_ARRAY_BUFFER, 0, $$ref);  # zero-copy upload

Pool goes further - it's a natural fit for particle systems. Particles are dynamically spawned (alloc) and despawned (free), each with a fixed-size state struct. A spawner process allocates particles, a physics process updates them, and the renderer uploads the live slots to a VBO via ptr(). The raw pointer goes straight to glBufferSubData_c - no packing, no intermediate copies.

The common thread here is that the data is already in the format the consuming library expects. F32 buffers are packed floats. I64 pools are packed int64s. There's no Perl-side serialization layer to bypass because there was never one to begin with.

Optional Keyword API

If you install XS::Parse::Keyword, several modules expose lexical keywords that bypass Perl method dispatch entirely:

use Data::Queue::Shared;

q_int_push $q, 42;

my $v = q_int_pop $q;

Zero dispatch overhead. The XS function gets called directly. It's optional - the method API works fine - but it's there when you need every last microsecond.

The Big Picture

Here's how the pieces fit together in a typical system:

Data::Queue::Shared distributes work from producers to a pool of workers
Data::HashMap::Shared acts as a shared cache or config store that all workers read from
Data::PubSub::Shared broadcasts events or status updates to whoever's listening
Data::Sync::Shared coordinates startup (Barrier), limits concurrency (Semaphore), and protects shared initialization (Once)
Data::Pool::Shared manages reusable resource slots
Data::RingBuffer::Shared or Data::Log::Shared holds recent metrics or audit trails

All of this running across fork()'d processes, communicating through shared memory at millions of operations per second, no serialization overhead.

Getting Started

Values are typed C scalars or fixed-length strings - no automatic serialization of arbitrary Perl structures. That's by design: raw mmap'd memory is what makes everything fast and FFI-friendly, but it means you won't be sharing hashrefs or blessed objects directly.

All modules follow the same pattern:

use Data::Queue::Shared;

# Pick your backing: file, anonymous, or memfd

my $q = Data::Queue::Shared::Int->new(undef, 4096);

if (fork() == 0) {

    $q->push($$);  # child pushes its PID

    exit;

}

my $child_pid = $q->pop_wait(5.0);

say "Child reported in: $child_pid";

The modules are on GitHub under the vividsnow account. Each one has its own repo, test suite, and benchmarks you can run yourself.

If you've ever wished Perl had something like Go's channels and sync primitives but for fork()'d processes - well, now it does. Fourteen of them, actually.

Happy sharing

2 comments

Tagged as:

ipc, mmap

2 Comments

dami | April 14, 2026 6:28 AM | Reply

Brilliant!
Impressive technical work and excellent article.

Christian Hansen | April 16, 2026 5:21 PM | Reply

The API and code look excellent, great work!

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About vividsnow

perl blog

More info »

vividsnow

Happy sharing

The Approach

The Modules

Message Passing

Key-Value

Sequential & Positional

Resource Management

Graphs & Priority

Synchronization Primitives

At a Glance

Type Specialization

Event Loop Integration

Playing Nice with Others: PDL, FFI::Platypus, OpenGL::Modern

Optional Keyword API

The Big Picture

Getting Started

Tagged as:

2 Comments

Leave a comment

About vividsnow

Search this blog