You decided to go fast without locks. Great choice—until a crash dump shows a freed node being dereferenced by a different thread, and your “obvious” fix with volatile
doesn’t fix anything. The truth is simple and sharp: lock-free is not just about atomics and compare_exchange
. It’s also about safe memory reclamation. If you pop a node while another thread can still see it, you must not free it yet—or you invite use-after-free and the notorious ABA problem.
This post builds up the foundations you actually need to ship lock-free structures in C: what “lock-free” means, why ABA appears in the first place, how pointer tagging helps (sometimes), and why production codebases standardize on either hazard pointers or epoch-based reclamation to free memory safely. We’ll write real C11 code, reason with the memory model, and keep the vibe practical.
What “lock-free” actually promises
Progress guarantees, in one breath:
- Obstruction-free: a thread makes progress in isolation.
- Lock-free: the system makes progress overall—at least one thread completes in a finite number of steps.
- Wait-free: every thread completes in a bounded number of its own steps.
Most classic stacks/queues you’ll hand-roll in C target lock-free. We’ll implement them with C11 atomics and make memory ordering explicit so the compiler and CPU cooperate.
A minimal lock-free stack (Treiber) without reclamation
We’ll start with the canonical Treiber stack. It works—and it leaks—because we won’t reclaim memory yet. That’s deliberate: separating logic from reclamation makes each part provable.
#include <stdatomic.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
struct node {
int value;
struct node *next;
};
// Head of the stack; atomic pointer so CAS is well-defined
static _Atomic(struct node *) head = ATOMIC_VAR_INIT(NULL);
void push(struct node *n) {
struct node *old = atomic_load_explicit(&head, memory_order_relaxed);
do {
n->next = old; // prepare link to current head
} while (!atomic_compare_exchange_weak_explicit(
&head, &old, n,
memory_order_acq_rel, // success: publish n and order with readers
memory_order_acquire)); // failure: observe updated head before retry
}
// Returns a node the caller owns. Caller MUST NOT free it yet (reclamation to come).
struct node *pop(void) {
struct node *old = atomic_load_explicit(&head, memory_order_acquire);
for (;;) {
if (old == NULL) return NULL; // empty
struct node *next = old->next; // requires acquire-load of head
if (atomic_compare_exchange_weak_explicit(
&head, &old, next,
memory_order_acq_rel, // success: detach node
memory_order_acquire)) // failure: head changed; reload old
return old;
// CAS failed: old has been updated with the current head; loop
}
}
Notes:
- The acquire load of
head
and acq_rel CAS ensure that when a consumer sees a node, its fields are fully initialized by the producer that linked it. - We intentionally “leak” popped nodes for now. Freeing here would be a race: another thread may still be reading
old
it fetched before our CAS.
Why reclamation is hard in lock-free code
With coarse locks, freeing is easy: you hold a lock that excludes readers. In lock-free land, other threads can read old
at any time between your load and CAS—even after you’ve detached it—so freeing immediately can produce a use-after-free when they dereference old->next
later. We need a protocol to decide “no thread can possibly hold a reference to this node anymore.”
Two families dominate:
- Hazard pointers: readers publish the pointers they are about to touch; writers only free nodes that are not listed as “hazardous.”
- Epoch-based reclamation: time is divided into epochs; readers mark entry/exit from quiescent states; writers free nodes only after at least two epoch advances.
Before we get there, we must understand the ABA problem that poisons naive CAS designs.
The ABA problem (and why CAS can lie)
CAS checks “is the word still equal to what I saw?” If yes, it installs the new value. But equality of bits does not imply equality of meaning.
Classic scenario on our Treiber stack:
- Thread T1 reads head = A.
- T1 is descheduled.
- Thread T2 pops A (head becomes B) and frees A.
- Memory allocator reuses A’s address for a new node A′ and pushes it: head becomes A again.
- T1 wakes up, sees head still equals A (bitwise), and CAS succeeds, linking its node after A′. The list is now corrupt, and you’ll eventually crash.
We can mitigate ABA by making the pointer value change monotonically—even when the same address is recycled—typically by tagging pointers with a small counter. That helps single-word CAS detect reuse, but it’s not a complete memory reclamation strategy.
Pointer tagging: a quick primer
If your nodes are aligned (they are), the low 1–3 bits of pointers are zero. You can store a small counter in those bits and increment on each update. Comparing the tagged word catches A→B→A′ cycles.
#include <inttypes.h>
struct tagged_ptr {
uintptr_t raw; // [ .. pointer .. | tag_low_bits ]
};
static inline struct tagged_ptr make_tagged(struct node *p, unsigned tag) {
return (struct tagged_ptr){ .raw = ((uintptr_t)p) | (tag & 0x3u) };
}
static inline struct node *tp_ptr(struct tagged_ptr t) {
return (struct node *)(t.raw & ~((uintptr_t)0x3u));
}
static inline unsigned tp_tag(struct tagged_ptr t) { return (unsigned)(t.raw & 0x3u); }
static inline struct tagged_ptr tp_bump(struct tagged_ptr t, struct node *newp) {
return make_tagged(newp, (tp_tag(t) + 1u) & 0x3u);
}
A tagged CAS uses _Atomic(uintptr_t)
instead of _Atomic(struct node*)
. This defeats many ABA cases without widening the head to two words. Caveats: a 2-bit tag can still wrap; increase bits if alignment allows. Tagging helps correctness but does not decide safety of free()
—you still need hazard pointers or epochs to know when reclamation is safe.
A concrete ABA timeline
The fix is twofold: make equality checks robust (tagging) and defer reclamation until no readers can hold the pointer anymore (hazard/epoch).
Memory ordering we’ll rely on
Lock-free structures live or die on ordering edges. We’ll use:
memory_order_release
when publishing a new head so readers that see it (via acquire) also see the initialized node fields.memory_order_acquire
when consuming the head so subsequent reads (e.g.,n->next
, payload) observe what the publisher wrote.memory_order_acq_rel
on successful CAS to both consume the old state and publish the new.- Failure order of
compare_exchange_*
:memory_order_acquire
(orrelaxed
if you do a separate load) is typically enough.
This mirrors the producer/consumer handoff pattern and keeps us out of undefined-behavior territory.
Why volatile
is not the fix
volatile
talks to the compiler about I/O-like side effects. It does not make operations atomic, it does not create inter-thread happens-before, and it does not fix ABA. Use <stdatomic.h>
and explicit memory orders; treat volatile
as a tool for MMIO, not concurrency.
The reclamation contract (what we must prove before free)
When we detach a node X
from a structure, freeing X
is safe only when both are true:
- No thread can still hold a pointer to
X
from a prior read. - No thread can obtain a pointer to
X
in the future from the shared structure (i.e.,X
is no longer reachable).
Pointer tagging helps ensure condition (2) isn’t subverted by ABA (we won’t “re-see” an old bit pattern by accident). Hazard pointers and epochs are strategies to prove (1) without global locks:
- Hazard pointers: readers publish the exact pointers they will dereference. Writers retire nodes and later scan the global hazard set; any node not listed is safe to free.
- Epoch-based reclamation: readers enter an epoch while they may hold raw pointers; writers free nodes after two epoch advances, guaranteeing all prior readers have exited.
We’ll implement both patterns on top of the Treiber stack and compare their ergonomics and performance.
Testable baseline: Treiber stack with a retire list (no frees yet)
Let’s establish a baseline “retire but don’t free” path; we’ll plug hazard/epoch policies into try_reclaim()
later.
#include <stdatomic.h>
#include <stdlib.h>
struct retire_node { struct node *n; struct retire_node *next; };
static _Atomic(struct retire_node *) retire_head = ATOMIC_VAR_INIT(NULL);
static void retire(struct node *n) {
struct retire_node *r = malloc(sizeof *r);
r->n = n; r->next = NULL;
struct retire_node *old = atomic_load_explicit(&retire_head, memory_order_relaxed);
do { r->next = old; } while (!atomic_compare_exchange_weak_explicit(
&retire_head, &old, r, memory_order_release, memory_order_relaxed));
}
// Placeholder: hazard/epoch policies will scan retire list and free safe nodes.
static void try_reclaim(void) {
// no-op for now
}
struct node *safe_pop(void) {
struct node *x = pop();
if (x) { retire(x); try_reclaim(); }
return x;
}
Why this shape works for both approaches:
- Writers “retire” nodes immediately after detachment.
- A background or opportunistic
try_reclaim()
applies the policy: either scan hazards (hazard pointers) or check epochs and batch-free (epochs).
What’s next in the design
From here, the two reclamation families diverge in reader/writer responsibilities:
- Hazard pointers push work to readers (publish before deref; clear after). Writers scan retire lists against hazards. Low latency under contention but more per-read overhead and memory for hazard slots.
- Epoch-based reclamation keeps readers light (enter/exit epoch) and batches frees. Great throughput and simple fast-path; pauses can spike if epochs cannot advance (e.g., stalled threads).
We’ll structure the APIs so you can switch policies without touching the stack/queue logic.
A quick benchmarking lens
When we compare policies, measure both throughput and tail-latency:
- Throughput: pushes/pops per second at many threads.
- Pause behavior: p99 of individual pop latency under churn.
- Memory footprint: average and p99 of retired-but-not-freed nodes.
Microbenchmarks should pin threads, pre-allocate nodes to avoid allocator bias, and exercise contention patterns (SPSC, MPMC bursts). Keep the core invariant: correctness first, then speed.
Hazard pointers in practice: a minimal, portable implementation
We’ll build a compact hazard-pointer manager you can drop into a C11 project. It assumes a bounded number of threads and allocates each thread a fixed number of hazard slots. Readers publish the pointer they intend to dereference, re-validate, then proceed. Writers retire nodes and occasionally scan the retire list against the global hazard set to free what’s safe.
Configuration and global state
#include <stdatomic.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
#include <limits.h>
// Tunables
#define HP_MAX_THREADS 128 // maximum threads that will call hp_thread_init
#define HP_SLOTS_PER_T 2 // hazards per thread (1–3 typical for stacks/queues)
#define HP_SCAN_THRESHOLD 64 // retired nodes to accumulate before a scan
// Global hazard table: each thread owns a contiguous block of slots
static _Atomic(struct node *) g_hazards[HP_MAX_THREADS * HP_SLOTS_PER_T];
// Monotonic thread id assignment
static _Atomic(unsigned) g_next_tid = ATOMIC_VAR_INIT(0);
// Thread-local bookkeeping
static _Thread_local unsigned hp_tid = UINT_MAX; // uninitialized sentinel
static _Thread_local unsigned hp_retired_since_scan = 0;
static inline unsigned hp_base_index(void) { return hp_tid * HP_SLOTS_PER_T; }
static inline _Atomic(struct node *) *hp_slot(unsigned i) {
return &g_hazards[hp_base_index() + i];
}
// Initialize hazard slots to NULL once per process (optional hygiene)
static void hp_global_init(void) {
for (size_t i = 0; i < (size_t)HP_MAX_THREADS * HP_SLOTS_PER_T; ++i) {
atomic_store_explicit(&g_hazards[i], NULL, memory_order_relaxed);
}
}
Thread registration and slot helpers
// Must be called by each participating thread before using hazard pointers
static bool hp_thread_init(void) {
if (hp_tid != UINT_MAX) return true; // already initialized
unsigned id = atomic_fetch_add_explicit(&g_next_tid, 1, memory_order_relaxed);
if (id >= HP_MAX_THREADS) return false; // out of slots
hp_tid = id;
for (unsigned i = 0; i < HP_SLOTS_PER_T; ++i) {
atomic_store_explicit(hp_slot(i), NULL, memory_order_relaxed);
}
return true;
}
static inline void hp_set(unsigned i, struct node *p) {
// Release makes the publication visible to scanners that acquire-load
atomic_store_explicit(hp_slot(i), p, memory_order_release);
}
static inline void hp_clear(unsigned i) {
atomic_store_explicit(hp_slot(i), NULL, memory_order_release);
}
Retire and scan with hazard comparison
We’ll reuse the retire_head
global list from earlier and implement a non-blocking scan by atomically detaching the list, comparing against the current hazard set, freeing what’s not protected, and pushing the survivors back.
// From earlier baseline
struct retire_node { struct node *n; struct retire_node *next; };
extern _Atomic(struct retire_node *) retire_head; // declared where defined
static void hp_scan(void) {
// Snapshot hazard set into a temporary array
const size_t HCAP = (size_t)HP_MAX_THREADS * HP_SLOTS_PER_T;
struct node **haz = malloc(HCAP * sizeof *haz);
size_t hlen = 0;
for (size_t i = 0; i < HCAP; ++i) {
struct node *p = atomic_load_explicit(&g_hazards[i], memory_order_acquire);
if (p) haz[hlen++] = p;
}
// Detach the current retire list
struct retire_node *list = atomic_exchange_explicit(&retire_head, NULL, memory_order_acq_rel);
struct retire_node *keep = NULL; // nodes still protected
// For each retired node, free if not present in hazard set
while (list) {
struct retire_node *r = list; list = list->next; r->next = NULL;
bool protected = false;
for (size_t i = 0; i < hlen; ++i) {
if (haz[i] == r->n) { protected = true; break; }
}
if (!protected) {
free(r->n);
free(r);
} else {
r->next = keep; keep = r;
}
}
// Push back the kept nodes so a future scan can reconsider them
while (keep) {
struct retire_node *r = keep; keep = keep->next;
struct retire_node *old = atomic_load_explicit(&retire_head, memory_order_relaxed);
do { r->next = old; } while (!atomic_compare_exchange_weak_explicit(
&retire_head, &old, r, memory_order_release, memory_order_relaxed));
}
free(haz);
}
Notes:
- Scanning with
atomic_exchange
isolates a consistent batch without blocking other threads from retiring more nodes. - Hazard loads use acquire to synchronize with readers’ release publications.
- Freeing happens outside of any shared structure; freed nodes cannot reappear because they were unlinked earlier.
A hazard-protected pop
Publishing a hazard follows the “publish then validate” dance: take a snapshot of head
, publish it, re-read head
to confirm it hasn’t changed, then proceed. If head
changed, clear and retry.
struct node *pop_hp(void) {
if (!hp_thread_init()) return NULL; // or abort: too many threads
for (;;) {
struct node *old = atomic_load_explicit(&head, memory_order_acquire);
if (!old) return NULL;
hp_set(0, old); // publish intent to dereference old
// Re-validate that head still equals old under the hazard
if (atomic_load_explicit(&head, memory_order_acquire) != old) {
hp_clear(0);
continue; // changed under us; retry with a fresh candidate
}
struct node *next = old->next; // safe: old is protected
if (atomic_compare_exchange_weak_explicit(
&head, &old, next,
memory_order_acq_rel,
memory_order_acquire)) {
hp_clear(0); // stop protecting old
retire(old); // detach already done; retire for deferred free
if (++hp_retired_since_scan >= HP_SCAN_THRESHOLD) {
hp_scan(); hp_retired_since_scan = 0;
}
return old;
}
// CAS failed; hazard still points to an obsolete node; clear and retry
hp_clear(0);
}
}
Why this is correct:
- Any thread trying to free
old
must first find it on the retire list and compare it against the hazard set. Because we publishedold
with a release store and scanners load with acquire, they will see the hazard and keepold
alive. - The re-validation ensures that we never dereference a node that was concurrently removed and replaced; if
head
changes, we loop before touchingold
.
Reader and writer responsibilities at a glance
-
Reader responsibilities:
- Initialize its hazard-thread state once.
- Before dereferencing a shared pointer, publish it into a hazard slot, re-validate the shared location, then proceed.
- Clear the hazard slot when done to let future scans reclaim memory.
-
Writer responsibilities:
- After unlinking a node from the data structure, call
retire(node)
. - Periodically call
hp_scan()
(e.g., everyHP_SCAN_THRESHOLD
retires or on a timer) to free unprotected nodes.
- After unlinking a node from the data structure, call
Common pitfalls and guardrails
- Never mix atomic and non-atomic access to shared pointers. Use
_Atomic(struct node*)
consistently forhead
and similar shared fields. - Don’t forget to re-validate after publishing the hazard. Publishing alone does not pin the specific snapshot you loaded if the shared pointer changed.
- Ensure
HP_MAX_THREADS * HP_SLOTS_PER_T
covers the worst-case concurrency; otherwise threads may fail to initialize hazard slots. - Hazard slots are per-thread state. If a thread exits without clearing, future scans still treat its last published pointer as protected. In long-lived services, either reuse threads or add a deregistration path.
Variants and micro-optimizations
- Use more than one hazard slot when traversing multi-pointer structures (e.g., linked queues where you protect both the current and next pointer during traversal).
- Replace the linear hazard search with a hash set or radix table when
HP_MAX_THREADS
is large. For the common case (hundreds of hazards), linear scans are cache-friendly and competitive. - Batch
retire()
per thread and push to the global list via a single multi-node append to reduce contention.
Epoch-based reclamation (EBR): lightweight readers, batched frees
Hazard pointers make readers publish exact pointers. Epoch-based reclamation keeps readers lighter: they only announce “I may dereference raw pointers now” by entering the current epoch and announce they are safe again by exiting. Writers tag retired nodes with the current epoch and free them only after at least two global epoch advances so that all readers that could have seen them are gone.
Global state and thread records
#include <stdatomic.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <limits.h>
#define EBR_MAX_THREADS 128
#define EBR_SCAN_THRESHOLD 128
static _Atomic(unsigned) ebr_global_epoch = ATOMIC_VAR_INIT(1); // start at 1
struct ebr_thread_rec {
_Atomic(unsigned) active; // 0 = quiescent, 1 = inside epoch
_Atomic(unsigned) epoch; // epoch value captured at entry
};
static struct ebr_thread_rec ebr_threads[EBR_MAX_THREADS];
static _Atomic(unsigned) ebr_next_tid = ATOMIC_VAR_INIT(0);
static _Thread_local unsigned ebr_tid = UINT_MAX;
static _Thread_local unsigned ebr_retired_since_scan = 0;
static inline bool ebr_thread_register(void) {
if (ebr_tid != UINT_MAX) return true;
unsigned id = atomic_fetch_add_explicit(&ebr_next_tid, 1, memory_order_relaxed);
if (id >= EBR_MAX_THREADS) return false;
ebr_tid = id;
atomic_store_explicit(&ebr_threads[id].active, 0u, memory_order_relaxed);
atomic_store_explicit(&ebr_threads[id].epoch, 0u, memory_order_relaxed);
return true;
}
static inline void ebr_enter(void) {
if (!ebr_thread_register()) return; // handle error upstream
unsigned e = atomic_load_explicit(&ebr_global_epoch, memory_order_acquire);
atomic_store_explicit(&ebr_threads[ebr_tid].epoch, e, memory_order_release);
atomic_store_explicit(&ebr_threads[ebr_tid].active, 1u, memory_order_release);
}
static inline void ebr_exit(void) {
// Mark quiescent; store active=0 last to ensure epoch is visible first
atomic_store_explicit(&ebr_threads[ebr_tid].active, 0u, memory_order_release);
// Optionally refresh epoch to current to avoid blocking min computations
unsigned e = atomic_load_explicit(&ebr_global_epoch, memory_order_relaxed);
atomic_store_explicit(&ebr_threads[ebr_tid].epoch, e, memory_order_relaxed);
}
// Try to bump the global epoch if all active threads have caught up
static void ebr_maybe_advance(void) {
unsigned cur = atomic_load_explicit(&ebr_global_epoch, memory_order_acquire);
unsigned min_active = UINT_MAX;
for (unsigned i = 0; i < EBR_MAX_THREADS; ++i) {
if (atomic_load_explicit(&ebr_threads[i].active, memory_order_acquire)) {
unsigned te = atomic_load_explicit(&ebr_threads[i].epoch, memory_order_acquire);
if (te < min_active) min_active = te;
}
}
if (min_active >= cur) {
(void)atomic_compare_exchange_strong_explicit(
&ebr_global_epoch, &cur, cur + 1,
memory_order_acq_rel, memory_order_acquire);
}
}
Retire list tagged with epochs and a scanner
struct ebr_retire_node { struct node *n; unsigned epoch; struct ebr_retire_node *next; };
static _Atomic(struct ebr_retire_node *) ebr_retire_head = ATOMIC_VAR_INIT(NULL);
static void ebr_retire(struct node *n) {
struct ebr_retire_node *r = malloc(sizeof *r);
r->n = n;
r->epoch = atomic_load_explicit(&ebr_global_epoch, memory_order_relaxed);
struct ebr_retire_node *old = atomic_load_explicit(&ebr_retire_head, memory_order_relaxed);
do { r->next = old; } while (!atomic_compare_exchange_weak_explicit(
&ebr_retire_head, &old, r, memory_order_release, memory_order_relaxed));
}
static void ebr_scan(void) {
ebr_maybe_advance();
unsigned cur = atomic_load_explicit(&ebr_global_epoch, memory_order_acquire);
// Compute the minimum epoch among active threads; inactive threads don’t block
unsigned min_active = UINT_MAX;
for (unsigned i = 0; i < EBR_MAX_THREADS; ++i) {
if (atomic_load_explicit(&ebr_threads[i].active, memory_order_acquire)) {
unsigned te = atomic_load_explicit(&ebr_threads[i].epoch, memory_order_acquire);
if (te < min_active) min_active = te;
}
}
// Detach retire list batch
struct ebr_retire_node *list = atomic_exchange_explicit(&ebr_retire_head, NULL, memory_order_acq_rel);
struct ebr_retire_node *keep = NULL;
while (list) {
struct ebr_retire_node *r = list; list = list->next; r->next = NULL;
// Free if node was retired at least two epochs earlier and all active threads are past it
if (min_active != UINT_MAX && r->epoch + 2u <= min_active) {
free(r->n);
free(r);
} else if (r->epoch + 2u <= cur && min_active == UINT_MAX) {
// No active threads: safe to free if enough global time elapsed
free(r->n);
free(r);
} else {
r->next = keep; keep = r;
}
}
// Push back survivors
while (keep) {
struct ebr_retire_node *r = keep; keep = keep->next;
struct ebr_retire_node *old = atomic_load_explicit(&ebr_retire_head, memory_order_relaxed);
do { r->next = old; } while (!atomic_compare_exchange_weak_explicit(
&ebr_retire_head, &old, r, memory_order_release, memory_order_relaxed));
}
}
A pop using epoch-based reclamation
Readers enter/exit a read-side critical section around operations that may dereference raw pointers. No hazard publication is needed.
struct node *pop_ebr(void) {
ebr_enter();
for (;;) {
struct node *old = atomic_load_explicit(&head, memory_order_acquire);
if (!old) { ebr_exit(); return NULL; }
struct node *next = old->next; // safe while in critical section
if (atomic_compare_exchange_weak_explicit(
&head, &old, next,
memory_order_acq_rel, memory_order_acquire)) {
ebr_retire(old);
if (++ebr_retired_since_scan >= EBR_SCAN_THRESHOLD) {
ebr_scan(); ebr_retired_since_scan = 0;
}
ebr_exit();
return old;
}
}
}
When epochs shine (and when they bite)
-
Strengths:
- Ultra-light readers: two atomic stores on entry/exit, no hazard publication per pointer.
- Batching frees reduces allocator churn and can boost throughput.
- Fits traversal-heavy structures (lists, hash tables) with minimal per-step overhead.
-
Trade-offs:
- A stalled thread that never calls
ebr_exit()
can delay reclamation indefinitely, growing memory. - Coarse timing: freeing waits for global advances; worst-case latency can be larger than hazard pointers under contention.
- Requires disciplined read-side critical sections around every dereference of shared nodes.
- A stalled thread that never calls
Practical guardrails
- Keep read-side critical sections short; avoid blocking calls inside them.
- Ensure all participating threads are registered; provide a shutdown path that marks threads quiescent before joining.
- Drive
ebr_scan()
andebr_maybe_advance()
periodically (thresholds or timers) to keep memory bounded.
Hazard pointers vs epochs: quick guidance
- Use hazard pointers when:
- You need predictable reclamation latency and can afford a couple of atomic stores per dereference.
- You have few outstanding hazards per thread and moderate thread counts.
- Use epochs when:
- Read paths are hot and must be minimal.
- You can guarantee timely quiescent states (no long-paused threads).
- Batching frees aligns with your allocator and workload.
A unified policy seam: swap HP and EBR without touching data-structure logic
Keep your stack/queue code agnostic to the reclamation policy by routing through a tiny interface. At compile time (or runtime with function pointers), choose HP or EBR. This keeps tests honest and refactors cheap.
// Reclamation policy interface (minimal)
enum rp_kind { RP_HP, RP_EBR };
struct rp_vtbl {
void (*thread_init)(void);
void (*enter)(void); // no-op for HP
void (*exit)(void); // no-op for HP
void (*protect)(unsigned slot, struct node *p); // HP only; no-op for EBR
void (*unprotect)(unsigned slot); // HP only; no-op for EBR
void (*retire)(struct node *n);
void (*scan)(void);
};
// HP-backed implementation
static void hp_thread_init_wrap(void){ (void)hp_thread_init(); }
static void hp_enter_noop(void){}
static void hp_exit_noop(void){}
static void hp_protect(unsigned i, struct node *p){ hp_set(i, p); }
static void hp_unprotect(unsigned i){ hp_clear(i); }
static void hp_retire_wrap(struct node *n){ retire(n); }
static void hp_scan_wrap(void){ hp_scan(); }
static const struct rp_vtbl RP_HP_VTBL = {
hp_thread_init_wrap, hp_enter_noop, hp_exit_noop,
hp_protect, hp_unprotect, hp_retire_wrap, hp_scan_wrap
};
// EBR-backed implementation
static void ebr_thread_init_wrap(void){ (void)ebr_thread_register(); }
static void ebr_protect_noop(unsigned slot, struct node *p){ (void)slot; (void)p; }
static void ebr_unprotect_noop(unsigned slot){ (void)slot; }
static void ebr_retire_wrap(struct node *n){ ebr_retire(n); }
static void ebr_scan_wrap(void){ ebr_scan(); }
static const struct rp_vtbl RP_EBR_VTBL = {
ebr_thread_init_wrap, ebr_enter, ebr_exit,
ebr_protect_noop, ebr_unprotect_noop, ebr_retire_wrap, ebr_scan_wrap
};
Usage sketch in a generic pop_generic
:
static inline struct node *pop_generic(const struct rp_vtbl *rp) {
rp->thread_init();
rp->enter();
for (;;) {
struct node *old = atomic_load_explicit(&head, memory_order_acquire);
if (!old) { rp->exit(); return NULL; }
rp->protect(0, old); // HP publishes; EBR no-op
if (rp == &RP_HP_VTBL && atomic_load_explicit(&head, memory_order_acquire) != old) {
rp->unprotect(0); // re-validate for HP
continue;
}
struct node *next = old->next;
if (atomic_compare_exchange_weak_explicit(
&head, &old, next, memory_order_acq_rel, memory_order_acquire)) {
rp->unprotect(0);
rp->retire(old);
rp->scan();
rp->exit();
return old;
}
rp->unprotect(0);
}
}
Testing and validation that catch real bugs
- Compile with sanitizers:
- Thread sanitizer:
-fsanitize=thread -fPIE -pie -pthread
to catch racy non-atomic accesses around auxiliary state. Note: TSan understands C11 atomics. - Undefined behavior sanitizer:
-fsanitize=undefined
to detect unsequenced operations.
- Thread sanitizer:
- Stress at
-O3 -march=native
where optimizations surface UB quickly. - Run on real cores; vary thread counts and affinity; test SPSC, MPSC, and MPMC patterns.
- Add assertions on core invariants: e.g., bounded retired-list length under steady load, non-negative queue length, node pointer non-NULL after successful acquire.
Minimal harness outline:
#include <pthread.h>
#include <stdio.h>
#include <stdatomic.h>
static atomic_bool g_run = ATOMIC_VAR_INIT(true);
void *producer(void *arg) {
const struct rp_vtbl *rp = (const struct rp_vtbl *)arg;
rp->thread_init();
while (atomic_load_explicit(&g_run, memory_order_relaxed)) {
struct node *n = malloc(sizeof *n); n->value = 1; n->next = NULL;
push(n);
}
return NULL;
}
void *consumer(void *arg) {
const struct rp_vtbl *rp = (const struct rp_vtbl *)arg;
rp->thread_init();
uint64_t cnt = 0;
while (atomic_load_explicit(&g_run, memory_order_relaxed)) {
struct node *n = pop_generic(rp);
if (n) { ++cnt; }
}
printf("consumed=%llu\n", (unsigned long long)cnt);
return NULL;
}
Drive with HP or EBR and compare throughput and peak retired-node backlog.
Benchmarking methodology that surfaces the trade-offs
- Throughput: operations/sec for push+pop across threads.
- Latency: p50/p95/p99 of pop under load (with timing in the consumer loop).
- Memory: maximum retired-but-not-freed nodes (HP: pending after scans; EBR: awaiting epoch).
- Sensitivity:
- Vary
HP_SCAN_THRESHOLD
andEBR_SCAN_THRESHOLD
for amortization vs latency. - Vary hazard slots per thread; measure effect on scan time and cache behavior.
- Introduce stalled threads to test EBR worst-case (epochs cannot advance) and HP robustness.
- Vary
Interpretation:
- If pauses spike with EBR under stalled readers, prefer HP or enforce quiescent deadlines.
- If readers are extremely hot and predictable, EBR often wins on CPU with fewer atomic stores per dereference.
- For moderate cores and few hazards, HP’s scans are small and fast, delivering predictable free latency.
Memory ordering cheat sheet for these structures
- Loads of shared pointers:
memory_order_acquire
. - Stores that publish new nodes:
memory_order_release
. - Successful CAS that both consumes and publishes:
memory_order_acq_rel
. - CAS failure order:
memory_order_acquire
(orrelaxed
if followed by a fresh acquire load). - Hazard publication: store with
memory_order_release
; hazard scans load withmemory_order_acquire
.
Production hardening checklist
- Prefer fixed-size per-thread caches for retired nodes to reduce malloc churn; free in batches.
- Bound retire-list growth; trigger scans via counts and timers.
- Provide thread deregistration for hazard pointers (clear slots) and EBR (mark quiescent) on shutdown.
- Tag pointers where address reuse is common; widen tags if alignment allows.
- Keep read-side critical sections (EBR) free of blocking calls and syscalls.
- Log metrics: retired backlog, scan duration, frees/sec, epoch advances/sec.
Closing thoughts
Lock-free isn’t just CAS tricks—it’s a contract about visibility and lifetime. Tagged pointers curb ABA, while hazard pointers and epochs turn “when is it safe to free?” into enforceable policies. Pick the approach that matches your workload’s truth: if readers must be featherweight and you can guarantee quiescent progress, epochs are elegant and fast. If you need tight control over free latency in the face of slow or preempted threads, hazard pointers provide sharp guarantees with small, predictable costs. Either way, write down the happens-before edges, test under optimization, and keep your reclamation policy a swappable seam. That’s how you get both speed and sleep.