Building Lock-Free Structures in C: Hazard Pointers vs. Epoch GC

You decided to go fast without locks. Great choice—until a crash dump shows a freed node being dereferenced by a different thread, and your “obvious” fix with volatile doesn’t fix anything. The truth is simple and sharp: lock-free is not just about atomics and compare_exchange. It’s also about safe memory reclamation. If you pop a node while another thread can still see it, you must not free it yet—or you invite use-after-free and the notorious ABA problem.

This post builds up the foundations you actually need to ship lock-free structures in C: what “lock-free” means, why ABA appears in the first place, how pointer tagging helps (sometimes), and why production codebases standardize on either hazard pointers or epoch-based reclamation to free memory safely. We’ll write real C11 code, reason with the memory model, and keep the vibe practical.

What “lock-free” actually promises

Progress guarantees, in one breath:

Obstruction-free: a thread makes progress in isolation.
Lock-free: the system makes progress overall—at least one thread completes in a finite number of steps.
Wait-free: every thread completes in a bounded number of its own steps.

Most classic stacks/queues you’ll hand-roll in C target lock-free. We’ll implement them with C11 atomics and make memory ordering explicit so the compiler and CPU cooperate.

A minimal lock-free stack (Treiber) without reclamation

We’ll start with the canonical Treiber stack. It works—and it leaks—because we won’t reclaim memory yet. That’s deliberate: separating logic from reclamation makes each part provable.

#include <stdatomic.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
 
struct node {
  int value;
  struct node *next;
};
 
// Head of the stack; atomic pointer so CAS is well-defined
static _Atomic(struct node *) head = ATOMIC_VAR_INIT(NULL);
 
void push(struct node *n) {
  struct node *old = atomic_load_explicit(&head, memory_order_relaxed);
  do {
    n->next = old; // prepare link to current head
  } while (!atomic_compare_exchange_weak_explicit(
      &head, &old, n,
      memory_order_acq_rel,   // success: publish n and order with readers
      memory_order_acquire)); // failure: observe updated head before retry
}
 
// Returns a node the caller owns. Caller MUST NOT free it yet (reclamation to come).
struct node *pop(void) {
  struct node *old = atomic_load_explicit(&head, memory_order_acquire);
  for (;;) {
    if (old == NULL) return NULL; // empty
    struct node *next = old->next; // requires acquire-load of head
    if (atomic_compare_exchange_weak_explicit(
            &head, &old, next,
            memory_order_acq_rel,  // success: detach node
            memory_order_acquire)) // failure: head changed; reload old
      return old;
    // CAS failed: old has been updated with the current head; loop
  }
}

Notes:

The acquire load of head and acq_rel CAS ensure that when a consumer sees a node, its fields are fully initialized by the producer that linked it.
We intentionally “leak” popped nodes for now. Freeing here would be a race: another thread may still be reading old it fetched before our CAS.

Why reclamation is hard in lock-free code

With coarse locks, freeing is easy: you hold a lock that excludes readers. In lock-free land, other threads can read old at any time between your load and CAS—even after you’ve detached it—so freeing immediately can produce a use-after-free when they dereference old->next later. We need a protocol to decide “no thread can possibly hold a reference to this node anymore.”

Two families dominate:

Hazard pointers: readers publish the pointers they are about to touch; writers only free nodes that are not listed as “hazardous.”
Epoch-based reclamation: time is divided into epochs; readers mark entry/exit from quiescent states; writers free nodes only after at least two epoch advances.

Before we get there, we must understand the ABA problem that poisons naive CAS designs.

The ABA problem (and why CAS can lie)

CAS checks “is the word still equal to what I saw?” If yes, it installs the new value. But equality of bits does not imply equality of meaning.

Classic scenario on our Treiber stack:

Thread T1 reads head = A.
T1 is descheduled.
Thread T2 pops A (head becomes B) and frees A.
Memory allocator reuses A’s address for a new node A′ and pushes it: head becomes A again.
T1 wakes up, sees head still equals A (bitwise), and CAS succeeds, linking its node after A′. The list is now corrupt, and you’ll eventually crash.

We can mitigate ABA by making the pointer value change monotonically—even when the same address is recycled—typically by tagging pointers with a small counter. That helps single-word CAS detect reuse, but it’s not a complete memory reclamation strategy.

Pointer tagging: a quick primer

If your nodes are aligned (they are), the low 1–3 bits of pointers are zero. You can store a small counter in those bits and increment on each update. Comparing the tagged word catches A→B→A′ cycles.

#include <inttypes.h>
 
struct tagged_ptr {
  uintptr_t raw; // [ .. pointer .. | tag_low_bits ]
};
 
static inline struct tagged_ptr make_tagged(struct node *p, unsigned tag) {
  return (struct tagged_ptr){ .raw = ((uintptr_t)p) | (tag & 0x3u) };
}
 
static inline struct node *tp_ptr(struct tagged_ptr t) {
  return (struct node *)(t.raw & ~((uintptr_t)0x3u));
}
 
static inline unsigned tp_tag(struct tagged_ptr t) { return (unsigned)(t.raw & 0x3u); }
 
static inline struct tagged_ptr tp_bump(struct tagged_ptr t, struct node *newp) {
  return make_tagged(newp, (tp_tag(t) + 1u) & 0x3u);
}

A tagged CAS uses _Atomic(uintptr_t) instead of _Atomic(struct node*). This defeats many ABA cases without widening the head to two words. Caveats: a 2-bit tag can still wrap; increase bits if alignment allows. Tagging helps correctness but does not decide safety of free()—you still need hazard pointers or epochs to know when reclamation is safe.

A concrete ABA timeline

sequenceDiagram participant T1 as Thread T1 participant T2 as Thread T2 participant S as Stack(head) T1->>S: load head = A Note right of T1: holds local A T2->>S: CAS head: A→B (pop A) T2->>T2: free(A) T2->>S: push(A′) — A's address reused S-->>S: head = A′ (same bits as A) T1->>S: CAS head: A→C (succeeds!) Note over T1,S: Structure is corrupted; CAS saw equal bits, not equal meaning

The fix is twofold: make equality checks robust (tagging) and defer reclamation until no readers can hold the pointer anymore (hazard/epoch).

Memory ordering we’ll rely on

Lock-free structures live or die on ordering edges. We’ll use:

memory_order_release when publishing a new head so readers that see it (via acquire) also see the initialized node fields.
memory_order_acquire when consuming the head so subsequent reads (e.g., n->next, payload) observe what the publisher wrote.
memory_order_acq_rel on successful CAS to both consume the old state and publish the new.
Failure order of compare_exchange_*: memory_order_acquire (or relaxed if you do a separate load) is typically enough.

This mirrors the producer/consumer handoff pattern and keeps us out of undefined-behavior territory.

Why `volatile` is not the fix

volatile talks to the compiler about I/O-like side effects. It does not make operations atomic, it does not create inter-thread happens-before, and it does not fix ABA. Use <stdatomic.h> and explicit memory orders; treat volatile as a tool for MMIO, not concurrency.

The reclamation contract (what we must prove before free)

When we detach a node X from a structure, freeing X is safe only when both are true:

No thread can still hold a pointer to X from a prior read.
No thread can obtain a pointer to X in the future from the shared structure (i.e., X is no longer reachable).

Pointer tagging helps ensure condition (2) isn’t subverted by ABA (we won’t “re-see” an old bit pattern by accident). Hazard pointers and epochs are strategies to prove (1) without global locks:

Hazard pointers: readers publish the exact pointers they will dereference. Writers retire nodes and later scan the global hazard set; any node not listed is safe to free.
Epoch-based reclamation: readers enter an epoch while they may hold raw pointers; writers free nodes after two epoch advances, guaranteeing all prior readers have exited.

We’ll implement both patterns on top of the Treiber stack and compare their ergonomics and performance.

Testable baseline: Treiber stack with a retire list (no frees yet)

Let’s establish a baseline “retire but don’t free” path; we’ll plug hazard/epoch policies into try_reclaim() later.

#include <stdatomic.h>
#include <stdlib.h>
 
struct retire_node { struct node *n; struct retire_node *next; };
static _Atomic(struct retire_node *) retire_head = ATOMIC_VAR_INIT(NULL);
 
static void retire(struct node *n) {
  struct retire_node *r = malloc(sizeof *r);
  r->n = n; r->next = NULL;
  struct retire_node *old = atomic_load_explicit(&retire_head, memory_order_relaxed);
  do { r->next = old; } while (!atomic_compare_exchange_weak_explicit(
    &retire_head, &old, r, memory_order_release, memory_order_relaxed));
}
 
// Placeholder: hazard/epoch policies will scan retire list and free safe nodes.
static void try_reclaim(void) {
  // no-op for now
}
 
struct node *safe_pop(void) {
  struct node *x = pop();
  if (x) { retire(x); try_reclaim(); }
  return x;
}

Why this shape works for both approaches:

Writers “retire” nodes immediately after detachment.
A background or opportunistic try_reclaim() applies the policy: either scan hazards (hazard pointers) or check epochs and batch-free (epochs).

What’s next in the design

From here, the two reclamation families diverge in reader/writer responsibilities:

Hazard pointers push work to readers (publish before deref; clear after). Writers scan retire lists against hazards. Low latency under contention but more per-read overhead and memory for hazard slots.
Epoch-based reclamation keeps readers light (enter/exit epoch) and batches frees. Great throughput and simple fast-path; pauses can spike if epochs cannot advance (e.g., stalled threads).

We’ll structure the APIs so you can switch policies without touching the stack/queue logic.

A quick benchmarking lens

When we compare policies, measure both throughput and tail-latency:

Throughput: pushes/pops per second at many threads.
Pause behavior: p99 of individual pop latency under churn.
Memory footprint: average and p99 of retired-but-not-freed nodes.

Microbenchmarks should pin threads, pre-allocate nodes to avoid allocator bias, and exercise contention patterns (SPSC, MPMC bursts). Keep the core invariant: correctness first, then speed.

Hazard pointers in practice: a minimal, portable implementation

We’ll build a compact hazard-pointer manager you can drop into a C11 project. It assumes a bounded number of threads and allocates each thread a fixed number of hazard slots. Readers publish the pointer they intend to dereference, re-validate, then proceed. Writers retire nodes and occasionally scan the retire list against the global hazard set to free what’s safe.

Configuration and global state

#include <stdatomic.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
#include <limits.h>
 
// Tunables
#define HP_MAX_THREADS   128   // maximum threads that will call hp_thread_init
#define HP_SLOTS_PER_T   2     // hazards per thread (1–3 typical for stacks/queues)
#define HP_SCAN_THRESHOLD 64   // retired nodes to accumulate before a scan
 
// Global hazard table: each thread owns a contiguous block of slots
static _Atomic(struct node *) g_hazards[HP_MAX_THREADS * HP_SLOTS_PER_T];
 
// Monotonic thread id assignment
static _Atomic(unsigned) g_next_tid = ATOMIC_VAR_INIT(0);
 
// Thread-local bookkeeping
static _Thread_local unsigned hp_tid = UINT_MAX; // uninitialized sentinel
static _Thread_local unsigned hp_retired_since_scan = 0;
 
static inline unsigned hp_base_index(void) { return hp_tid * HP_SLOTS_PER_T; }
 
static inline _Atomic(struct node *) *hp_slot(unsigned i) {
  return &g_hazards[hp_base_index() + i];
}
 
// Initialize hazard slots to NULL once per process (optional hygiene)
static void hp_global_init(void) {
  for (size_t i = 0; i < (size_t)HP_MAX_THREADS * HP_SLOTS_PER_T; ++i) {
    atomic_store_explicit(&g_hazards[i], NULL, memory_order_relaxed);
  }
}

Thread registration and slot helpers

// Must be called by each participating thread before using hazard pointers
static bool hp_thread_init(void) {
  if (hp_tid != UINT_MAX) return true; // already initialized
  unsigned id = atomic_fetch_add_explicit(&g_next_tid, 1, memory_order_relaxed);
  if (id >= HP_MAX_THREADS) return false; // out of slots
  hp_tid = id;
  for (unsigned i = 0; i < HP_SLOTS_PER_T; ++i) {
    atomic_store_explicit(hp_slot(i), NULL, memory_order_relaxed);
  }
  return true;
}
 
static inline void hp_set(unsigned i, struct node *p) {
  // Release makes the publication visible to scanners that acquire-load
  atomic_store_explicit(hp_slot(i), p, memory_order_release);
}
 
static inline void hp_clear(unsigned i) {
  atomic_store_explicit(hp_slot(i), NULL, memory_order_release);
}

Retire and scan with hazard comparison

We’ll reuse the retire_head global list from earlier and implement a non-blocking scan by atomically detaching the list, comparing against the current hazard set, freeing what’s not protected, and pushing the survivors back.

// From earlier baseline
struct retire_node { struct node *n; struct retire_node *next; };
extern _Atomic(struct retire_node *) retire_head; // declared where defined
 
static void hp_scan(void) {
  // Snapshot hazard set into a temporary array
  const size_t HCAP = (size_t)HP_MAX_THREADS * HP_SLOTS_PER_T;
  struct node **haz = malloc(HCAP * sizeof *haz);
  size_t hlen = 0;
  for (size_t i = 0; i < HCAP; ++i) {
    struct node *p = atomic_load_explicit(&g_hazards[i], memory_order_acquire);
    if (p) haz[hlen++] = p;
  }
 
  // Detach the current retire list
  struct retire_node *list = atomic_exchange_explicit(&retire_head, NULL, memory_order_acq_rel);
  struct retire_node *keep = NULL; // nodes still protected
 
  // For each retired node, free if not present in hazard set
  while (list) {
    struct retire_node *r = list; list = list->next; r->next = NULL;
    bool protected = false;
    for (size_t i = 0; i < hlen; ++i) {
      if (haz[i] == r->n) { protected = true; break; }
    }
    if (!protected) {
      free(r->n);
      free(r);
    } else {
      r->next = keep; keep = r;
    }
  }
 
  // Push back the kept nodes so a future scan can reconsider them
  while (keep) {
    struct retire_node *r = keep; keep = keep->next;
    struct retire_node *old = atomic_load_explicit(&retire_head, memory_order_relaxed);
    do { r->next = old; } while (!atomic_compare_exchange_weak_explicit(
      &retire_head, &old, r, memory_order_release, memory_order_relaxed));
  }
 
  free(haz);
}

Notes:

Scanning with atomic_exchange isolates a consistent batch without blocking other threads from retiring more nodes.
Hazard loads use acquire to synchronize with readers’ release publications.
Freeing happens outside of any shared structure; freed nodes cannot reappear because they were unlinked earlier.

A hazard-protected pop

Publishing a hazard follows the “publish then validate” dance: take a snapshot of head, publish it, re-read head to confirm it hasn’t changed, then proceed. If head changed, clear and retry.

struct node *pop_hp(void) {
  if (!hp_thread_init()) return NULL; // or abort: too many threads
  for (;;) {
    struct node *old = atomic_load_explicit(&head, memory_order_acquire);
    if (!old) return NULL;
 
    hp_set(0, old); // publish intent to dereference old
 
    // Re-validate that head still equals old under the hazard
    if (atomic_load_explicit(&head, memory_order_acquire) != old) {
      hp_clear(0);
      continue; // changed under us; retry with a fresh candidate
    }
 
    struct node *next = old->next; // safe: old is protected
    if (atomic_compare_exchange_weak_explicit(
          &head, &old, next,
          memory_order_acq_rel,
          memory_order_acquire)) {
      hp_clear(0);     // stop protecting old
      retire(old);     // detach already done; retire for deferred free
      if (++hp_retired_since_scan >= HP_SCAN_THRESHOLD) {
        hp_scan(); hp_retired_since_scan = 0;
      }
      return old;
    }
 
    // CAS failed; hazard still points to an obsolete node; clear and retry
    hp_clear(0);
  }
}

Why this is correct:

Any thread trying to free old must first find it on the retire list and compare it against the hazard set. Because we published old with a release store and scanners load with acquire, they will see the hazard and keep old alive.
The re-validation ensures that we never dereference a node that was concurrently removed and replaced; if head changes, we loop before touching old.

Reader and writer responsibilities at a glance

Reader responsibilities:
- Initialize its hazard-thread state once.
- Before dereferencing a shared pointer, publish it into a hazard slot, re-validate the shared location, then proceed.
- Clear the hazard slot when done to let future scans reclaim memory.
Writer responsibilities:
- After unlinking a node from the data structure, call retire(node).
- Periodically call hp_scan() (e.g., every HP_SCAN_THRESHOLD retires or on a timer) to free unprotected nodes.

Common pitfalls and guardrails

Never mix atomic and non-atomic access to shared pointers. Use _Atomic(struct node*) consistently for head and similar shared fields.
Don’t forget to re-validate after publishing the hazard. Publishing alone does not pin the specific snapshot you loaded if the shared pointer changed.
Ensure HP_MAX_THREADS * HP_SLOTS_PER_T covers the worst-case concurrency; otherwise threads may fail to initialize hazard slots.
Hazard slots are per-thread state. If a thread exits without clearing, future scans still treat its last published pointer as protected. In long-lived services, either reuse threads or add a deregistration path.

Variants and micro-optimizations

Use more than one hazard slot when traversing multi-pointer structures (e.g., linked queues where you protect both the current and next pointer during traversal).
Replace the linear hazard search with a hash set or radix table when HP_MAX_THREADS is large. For the common case (hundreds of hazards), linear scans are cache-friendly and competitive.
Batch retire() per thread and push to the global list via a single multi-node append to reduce contention.

Epoch-based reclamation (EBR): lightweight readers, batched frees

Hazard pointers make readers publish exact pointers. Epoch-based reclamation keeps readers lighter: they only announce “I may dereference raw pointers now” by entering the current epoch and announce they are safe again by exiting. Writers tag retired nodes with the current epoch and free them only after at least two global epoch advances so that all readers that could have seen them are gone.

Global state and thread records

#include <stdatomic.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <limits.h>
 
#define EBR_MAX_THREADS    128
#define EBR_SCAN_THRESHOLD 128
 
static _Atomic(unsigned) ebr_global_epoch = ATOMIC_VAR_INIT(1); // start at 1
 
struct ebr_thread_rec {
  _Atomic(unsigned) active; // 0 = quiescent, 1 = inside epoch
  _Atomic(unsigned) epoch;  // epoch value captured at entry
};
 
static struct ebr_thread_rec ebr_threads[EBR_MAX_THREADS];
static _Atomic(unsigned) ebr_next_tid = ATOMIC_VAR_INIT(0);
static _Thread_local unsigned ebr_tid = UINT_MAX;
static _Thread_local unsigned ebr_retired_since_scan = 0;
 
static inline bool ebr_thread_register(void) {
  if (ebr_tid != UINT_MAX) return true;
  unsigned id = atomic_fetch_add_explicit(&ebr_next_tid, 1, memory_order_relaxed);
  if (id >= EBR_MAX_THREADS) return false;
  ebr_tid = id;
  atomic_store_explicit(&ebr_threads[id].active, 0u, memory_order_relaxed);
  atomic_store_explicit(&ebr_threads[id].epoch, 0u, memory_order_relaxed);
  return true;
}
 
static inline void ebr_enter(void) {
  if (!ebr_thread_register()) return; // handle error upstream
  unsigned e = atomic_load_explicit(&ebr_global_epoch, memory_order_acquire);
  atomic_store_explicit(&ebr_threads[ebr_tid].epoch, e, memory_order_release);
  atomic_store_explicit(&ebr_threads[ebr_tid].active, 1u, memory_order_release);
}
 
static inline void ebr_exit(void) {
  // Mark quiescent; store active=0 last to ensure epoch is visible first
  atomic_store_explicit(&ebr_threads[ebr_tid].active, 0u, memory_order_release);
  // Optionally refresh epoch to current to avoid blocking min computations
  unsigned e = atomic_load_explicit(&ebr_global_epoch, memory_order_relaxed);
  atomic_store_explicit(&ebr_threads[ebr_tid].epoch, e, memory_order_relaxed);
}
 
// Try to bump the global epoch if all active threads have caught up
static void ebr_maybe_advance(void) {
  unsigned cur = atomic_load_explicit(&ebr_global_epoch, memory_order_acquire);
  unsigned min_active = UINT_MAX;
  for (unsigned i = 0; i < EBR_MAX_THREADS; ++i) {
    if (atomic_load_explicit(&ebr_threads[i].active, memory_order_acquire)) {
      unsigned te = atomic_load_explicit(&ebr_threads[i].epoch, memory_order_acquire);
      if (te < min_active) min_active = te;
    }
  }
  if (min_active >= cur) {
    (void)atomic_compare_exchange_strong_explicit(
      &ebr_global_epoch, &cur, cur + 1,
      memory_order_acq_rel, memory_order_acquire);
  }
}

Retire list tagged with epochs and a scanner

struct ebr_retire_node { struct node *n; unsigned epoch; struct ebr_retire_node *next; };
static _Atomic(struct ebr_retire_node *) ebr_retire_head = ATOMIC_VAR_INIT(NULL);
 
static void ebr_retire(struct node *n) {
  struct ebr_retire_node *r = malloc(sizeof *r);
  r->n = n;
  r->epoch = atomic_load_explicit(&ebr_global_epoch, memory_order_relaxed);
  struct ebr_retire_node *old = atomic_load_explicit(&ebr_retire_head, memory_order_relaxed);
  do { r->next = old; } while (!atomic_compare_exchange_weak_explicit(
    &ebr_retire_head, &old, r, memory_order_release, memory_order_relaxed));
}
 
static void ebr_scan(void) {
  ebr_maybe_advance();
  unsigned cur = atomic_load_explicit(&ebr_global_epoch, memory_order_acquire);
 
  // Compute the minimum epoch among active threads; inactive threads don’t block
  unsigned min_active = UINT_MAX;
  for (unsigned i = 0; i < EBR_MAX_THREADS; ++i) {
    if (atomic_load_explicit(&ebr_threads[i].active, memory_order_acquire)) {
      unsigned te = atomic_load_explicit(&ebr_threads[i].epoch, memory_order_acquire);
      if (te < min_active) min_active = te;
    }
  }
 
  // Detach retire list batch
  struct ebr_retire_node *list = atomic_exchange_explicit(&ebr_retire_head, NULL, memory_order_acq_rel);
  struct ebr_retire_node *keep = NULL;
 
  while (list) {
    struct ebr_retire_node *r = list; list = list->next; r->next = NULL;
    // Free if node was retired at least two epochs earlier and all active threads are past it
    if (min_active != UINT_MAX && r->epoch + 2u <= min_active) {
      free(r->n);
      free(r);
    } else if (r->epoch + 2u <= cur && min_active == UINT_MAX) {
      // No active threads: safe to free if enough global time elapsed
      free(r->n);
      free(r);
    } else {
      r->next = keep; keep = r;
    }
  }
 
  // Push back survivors
  while (keep) {
    struct ebr_retire_node *r = keep; keep = keep->next;
    struct ebr_retire_node *old = atomic_load_explicit(&ebr_retire_head, memory_order_relaxed);
    do { r->next = old; } while (!atomic_compare_exchange_weak_explicit(
      &ebr_retire_head, &old, r, memory_order_release, memory_order_relaxed));
  }
}

A pop using epoch-based reclamation

Readers enter/exit a read-side critical section around operations that may dereference raw pointers. No hazard publication is needed.

struct node *pop_ebr(void) {
  ebr_enter();
  for (;;) {
    struct node *old = atomic_load_explicit(&head, memory_order_acquire);
    if (!old) { ebr_exit(); return NULL; }
    struct node *next = old->next; // safe while in critical section
    if (atomic_compare_exchange_weak_explicit(
          &head, &old, next,
          memory_order_acq_rel, memory_order_acquire)) {
      ebr_retire(old);
      if (++ebr_retired_since_scan >= EBR_SCAN_THRESHOLD) {
        ebr_scan(); ebr_retired_since_scan = 0;
      }
      ebr_exit();
      return old;
    }
  }
}

When epochs shine (and when they bite)

Strengths:
- Ultra-light readers: two atomic stores on entry/exit, no hazard publication per pointer.
- Batching frees reduces allocator churn and can boost throughput.
- Fits traversal-heavy structures (lists, hash tables) with minimal per-step overhead.
Trade-offs:
- A stalled thread that never calls ebr_exit() can delay reclamation indefinitely, growing memory.
- Coarse timing: freeing waits for global advances; worst-case latency can be larger than hazard pointers under contention.
- Requires disciplined read-side critical sections around every dereference of shared nodes.

Practical guardrails

Keep read-side critical sections short; avoid blocking calls inside them.
Ensure all participating threads are registered; provide a shutdown path that marks threads quiescent before joining.
Drive ebr_scan() and ebr_maybe_advance() periodically (thresholds or timers) to keep memory bounded.

Hazard pointers vs epochs: quick guidance

Use hazard pointers when:
- You need predictable reclamation latency and can afford a couple of atomic stores per dereference.
- You have few outstanding hazards per thread and moderate thread counts.

graph TB subgraph "Hazard Pointers" HP1[Reader declares intent] HP2[Publish specific pointer] HP3[Re-validate pointer] HP4[Use protected memory] HP5[Clear hazard slot] HP1 --> HP2 --> HP3 --> HP4 --> HP5 HPW1[Writer retires node] HPW2[Scan hazard list] HPW3[Free unprotected nodes] HPW1 --> HPW2 --> HPW3 end subgraph "Epoch-Based Reclamation" E1[Reader enters epoch] E2[Use any pointers] E3[Exit epoch] E1 --> E2 --> E3 EW1[Writer retires to epoch N] EW2[Wait for epoch N+2] EW3[Batch free old nodes] EW1 --> EW2 --> EW3 end subgraph "Trade-offs" HPT["Hazard Pointers: ✓ Precise control ✓ Predictable latency ✓ Per-pointer protection ✗ More atomic ops ✗ Linear hazard scan"] ET["Epoch-Based: ✓ Ultra-light readers ✓ Batch efficiency ✓ Simple critical sections ✗ Stalled reader blocks all ✗ Coarser timing"] end style HP4 fill:#e8f5e8 style E2 fill:#e1f5fe style HPT fill:#fff3e0 style ET fill:#e3f2fd

Use epochs when:
- Read paths are hot and must be minimal.
- You can guarantee timely quiescent states (no long-paused threads).
- Batching frees aligns with your allocator and workload.

A unified policy seam: swap HP and EBR without touching data-structure logic

Keep your stack/queue code agnostic to the reclamation policy by routing through a tiny interface. At compile time (or runtime with function pointers), choose HP or EBR. This keeps tests honest and refactors cheap.

// Reclamation policy interface (minimal)
enum rp_kind { RP_HP, RP_EBR };
 
struct rp_vtbl {
  void (*thread_init)(void);
  void (*enter)(void);   // no-op for HP
  void (*exit)(void);    // no-op for HP
  void (*protect)(unsigned slot, struct node *p); // HP only; no-op for EBR
  void (*unprotect)(unsigned slot);               // HP only; no-op for EBR
  void (*retire)(struct node *n);
  void (*scan)(void);
};
 
// HP-backed implementation
static void hp_thread_init_wrap(void){ (void)hp_thread_init(); }
static void hp_enter_noop(void){}
static void hp_exit_noop(void){}
static void hp_protect(unsigned i, struct node *p){ hp_set(i, p); }
static void hp_unprotect(unsigned i){ hp_clear(i); }
static void hp_retire_wrap(struct node *n){ retire(n); }
static void hp_scan_wrap(void){ hp_scan(); }
 
static const struct rp_vtbl RP_HP_VTBL = {
  hp_thread_init_wrap, hp_enter_noop, hp_exit_noop,
  hp_protect, hp_unprotect, hp_retire_wrap, hp_scan_wrap
};
 
// EBR-backed implementation
static void ebr_thread_init_wrap(void){ (void)ebr_thread_register(); }
static void ebr_protect_noop(unsigned slot, struct node *p){ (void)slot; (void)p; }
static void ebr_unprotect_noop(unsigned slot){ (void)slot; }
static void ebr_retire_wrap(struct node *n){ ebr_retire(n); }
static void ebr_scan_wrap(void){ ebr_scan(); }
 
static const struct rp_vtbl RP_EBR_VTBL = {
  ebr_thread_init_wrap, ebr_enter, ebr_exit,
  ebr_protect_noop, ebr_unprotect_noop, ebr_retire_wrap, ebr_scan_wrap
};

Usage sketch in a generic pop_generic:

static inline struct node *pop_generic(const struct rp_vtbl *rp) {
  rp->thread_init();
  rp->enter();
  for (;;) {
    struct node *old = atomic_load_explicit(&head, memory_order_acquire);
    if (!old) { rp->exit(); return NULL; }
    rp->protect(0, old); // HP publishes; EBR no-op
    if (rp == &RP_HP_VTBL && atomic_load_explicit(&head, memory_order_acquire) != old) {
      rp->unprotect(0); // re-validate for HP
      continue;
    }
    struct node *next = old->next;
    if (atomic_compare_exchange_weak_explicit(
          &head, &old, next, memory_order_acq_rel, memory_order_acquire)) {
      rp->unprotect(0);
      rp->retire(old);
      rp->scan();
      rp->exit();
      return old;
    }
    rp->unprotect(0);
  }
}

Testing and validation that catch real bugs

Compile with sanitizers:
- Thread sanitizer: -fsanitize=thread -fPIE -pie -pthread to catch racy non-atomic accesses around auxiliary state. Note: TSan understands C11 atomics.
- Undefined behavior sanitizer: -fsanitize=undefined to detect unsequenced operations.
Stress at -O3 -march=native where optimizations surface UB quickly.
Run on real cores; vary thread counts and affinity; test SPSC, MPSC, and MPMC patterns.
Add assertions on core invariants: e.g., bounded retired-list length under steady load, non-negative queue length, node pointer non-NULL after successful acquire.

Minimal harness outline:

#include <pthread.h>
#include <stdio.h>
#include <stdatomic.h>
 
static atomic_bool g_run = ATOMIC_VAR_INIT(true);
 
void *producer(void *arg) {
  const struct rp_vtbl *rp = (const struct rp_vtbl *)arg;
  rp->thread_init();
  while (atomic_load_explicit(&g_run, memory_order_relaxed)) {
    struct node *n = malloc(sizeof *n); n->value = 1; n->next = NULL;
    push(n);
  }
  return NULL;
}
 
void *consumer(void *arg) {
  const struct rp_vtbl *rp = (const struct rp_vtbl *)arg;
  rp->thread_init();
  uint64_t cnt = 0;
  while (atomic_load_explicit(&g_run, memory_order_relaxed)) {
    struct node *n = pop_generic(rp);
    if (n) { ++cnt; }
  }
  printf("consumed=%llu\n", (unsigned long long)cnt);
  return NULL;
}

Drive with HP or EBR and compare throughput and peak retired-node backlog.

Benchmarking methodology that surfaces the trade-offs

Throughput: operations/sec for push+pop across threads.
Latency: p50/p95/p99 of pop under load (with timing in the consumer loop).
Memory: maximum retired-but-not-freed nodes (HP: pending after scans; EBR: awaiting epoch).
Sensitivity:
- Vary HP_SCAN_THRESHOLD and EBR_SCAN_THRESHOLD for amortization vs latency.
- Vary hazard slots per thread; measure effect on scan time and cache behavior.
- Introduce stalled threads to test EBR worst-case (epochs cannot advance) and HP robustness.

Interpretation:

If pauses spike with EBR under stalled readers, prefer HP or enforce quiescent deadlines.
If readers are extremely hot and predictable, EBR often wins on CPU with fewer atomic stores per dereference.
For moderate cores and few hazards, HP’s scans are small and fast, delivering predictable free latency.

Memory ordering cheat sheet for these structures

Loads of shared pointers: memory_order_acquire.
Stores that publish new nodes: memory_order_release.
Successful CAS that both consumes and publishes: memory_order_acq_rel.
CAS failure order: memory_order_acquire (or relaxed if followed by a fresh acquire load).
Hazard publication: store with memory_order_release; hazard scans load with memory_order_acquire.

Production hardening checklist

Prefer fixed-size per-thread caches for retired nodes to reduce malloc churn; free in batches.
Bound retire-list growth; trigger scans via counts and timers.
Provide thread deregistration for hazard pointers (clear slots) and EBR (mark quiescent) on shutdown.
Tag pointers where address reuse is common; widen tags if alignment allows.
Keep read-side critical sections (EBR) free of blocking calls and syscalls.
Log metrics: retired backlog, scan duration, frees/sec, epoch advances/sec.

Closing thoughts

Lock-free isn’t just CAS tricks—it’s a contract about visibility and lifetime. Tagged pointers curb ABA, while hazard pointers and epochs turn “when is it safe to free?” into enforceable policies. Pick the approach that matches your workload’s truth: if readers must be featherweight and you can guarantee quiescent progress, epochs are elegant and fast. If you need tight control over free latency in the face of slow or preempted threads, hazard pointers provide sharp guarantees with small, predictable costs. Either way, write down the happens-before edges, test under optimization, and keep your reclamation policy a swappable seam. That’s how you get both speed and sleep.