So you write fast C and it usually works—right up until it doesn’t. Threads deadlock “randomly.” A variable holds 0 when you swear you just wrote 42. A micro-optimization turns into a macro-catastrophe. The culprit is often the same invisible character in the story: the C memory model.
This post is a practical guide to what the language (and your compiler/CPU) actually promise, why some perfectly “reasonable” code is undefined, and how to think about ordering, visibility, and correctness. We’ll start from the ground reality of older “sequence point” rules, climb to C11’s formal sequencing and atomics, and keep our feet in production by showing how to avoid footguns.
If you care about low-latency systems, lock-free structures, or just not shipping heisenbugs, understanding the memory model is table stakes.
Why this matters (and why it’s weird)
Two uncomfortable truths about C:
- The compiler can reorder a shocking amount of code as long as observable behavior is preserved. This is the “as-if” rule.
- A data race in C is undefined behavior (UB). Once UB occurs, all bets are off—optimizers can (and do) assume the impossible never happens.
Add CPU-level reordering, caches, and cores and you get a world where naïve intuition breaks. The fix isn't "don't optimize"—it's learning the rules and using the right tools.
Ground rules: what C actually promises
- As-if rule: The compiler may transform code in any way that doesn’t change the program’s observable behavior (roughly, what a strictly conforming program could detect via I/O, volatile operations, etc.).
- Undefined behavior (UB): Executing UB means the program has no meaning in the language. Optimizations can assume UB never occurs when proving transformations safe.
- Unspecified vs implementation-defined: Order of evaluation of function arguments is unspecified; integer widths are implementation-defined. Learn the difference.
- volatile is not a concurrency primitive:
volatile
prevents certain optimizations around memory-mapped I/O. It does not establish inter-thread ordering or atomicity.
From sequence points to sequencing
Before C11, programmers talked about “sequence points”—moments where all side effects of previous evaluations are complete. Classic sequence points included the end of a full expression, the logical operators &&
and ||
, the conditional operator ?:
, and the comma operator. Violating rules between these points produced UB.
C11 replaced that old model with a more precise vocabulary:
- sequenced-before / sequenced-after: A per-thread order within a single expression.
- unsequenced: Two side effects or a value computation and a side effect are not ordered relative to each other.
- indeterminately sequenced: One occurs before the other, but which one is not specified.
The short version: If two side effects on the same scalar are unsequenced, you have UB. If their order is indeterminate but both orders are valid, the compiler may choose either.
The infamous footguns
int i = 0;
i = i++; // UB: i is modified and read without sequencing
int a = 1;
int b = a++ + a++; // UB: two unsequenced modifications of a
int f(int), g(int);
int x = f(1) + g(2); // OK, but order of f and g is unspecified
int arr[2] = {0,0};
int j = 0;
arr[j] = j++; // UB: unsequenced read/modify of j
Instead, make sequencing explicit:
int i = 0;
int t = i;
i = i + 1;
// or just ++i; in a separate statement
int a = 1;
int b = a + (a + 1);
++a; ++a; // or refactor to avoid reliance on interim values
int f(int), g(int);
int x = f(1);
x += g(2); // force a clear order
int arr[2] = {0,0};
int j = 0;
arr[j] = j;
j++;
Even in single-threaded code, compilers are allowed to rearrange as long as the as-if rule holds. In multi-threaded code without proper synchronization, they can and will exploit UB assumptions, often eliminating reads/writes you thought were “obvious.”
Single-core intuition vs multicore reality
Consider this very common pattern:
#include <stdbool.h>
int data = 0;
bool done = false;
// Thread A
void producer(void) {
data = 42; // 1
done = true; // 2
}
// Thread B
void consumer(void) {
if (done) { // 3
// expect to see data == 42
printf("%d\n", data); // 4
}
}
It looks fine: write the data, then set the flag; reader checks the flag, then reads the data. But in C, absent atomics or locks, this has a data race—simultaneous access to the same object (done
and possibly data
) where at least one is a write, without synchronization. The behavior is undefined. Practically, you can observe:
- Reordering by the compiler: lines (1) and (2) might commute or get hoisted/sunk relative to other ops.
- CPU reordering and cache visibility issues across cores.
- The reader sees
done == true
yet still reads the olddata
value.
The fix is to use atomics with appropriate memory ordering or a lock. We’ll get into the “how” and “which order” next, but the key point is: without atomics or locks, the language gives you zero inter-thread guarantees.
Teaser: what memory orders are and why they exist
Starting in C11, we have <stdatomic.h>
and a formal model with memory orders:
memory_order_relaxed
: atomicity without ordering/visibility guarantees beyond modification order of the atomic itself.memory_order_acquire
/memory_order_release
/memory_order_acq_rel
: establish one-way or two-way visibility edges between threads.memory_order_seq_cst
: a stronger, global total order for all seq_cst operations.
You choose the weakest order that preserves correctness. Using everything as seq_cst
is simple but can be slower; using relaxed
blindly is wrong. The art is mapping the correctness requirements (visibility + ordering) to the right order.
We’ll build up the mental model in steps: first sequencing and UB (this section), next the happens-before relation and acquire/release, then fences vs atomic read-modify-write (RMW), and finally performance trade-offs and patterns.
Reality check: the optimizer is not your enemy (but it’s not your friend)
The optimizer’s job is to preserve the program’s defined behavior and remove everything else. If your code relies on an evaluation order that the language doesn’t promise, or on racy communication between threads, the optimizer will “break” it—because it’s allowed to.
Some tips you can adopt immediately:
- Don’t write multiple side effects on the same scalar in a single expression. Prefer smaller, explicit statements.
- Never use
volatile
as a synchronization mechanism. Use atomics or locks. - Assume function argument evaluation order is unspecified; avoid writing code that depends on it.
- In multi-threaded code, treat non-atomic shared reads/writes as a bug.
Establishing happens-before with atomics
C11 gives us <stdatomic.h>
, a vocabulary of atomic types and operations, and selectable memory orders. The goal is simple: make your intended visibility and ordering explicit so both the compiler and CPU cooperate.
Let’s fix the earlier handoff example properly using acquire/release:
#include <stdatomic.h>
#include <stdbool.h>
#include <stdio.h>
int data = 0; // ordinary object written before publish
atomic_bool done = ATOMIC_VAR_INIT(false);
// Thread A
void producer(void) {
data = 42; // write the payload first
// publish with release: all prior writes in this thread become visible
atomic_store_explicit(&done, true, memory_order_release);
}
// Thread B
void consumer(void) {
// acquire pairs with release: if we see true, we see prior writes
if (atomic_load_explicit(&done, memory_order_acquire)) {
printf("%d\n", data); // guaranteed to print 42
}
}
Why this is correct:
- The release store on
done
prevents reordering of thedata = 42
write after the publish. - The acquire load on
done
prevents reordering of subsequent reads (thedata
read) before the check. - Seeing
done == true
via acquire establishes a happens-before edge from the producer’s prior writes to the consumer’s subsequent reads, making the non-atomicdata
read defined and up-to-date.
This pattern—write data, then publish a flag with release; read flag with acquire, then read data—is the building block for many single-producer/single-consumer designs.
Choosing the right memory order
Pick the weakest order that still preserves correctness:
memory_order_relaxed
: atomicity only. No cross-thread ordering/visibility beyond the atomic’s own modification order. Great for counters/telemetry when exact inter-thread ordering isn’t needed.memory_order_release
/memory_order_acquire
: create one-way visibility. Use for handoff of data via flags or queues.memory_order_acq_rel
: use on read-modify-write (RMW) ops when you need both sides (prior writes visible to others, and your subsequent reads not hoisted).memory_order_seq_cst
: imposes a global order over all seq_cst ops. Easiest to reason about, sometimes slower. Use sparingly for global invariants or when debugging tricky races.
Relaxed counters (safe and fast)
#include <stdatomic.h>
atomic_uint64_t packets = ATOMIC_VAR_INIT(0);
void on_packet(void) {
atomic_fetch_add_explicit(&packets, 1, memory_order_relaxed);
}
uint64_t snapshot(void) {
// A relaxed read is fine if we only need an approximate count
return atomic_load_explicit(&packets, memory_order_relaxed);
}
This preserves atomicity without imposing ordering that the algorithm doesn’t need.
Acquire-release handoff (the common case)
#include <stdatomic.h>
#include <stdbool.h>
struct message { int value; };
struct message slot; // written by producer
atomic_bool ready = ATOMIC_VAR_INIT(false);
void produce(int v) {
slot.value = v; // prepare payload
atomic_store_explicit(&ready, true, memory_order_release); // publish
}
bool consume(int *out) {
if (atomic_load_explicit(&ready, memory_order_acquire)) {
*out = slot.value; // safe and up-to-date
return true;
}
return false;
}
Acq_rel on RMW (locking counters, state machines)
#include <stdatomic.h>
atomic_int state = ATOMIC_VAR_INIT(0);
// Advance state if it is exactly expected; both acquire and release effects apply
bool advance(int expected, int next) {
return atomic_compare_exchange_strong_explicit(
&state, &expected, next,
memory_order_acq_rel, // success
memory_order_relaxed); // failure: no state change, no need for fences
}
Pitfalls when using atomics
- Mixing atomic and non-atomic access on the same object is a data race. If a variable is ever accessed atomically in one thread, all accesses to that variable that can race must be atomic.
- Atomics don’t fix logic bugs. You still need proper protocols (e.g., single writer for non-atomic payload guarded by an atomic flag, or fully atomic payload updates if multiple writers exist).
- Alignment and type choice matter. Prefer the standard
atomic_*
typedefs (e.g.,atomic_int
,atomic_uintptr_t
). If an implementation cannot lock-free a type, it still provides correct semantics, but performance characteristics may differ. volatile
is still not synchronization. It doesn’t establish happens-before.
A note on fences
Atomic fences like atomic_thread_fence(memory_order_release)
and atomic_thread_fence(memory_order_acquire)
can order surrounding non-atomic operations without touching a specific atomic object. They’re useful when you need to separate data movement from the publishing atomic—or when coordinating across multiple atomics—but they are easy to misuse. Prefer acquire/release on the operations themselves unless you have a clear need for decoupling.
Fences vs atomic RMW in practice
There are two main ways to enforce ordering:
- Put the ordering on the operation itself (preferred): e.g.,
atomic_store_explicit(..., memory_order_release)
oratomic_load_explicit(..., memory_order_acquire)
; on RMW ops usememory_order_acq_rel
for success. - Use a fence to order non-atomic operations around a later/earlier atomic. This is powerful when you must publish multiple fields with a single flag.
Publishing with a fence
#include <stdatomic.h>
struct payload { int a; int b; } p;
atomic_bool ready = ATOMIC_VAR_INIT(false);
void publish(int x, int y) {
p.a = x;
p.b = y;
atomic_thread_fence(memory_order_release); // order the prior writes
atomic_store_explicit(&ready, true, memory_order_relaxed); // ok: fence provides release
}
bool try_consume(struct payload *out) {
if (atomic_load_explicit(&ready, memory_order_acquire)) {
*out = p; // both fields are visible
return true;
}
return false;
}
The release fence ensures all prior ordinary writes become visible before the relaxed store sets the flag. The acquire load pairs with that fence via the flag.
Atomic RMW carries ordering by itself
Read-modify-write operations (like fetch_add
, exchange
, compare_exchange
) can carry acquire, release, or acq_rel semantics in a single step. This often simplifies reasoning compared to sprinkling fences.
atomic_int tickets = ATOMIC_VAR_INIT(0);
int acquire_ticket(void) {
// acq_rel ensures prior writes are published and subsequent reads are not hoisted
return atomic_fetch_add_explicit(&tickets, 1, memory_order_acq_rel);
}
Correct compare_exchange
usage
compare_exchange_strong
and compare_exchange_weak
take two orders: success and failure. On failure, the object is not modified, but the function writes the current value into your expected parameter. Typical guidance:
- Use
memory_order_acq_rel
on success when the CAS protects both prior writes and subsequent reads. - Use
memory_order_relaxed
on failure unless you rely on the read of the new value to observe prior writes (thenmemory_order_acquire
). - Prefer
weak
in loops to allow spurious failure that some platforms use for efficiency.
#include <stdatomic.h>
typedef struct node { struct node *next; int value; } node_t;
// Head must be declared as _Atomic(node_t *) and passed by pointer
bool push(_Atomic(node_t *) *head, node_t *n) {
node_t *old = atomic_load_explicit(head, memory_order_relaxed);
do {
n->next = old;
} while (!atomic_compare_exchange_weak_explicit(
head, &old, n,
memory_order_acq_rel, // success
memory_order_relaxed)); // failure
return true;
}
Pitfall: using memory_order_relaxed
for success in structures that depend on ordering frequently produces subtle bugs—subsequent readers may not see linked node fields initialized before the CAS.
Safe publication (double-checked style) without locks
Publishing pointers to initialized objects is a common pattern. The key is making initialization complete-before publication and ensuring readers acquire before dereference.
struct config { int x; int y; };
_Atomic(struct config *) g_cfg = ATOMIC_VAR_INIT(NULL);
struct config *get_cfg(void) {
struct config *c = atomic_load_explicit(&g_cfg, memory_order_acquire);
if (c) return c;
// Slow path: allocate and initialize
struct config *tmp = malloc(sizeof *tmp);
tmp->x = 1; tmp->y = 2; // fully initialize
// Publish with release so readers acquire a fully-built object
struct config *expected = NULL;
if (atomic_compare_exchange_strong_explicit(
&g_cfg, &expected, tmp,
memory_order_release, // success
memory_order_relaxed)) // failure
return tmp;
// Another thread won the race; use the published one
free(tmp);
return atomic_load_explicit(&g_cfg, memory_order_acquire);
}
This version ensures that any reader that observes a non-NULL pointer via an acquire load also observes the fully initialized contents.
When seq_cst
is warranted
Sequentially consistent order forces a single total order over all seq_cst
operations across threads. It is not always slower, but it constrains compiler/CPU freedom. Use it when:
- You need a global notion of time for a small set of coordination variables (e.g., a global stop-the-world flag, feature gates).
- You are debugging a complex race and want to simplify reasoning temporarily.
- You must uphold cross-object invariants that become error-prone with only acq/rel (rare in performance-critical paths).
Example: a process-wide “shutdown now” flag that all workers check.
atomic_bool shutdown_now = ATOMIC_VAR_INIT(false);
void request_shutdown(void) {
atomic_store_explicit(&shutdown_now, true, memory_order_seq_cst);
}
bool should_exit(void) {
return atomic_load_explicit(&shutdown_now, memory_order_seq_cst);
}
While acq/rel would often suffice here, seq_cst
guarantees a consistent order of observations across threads, avoiding rare visibility puzzles in diagnostic code paths.
Performance notes
- Prefer acq/rel over
seq_cst
on hot paths unless you have a concrete reason for total ordering. - Keep atomics narrow and localized; shard counters and aggregate periodically.
- Minimize contention by using per-thread or per-core structures, then publish with a single release store.
- Measure. Real hardware and workloads often surprise intuition.
Designing with happens-before: a method
When correctness depends on cross-thread visibility, design the happens-before graph first, then choose minimal orders to implement it.
- List shared data and the invariants you need to hold when another thread observes a flag/state.
- Identify the publishing write(s) and the observing read(s).
- Add a release edge on the publisher and an acquire edge on the observer to transport visibility.
- If the same operation must both read prior state and publish new state atomically, use an acq_rel RMW.
- Use relaxed for everything that doesn’t participate in cross-thread ordering.
This keeps the hot path fast and the reasoning clean.
Case study: SPSC ring buffer (acquire/release only)
A single-producer/single-consumer queue can avoid locks with careful ordering. Each side updates its own index with release and reads the other’s index with acquire. Data slots themselves are ordinary (non-atomic) because the indices carry the ordering.
#include <stdatomic.h>
#include <stddef.h>
#include <stdbool.h>
#define CAP 1024
static int buf[CAP];
static atomic_size_t head = ATOMIC_VAR_INIT(0); // next to read (consumer-owned)
static atomic_size_t tail = ATOMIC_VAR_INIT(0); // next to write (producer-owned)
// Producer: returns false if full
bool enqueue(int v) {
size_t t = atomic_load_explicit(&tail, memory_order_relaxed);
size_t h = atomic_load_explicit(&head, memory_order_acquire); // see latest space info
if (((t + 1) % CAP) == h) return false; // full
buf[t] = v; // write payload first
atomic_store_explicit(&tail, (t + 1) % CAP, memory_order_release); // publish
return true;
}
// Consumer: returns false if empty
bool dequeue(int *out) {
size_t h = atomic_load_explicit(&head, memory_order_relaxed);
size_t t = atomic_load_explicit(&tail, memory_order_acquire); // see published items
if (h == t) return false; // empty
*out = buf[h]; // safe due to acquire above
atomic_store_explicit(&head, (h + 1) % CAP, memory_order_release); // release consumption
return true;
}
Notes:
- Only the owner updates its index; the other side reads it. This avoids RMW and uses cheap load/store with acq/rel.
- The acquire load of the opposite index ensures slot contents are visible when consumed.
- The data array is ordinary memory. The indices carry the necessary ordering.
Debugging and validation
- Build with sanitizers:
- Thread sanitizer (data races):
-fsanitize=thread
(Clang/GCC). Example:cc -O1 -g -fsanitize=thread -fPIE -pie -pthread q.c -o q
. - Undefined behavior sanitizer:
-fsanitize=undefined
to catch unsequenced mods and other UB at runtime.
- Thread sanitizer (data races):
- Test at high optimization: reproduce under
-O3 -march=native
as optimizations expose latent issues. - Stress on real cores: concurrency bugs are often schedule-sensitive; use many cores and randomized interleavings.
- Add assertions that check invariants at boundaries (e.g., index bounds, pointer non-NULL after acquire).
Common myths vs facts
- “
volatile
makes it thread-safe.” — False. It prevents some compiler optimizations and is for MMIO; it doesn’t create happens-before. - “If it works at
-O0
, it’s correct.” — False. Optimizations legally reorder; only defined behavior is preserved. - “Relaxed is dangerous and slow.” — False. Relaxed is the fastest; it’s dangerous only when misapplied. Use it where ordering isn’t required.
- “
seq_cst
is always too slow.” — Often false. Measure; many platforms implement seq_cst with similar cost to acq/rel on some operations, but it reduces reordering freedom.
Quick checklist
- Are any two threads accessing the same object with at least one write? If yes, use atomics or a lock.
- Do observers need to see a fully initialized object after a flag flips? Use release on publish, acquire on observe.
- Is a single op both reading old state and publishing new? Use acq_rel RMW.
- Can some counters be approximate? Use relaxed and aggregate.
- Have you validated with sanitizers and under
-O3
?
Closing thoughts
Correctness in concurrent C isn’t about memorizing incantations—it’s about mapping your invariants to explicit happens-before edges with the weakest necessary orders. Start from the rules, choose orders deliberately, and keep your design simple enough to prove. The payoff is software that is both fast and predictably correct.