You tuned a tight loop, vectorized a kernel, and still your program stutters like it’s stuck in molasses. The culprit is rarely your ALU—it’s almost always memory. Modern CPUs sprint only when you feed them with cache-friendly data and keep the cores from fighting over the same cache lines.
This post is a practical guide to writing C that plays nicely with the cache hierarchy: understanding lines and associativity, avoiding false sharing, choosing layouts (AoS vs SoA), and judiciously using prefetching. The goal is fast code that’s predictably fast under load.
Caches in 5 minutes (the mental model you actually use)
Your program sees a flat address space; your CPU sees a pyramid:
- L1 data cache: tiny (32–64 KiB), absurdly fast, private per core, line-based
- L2: bigger (256 KiB–2 MiB), still private
- L3: much larger (several MiB–tens of MiB), shared across cores
- DRAM: orders of magnitude slower; touching it in the hot path is a tax
Key terms you’ll reason about every day:
- Cache line: the unit of transfer/coherence (commonly 64 bytes). Touch one byte, fetch 64.
- Set associativity: each line maps to a limited number of “ways” in a set; conflict misses arise when hot lines collide.
- Write-allocate + write-back: stores typically pull a line into cache (write-allocate) and mark it dirty (write-back later).
- Store buffers: let the core continue past a store before it reaches L1; visibility to other cores is governed by coherence (and your synchronization).
Miss taxonomy (useful to categorize symptoms):
- Compulsory: the first touch; unavoidable unless prefetched or already cached by earlier work
- Capacity: working set > cache; fix with layout and blocking
- Conflict: mapping collisions; fix with padding, alignment, or different traversal
Performance rule of thumb: maximize sequential, stride-1 access and keep the hot working set inside L1/L2. When you can’t, structure accesses to hit each cache line once per pass.
False sharing: when independent threads sabotage each other
False sharing happens when threads modify different variables that reside on the same cache line. Coherence traffic forces the line to bounce between cores, turning cheap per-core increments into a cache-churn festival.
Minimal reproduction:
#include <stdint.h>
#include <pthread.h>
#include <stdio.h>
struct counters {
uint64_t a; // thread 0 writes a
uint64_t b; // thread 1 writes b
};
static struct counters C = {0, 0};
static const uint64_t N = 100000000ull;
static void *t0(void *_) { for (uint64_t i = 0; i < N; ++i) C.a++; return NULL; }
static void *t1(void *_) { for (uint64_t i = 0; i < N; ++i) C.b++; return NULL; }
int main(void) {
pthread_t th0, th1;
pthread_create(&th0, NULL, t0, NULL);
pthread_create(&th1, NULL, t1, NULL);
pthread_join(th0, NULL);
pthread_join(th1, NULL);
printf("%llu %llu\n", (unsigned long long)C.a, (unsigned long long)C.b);
}
On most machines, a
and b
land on the same 64-byte line. Each increment invalidates the other core’s copy, causing the line to ping-pong across cores. Throughput plummets even though there is no logical sharing.
Fix: separate hot fields onto different lines
Align and pad per-thread/per-core state so hot fields don’t share a line.
#include <stdalign.h>
#include <stdint.h>
struct alignas(64) padded_u64 { uint64_t v; };
struct counters2 {
struct padded_u64 a; // 64-byte aligned, isolated
struct padded_u64 b; // 64-byte aligned, isolated
};
static struct counters2 C2 = { {0}, {0} };
If your compiler lacks alignas
, use extensions:
struct padded_u64 { __attribute__((aligned(64))) uint64_t v; };
When building arrays of hot counters, pad the stride to a line-sized multiple:
#define CL 64u
struct counter_array {
uint8_t bytes[CL]; // reserve an entire line per entry
};
// use one counter per CPU/thread index; place the 8-byte value at a known offset
static inline uint64_t *counter_ptr(struct counter_array *arr, unsigned idx) {
return (uint64_t *)(arr[idx].bytes); // offset 0 keeps it simple
}
Caveats:
- Don’t over-pad cold fields. Pad only frequently written data that different threads update independently.
- Beware of structure packing/ABI: exported structs shouldn’t change layout across versions/protocols. Hide padding behind internal types.
Layout control: Array of Structs (AoS) vs Struct of Arrays (SoA)
Layout decides your stride and cache reuse. If you touch a few fields across many elements, SoA wins. If you touch many fields of one element at a time, AoS may be better.
AoS (convenient API, wider stride):
struct Particle { float x, y, z, mass; /* ... */ };
struct Particle *p; // array of N
// Update position (reads x,y,z only; mass is a passenger)
for (size_t i = 0; i < N; ++i) {
p[i].x += 0.01f; p[i].y += 0.01f; p[i].z += 0.01f;
}
SoA (narrow stride when touching a subset):
struct Particles {
float *x, *y, *z, *mass; // contiguous per field
};
for (size_t i = 0; i < N; ++i) {
parts.x[i] += 0.01f; parts.y[i] += 0.01f; parts.z[i] += 0.01f;
}
SoA reduces bytes per cache line fetched when only a subset is hot. It also helps vectorization by giving the compiler/unit contiguous lanes. We’ll build on this with blocking/tiling later.
Prefetching: useful, but only with discipline
Hardware prefetchers love simple, predictable strides. Software prefetching helps when:
- Your access pattern is predictable to you but opaque to the hardware (e.g., pointer chasing with known future steps)
- You have enough independent work between hint and use to hide latency
Two common interfaces:
// GCC/Clang builtin: locality 0 (no reuse) to 3 (high reuse); rw=0 for read, 1 for write
__builtin_prefetch(addr, /*rw*/0, /*locality*/1);
// x86 intrinsics (needs immintrin.h)
_mm_prefetch((const char *)addr, _MM_HINT_T0); // L1
_mm_prefetch((const char *)addr, _MM_HINT_T1); // L2
_mm_prefetch((const char *)addr, _MM_HINT_T2); // L3
_mm_prefetch((const char *)addr, _MM_HINT_NTA); // streaming/no temporal locality
Pointer-chasing example with a prefetch distance:
struct Node { struct Node *next; int payload; };
int sum_list(struct Node *n) {
int s = 0;
while (n) {
struct Node *next = n->next;
if (next) __builtin_prefetch(next, 0, 1); // hint next node
s += n->payload;
n = next;
}
return s;
}
Rules of thumb:
- Choose a prefetch distance that gives 200–400 cycles of independent work before use on big cores; measure.
- Remove prefetch if it doesn’t help in your workload. Over-prefetching creates bandwidth pressure and cache pollution.
- Prefer algorithmic locality (SoA, blocking) before prefetch. Hints are last-mile tweaks.
A tiny checklist you can apply immediately
- Align and pad frequently written per-thread fields to 64 bytes (or your platform’s line size)
- Audit tight loops for stride; convert AoS→SoA where you touch a subset across many elements
- Measure before/after: look for reduced LLC misses and improved scaling across cores
- Treat prefetching as experimental: add, measure, keep or remove
We’ll build on these foundations with blocking/tiling, streaming stores, and NUMA-aware placement next—keeping the hot path lean and the cores out of each other’s way.
Blocking and tiling: eat one cache at a time
When your working set doesn’t fit in L1/L2, you don’t have to lose. You can process data in tiles that do fit, maximizing reuse before moving on. Classic example: matrix multiply.
Naïve traversal touches full rows/columns repeatedly, blowing the cache. Block it so each submatrix stays hot while you compute its contribution.
#include <stddef.h>
#define B 64 // tune per machine: choose so 3*B*B*sizeof(double) < ~1/2 of L2
void dgemm_blocked(size_t n, const double *A, const double *Bmat, double *C) {
for (size_t ii = 0; ii < n; ii += B)
for (size_t jj = 0; jj < n; jj += B)
for (size_t kk = 0; kk < n; kk += B) {
size_t iimax = ii + (B < n-ii ? B : n-ii);
size_t jjmax = jj + (B < n-jj ? B : n-jj);
size_t kkmax = kk + (B < n-kk ? B : n-kk);
for (size_t i = ii; i < iimax; ++i)
for (size_t k = kk; k < kkmax; ++k) {
double aik = A[i*n + k];
// Prefetch a future C row chunk to smooth write-allocate
__builtin_prefetch(&C[i*n + jj], 1, 1);
for (size_t j = jj; j < jjmax; ++j) {
C[i*n + j] += aik * Bmat[k*n + j];
}
}
}
}
Picking B:
- Rule of thumb for
double
:3 * B^2 * 8 bytes < f * L2
, withf ∈ [0.3, 0.6]
to leave headroom - Validate with perf: capacity misses should drop; IPC should rise
Tiling generalizes:
- Images: process tiles of WxH pixels that fit L1/L2
- Stencils: keep radius r halos in the tile; slide tiles with overlap
- Hash tables: batch probes by cache-index buckets to reduce thrash
Conflict misses and associativity traps
Even if data “fits,” it can still collide when many hot lines map to the same set. The symptom: performance craters only for certain sizes/strides (often powers of two).
Pathological stride example:
#include <stddef.h>
#include <stdint.h>
// Sum one cache line from each page with a stride that can collide across sets
uint64_t sum_stride(const uint8_t *buf, size_t len, size_t stride) {
uint64_t s = 0;
for (size_t i = 0; i < len; i += stride) {
s += buf[i];
}
return s;
}
If stride
equals a power-of-two multiple of the cache size, addresses can alias the same set repeatedly. Fixes:
- Add a small prime/odd offset to break alignment:
stride += 64
or pad rows by+1..+16
bytes - Align base pointers to 64 bytes to make padding predictable
- Change traversal order to touch adjacent lines before wrapping around
Row padding pattern (avoid row-to-row set conflicts):
enum { WIDTH = 1920, PAD = 16 }; // PAD breaks power-of-two stride
uint8_t img[(WIDTH + PAD) * 1080]; // single plane, padded rows
static inline uint8_t *row(uint8_t *base, size_t pitch, size_t y) {
return base + y * pitch;
}
// pitch = WIDTH + PAD; iterate x < WIDTH, but advance by pitch per row
Alignment and allocation that don’t fight the cache
Alignment helps both vector units and the cache hierarchy. Use 64-byte alignment for hot arrays and per-core state.
Portable allocation options:
#include <stdlib.h>
#include <errno.h>
// C11
void *p = aligned_alloc(64, size /* must be multiple of 64 */);
// POSIX
void *q = NULL; if (posix_memalign(&q, 64, size) != 0) { /* handle ENOMEM/EINVAL */ }
// Fallback (over-allocate + adjust):
void *raw = malloc(size + 63 + sizeof(void*));
uintptr_t base = (uintptr_t)raw + sizeof(void*);
void *aligned = (void *)((base + 63u) & ~((uintptr_t)63));
((void**)aligned)[-1] = raw; // store for free()
Annotate objects you embed in structs:
struct alignas(64) ring_cursor { volatile unsigned head, tail; };
Streaming stores (non-temporal writes)
When you write a large buffer once (e.g., image render, log encoder), filling cache with those lines hurts nearby hot data. Non-temporal stores bypass (or demote) caches and avoid read-for-ownership.
Two approaches:
#include <immintrin.h>
// Float example: write 16 floats per loop without polluting caches
void stream_store_float(float *dst, float value, size_t n) {
size_t i = 0;
__m128 v = _mm_set1_ps(value);
for (; i + 16 <= n; i += 16) {
_mm_stream_ps(dst + i + 0, v);
_mm_stream_ps(dst + i + 4, v);
_mm_stream_ps(dst + i + 8, v);
_mm_stream_ps(dst + i + 12, v);
}
for (; i < n; ++i) dst[i] = value; // tail
_mm_sfence(); // ensure visibility before consumers
}
// GCC/Clang builtins: works for integers too
static inline void store_nt_u64(unsigned long long *p, unsigned long long v) {
__builtin_nontemporal_store(v, p);
}
Guidelines:
- Use only for write-once or write-rarely streams where re-reading is unlikely soon
- Ensure destination is 16/32/64-byte aligned for vectorized streams
- Finish with a fence (
_mm_sfence
) before making data visible to other threads/devices
Prefetch tuning beyond basics
Think of prefetch as a pipeline: issue hints d iterations ahead so lines arrive just-in-time.
// Lead/lag pipeline over a contiguous array
void saxpy_pf(size_t n, float a, const float *x, float *y) {
const size_t D = 128; // elements ahead; tune (bytes = D*sizeof(float))
for (size_t i = 0; i < n; ++i) {
if (i + D < n) __builtin_prefetch(&x[i + D], 0, 3), __builtin_prefetch(&y[i + D], 1, 3);
y[i] = a * x[i] + y[i];
}
}
Advanced tips:
- Use T2/T1 hints (
_MM_HINT_T2/T1
) for farther targets, T0 as you near use - For linked structures, prefetch the next node and its payload (
next->payload
) if known - Don’t stack multiple overlapping prefetches per iteration unless you measured a win
Measure what matters (and pin it)
Use a single core to reduce noise and read cache counters to prove improvements.
# Pin to CPU 2 and collect key cache stats
taskset -c 2 perf stat -e \
cycles,instructions,cache-references,cache-misses, \
L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses \
./app
# Inspect where misses happen
taskset -c 2 perf record -e cache-misses,LLC-load-misses -g -- ./app
perf report --stdio | cat
What you want to see after fixes:
- Lower LLC-load-misses and cache-misses ratio
- Higher IPC (instructions per cycle)
- Flatter tail latency in repeated runs
A minimal timing harness (steady-state):
#include <time.h>
#include <stdint.h>
static uint64_t now_ns(void) {
struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts);
return (uint64_t)ts.tv_sec*1000000000ull + (uint64_t)ts.tv_nsec;
}
double bench(void (*fn)(void)) {
const int warm = 5, reps = 50; double best = 1e99;
for (int i = 0; i < warm + reps; ++i) {
uint64_t t0 = now_ns(); fn(); uint64_t t1 = now_ns();
if (i >= warm) { double ms = (t1 - t0) / 1e6; if (ms < best) best = ms; }
}
return best;
}
A practical checklist (round 2)
- Block/tiling loops so working sets fit in L1/L2; validate with counters
- Break power-of-two strides with padding; align hot arrays to 64 bytes
- Use non-temporal stores for write-once streams; fence before publish
- Treat prefetch as a pipeline; tune distance and hint level empirically
- Pin and measure with
perf
; regressions without counter improvements usually aren’t wins
NUMA-aware placement: keep bytes near the core that uses them
On multi-socket systems, remote memory can cost 1.3–2.0× latency and bandwidth. You don’t need a full NUMA architecture to win—just a few rules:
- First touch: memory is placed on the node of the thread that first writes it
- Pin threads: keep workers on specific nodes to preserve locality
- Shard state per node: aggregate periodically instead of sharing globally
Minimal libnuma usage (Linux):
#include <numa.h>
#include <pthread.h>
#include <stdio.h>
struct shard { alignas(64) unsigned long counters[16]; };
struct shard *alloc_shard_on(int node) {
if (numa_available() < 0) return NULL;
struct shard *s = numa_alloc_onnode(sizeof *s, node);
if (!s) return NULL;
for (int i = 0; i < 16; ++i) s->counters[i] = 0; // first-touch on node
return s;
}
static void *worker(void *arg) {
int node = (int)(uintptr_t)arg;
numa_run_on_node(node); // pin thread to node
struct shard *local = alloc_shard_on(node);
// ... do work using local->counters ...
(void)local; return NULL;
}
int main(void) {
pthread_t t0, t1;
pthread_create(&t0, NULL, worker, (void*)(uintptr_t)0);
pthread_create(&t1, NULL, worker, (void*)(uintptr_t)1);
pthread_join(t0, NULL); pthread_join(t1, NULL);
}
Fallback if libnuma is unavailable: pin with sched_setaffinity
and rely on first-touch by initializing buffers on the owning thread.
Guidelines:
- Allocate and initialize per-node structures inside the thread bound to that node
- Avoid frequent cross-node writes; aggregate via one thread per node to a global snapshot
- For shared read-mostly data, interleave or replicate per node
Per-core sharding: histograms, counters, and queues that scale
Instead of a single hot counter, keep one per core and sum occasionally. Pad to cache lines and index by CPU.
#define CL 64u
struct counter { alignas(CL) unsigned long v; };
static struct counter per_core[256]; // large enough upper bound
static inline unsigned this_cpu(void) {
#if defined(__linux__)
return (unsigned)sched_getcpu();
#else
// Fallback: thread-local random shard; aggregate correctness still holds
static _Thread_local unsigned idx;
if (!idx) idx = (unsigned)(uintptr_t)&idx % 256;
return idx;
#endif
}
void inc_counter(void) { per_core[this_cpu()].v++; }
unsigned long snapshot(void) {
unsigned long sum = 0; for (int i = 0; i < 256; ++i) sum += per_core[i].v; return sum;
}
Use the same pattern for per-core histograms or small accumulators; flush to a shared structure during quiescent periods.
Cache-friendly SPSC ring buffer (layout matters)
Separate producer and consumer indices so they don’t live on the same line. Keep the data array independent.
#include <stdatomic.h>
#include <stddef.h>
#define CAP 1024
struct alignas(64) index64 { atomic_size_t v; char pad[64 - sizeof(atomic_size_t)]; };
static int buf[CAP];
static struct index64 head = { ATOMIC_VAR_INIT(0) }, tail = { ATOMIC_VAR_INIT(0) };
int enqueue(int x) {
size_t t = atomic_load_explicit(&tail.v, memory_order_relaxed);
size_t h = atomic_load_explicit(&head.v, memory_order_acquire);
if (((t + 1) % CAP) == h) return 0; // full
buf[t] = x; // write payload first
atomic_store_explicit(&tail.v, (t + 1) % CAP, memory_order_release);
return 1;
}
int dequeue(int *out) {
size_t h = atomic_load_explicit(&head.v, memory_order_relaxed);
size_t t = atomic_load_explicit(&tail.v, memory_order_acquire);
if (h == t) return 0; // empty
*out = buf[h];
atomic_store_explicit(&head.v, (h + 1) % CAP, memory_order_release);
return 1;
}
This avoids false sharing on the indices and gives each side its own line. For multi-producer/consumer, prefer queues designed for that pattern (and still pad hot fields).
Hash tables that don’t thrash
Open addressing with linear/robin-hood probing keeps probes on adjacent lines. Prefetch the next probe early.
struct entry { uint64_t key; uint64_t val; };
struct table { struct entry *e; size_t cap; };
static inline size_t h(uint64_t k, size_t cap) { return (k * 11400714819323198485ull) & (cap - 1); }
int lookup(const struct table *T, uint64_t key, uint64_t *out) {
size_t i = h(key, T->cap);
for (size_t step = 0; step < T->cap; ++step, i = (i + 1) & (T->cap - 1)) {
size_t next = (i + 1) & (T->cap - 1);
__builtin_prefetch(&T->e[next], 0, 1); // anticipate next bucket
uint64_t k = T->e[i].key;
if (k == 0) return 0; // empty slot
if (k == key) { *out = T->e[i].val; return 1; }
}
return 0;
}
Tips:
- Power-of-two capacity with good multiplicative hashing makes modulo cheap and preserves locality
- Group small buckets per cache line (e.g., 4 entries) to amortize tag checks per line
- Pad/align the table base to 64 bytes
Detecting and proving false sharing
You can see line ping-pong in hardware counters.
# System-wide cache-to-cache analysis (root often required)
sudo perf c2c record -a -- ./app
sudo perf c2c report | cat
# Per-process coherence stats
perf stat -e cycles,instructions,LLC-loads,LLC-load-misses,\
mem_load_retired.l3_miss,offcore_response.demand_data_rd.llc_miss.local_dram,\
glim: not all PMU names are portable
./app | cat
Signals of false sharing:
- High number of HITM (hit-modified) or cache-to-cache transfers on addresses near each other
- Throughput improves dramatically after padding hot fields
Layout pitfalls: bitfields, packed structs, and friendly sizes
Bitfields look convenient but often generate read-modify-write cycles that stomp on neighboring fields sharing the same word/line.
Preferred alternatives:
// Instead of bitfields across hot flags, use explicit masks on an aligned word
struct alignas(64) flags64 { unsigned long w; };
enum { F_A = 1u<<0, F_B = 1u<<1 };
static inline void set_flag(struct flags64 *f, unsigned m) { f->w |= m; }
static inline int has_flag(const struct flags64 *f, unsigned m) { return (f->w & m) != 0; }
Packed structs (__attribute__((packed))
) save space but can force unaligned accesses; in hot paths, prefer natural alignment and reorder fields to minimize padding. Lock in expectations with compile-time checks where available.
#if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L
_Static_assert(sizeof(struct flags64) == 64, "flags64 not cache-line sized");
#endif
A practical checklist (round 3)
- Pin threads and use first-touch to keep memory near its worker; shard per NUMA node
- Shard per core for hot counters/histograms and aggregate periodically
- Separate hot indices/control words to their own cache lines in queues/rings
- Use open addressing with contiguous probes; prefetch the next bucket
- Replace bitfields in hot paths with masked word updates; avoid packed in hot structs
End-to-end case study: a cache-friendly 2D filter
Let’s turn knobs in one realistic pipeline: a 2D 3×3 convolution over an 8-bit grayscale image, producing a new image. We’ll start naïve, then apply layout, tiling, prefetch, and streaming stores.
Baseline (naïve AoS-like flat buffer, no tiling):
#include <stddef.h>
#include <stdint.h>
void conv3x3_naive(const uint8_t *in, uint8_t *out, size_t w, size_t h, ptrdiff_t stride) {
// stride == w for tightly packed rows; borders skipped for brevity
for (size_t y = 1; y + 1 < h; ++y) {
for (size_t x = 1; x + 1 < w; ++x) {
int sum = 0;
sum += 1*in[(y-1)*stride + (x-1)] + 2*in[(y-1)*stride + x] + 1*in[(y-1)*stride + (x+1)];
sum += 2*in[y*stride + (x-1)] + 4*in[y*stride + x] + 2*in[y*stride + (x+1)];
sum += 1*in[(y+1)*stride + (x-1)] + 2*in[(y+1)*stride + x] + 1*in[(y+1)*stride + (x+1)];
out[y*stride + x] = (uint8_t)(sum >> 4);
}
}
}
Issues:
- Poor reuse across rows/columns; each pixel load is used once
- Writes cause write-allocate; hot lines in other code get evicted
Blocked version (tile WxH; prefetch next row; avoid thrash with padded stride):
#include <immintrin.h>
void conv3x3_tiled(const uint8_t *in, uint8_t *out, size_t w, size_t h, ptrdiff_t pitch) {
const size_t TW = 128, TH = 64; // tune: make 3 rows × TW fit L1 with headroom
for (size_t ty = 1; ty + 1 < h; ty += TH) {
size_t ymax = (ty + TH + 1 < h) ? ty + TH : h - 1;
for (size_t tx = 1; tx + 1 < w; tx += TW) {
size_t xmax = (tx + TW + 1 < w) ? tx + TW : w - 1;
for (size_t y = ty; y < ymax; ++y) {
// prefetch input rows for the next iteration to hide row jumps
if (y + 2 < h) __builtin_prefetch(&in[(y+2)*pitch + tx], 0, 3);
for (size_t x = tx; x < xmax; ++x) {
int sum = 0;
const uint8_t *r0 = &in[(y-1)*pitch + x];
const uint8_t *r1 = &in[y*pitch + x];
const uint8_t *r2 = &in[(y+1)*pitch + x];
sum += 1*r0[-1] + 2*r0[0] + 1*r0[+1];
sum += 2*r1[-1] + 4*r1[0] + 2*r1[+1];
sum += 1*r2[-1] + 2*r2[0] + 1*r2[+1];
out[y*pitch + x] = (uint8_t)(sum >> 4);
}
}
}
}
}
Further wins:
- Vectorize across x with 16-byte/32-byte loads; keep three sliding row vectors
- Use non-temporal stores for large output images when the next stage won’t read soon
- Pad pitch by
+16
to break set conflicts for wide powers-of-two widths
Non-temporal write variant for the output row tail:
static inline void store_row_nt(uint8_t *dst, const uint8_t *src, size_t n) {
size_t i = 0;
for (; i + 16 <= n; i += 16) {
__m128i v = _mm_loadu_si128((const __m128i *)(src + i));
_mm_stream_si128((__m128i *)(dst + i), v);
}
for (; i < n; ++i) dst[i] = src[i];
_mm_sfence();
}
Measure deltas with cache counters; expect lower LLC misses and improved IPC.
TLBs, page size, and why huge pages sometimes help
TLB misses stall even when caches are warm. Sequential scans over very large arrays can saturate the TLB.
Guidance:
- Keep working sets small via tiling first; TLB problems often disappear
- Consider using huge pages (Linux
madvise(MADV_HUGEPAGE)
or explicit hugetlbfs) for large, linear buffers - Beware of increased internal fragmentation; test both ways—huge pages can hurt random access patterns
Linux sketches:
#include <sys/mman.h>
// Request THP for a region (transparent, best-effort)
void prefer_thp(void *p, size_t len) { madvise(p, len, MADV_HUGEPAGE); }
// Explicit huge page mapping (requires sysctl/config and privileges)
// mmap(..., MAP_HUGETLB | (2MB/1GB flags))
Store buffers, fences, and publication
Stores retire into store buffers before they reach L1. When publishing data to other threads, use proper memory order so readers see complete writes in the intended order.
Pattern refresher (data, then flag with release; reader checks with acquire):
#include <stdatomic.h>
struct payload { int a, b, c; } P;
atomic_int ready = ATOMIC_VAR_INIT(0);
void publish(int x, int y, int z) {
P.a = x; P.b = y; P.c = z;
atomic_store_explicit(&ready, 1, memory_order_release);
}
int consume(struct payload *out) {
if (atomic_load_explicit(&ready, memory_order_acquire)) { *out = P; return 1; }
return 0;
}
This ensures cache-visible ordering across cores. Combine with cache-friendly layout (pad hot fields, align) for fewer coherence stalls.
Tooling beyond perf: where to look next
- Intel VTune, AMD uProf: deeper breakdowns (bandwidth roofline, bad speculation)
- Linux
perf mem
: sampling memory accesses with latency attribution - Flamegraphs +
perf script
+stackcollapse-perf.pl
: visualize hot call stacks before/after layout changes - macOS: Instruments (Time Profiler, Counters); DTrace where available
Common anti-patterns and safer alternatives
- Writing interleaved per-thread logs into a single array: shard per thread and concatenate; or use line-sized chunks per thread
- Frequent atomics on a shared counter in the hot path: per-core counters + periodic aggregation
- Bitfields/packed types in hot structs: align and use masks; keep hot fields in their own line
- Random access without structure: bucketize by locality; reorder work units to improve spatial/temporal reuse
Production checklist (final)
- Pick layouts that match access patterns (AoS vs SoA); tile to fit L1/L2
- Eliminate false sharing with line-sized padding and per-core sharding
- Break stride pathologies with small paddings and 64-byte alignment
- Use prefetch as a measured pipeline; remove if it doesn’t help
- Apply non-temporal stores for write-once streams; fence before publish
- On multi-socket boxes, pin and first-touch; replicate read-mostly state per node
- Measure with counters, not vibes; keep a small harness and baselines checked in
Closing thoughts
Cache-aware C isn’t a bag of tricks—it’s a way of shaping data and work so the hardware’s fast paths are the default path. Choose layouts that minimize bytes touched per result, keep hot state off your neighbors’ cache lines, process data in tiles that fit a cache, and prove improvements with counters. Do that, and your programs stop stuttering and start sprinting.