Cache-Aware C: False Sharing, Prefetching, and Layout Control

Published: September 14, 2015 (9y ago)23 min read

Updated: November 2, 2022 (2y ago)

You tuned a tight loop, vectorized a kernel, and still your program stutters like it’s stuck in molasses. The culprit is rarely your ALU—it’s almost always memory. Modern CPUs sprint only when you feed them with cache-friendly data and keep the cores from fighting over the same cache lines.

This post is a practical guide to writing C that plays nicely with the cache hierarchy: understanding lines and associativity, avoiding false sharing, choosing layouts (AoS vs SoA), and judiciously using prefetching. The goal is fast code that’s predictably fast under load.

Caches in 5 minutes (the mental model you actually use)

Your program sees a flat address space; your CPU sees a pyramid:

  • L1 data cache: tiny (32–64 KiB), absurdly fast, private per core, line-based
  • L2: bigger (256 KiB–2 MiB), still private
  • L3: much larger (several MiB–tens of MiB), shared across cores
  • DRAM: orders of magnitude slower; touching it in the hot path is a tax
graph TB subgraph "CPU Memory Hierarchy" CPU[CPU Core] L1["L1 Cache<br/>32-64 KiB<br/>~1 cycle<br/>Private"] L2["L2 Cache<br/>256 KiB-2 MiB<br/>~10 cycles<br/>Private"] L3["L3 Cache<br/>Several MiB<br/>~40 cycles<br/>Shared"] DRAM["DRAM<br/>GiB+<br/>~200+ cycles<br/>Shared"] CPU --> L1 L1 --> L2 L2 --> L3 L3 --> DRAM end subgraph "Cache Line (64 bytes)" CL1[Byte 0-7] CL2[Byte 8-15] CL3[Byte 16-23] CL4[Byte 24-31] CL5[Byte 32-39] CL6[Byte 40-47] CL7[Byte 48-55] CL8[Byte 56-63] end subgraph "Performance Impact" P1["Touch 1 byte →<br/>Load 64 bytes"] P2["Sequential access →<br/>~1 cycle per byte"] P3["Random access →<br/>~200+ cycles per miss"] end style L1 fill:#e8f5e8 style L2 fill:#fff3e0 style L3 fill:#ffecb3 style DRAM fill:#ffebee style P3 fill:#ffcdd2

Key terms you’ll reason about every day:

  • Cache line: the unit of transfer/coherence (commonly 64 bytes). Touch one byte, fetch 64.
  • Set associativity: each line maps to a limited number of “ways” in a set; conflict misses arise when hot lines collide.
  • Write-allocate + write-back: stores typically pull a line into cache (write-allocate) and mark it dirty (write-back later).
  • Store buffers: let the core continue past a store before it reaches L1; visibility to other cores is governed by coherence (and your synchronization).

Miss taxonomy (useful to categorize symptoms):

  • Compulsory: the first touch; unavoidable unless prefetched or already cached by earlier work
  • Capacity: working set > cache; fix with layout and blocking
  • Conflict: mapping collisions; fix with padding, alignment, or different traversal

Performance rule of thumb: maximize sequential, stride-1 access and keep the hot working set inside L1/L2. When you can’t, structure accesses to hit each cache line once per pass.

False sharing: when independent threads sabotage each other

False sharing happens when threads modify different variables that reside on the same cache line. Coherence traffic forces the line to bounce between cores, turning cheap per-core increments into a cache-churn festival.

sequenceDiagram participant C1 as Core 1 participant CL as Cache Line (64 bytes) participant C2 as Core 2 Note over C1, C2: False Sharing Problem Note over CL: [var_a][var_b] both in same line C1->>CL: Write var_a++ Note over CL: Line owned by Core 1 C2->>CL: Write var_b++ Note over CL: Invalidate Core 1's copy CL-->>C1: Cache miss! CL->>C2: Transfer ownership C1->>CL: Write var_a++ again Note over CL: Invalidate Core 2's copy CL-->>C2: Cache miss! CL->>C1: Transfer ownership Note right of C2: Independent variables<br/>cause cache ping-pong<br/>Severe performance loss Note over C1, C2: Solution: Padding Note over CL: [var_a + padding][separate line][var_b + padding] C1->>CL: Write var_a++ Note over CL: Core 1 owns line 1 C2->>CL: Write var_b++ Note over CL: Core 2 owns line 2 Note right of C2: No conflicts!<br/>Each core has<br/>its own cache line

Minimal reproduction:

#include <stdint.h>
#include <pthread.h>
#include <stdio.h>
 
struct counters {
  uint64_t a; // thread 0 writes a
  uint64_t b; // thread 1 writes b
};
 
static struct counters C = {0, 0};
static const uint64_t N = 100000000ull;
 
static void *t0(void *_) { for (uint64_t i = 0; i < N; ++i) C.a++; return NULL; }
static void *t1(void *_) { for (uint64_t i = 0; i < N; ++i) C.b++; return NULL; }
 
int main(void) {
  pthread_t th0, th1;
  pthread_create(&th0, NULL, t0, NULL);
  pthread_create(&th1, NULL, t1, NULL);
  pthread_join(th0, NULL);
  pthread_join(th1, NULL);
  printf("%llu %llu\n", (unsigned long long)C.a, (unsigned long long)C.b);
}

On most machines, a and b land on the same 64-byte line. Each increment invalidates the other core’s copy, causing the line to ping-pong across cores. Throughput plummets even though there is no logical sharing.

Fix: separate hot fields onto different lines

Align and pad per-thread/per-core state so hot fields don’t share a line.

#include <stdalign.h>
#include <stdint.h>
 
struct alignas(64) padded_u64 { uint64_t v; };
 
struct counters2 {
  struct padded_u64 a;   // 64-byte aligned, isolated
  struct padded_u64 b;   // 64-byte aligned, isolated
};
 
static struct counters2 C2 = { {0}, {0} };

If your compiler lacks alignas, use extensions:

struct padded_u64 { __attribute__((aligned(64))) uint64_t v; };

When building arrays of hot counters, pad the stride to a line-sized multiple:

#define CL 64u
struct counter_array {
  uint8_t bytes[CL]; // reserve an entire line per entry
};
 
// use one counter per CPU/thread index; place the 8-byte value at a known offset
static inline uint64_t *counter_ptr(struct counter_array *arr, unsigned idx) {
  return (uint64_t *)(arr[idx].bytes); // offset 0 keeps it simple
}

Caveats:

  • Don’t over-pad cold fields. Pad only frequently written data that different threads update independently.
  • Beware of structure packing/ABI: exported structs shouldn’t change layout across versions/protocols. Hide padding behind internal types.

Layout control: Array of Structs (AoS) vs Struct of Arrays (SoA)

Layout decides your stride and cache reuse. If you touch a few fields across many elements, SoA wins. If you touch many fields of one element at a time, AoS may be better.

graph TB subgraph "Array of Structs (AoS)" AoS1["Particle[0]<br/>{x, y, z, mass}"] AoS2["Particle[1]<br/>{x, y, z, mass}"] AoS3["Particle[2]<br/>{x, y, z, mass}"] AoS4["Particle[3]<br/>{x, y, z, mass}"] AoSMem["Memory Layout:<br/>[x0,y0,z0,mass0][x1,y1,z1,mass1][x2,y2,z2,mass2]..."] end subgraph "Struct of Arrays (SoA)" SoAX["X Array<br/>[x0, x1, x2, x3, ...]"] SoAY["Y Array<br/>[y0, y1, y2, y3, ...]"] SoAZ["Z Array<br/>[z0, z1, z2, z3, ...]"] SoAM["Mass Array<br/>[m0, m1, m2, m3, ...]"] end subgraph "Access Patterns" AP1["Update positions only:<br/>for(i) { x[i]+=dx; y[i]+=dy; z[i]+=dz; }"] AP2["Process one particle:<br/>distance = sqrt(p[i].x² + p[i].y² + p[i].z²)"] end subgraph "Cache Performance" Cache1["AoS: 16 bytes per element<br/>✓ Good for: single element<br/>✗ Bad for: field-only loops<br/>Wastes 4 bytes (mass) when updating position"] Cache2["SoA: 4 bytes per element per field<br/>✓ Good for: field-only loops<br/>✓ Perfect cache utilization<br/>✗ Bad for: mixed field access"] end AP1 --> Cache2 AP2 --> Cache1 style SoAX fill:#e8f5e8 style SoAY fill:#e8f5e8 style SoAZ fill:#e8f5e8 style Cache2 fill:#e8f5e8 style Cache1 fill:#fff3e0

AoS (convenient API, wider stride):

struct Particle { float x, y, z, mass; /* ... */ };
struct Particle *p; // array of N
 
// Update position (reads x,y,z only; mass is a passenger)
for (size_t i = 0; i < N; ++i) {
  p[i].x += 0.01f; p[i].y += 0.01f; p[i].z += 0.01f;
}

SoA (narrow stride when touching a subset):

struct Particles {
  float *x, *y, *z, *mass; // contiguous per field
};
 
for (size_t i = 0; i < N; ++i) {
  parts.x[i] += 0.01f; parts.y[i] += 0.01f; parts.z[i] += 0.01f;
}

SoA reduces bytes per cache line fetched when only a subset is hot. It also helps vectorization by giving the compiler/unit contiguous lanes. We’ll build on this with blocking/tiling later.

Prefetching: useful, but only with discipline

Hardware prefetchers love simple, predictable strides. Software prefetching helps when:

  • Your access pattern is predictable to you but opaque to the hardware (e.g., pointer chasing with known future steps)
  • You have enough independent work between hint and use to hide latency

Two common interfaces:

// GCC/Clang builtin: locality 0 (no reuse) to 3 (high reuse); rw=0 for read, 1 for write
__builtin_prefetch(addr, /*rw*/0, /*locality*/1);
 
// x86 intrinsics (needs immintrin.h)
_mm_prefetch((const char *)addr, _MM_HINT_T0);   // L1
_mm_prefetch((const char *)addr, _MM_HINT_T1);   // L2
_mm_prefetch((const char *)addr, _MM_HINT_T2);   // L3
_mm_prefetch((const char *)addr, _MM_HINT_NTA);  // streaming/no temporal locality

Pointer-chasing example with a prefetch distance:

struct Node { struct Node *next; int payload; };
 
int sum_list(struct Node *n) {
  int s = 0;
  while (n) {
    struct Node *next = n->next;
    if (next) __builtin_prefetch(next, 0, 1); // hint next node
    s += n->payload;
    n = next;
  }
  return s;
}

Rules of thumb:

  • Choose a prefetch distance that gives 200–400 cycles of independent work before use on big cores; measure.
  • Remove prefetch if it doesn’t help in your workload. Over-prefetching creates bandwidth pressure and cache pollution.
  • Prefer algorithmic locality (SoA, blocking) before prefetch. Hints are last-mile tweaks.

A tiny checklist you can apply immediately

  • Align and pad frequently written per-thread fields to 64 bytes (or your platform’s line size)
  • Audit tight loops for stride; convert AoS→SoA where you touch a subset across many elements
  • Measure before/after: look for reduced LLC misses and improved scaling across cores
  • Treat prefetching as experimental: add, measure, keep or remove

We’ll build on these foundations with blocking/tiling, streaming stores, and NUMA-aware placement next—keeping the hot path lean and the cores out of each other’s way.

Blocking and tiling: eat one cache at a time

When your working set doesn’t fit in L1/L2, you don’t have to lose. You can process data in tiles that do fit, maximizing reuse before moving on. Classic example: matrix multiply.

Naïve traversal touches full rows/columns repeatedly, blowing the cache. Block it so each submatrix stays hot while you compute its contribution.

#include <stddef.h>
 
#define B 64 // tune per machine: choose so 3*B*B*sizeof(double) < ~1/2 of L2
 
void dgemm_blocked(size_t n, const double *A, const double *Bmat, double *C) {
  for (size_t ii = 0; ii < n; ii += B)
  for (size_t jj = 0; jj < n; jj += B)
  for (size_t kk = 0; kk < n; kk += B) {
    size_t iimax = ii + (B < n-ii ? B : n-ii);
    size_t jjmax = jj + (B < n-jj ? B : n-jj);
    size_t kkmax = kk + (B < n-kk ? B : n-kk);
    for (size_t i = ii; i < iimax; ++i)
      for (size_t k = kk; k < kkmax; ++k) {
        double aik = A[i*n + k];
        // Prefetch a future C row chunk to smooth write-allocate
        __builtin_prefetch(&C[i*n + jj], 1, 1);
        for (size_t j = jj; j < jjmax; ++j) {
          C[i*n + j] += aik * Bmat[k*n + j];
        }
      }
  }
}

Picking B:

  • Rule of thumb for double: 3 * B^2 * 8 bytes < f * L2, with f ∈ [0.3, 0.6] to leave headroom
  • Validate with perf: capacity misses should drop; IPC should rise

Tiling generalizes:

  • Images: process tiles of WxH pixels that fit L1/L2
  • Stencils: keep radius r halos in the tile; slide tiles with overlap
  • Hash tables: batch probes by cache-index buckets to reduce thrash

Conflict misses and associativity traps

Even if data “fits,” it can still collide when many hot lines map to the same set. The symptom: performance craters only for certain sizes/strides (often powers of two).

Pathological stride example:

#include <stddef.h>
#include <stdint.h>
 
// Sum one cache line from each page with a stride that can collide across sets
uint64_t sum_stride(const uint8_t *buf, size_t len, size_t stride) {
  uint64_t s = 0;
  for (size_t i = 0; i < len; i += stride) {
    s += buf[i];
  }
  return s;
}

If stride equals a power-of-two multiple of the cache size, addresses can alias the same set repeatedly. Fixes:

  • Add a small prime/odd offset to break alignment: stride += 64 or pad rows by +1..+16 bytes
  • Align base pointers to 64 bytes to make padding predictable
  • Change traversal order to touch adjacent lines before wrapping around

Row padding pattern (avoid row-to-row set conflicts):

enum { WIDTH = 1920, PAD = 16 };           // PAD breaks power-of-two stride
uint8_t img[(WIDTH + PAD) * 1080];         // single plane, padded rows
 
static inline uint8_t *row(uint8_t *base, size_t pitch, size_t y) {
  return base + y * pitch;
}
 
// pitch = WIDTH + PAD; iterate x < WIDTH, but advance by pitch per row

Alignment and allocation that don’t fight the cache

Alignment helps both vector units and the cache hierarchy. Use 64-byte alignment for hot arrays and per-core state.

Portable allocation options:

#include <stdlib.h>
#include <errno.h>
 
// C11
void *p = aligned_alloc(64, size /* must be multiple of 64 */);
 
// POSIX
void *q = NULL; if (posix_memalign(&q, 64, size) != 0) { /* handle ENOMEM/EINVAL */ }
 
// Fallback (over-allocate + adjust):
void *raw = malloc(size + 63 + sizeof(void*));
uintptr_t base = (uintptr_t)raw + sizeof(void*);
void *aligned = (void *)((base + 63u) & ~((uintptr_t)63));
((void**)aligned)[-1] = raw; // store for free()

Annotate objects you embed in structs:

struct alignas(64) ring_cursor { volatile unsigned head, tail; };

Streaming stores (non-temporal writes)

When you write a large buffer once (e.g., image render, log encoder), filling cache with those lines hurts nearby hot data. Non-temporal stores bypass (or demote) caches and avoid read-for-ownership.

Two approaches:

#include <immintrin.h>
 
// Float example: write 16 floats per loop without polluting caches
void stream_store_float(float *dst, float value, size_t n) {
  size_t i = 0;
  __m128 v = _mm_set1_ps(value);
  for (; i + 16 <= n; i += 16) {
    _mm_stream_ps(dst + i +  0, v);
    _mm_stream_ps(dst + i +  4, v);
    _mm_stream_ps(dst + i +  8, v);
    _mm_stream_ps(dst + i + 12, v);
  }
  for (; i < n; ++i) dst[i] = value; // tail
  _mm_sfence(); // ensure visibility before consumers
}
 
// GCC/Clang builtins: works for integers too
static inline void store_nt_u64(unsigned long long *p, unsigned long long v) {
  __builtin_nontemporal_store(v, p);
}

Guidelines:

  • Use only for write-once or write-rarely streams where re-reading is unlikely soon
  • Ensure destination is 16/32/64-byte aligned for vectorized streams
  • Finish with a fence (_mm_sfence) before making data visible to other threads/devices

Prefetch tuning beyond basics

Think of prefetch as a pipeline: issue hints d iterations ahead so lines arrive just-in-time.

// Lead/lag pipeline over a contiguous array
void saxpy_pf(size_t n, float a, const float *x, float *y) {
  const size_t D = 128; // elements ahead; tune (bytes = D*sizeof(float))
  for (size_t i = 0; i < n; ++i) {
    if (i + D < n) __builtin_prefetch(&x[i + D], 0, 3), __builtin_prefetch(&y[i + D], 1, 3);
    y[i] = a * x[i] + y[i];
  }
}

Advanced tips:

  • Use T2/T1 hints (_MM_HINT_T2/T1) for farther targets, T0 as you near use
  • For linked structures, prefetch the next node and its payload (next->payload) if known
  • Don’t stack multiple overlapping prefetches per iteration unless you measured a win

Measure what matters (and pin it)

Use a single core to reduce noise and read cache counters to prove improvements.

# Pin to CPU 2 and collect key cache stats
taskset -c 2 perf stat -e \
  cycles,instructions,cache-references,cache-misses, \
  L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses \
  ./app
 
# Inspect where misses happen
taskset -c 2 perf record -e cache-misses,LLC-load-misses -g -- ./app
perf report --stdio | cat

What you want to see after fixes:

  • Lower LLC-load-misses and cache-misses ratio
  • Higher IPC (instructions per cycle)
  • Flatter tail latency in repeated runs

A minimal timing harness (steady-state):

#include <time.h>
#include <stdint.h>
 
static uint64_t now_ns(void) {
  struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts);
  return (uint64_t)ts.tv_sec*1000000000ull + (uint64_t)ts.tv_nsec;
}
 
double bench(void (*fn)(void)) {
  const int warm = 5, reps = 50; double best = 1e99;
  for (int i = 0; i < warm + reps; ++i) {
    uint64_t t0 = now_ns(); fn(); uint64_t t1 = now_ns();
    if (i >= warm) { double ms = (t1 - t0) / 1e6; if (ms < best) best = ms; }
  }
  return best;
}

A practical checklist (round 2)

  • Block/tiling loops so working sets fit in L1/L2; validate with counters
  • Break power-of-two strides with padding; align hot arrays to 64 bytes
  • Use non-temporal stores for write-once streams; fence before publish
  • Treat prefetch as a pipeline; tune distance and hint level empirically
  • Pin and measure with perf; regressions without counter improvements usually aren’t wins

NUMA-aware placement: keep bytes near the core that uses them

On multi-socket systems, remote memory can cost 1.3–2.0× latency and bandwidth. You don’t need a full NUMA architecture to win—just a few rules:

  • First touch: memory is placed on the node of the thread that first writes it
  • Pin threads: keep workers on specific nodes to preserve locality
  • Shard state per node: aggregate periodically instead of sharing globally

Minimal libnuma usage (Linux):

#include <numa.h>
#include <pthread.h>
#include <stdio.h>
 
struct shard { alignas(64) unsigned long counters[16]; };
 
struct shard *alloc_shard_on(int node) {
  if (numa_available() < 0) return NULL;
  struct shard *s = numa_alloc_onnode(sizeof *s, node);
  if (!s) return NULL;
  for (int i = 0; i < 16; ++i) s->counters[i] = 0; // first-touch on node
  return s;
}
 
static void *worker(void *arg) {
  int node = (int)(uintptr_t)arg;
  numa_run_on_node(node); // pin thread to node
  struct shard *local = alloc_shard_on(node);
  // ... do work using local->counters ...
  (void)local; return NULL;
}
 
int main(void) {
  pthread_t t0, t1;
  pthread_create(&t0, NULL, worker, (void*)(uintptr_t)0);
  pthread_create(&t1, NULL, worker, (void*)(uintptr_t)1);
  pthread_join(t0, NULL); pthread_join(t1, NULL);
}

Fallback if libnuma is unavailable: pin with sched_setaffinity and rely on first-touch by initializing buffers on the owning thread.

Guidelines:

  • Allocate and initialize per-node structures inside the thread bound to that node
  • Avoid frequent cross-node writes; aggregate via one thread per node to a global snapshot
  • For shared read-mostly data, interleave or replicate per node

Per-core sharding: histograms, counters, and queues that scale

Instead of a single hot counter, keep one per core and sum occasionally. Pad to cache lines and index by CPU.

#define CL 64u
struct counter { alignas(CL) unsigned long v; };
static struct counter per_core[256]; // large enough upper bound
 
static inline unsigned this_cpu(void) {
#if defined(__linux__)
  return (unsigned)sched_getcpu();
#else
  // Fallback: thread-local random shard; aggregate correctness still holds
  static _Thread_local unsigned idx;
  if (!idx) idx = (unsigned)(uintptr_t)&idx % 256;
  return idx;
#endif
}
 
void inc_counter(void) { per_core[this_cpu()].v++; }
 
unsigned long snapshot(void) {
  unsigned long sum = 0; for (int i = 0; i < 256; ++i) sum += per_core[i].v; return sum;
}

Use the same pattern for per-core histograms or small accumulators; flush to a shared structure during quiescent periods.

Cache-friendly SPSC ring buffer (layout matters)

Separate producer and consumer indices so they don’t live on the same line. Keep the data array independent.

#include <stdatomic.h>
#include <stddef.h>
#define CAP 1024
 
struct alignas(64) index64 { atomic_size_t v; char pad[64 - sizeof(atomic_size_t)]; };
static int buf[CAP];
static struct index64 head = { ATOMIC_VAR_INIT(0) }, tail = { ATOMIC_VAR_INIT(0) };
 
int enqueue(int x) {
  size_t t = atomic_load_explicit(&tail.v, memory_order_relaxed);
  size_t h = atomic_load_explicit(&head.v, memory_order_acquire);
  if (((t + 1) % CAP) == h) return 0; // full
  buf[t] = x; // write payload first
  atomic_store_explicit(&tail.v, (t + 1) % CAP, memory_order_release);
  return 1;
}
 
int dequeue(int *out) {
  size_t h = atomic_load_explicit(&head.v, memory_order_relaxed);
  size_t t = atomic_load_explicit(&tail.v, memory_order_acquire);
  if (h == t) return 0; // empty
  *out = buf[h];
  atomic_store_explicit(&head.v, (h + 1) % CAP, memory_order_release);
  return 1;
}

This avoids false sharing on the indices and gives each side its own line. For multi-producer/consumer, prefer queues designed for that pattern (and still pad hot fields).

Hash tables that don’t thrash

Open addressing with linear/robin-hood probing keeps probes on adjacent lines. Prefetch the next probe early.

struct entry { uint64_t key; uint64_t val; };
struct table { struct entry *e; size_t cap; };
 
static inline size_t h(uint64_t k, size_t cap) { return (k * 11400714819323198485ull) & (cap - 1); }
 
int lookup(const struct table *T, uint64_t key, uint64_t *out) {
  size_t i = h(key, T->cap);
  for (size_t step = 0; step < T->cap; ++step, i = (i + 1) & (T->cap - 1)) {
    size_t next = (i + 1) & (T->cap - 1);
    __builtin_prefetch(&T->e[next], 0, 1); // anticipate next bucket
    uint64_t k = T->e[i].key;
    if (k == 0) return 0;          // empty slot
    if (k == key) { *out = T->e[i].val; return 1; }
  }
  return 0;
}

Tips:

  • Power-of-two capacity with good multiplicative hashing makes modulo cheap and preserves locality
  • Group small buckets per cache line (e.g., 4 entries) to amortize tag checks per line
  • Pad/align the table base to 64 bytes

Detecting and proving false sharing

You can see line ping-pong in hardware counters.

# System-wide cache-to-cache analysis (root often required)
sudo perf c2c record -a -- ./app
sudo perf c2c report | cat
 
# Per-process coherence stats
perf stat -e cycles,instructions,LLC-loads,LLC-load-misses,\
  mem_load_retired.l3_miss,offcore_response.demand_data_rd.llc_miss.local_dram,\
  glim: not all PMU names are portable
  ./app | cat

Signals of false sharing:

  • High number of HITM (hit-modified) or cache-to-cache transfers on addresses near each other
  • Throughput improves dramatically after padding hot fields

Layout pitfalls: bitfields, packed structs, and friendly sizes

Bitfields look convenient but often generate read-modify-write cycles that stomp on neighboring fields sharing the same word/line.

Preferred alternatives:

// Instead of bitfields across hot flags, use explicit masks on an aligned word
struct alignas(64) flags64 { unsigned long w; };
 
enum { F_A = 1u<<0, F_B = 1u<<1 };
 
static inline void set_flag(struct flags64 *f, unsigned m) { f->w |= m; }
static inline int  has_flag(const struct flags64 *f, unsigned m) { return (f->w & m) != 0; }

Packed structs (__attribute__((packed))) save space but can force unaligned accesses; in hot paths, prefer natural alignment and reorder fields to minimize padding. Lock in expectations with compile-time checks where available.

#if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L
_Static_assert(sizeof(struct flags64) == 64, "flags64 not cache-line sized");
#endif

A practical checklist (round 3)

  • Pin threads and use first-touch to keep memory near its worker; shard per NUMA node
  • Shard per core for hot counters/histograms and aggregate periodically
  • Separate hot indices/control words to their own cache lines in queues/rings
  • Use open addressing with contiguous probes; prefetch the next bucket
  • Replace bitfields in hot paths with masked word updates; avoid packed in hot structs

End-to-end case study: a cache-friendly 2D filter

Let’s turn knobs in one realistic pipeline: a 2D 3×3 convolution over an 8-bit grayscale image, producing a new image. We’ll start naïve, then apply layout, tiling, prefetch, and streaming stores.

Baseline (naïve AoS-like flat buffer, no tiling):

#include <stddef.h>
#include <stdint.h>
 
void conv3x3_naive(const uint8_t *in, uint8_t *out, size_t w, size_t h, ptrdiff_t stride) {
  // stride == w for tightly packed rows; borders skipped for brevity
  for (size_t y = 1; y + 1 < h; ++y) {
    for (size_t x = 1; x + 1 < w; ++x) {
      int sum = 0;
      sum += 1*in[(y-1)*stride + (x-1)] + 2*in[(y-1)*stride + x] + 1*in[(y-1)*stride + (x+1)];
      sum += 2*in[y*stride + (x-1)]     + 4*in[y*stride + x]     + 2*in[y*stride + (x+1)];
      sum += 1*in[(y+1)*stride + (x-1)] + 2*in[(y+1)*stride + x] + 1*in[(y+1)*stride + (x+1)];
      out[y*stride + x] = (uint8_t)(sum >> 4);
    }
  }
}

Issues:

  • Poor reuse across rows/columns; each pixel load is used once
  • Writes cause write-allocate; hot lines in other code get evicted

Blocked version (tile WxH; prefetch next row; avoid thrash with padded stride):

#include <immintrin.h>
 
void conv3x3_tiled(const uint8_t *in, uint8_t *out, size_t w, size_t h, ptrdiff_t pitch) {
  const size_t TW = 128, TH = 64; // tune: make 3 rows × TW fit L1 with headroom
  for (size_t ty = 1; ty + 1 < h; ty += TH) {
    size_t ymax = (ty + TH + 1 < h) ? ty + TH : h - 1;
    for (size_t tx = 1; tx + 1 < w; tx += TW) {
      size_t xmax = (tx + TW + 1 < w) ? tx + TW : w - 1;
      for (size_t y = ty; y < ymax; ++y) {
        // prefetch input rows for the next iteration to hide row jumps
        if (y + 2 < h) __builtin_prefetch(&in[(y+2)*pitch + tx], 0, 3);
        for (size_t x = tx; x < xmax; ++x) {
          int sum = 0;
          const uint8_t *r0 = &in[(y-1)*pitch + x];
          const uint8_t *r1 = &in[y*pitch + x];
          const uint8_t *r2 = &in[(y+1)*pitch + x];
          sum += 1*r0[-1] + 2*r0[0] + 1*r0[+1];
          sum += 2*r1[-1] + 4*r1[0] + 2*r1[+1];
          sum += 1*r2[-1] + 2*r2[0] + 1*r2[+1];
          out[y*pitch + x] = (uint8_t)(sum >> 4);
        }
      }
    }
  }
}

Further wins:

  • Vectorize across x with 16-byte/32-byte loads; keep three sliding row vectors
  • Use non-temporal stores for large output images when the next stage won’t read soon
  • Pad pitch by +16 to break set conflicts for wide powers-of-two widths

Non-temporal write variant for the output row tail:

static inline void store_row_nt(uint8_t *dst, const uint8_t *src, size_t n) {
  size_t i = 0;
  for (; i + 16 <= n; i += 16) {
    __m128i v = _mm_loadu_si128((const __m128i *)(src + i));
    _mm_stream_si128((__m128i *)(dst + i), v);
  }
  for (; i < n; ++i) dst[i] = src[i];
  _mm_sfence();
}

Measure deltas with cache counters; expect lower LLC misses and improved IPC.

TLBs, page size, and why huge pages sometimes help

TLB misses stall even when caches are warm. Sequential scans over very large arrays can saturate the TLB.

Guidance:

  • Keep working sets small via tiling first; TLB problems often disappear
  • Consider using huge pages (Linux madvise(MADV_HUGEPAGE) or explicit hugetlbfs) for large, linear buffers
  • Beware of increased internal fragmentation; test both ways—huge pages can hurt random access patterns

Linux sketches:

#include <sys/mman.h>
// Request THP for a region (transparent, best-effort)
void prefer_thp(void *p, size_t len) { madvise(p, len, MADV_HUGEPAGE); }
 
// Explicit huge page mapping (requires sysctl/config and privileges)
// mmap(..., MAP_HUGETLB | (2MB/1GB flags))

Store buffers, fences, and publication

Stores retire into store buffers before they reach L1. When publishing data to other threads, use proper memory order so readers see complete writes in the intended order.

Pattern refresher (data, then flag with release; reader checks with acquire):

#include <stdatomic.h>
 
struct payload { int a, b, c; } P;
atomic_int ready = ATOMIC_VAR_INIT(0);
 
void publish(int x, int y, int z) {
  P.a = x; P.b = y; P.c = z;
  atomic_store_explicit(&ready, 1, memory_order_release);
}
 
int consume(struct payload *out) {
  if (atomic_load_explicit(&ready, memory_order_acquire)) { *out = P; return 1; }
  return 0;
}

This ensures cache-visible ordering across cores. Combine with cache-friendly layout (pad hot fields, align) for fewer coherence stalls.

Tooling beyond perf: where to look next

  • Intel VTune, AMD uProf: deeper breakdowns (bandwidth roofline, bad speculation)
  • Linux perf mem: sampling memory accesses with latency attribution
  • Flamegraphs + perf script + stackcollapse-perf.pl: visualize hot call stacks before/after layout changes
  • macOS: Instruments (Time Profiler, Counters); DTrace where available

Common anti-patterns and safer alternatives

  • Writing interleaved per-thread logs into a single array: shard per thread and concatenate; or use line-sized chunks per thread
  • Frequent atomics on a shared counter in the hot path: per-core counters + periodic aggregation
  • Bitfields/packed types in hot structs: align and use masks; keep hot fields in their own line
  • Random access without structure: bucketize by locality; reorder work units to improve spatial/temporal reuse

Production checklist (final)

  • Pick layouts that match access patterns (AoS vs SoA); tile to fit L1/L2
  • Eliminate false sharing with line-sized padding and per-core sharding
  • Break stride pathologies with small paddings and 64-byte alignment
  • Use prefetch as a measured pipeline; remove if it doesn’t help
  • Apply non-temporal stores for write-once streams; fence before publish
  • On multi-socket boxes, pin and first-touch; replicate read-mostly state per node
  • Measure with counters, not vibes; keep a small harness and baselines checked in

Closing thoughts

Cache-aware C isn’t a bag of tricks—it’s a way of shaping data and work so the hardware’s fast paths are the default path. Choose layouts that minimize bytes touched per result, keep hot state off your neighbors’ cache lines, process data in tiles that fit a cache, and prove improvements with counters. Do that, and your programs stop stuttering and start sprinting.