Measuring What Matters: perf, eBPF, and Flamegraphs for C

You can’t optimize what you can’t see. In C—where a single branch misprediction can be the difference between “fast” and “oops”—good performance work starts with measurement. Not vibes, not guesses, and definitely not cargo-cult flags. Measurement. The tools are excellent: perf for low-overhead sampling and counters, eBPF for surgical tracing, and Flamegraphs for turning a million tiny samples into an “aha.”

This post is a practical path to actionable visibility:

Understand sampling vs tracing and when to use each
Get reliable call stacks (frame pointers vs DWARF unwind)
Read the right counters (cycles, cache misses, branch misses, context switches)
Generate Flamegraphs that point at the real hot paths
Use eBPF to answer “why” without rewriting your app

No silver bullets—just repeatable workflows that keep you honest and ship wins.

What you should measure (and why)

Performance is a budget, not a vibe. Pick metrics that map to user experience and capacity planning:

Throughput: requests/second, bytes/second, jobs/second
Latency: P50/P95/P99 end-to-end, plus key internal steps
CPU: cycles, IPC (instructions per cycle), stalled cycles
Memory: LLC misses, DTLB/ITLB misses, cache hit ratios
Syscalls and context switches: kernel overhead and scheduler churn
I/O: read/write syscalls, bytes moved, short/partial ops, EAGAIN rates

When a number regresses, you want to answer two questions fast:

Where is the time going? 2) Why is the time going there?

Sampling Flamegraphs answer (1). Targeted tracing (eBPF) answers (2).

Sampling vs tracing (pick the right lens)

Sampling (e.g., perf record at 99–999 Hz):
- Low overhead; minimal code changes
- Great for big-picture hotspots and steady-state CPU use
- Produces call stacks you can aggregate into Flamegraphs
Tracing (e.g., eBPF uprobes/kprobes):
- Event-driven; you choose what to record
- Higher fidelity for rare/short events, syscalls, or specific functions
- Perfect for attribution (“which call sites trigger this path?”) and argument/result capture

Use sampling to find the mountain. Use tracing to map the trail.

Make stacks trustworthy (symbols, frame pointers, unwinding)

Flamegraphs are only as good as their stacks. Three rules make life easy:

Build with symbols. Add -g in your CFLAGS for debug symbols (no need to drop -O3). Strip them in release artifacts if needed, but keep symbolized builds for profiling.
Keep frame pointers on hot targets. On x86_64, compile with -fno-omit-frame-pointer so perf can unwind cheaply and reliably (--call-graph fp).
If you can’t use frame pointers, use DWARF unwinding: perf record --call-graph dwarf. It’s more expensive but works without frame pointers.

Minimal build line you can adapt:

cc -O3 -g -fno-omit-frame-pointer -march=native -pipe -pthread app.c -o app

Quickstart: `perf` in five minutes

Warm up with three commands that cover 80% of cases.

High-level CPU and memory story:

perf stat -d --repeat 3 ./app --args

Key fields to watch:

cycles / instructions → IPC (higher is generally better until bounded)
branches / branch-misses → misprediction tax
cache-misses / LLC-load-misses → working set and locality pain
context-switches / cpu-migrations → scheduler churn

Record samples with call stacks:

perf record -F 199 -g --call-graph fp -- ./app --args
# If you lack frame pointers, prefer DWARF unwinding:
perf record -F 199 --call-graph dwarf -- ./app --args

Triage in TUI, then produce a Flamegraph:

perf report | cat
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Notes:

Sampling frequency: 99–199 Hz for low overhead; 999 Hz when chasing tight inner loops (still usually fine). Always measure overhead.
If your workload is server-like, profile under representative concurrency and input sizes; otherwise your “hot path” is a different program.
For short-lived CLIs, wrap execution in a loop to collect enough samples:

perf record -F 199 -g -- sh -c 'for i in $(seq 1 50); do ./app --args; done'

Reading the tea leaves: what common counters mean

Perf counters are sharp but interpretable:

Low IPC with high LLC misses: memory-bound; focus on layout, cache locality, and data movement
High branch-misses: unpredictable branches; consider branchless transforms, better data ordering, or hints
Many context switches: blocking or oversubscription; revisit threading and I/O models
Syscalls per request very high: I/O chatty; batch with readv/writev, buffer, or use zero-copy paths

Tie observations to hypotheses you can test in minutes, not weeks.

A tiny C kernel to have something to look at

If you need a toy target, here’s a brutally simple hotspot to validate your setup:

#include <stddef.h>
#include <stdint.h>
 
static uint64_t sum_u32(const uint32_t *a, size_t n) {
  uint64_t s = 0;
  for (size_t i = 0; i < n; ++i) s += a[i];
  return s;
}
 
int main(void) {
  enum { N = 1<<26 }; // ~64M
  static uint32_t buf[N];
  for (size_t i = 0; i < N; ++i) buf[i] = (uint32_t)i;
  return (int)(sum_u32(buf, N) & 0xFF);
}

Compile with symbols and frame pointers, then profile. You should see the loop dominate, with LLC misses if your memory can’t keep up.

Coming up next in this journey: turning samples into Flamegraphs you can read at a glance—and then using eBPF to answer “why this path, from which callers, with what arguments?”

From raw samples to Flamegraphs (end-to-end)

You’ve got perf.data. Now turn it into a picture your brain can parse in seconds.

Install FlameGraph tools (once)

git clone https://github.com/brendangregg/FlameGraph
export PATH="$PWD/FlameGraph:$PATH"

Capture, collapse, draw

# 1) Record with stacks
perf record -F 199 -g --call-graph fp -- ./app --args
 
# 2) Convert samples to folded stacks
perf script | stackcollapse-perf.pl > out.folded
 
# 3) Render the SVG
flamegraph.pl --title "app on-CPU" --color hot --width 1400 out.folded > flame.svg

Tips that save hours:

If you see many [unknown] frames, ensure -g -fno-omit-frame-pointer or use --call-graph dwarf.
If kernel frames dominate and are unmapped, run with privileges and ensure kernel.kptr_restrict=0 (or install debug symbols). Alternatively render user-only: --no-instr-ptr --call-graph fp,u.
Stacks from multiple processes/threads can be merged across runs—just append folded files before rendering.

Symbol resolution you’ll actually use

Userspace: install -dbg packages or keep a symbolized build. perf buildid-cache -r can prime caches; PERF_BUILDID_DIR controls lookups. Modern distros support debuginfod env to auto-fetch symbols.
Kernel: install kernel-debuginfo (or distro equivalent) and mount debugfs (/sys/kernel/debug).

How to read Flamegraphs (fast and correctly)

Width equals total time on-CPU across all samples for that stack. Fix wide regions first.
Self time lives at the leaf. A wide parent but narrow leaf suggests time is spread across children—drill down before changing code.
Kernel frames (e.g., __sched_text_start, tcp_sendmsg) mean you’re in the kernel; decide if that’s expected (I/O-heavy) or a sign of inefficiency (chatty syscalls, small writes, locks).
Many tiny adjacent leaves along the same parent often indicate cold instruction cache or fragmented call sites; consider simplifying the hot path.
Deep stacks with slim leaves are often structural (framework overhead). Attack with batching, fewer layers, or a fast path.

Common shapes and what they imply

"Comb" under malloc/free: allocator churn. Pool, reuse, or switch allocators. Consider posix_memalign for alignment-sensitive code.
Wide memcpy/memmove: memory-bound. Improve locality, reduce copies, use readv/writev, or redesign data layout.
futex/__lll_lock_wait: lock contention. Reduce critical sections; switch to per-core sharding; use lock-free where justified.
Branch-heavy leaf with high branch-misses: unpredictable conditionals. Reorder data, precompute hints, use branchless arithmetic.

Validate with code and counters

Pair the picture with numbers:

Use perf stat -d -d on the narrowed workload to see IPC and cache behavior.
Use line-level annotation to confirm the exact instructions doing work:

perf annotate --stdio --symbol your_hot_function | sed -n '1,120p'

If you built with -fno-omit-frame-pointer and -g, annotate will align source lines to hot instructions. Look for expensive loads, stores, and branches.

Differential Flamegraphs (prove the win)

Before shipping an optimization, compare before/after. You want wide red (regressions) to be small and wide blue (improvements) to be obvious.

# Baseline
perf record -F 199 -g -- ./app --args && perf script | stackcollapse-perf.pl > base.folded
 
# After change
perf record -F 199 -g -- ./app --args && perf script | stackcollapse-perf.pl > new.folded
 
# Diff and render
difffolded.pl base.folded new.folded > diff.folded
flamegraph.pl --title "diff" --color=hot --negate diff.folded > diff.svg

Rules of thumb:

Keep workload identical (data, concurrency, warmup). If it changes, the comparison lies.
Prefer wall-clock–normalized runs (same duration or same work count) to avoid sampling bias.
Use both the diff graph and perf stat deltas; a 10% IPC gain with flat runtime usually means you moved the bottleneck.

Pitfalls that make graphs lie

Stripped or LTO’ed binaries without symbols: keep a symbolized build for profiling, or use debuginfod.
Omitted frame pointers with fragile DWARF: prefer frame pointers on hot binaries.
ASLR and containers: make sure perf can read /proc in the namespace; grant CAP_SYS_ADMIN or run with --privileged only in trusted environments.
Short-lived processes: wrap in loops to collect enough samples.
Mixed binaries (plugins): ensure all shared objects are built with matching unwind settings.

eBPF: answering “why” with surgical tracing

Sampling shows you where CPU burns time. eBPF shows you why: which syscalls, which arguments, which call sites, and how long they take—without recompiling your app or rebooting.

What eBPF can hook (and when to choose it)

kprobes/kretprobes: dynamic hooks on kernel functions
tracepoints: stable kernel ABI events (preferred where available)
uprobes/uretprobes: dynamic hooks on userspace functions in any ELF (your service, libc, libssl, …)
USDT: static user probes compiled into apps/libraries (stable names + arguments)

Pick tracepoints first (stable), then uprobes for user code you own, then kprobes when nothing else exposes what you need.

Safety, overhead, and permissions

eBPF programs are verified and JITed; typical overhead for light event streams is a few percent.
Always filter early (by PID/TGID, comm, cgroup, port) and keep maps bounded to prevent high-cardinality explosions.
You’ll need privileges (root or CAP_BPF/CAP_SYS_ADMIN) and a new-ish kernel with BPF and BTF enabled.

Quick wins with bpftrace (zero build system changes)

Count syscalls by process name:

sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

Top callers of write(2) with size histogram (exit has return value):

sudo bpftrace -e '
tracepoint:syscalls:sys_exit_write /pid == PID/ { @bytes[comm] = hist(args->ret); }
'

Per-function latency histogram in your app via uprobes (entry/return timing):

# Replace ./app with full path; symbol must be visible (no inlining or strip)
sudo bpftrace -e '
uprobe:./app:hot_func { @t[tid] = nsecs; }
uretprobe:./app:hot_func /@t[tid]/ {
  @lat = hist((nsecs - @t[tid]) / 1000); delete(@t[tid]);
}
'

Stack sampling on expensive kernel events (e.g., long read(2)):

sudo bpftrace -e '
tracepoint:syscalls:sys_exit_read /args->ret > 1*1024*1024/ {
  printf("big read by %s pid=%d\\n", comm, pid);
  kstack();
  ustack();
}
'

Notes:

Use absolute paths for uprobes; add -fno-omit-frame-pointer to your binary for better ustack quality.
hist() produces log2 histograms; great for spotting bimodal size or latency distributions.
Attach filters (/pid == XYZ/, /comm == "app"/) to avoid drowning the system.

USDT probes when you have them

Some runtimes/libraries expose USDT (DTrace-style) probes. They’re stable and carry typed arguments.

List and attach:

sudo bpftrace -l 'usdt:./app:*'
sudo bpftrace -e 'usdt:./app:request__start { @[probe] = count(); }'

If you own the C code, adding USDT via libstapsdt or usr-probes yields durable hooks that survive inlining and LTO.

Turning questions into one-liners

“Which call sites cause small writes?” Attach a uprobe/uretprobe to write in libc and print ustack() when ret < 512.
“Why are we context switching so much?” Trace sched:sched_switch and aggregate by prev_comm/next_comm pairs.
“What’s the tail of TLS handshake time?” Uprobes on SSL_do_handshake entry/return with a latency histogram.

A tiny bpftrace script: per-socket write sizes (top-N)

#!/usr/bin/env bpftrace
 
BEGIN { printf("tracking write(2) sizes by PID/FD...\n"); }
 
tracepoint:syscalls:sys_exit_write /args->ret > 0/ {
  @bytes[pid, args->ret] = count();
}
 
interval:s:5 {
  printf("\nTOP write sizes (pid,size -> count)\n");
  print(@bytes, 10);
  clear(@bytes);
}

Run for a short window in staging/production during a controlled test. Prefer short intervals and capped prints to keep overhead and log volume low.

libbpf/CO-RE when you need permanence

For sustained use in CI/CD or production, prefer libbpf CO-RE (Compile Once – Run Everywhere):

Stable to kernel changes thanks to BTF type info
Explicit maps, ring-buffers, and perf-buffers for controlled data flow
Lower overhead than generic frontends; easier to version and test

Minimal outline (userland) you’ll see in most CO-RE samples:

// Pseudocode: open BPF object, load, set filters, attach, poll ring buffer
struct bpf_object *obj = bpf_object__open_file("prog.bpf.o", NULL);
bpf_object__load(obj);
// locate links/maps, set pid filter map value, attach kprobe/tracepoint
while (poll(ringfd, ...)) { /* read events, aggregate, print */ }

Use the official libbpf-bootstrap templates to scaffold projects quickly.

Production etiquette (so tracing doesn’t bite back)

Time-box sessions and cap event rates. If counts soar, stop or add filters.
Bound map keys (e.g., top-N with LRU) to avoid memory blow-ups.
Prefer histograms and aggregated counters over per-event logs; export deltas periodically.
Roll out tracing behind feature flags; collect build-id and version metadata for reproducibility.
Document probe points and arguments in the repo so future you knows what “lat_us” meant.

Turn findings into wins (a practical optimization workflow)

You have a hot path and a hypothesis. Move deliberately:

Form a specific hypothesis. “Batch small writes into writev to cut syscalls by 10x.”
Create a minimal, reversible change. Feature flag it.
Measure locally under a representative load:

perf stat -d --repeat 5 for counters
perf record -F 199 -g + Flamegraph for hotspot shifts
Targeted eBPF to confirm the mechanism (e.g., histogram of write sizes)

Decide with data. If the intended metric doesn’t improve (or another regresses), revert or adjust.
Land behind a guard, then validate in staging/canary with the exact same measurements.

Reproducible benchmarking (environment matters)

Noise can dwarf a 5–10% win. Control what you can:

Pin CPUs and memory: taskset -c 2-3 or numactl --cpunodebind=0 --membind=0
Keep frequency steady: use the performance governor during tests

sudo cpupower frequency-set -g performance

Eliminate cross-talk: dedicate cores or a host; avoid background builds/scans
Warm caches: run once to warm, then measure
Fix data shape and size: performance is input-dependent
Record the environment: kernel version, microcode, CPU model, governor, NUMA topology

For pinned measurements:

perf stat -C 2 -d -- ./app --args

Case studies (patterns that actually pay)

1) Syscall chatter → `writev` batching

Findings: Flamegraph shows wide write leafs; eBPF histograms show many sub-512B writes; perf stat shows high syscalls and context switches.

Change: Aggregate adjacent buffers; flush with writev up to a byte budget.

Result: Syscalls/request drop by 8–20x; perf stat shows lower context switches; Flamegraph shifts heat from kernel to your userland batching.

2) Allocator churn → pooling

Findings: Wide malloc/free bands; TCMalloc/GLIBC allocator frames dominate.

Change: Introduce per-thread pools/arenas; reuse buffers; cap peak.

Result: IPC rises, cycles fall; tail latency tightens as GC-like pauses vanish; eBPF shows sys_enter_brk/mmap nearly disappear.

3) Cache misses → layout and traversal

Findings: Low IPC with high LLC-load-misses; Flamegraph dominated by copy/parse loops.

Change: Switch AoS→SoA for hot fields; precompute indexes; reduce pointer chasing; prefetch cautiously and measure.

Result: LLC misses drop; cycles per element fall; Flamegraph leaf narrows around the compute kernel instead of memory stalls.

Make it part of CI (and don’t spam production)

Create a perf harness script that:
- Runs a fixed workload
- Captures perf stat -d into artifacts
- Optionally records samples and emits a Flamegraph SVG
- Fails the job when a guarded metric regresses beyond a tolerance (with noise buffers)
Nightly/stage jobs can run heavier profiles; PRs run lightweight sanity checks
Store artifacts (SVGs, folded stacks, perf.data) to compare across commits

Example harness sketch:

#!/usr/bin/env bash
set -euo pipefail
W="./bench --rows 2000000 --threads 4"
perf stat -d --repeat 5 -- $W 2> perf.stat.txt || true
perf record -F 199 -g -- $W
perf script | stackcollapse-perf.pl > out.folded
flamegraph.pl --title "bench" out.folded > flame.svg

Guardrails: correctness first

Always run unit/integration tests before trusting measurements
Keep sanitizer builds separate from perf builds; use them to validate logic while iterating on perf with release flags
Watch error rates alongside latency/throughput; a “speedup” that drops work is not a win

A printable checklist

Build: -O3 -g -fno-omit-frame-pointer (symbols kept for profiling)
Measure: perf stat -d then perf record -F 199 -g
Visualize: perf script | stackcollapse | flamegraph
Decide: pick the widest relevant leaf you own; form a concrete hypothesis
Verify: counters and eBPF confirm the intended mechanism changed
Regressions: check P95/P99, not just averages; compare with differential Flamegraphs
Rollout: guard, canary, track build-id; keep an escape hatch

Closing thoughts

Performance work is engineering, not art: measure, hypothesize, change, verify, repeat. Sampling shows you the mountain; eBPF explains the trail; Flamegraphs make it obvious where to dig. Keep your stacks trustworthy, your workloads representative, and your experiments reversible. The teams that win at performance aren’t magical—they’re disciplined.