You can’t optimize what you can’t see. In C—where a single branch misprediction can be the difference between “fast” and “oops”—good performance work starts with measurement. Not vibes, not guesses, and definitely not cargo-cult flags. Measurement. The tools are excellent: perf
for low-overhead sampling and counters, eBPF for surgical tracing, and Flamegraphs for turning a million tiny samples into an “aha.”
This post is a practical path to actionable visibility:
- Understand sampling vs tracing and when to use each
- Get reliable call stacks (frame pointers vs DWARF unwind)
- Read the right counters (cycles, cache misses, branch misses, context switches)
- Generate Flamegraphs that point at the real hot paths
- Use eBPF to answer “why” without rewriting your app
No silver bullets—just repeatable workflows that keep you honest and ship wins.
What you should measure (and why)
Performance is a budget, not a vibe. Pick metrics that map to user experience and capacity planning:
- Throughput: requests/second, bytes/second, jobs/second
- Latency: P50/P95/P99 end-to-end, plus key internal steps
- CPU: cycles, IPC (instructions per cycle), stalled cycles
- Memory: LLC misses, DTLB/ITLB misses, cache hit ratios
- Syscalls and context switches: kernel overhead and scheduler churn
- I/O: read/write syscalls, bytes moved, short/partial ops,
EAGAIN
rates
When a number regresses, you want to answer two questions fast:
- Where is the time going? 2) Why is the time going there?
Sampling Flamegraphs answer (1). Targeted tracing (eBPF) answers (2).
Sampling vs tracing (pick the right lens)
-
Sampling (e.g.,
perf record
at 99–999 Hz):- Low overhead; minimal code changes
- Great for big-picture hotspots and steady-state CPU use
- Produces call stacks you can aggregate into Flamegraphs
-
Tracing (e.g., eBPF uprobes/kprobes):
- Event-driven; you choose what to record
- Higher fidelity for rare/short events, syscalls, or specific functions
- Perfect for attribution (“which call sites trigger this path?”) and argument/result capture
Use sampling to find the mountain. Use tracing to map the trail.
Make stacks trustworthy (symbols, frame pointers, unwinding)
Flamegraphs are only as good as their stacks. Three rules make life easy:
-
Build with symbols. Add
-g
in your CFLAGS for debug symbols (no need to drop-O3
). Strip them in release artifacts if needed, but keep symbolized builds for profiling. -
Keep frame pointers on hot targets. On x86_64, compile with
-fno-omit-frame-pointer
soperf
can unwind cheaply and reliably (--call-graph fp
). -
If you can’t use frame pointers, use DWARF unwinding:
perf record --call-graph dwarf
. It’s more expensive but works without frame pointers.
Minimal build line you can adapt:
cc -O3 -g -fno-omit-frame-pointer -march=native -pipe -pthread app.c -o app
Quickstart: perf
in five minutes
Warm up with three commands that cover 80% of cases.
- High-level CPU and memory story:
perf stat -d --repeat 3 ./app --args
Key fields to watch:
- cycles / instructions → IPC (higher is generally better until bounded)
- branches / branch-misses → misprediction tax
- cache-misses / LLC-load-misses → working set and locality pain
- context-switches / cpu-migrations → scheduler churn
- Record samples with call stacks:
perf record -F 199 -g --call-graph fp -- ./app --args
# If you lack frame pointers, prefer DWARF unwinding:
perf record -F 199 --call-graph dwarf -- ./app --args
- Triage in TUI, then produce a Flamegraph:
perf report | cat
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
Notes:
- Sampling frequency: 99–199 Hz for low overhead; 999 Hz when chasing tight inner loops (still usually fine). Always measure overhead.
- If your workload is server-like, profile under representative concurrency and input sizes; otherwise your “hot path” is a different program.
- For short-lived CLIs, wrap execution in a loop to collect enough samples:
perf record -F 199 -g -- sh -c 'for i in $(seq 1 50); do ./app --args; done'
Reading the tea leaves: what common counters mean
Perf counters are sharp but interpretable:
- Low IPC with high LLC misses: memory-bound; focus on layout, cache locality, and data movement
- High branch-misses: unpredictable branches; consider branchless transforms, better data ordering, or hints
- Many context switches: blocking or oversubscription; revisit threading and I/O models
- Syscalls per request very high: I/O chatty; batch with
readv
/writev
, buffer, or use zero-copy paths
Tie observations to hypotheses you can test in minutes, not weeks.
A tiny C kernel to have something to look at
If you need a toy target, here’s a brutally simple hotspot to validate your setup:
#include <stddef.h>
#include <stdint.h>
static uint64_t sum_u32(const uint32_t *a, size_t n) {
uint64_t s = 0;
for (size_t i = 0; i < n; ++i) s += a[i];
return s;
}
int main(void) {
enum { N = 1<<26 }; // ~64M
static uint32_t buf[N];
for (size_t i = 0; i < N; ++i) buf[i] = (uint32_t)i;
return (int)(sum_u32(buf, N) & 0xFF);
}
Compile with symbols and frame pointers, then profile. You should see the loop dominate, with LLC misses if your memory can’t keep up.
Coming up next in this journey: turning samples into Flamegraphs you can read at a glance—and then using eBPF to answer “why this path, from which callers, with what arguments?”
From raw samples to Flamegraphs (end-to-end)
You’ve got perf.data
. Now turn it into a picture your brain can parse in seconds.
Install FlameGraph tools (once)
git clone https://github.com/brendangregg/FlameGraph
export PATH="$PWD/FlameGraph:$PATH"
Capture, collapse, draw
# 1) Record with stacks
perf record -F 199 -g --call-graph fp -- ./app --args
# 2) Convert samples to folded stacks
perf script | stackcollapse-perf.pl > out.folded
# 3) Render the SVG
flamegraph.pl --title "app on-CPU" --color hot --width 1400 out.folded > flame.svg
Tips that save hours:
- If you see many
[unknown]
frames, ensure-g -fno-omit-frame-pointer
or use--call-graph dwarf
. - If kernel frames dominate and are unmapped, run with privileges and ensure
kernel.kptr_restrict=0
(or install debug symbols). Alternatively render user-only:--no-instr-ptr --call-graph fp,u
. - Stacks from multiple processes/threads can be merged across runs—just append folded files before rendering.
Symbol resolution you’ll actually use
- Userspace: install
-dbg
packages or keep a symbolized build.perf buildid-cache -r
can prime caches;PERF_BUILDID_DIR
controls lookups. Modern distros supportdebuginfod
env to auto-fetch symbols. - Kernel: install
kernel-debuginfo
(or distro equivalent) and mountdebugfs
(/sys/kernel/debug
).
How to read Flamegraphs (fast and correctly)
- Width equals total time on-CPU across all samples for that stack. Fix wide regions first.
- Self time lives at the leaf. A wide parent but narrow leaf suggests time is spread across children—drill down before changing code.
- Kernel frames (e.g.,
__sched_text_start
,tcp_sendmsg
) mean you’re in the kernel; decide if that’s expected (I/O-heavy) or a sign of inefficiency (chatty syscalls, small writes, locks). - Many tiny adjacent leaves along the same parent often indicate cold instruction cache or fragmented call sites; consider simplifying the hot path.
- Deep stacks with slim leaves are often structural (framework overhead). Attack with batching, fewer layers, or a fast path.
Common shapes and what they imply
- "Comb" under
malloc
/free
: allocator churn. Pool, reuse, or switch allocators. Considerposix_memalign
for alignment-sensitive code. - Wide
memcpy
/memmove
: memory-bound. Improve locality, reduce copies, usereadv
/writev
, or redesign data layout. futex
/__lll_lock_wait
: lock contention. Reduce critical sections; switch to per-core sharding; use lock-free where justified.- Branch-heavy leaf with high
branch-misses
: unpredictable conditionals. Reorder data, precompute hints, use branchless arithmetic.
Validate with code and counters
Pair the picture with numbers:
- Use
perf stat -d -d
on the narrowed workload to see IPC and cache behavior. - Use line-level annotation to confirm the exact instructions doing work:
perf annotate --stdio --symbol your_hot_function | sed -n '1,120p'
If you built with -fno-omit-frame-pointer
and -g
, annotate will align source lines to hot instructions. Look for expensive loads, stores, and branches.
Differential Flamegraphs (prove the win)
Before shipping an optimization, compare before/after. You want wide red (regressions) to be small and wide blue (improvements) to be obvious.
# Baseline
perf record -F 199 -g -- ./app --args && perf script | stackcollapse-perf.pl > base.folded
# After change
perf record -F 199 -g -- ./app --args && perf script | stackcollapse-perf.pl > new.folded
# Diff and render
difffolded.pl base.folded new.folded > diff.folded
flamegraph.pl --title "diff" --color=hot --negate diff.folded > diff.svg
Rules of thumb:
- Keep workload identical (data, concurrency, warmup). If it changes, the comparison lies.
- Prefer wall-clock–normalized runs (same duration or same work count) to avoid sampling bias.
- Use both the diff graph and
perf stat
deltas; a 10% IPC gain with flat runtime usually means you moved the bottleneck.
Pitfalls that make graphs lie
- Stripped or LTO’ed binaries without symbols: keep a symbolized build for profiling, or use debuginfod.
- Omitted frame pointers with fragile DWARF: prefer frame pointers on hot binaries.
- ASLR and containers: make sure
perf
can read/proc
in the namespace; grantCAP_SYS_ADMIN
or run with--privileged
only in trusted environments. - Short-lived processes: wrap in loops to collect enough samples.
- Mixed binaries (plugins): ensure all shared objects are built with matching unwind settings.
eBPF: answering “why” with surgical tracing
Sampling shows you where CPU burns time. eBPF shows you why: which syscalls, which arguments, which call sites, and how long they take—without recompiling your app or rebooting.
What eBPF can hook (and when to choose it)
- kprobes/kretprobes: dynamic hooks on kernel functions
- tracepoints: stable kernel ABI events (preferred where available)
- uprobes/uretprobes: dynamic hooks on userspace functions in any ELF (your service, libc, libssl, …)
- USDT: static user probes compiled into apps/libraries (stable names + arguments)
Pick tracepoints first (stable), then uprobes for user code you own, then kprobes when nothing else exposes what you need.
Safety, overhead, and permissions
- eBPF programs are verified and JITed; typical overhead for light event streams is a few percent.
- Always filter early (by PID/TGID, comm, cgroup, port) and keep maps bounded to prevent high-cardinality explosions.
- You’ll need privileges (root or CAP_BPF/CAP_SYS_ADMIN) and a new-ish kernel with BPF and BTF enabled.
Quick wins with bpftrace (zero build system changes)
Count syscalls by process name:
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
Top callers of write(2)
with size histogram (exit has return value):
sudo bpftrace -e '
tracepoint:syscalls:sys_exit_write /pid == PID/ { @bytes[comm] = hist(args->ret); }
'
Per-function latency histogram in your app via uprobes (entry/return timing):
# Replace ./app with full path; symbol must be visible (no inlining or strip)
sudo bpftrace -e '
uprobe:./app:hot_func { @t[tid] = nsecs; }
uretprobe:./app:hot_func /@t[tid]/ {
@lat = hist((nsecs - @t[tid]) / 1000); delete(@t[tid]);
}
'
Stack sampling on expensive kernel events (e.g., long read(2)
):
sudo bpftrace -e '
tracepoint:syscalls:sys_exit_read /args->ret > 1*1024*1024/ {
printf("big read by %s pid=%d\\n", comm, pid);
kstack();
ustack();
}
'
Notes:
- Use absolute paths for uprobes; add
-fno-omit-frame-pointer
to your binary for betterustack
quality. hist()
produces log2 histograms; great for spotting bimodal size or latency distributions.- Attach filters (
/pid == XYZ/
,/comm == "app"/
) to avoid drowning the system.
USDT probes when you have them
Some runtimes/libraries expose USDT (DTrace-style) probes. They’re stable and carry typed arguments.
List and attach:
sudo bpftrace -l 'usdt:./app:*'
sudo bpftrace -e 'usdt:./app:request__start { @[probe] = count(); }'
If you own the C code, adding USDT via libstapsdt
or usr-probes
yields durable hooks that survive inlining and LTO.
Turning questions into one-liners
- “Which call sites cause small writes?” Attach a uprobe/uretprobe to
write
in libc and printustack()
whenret < 512
. - “Why are we context switching so much?” Trace
sched:sched_switch
and aggregate byprev_comm/next_comm
pairs. - “What’s the tail of TLS handshake time?” Uprobes on
SSL_do_handshake
entry/return with a latency histogram.
A tiny bpftrace script: per-socket write sizes (top-N)
#!/usr/bin/env bpftrace
BEGIN { printf("tracking write(2) sizes by PID/FD...\n"); }
tracepoint:syscalls:sys_exit_write /args->ret > 0/ {
@bytes[pid, args->ret] = count();
}
interval:s:5 {
printf("\nTOP write sizes (pid,size -> count)\n");
print(@bytes, 10);
clear(@bytes);
}
Run for a short window in staging/production during a controlled test. Prefer short intervals and capped prints to keep overhead and log volume low.
libbpf/CO-RE when you need permanence
For sustained use in CI/CD or production, prefer libbpf CO-RE (Compile Once – Run Everywhere):
- Stable to kernel changes thanks to BTF type info
- Explicit maps, ring-buffers, and perf-buffers for controlled data flow
- Lower overhead than generic frontends; easier to version and test
Minimal outline (userland) you’ll see in most CO-RE samples:
// Pseudocode: open BPF object, load, set filters, attach, poll ring buffer
struct bpf_object *obj = bpf_object__open_file("prog.bpf.o", NULL);
bpf_object__load(obj);
// locate links/maps, set pid filter map value, attach kprobe/tracepoint
while (poll(ringfd, ...)) { /* read events, aggregate, print */ }
Use the official libbpf-bootstrap
templates to scaffold projects quickly.
Production etiquette (so tracing doesn’t bite back)
- Time-box sessions and cap event rates. If counts soar, stop or add filters.
- Bound map keys (e.g., top-N with LRU) to avoid memory blow-ups.
- Prefer histograms and aggregated counters over per-event logs; export deltas periodically.
- Roll out tracing behind feature flags; collect build-id and version metadata for reproducibility.
- Document probe points and arguments in the repo so future you knows what “lat_us” meant.
Turn findings into wins (a practical optimization workflow)
You have a hot path and a hypothesis. Move deliberately:
-
Form a specific hypothesis. “Batch small writes into
writev
to cut syscalls by 10x.” -
Create a minimal, reversible change. Feature flag it.
-
Measure locally under a representative load:
perf stat -d --repeat 5
for countersperf record -F 199 -g
+ Flamegraph for hotspot shifts- Targeted eBPF to confirm the mechanism (e.g., histogram of write sizes)
-
Decide with data. If the intended metric doesn’t improve (or another regresses), revert or adjust.
-
Land behind a guard, then validate in staging/canary with the exact same measurements.
Reproducible benchmarking (environment matters)
Noise can dwarf a 5–10% win. Control what you can:
- Pin CPUs and memory:
taskset -c 2-3
ornumactl --cpunodebind=0 --membind=0
- Keep frequency steady: use the performance governor during tests
sudo cpupower frequency-set -g performance
- Eliminate cross-talk: dedicate cores or a host; avoid background builds/scans
- Warm caches: run once to warm, then measure
- Fix data shape and size: performance is input-dependent
- Record the environment: kernel version, microcode, CPU model, governor, NUMA topology
For pinned measurements:
perf stat -C 2 -d -- ./app --args
Case studies (patterns that actually pay)
1) Syscall chatter → writev
batching
Findings: Flamegraph shows wide write
leafs; eBPF histograms show many sub-512B writes; perf stat
shows high syscalls
and context switches.
Change: Aggregate adjacent buffers; flush with writev
up to a byte budget.
Result: Syscalls/request drop by 8–20x; perf stat
shows lower context switches; Flamegraph shifts heat from kernel to your userland batching.
2) Allocator churn → pooling
Findings: Wide malloc
/free
bands; TCMalloc/GLIBC allocator frames dominate.
Change: Introduce per-thread pools/arenas; reuse buffers; cap peak.
Result: IPC rises, cycles fall; tail latency tightens as GC-like pauses vanish; eBPF shows sys_enter_brk/mmap
nearly disappear.
3) Cache misses → layout and traversal
Findings: Low IPC with high LLC-load-misses; Flamegraph dominated by copy/parse loops.
Change: Switch AoS→SoA for hot fields; precompute indexes; reduce pointer chasing; prefetch cautiously and measure.
Result: LLC misses drop; cycles per element fall; Flamegraph leaf narrows around the compute kernel instead of memory stalls.
Make it part of CI (and don’t spam production)
- Create a perf harness script that:
- Runs a fixed workload
- Captures
perf stat -d
into artifacts - Optionally records samples and emits a Flamegraph SVG
- Fails the job when a guarded metric regresses beyond a tolerance (with noise buffers)
- Nightly/stage jobs can run heavier profiles; PRs run lightweight sanity checks
- Store artifacts (SVGs, folded stacks,
perf.data
) to compare across commits
Example harness sketch:
#!/usr/bin/env bash
set -euo pipefail
W="./bench --rows 2000000 --threads 4"
perf stat -d --repeat 5 -- $W 2> perf.stat.txt || true
perf record -F 199 -g -- $W
perf script | stackcollapse-perf.pl > out.folded
flamegraph.pl --title "bench" out.folded > flame.svg
Guardrails: correctness first
- Always run unit/integration tests before trusting measurements
- Keep sanitizer builds separate from perf builds; use them to validate logic while iterating on perf with release flags
- Watch error rates alongside latency/throughput; a “speedup” that drops work is not a win
A printable checklist
- Build:
-O3 -g -fno-omit-frame-pointer
(symbols kept for profiling) - Measure:
perf stat -d
thenperf record -F 199 -g
- Visualize:
perf script | stackcollapse | flamegraph
- Decide: pick the widest relevant leaf you own; form a concrete hypothesis
- Verify: counters and eBPF confirm the intended mechanism changed
- Regressions: check P95/P99, not just averages; compare with differential Flamegraphs
- Rollout: guard, canary, track build-id; keep an escape hatch
Closing thoughts
Performance work is engineering, not art: measure, hypothesize, change, verify, repeat. Sampling shows you the mountain; eBPF explains the trail; Flamegraphs make it obvious where to dig. Keep your stacks trustworthy, your workloads representative, and your experiments reversible. The teams that win at performance aren’t magical—they’re disciplined.