Profiling That Matters: py-spy, eBPF, perf, and Interpreter-level Counters

You don’t have to choose between correct results and a fast service to get them. The trick is matching the profiler to the question, keeping overhead low enough for reality to shine through, and reading the output like an engineer—not a fortune teller.

This guide builds a production‑first profiling playbook for Python using four pillars that complement each other:

py-spy for safe, low‑overhead CPU profiling and flamegraphs
perf for system‑grade sampling and native visibility
eBPF (bcc/bpftrace, continuous profilers) for off‑CPU and whole‑system views
Interpreter‑level counters (e.g., tracing hooks, PEP 669 monitoring) when you need precise, semantic signals

We’ll start with a taxonomy and overhead budgets you can apply immediately, then get hands‑on with quick wins you can run today.

Profiling taxonomy (choose the right lens)

flowchart LR A[What are you missing?] -->|CPU on‑core| B[Sampling profilers] A -->|Waiting/blocked| C[Off‑CPU + wait analysis] A -->|Precise counts/timelines| D[Tracing profilers] A -->|VM semantics| E[Interpreter counters] B --> B1[py-spy] B --> B2[perf] C --> C1[eBPF off‑CPU] C --> C2[perf sched/IO] D --> D1[cProfile/yappi] E --> E1[sys.setprofile] E --> E2[sys.monitoring (PEP 669)] classDef s fill:#e1f5fe,stroke:#90caf9,color:#0d47a1; classDef t fill:#fff3e0,stroke:#ffcc80,color:#e65100; class B,B1,B2 s; class C,C1,C2 s; class D,D1 t; class E,E1,E2 t;

Sampling (py-spy, perf, eBPF): periodic stack snapshots. Low overhead, great for truth in production. You get where time is burned on‑CPU; with the right events, you can also see off‑CPU stacks.
Tracing (cProfile, yappi): record every call/return. High detail, higher overhead. Use in dev, tight scopes, or test environments.
Interpreter counters/monitoring (sys.setprofile, sys.settrace, PEP 669 sys.monitoring): semantic hooks from the VM. Powerful for targeted instrumentation; budget carefully.

Overhead budgets that keep results honest

graph TB subgraph "Typical Overhead Bands (order‑of‑magnitude)" L1[System sampling (perf/eBPF): ~0.5–5%]:::ok L2[py-spy CPU sampling: ~1–10%]:::ok L3[Deterministic tracing (cProfile/yappi): ~10–100%]:::warn L4[Line tracing (settrace): 2–10x slowdown]:::risk end classDef ok fill:#e8f5e8,stroke:#66bb6a,color:#1b5e20; classDef warn fill:#fff8e1,stroke:#ffca28,color:#e65100; classDef risk fill:#ffebee,stroke:#ef5350,color:#b71c1c;

Practical guidance:

Aim for single‑digit percent overhead for truth in prod. Prefer sampling (py-spy, perf, eBPF) unless you need exact counts.
Use tracing profilers on small, representative runs or in CI to validate hypotheses—not on live hot paths.
For async and I/O‑heavy services, add off‑CPU visibility (eBPF/perf sched/block IO) or you’ll misattribute time to the wrong functions.

Quick wins you can run today

1) Live top of Python hotspots (zero code changes)

py-spy top --pid <PID>

What you get: a moving “top” of Python frames consuming CPU across threads. Use this to confirm suspected hot paths before you record anything heavier.

2) Record a flamegraph you can share

py-spy record -F 99 --pid <PID> -o profile.svg

-F 99: sample at ~99 Hz (friendly default for web services).
Open profile.svg in a browser; it’s interactive.

Include idle/waiting stacks when investigating stalls:

py-spy record --idle -F 99 --pid <PID> -o profile.svg

This adds stacks for threads that are sleeping or blocked (e.g., I/O), helping distinguish CPU from wait time.

3) System view with perf (Python symbols included)

On recent Python builds, enable Python function symbolization:

python -X perf your_app.py &
PERFPID=$!
perf record -F 999 -g -p $PERFPID -- sleep 30
perf report -g

This pairs system‑grade sampling with Python frame names so you see both native and Python time on the same call graph.

4) Off‑CPU (waiting) time, at a glance

If CPU looks fine but latency is high, sample waits/off‑CPU stacks. Two lightweight options:

perf scheduling timeline

perf sched record -p <PID> -- sleep 10
perf sched timehist -p <PID> | less

eBPF (bcc/bpftrace) off‑CPU profiles (advanced). This captures stacks where threads are not on‑CPU, surfacing mutex, I/O, and backpressure bottlenecks.

How to read a flamegraph without fooling yourself

Flamegraphs show where time accumulates on stacks. Width encodes time; height encodes call depth.

Look for wide frames near the bottom: those are foundational hotspots.
Prefer deltas: take a baseline, make one change, take another profile.
Separate CPU vs off‑CPU: a wide frame in a CPU‑only profile is a compute hotspot; in an idle‑inclusive profile it may be waiting.

flowchart TD FG[Flamegraph] --> W[Width = time] FG --> H[Height = call depth] FG --> B[Bottom-wide = foundational hotspot] FG --> D[Compare before/after, not absolutes] FG --> O[Include off‑CPU when latency is the symptom] style FG fill:#e1f5fe,stroke:#90caf9

Ground rules for fair runs

Pin scenarios: reproduce a realistic workload (QPS/concurrency, dataset size, cache state). Report the environment (CPU, Python version, OS).
Warm vs cold: make it explicit. For CPU hotspots, measure warmed; for throughput limits, do both.
Hold inputs constant; change one thing at a time. Prefer scripts over hand‑driven runs.
Budget: keep profiler overhead below the “signal” you’re trying to measure.

Minimal, targeted interpreter hooks (when you need exact answers)

Sometimes you need precise counts or lifecycle signals the sampler won’t give you. Use interpreter hooks sparingly and locally.

Counting calls to a specific function without touching callers:

# count_calls.py
import sys
 
TARGET = ("my_module", "expensive_fn")
counts: dict[str, int] = {"calls": 0}
 
def _prof(frame, event, arg):
    if event == "call":
        code = frame.f_code
        if code.co_name == TARGET[1] and code.co_filename.endswith(TARGET[0].replace(".", "/") + ".py"):
            counts["calls"] += 1
    return _prof
 
def main() -> None:
    sys.setprofile(_prof)
    try:
        import my_module  # your program starts here
        my_module.run()
    finally:
        sys.setprofile(None)
        print("expensive_fn calls:", counts["calls"])
 
if __name__ == "__main__":
    main()

Use this pattern narrowly (specific modules/functions, short windows). For broader, lower‑overhead monitoring hooks, modern Python exposes sys.monitoring (PEP 669), which we’ll use later for focused, opt‑in signals.

What you should have now:

A clear map for choosing sampling vs tracing vs interpreter signals
A safe default (py-spy) to get actionable CPU profiles in minutes
First tools to separate CPU vs waiting time (perf/eBPF, py-spy --idle)

Next, we’ll go deeper on py-spy (practical flags, accuracy gotchas, and reading real service profiles), then layer in perf/eBPF and interpreter counters for full coverage.

py-spy in practice: flags that matter, results you can trust

py-spy is the fastest path to trustworthy CPU profiles in production. Use these options to get the right picture the first time.

Install and basic usage

pip install py-spy
 
# Attach to a running process
py-spy top --pid <PID>
 
# Record an interactive flamegraph
py-spy record -F 99 --pid <PID> -d 60 -o profile.svg

Common options:

-F <Hz>: sampling frequency (e.g., 50–250). Higher = finer detail, more overhead.
-d <sec>: duration. Prefer 30–120s under representative load.
--idle: include sleeping/blocked stacks (great for latency investigations).
--native: include native frames (C extensions). Slightly higher overhead.
--subprocesses: follow children spawned by the target.
--format speedscope -o out.speedscope.json: export for speedscope/trace visualization.
py-spy record -o out.svg -- python app.py: launch and profile a new process.

Attach and permissions (don’t get blocked)

Linux ptrace restrictions can prevent attaching to non‑child processes. Fixes:

# Temporarily relax ptrace for this session (revert to 1 or original after)
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
# or
sudo sysctl kernel.yama.ptrace_scope=0
 
# If running inside a container, you may need extra caps
docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined ...

macOS: attach to non‑system Python; you may need to run as an admin user. System Integrity Protection (SIP) blocks profiling of protected binaries.

Windows: run an elevated shell (Administrator) to attach to other users’ processes.

Sampling recipes you’ll actually use

CPU hotspots during a load test

py-spy record -F 99 -d 120 --pid <PID> -o cpu-hotspots.svg

Latency investigation (see waiting time)

py-spy record --idle -F 99 -d 60 --pid <PID> -o latency.svg

Native work in C extensions (NumPy/Pandas/etc.)

py-spy record --native -F 199 -d 45 --pid <PID> -o native.svg

Shareable, zoomable timeline with speedscope

py-spy record -F 99 -d 60 --pid <PID> --format speedscope -o profile.speedscope.json
# Open with speedscope (CLI or web):
# npx speedscope profile.speedscope.json

Resolution vs overhead: pick the right point

graph LR F50[50 Hz] -->|lower overhead, coarse| R1[Stable hotspots] F99[99 Hz] -->|balanced default| R2[Most services] F199[199 Hz] -->|higher detail, more samples| R3[Short bursts] style F50 fill:#e8f5e8,stroke:#66bb6a style F99 fill:#e1f5fe,stroke:#90caf9 style F199 fill:#fff3e0,stroke:#ffcc80

Guidance:

Start at 99 Hz for web services; drop to 50 Hz for very busy hosts; spike to 199 Hz briefly for micro‑hotspots.
Prefer longer duration with moderate rate over ultra‑high rates: time dominates variance.

Following subprocesses and workers

Many apps fork worker processes (gunicorn, Celery, job runners). Follow them:

py-spy record --subprocesses -F 99 -d 90 --pid <MASTER_PID> -o farm.svg

You’ll get a combined flamegraph across the process tree; large bottoms suggest shared bottlenecks (imports, serialization, DB clients).

Containers and Kubernetes

Ensure the container has SYS_PTRACE capability and a permissive seccomp profile.
If attaching from the host, match PID namespaces (nsenter -t <container-pid> -m -p py-spy ...).
In Kubernetes, consider a debug pod or sidecar with the needed caps; avoid adding ptrace to public‑facing pods by default.

Reading results: a quick interpretation loop

flowchart TD S[Start recording] --> L[Let workload run steady] L --> V[View flamegraph] V --> Q{Bottom-wide frame?} Q -->|Yes| H[Hypothesis: optimize that layer] Q -->|No| N[Increase duration or add --idle] H --> D[Change one thing] D --> R[Re-record; compare delta] style V fill:#e1f5fe,stroke:#90caf9

Tips:

Don’t chase narrow stacks at the top; fix wide bases first.
If CPU looks balanced but latency is high, re‑record with --idle.
If frames are all native, re‑record with --native (or use perf to see kernel/extension cost).

Accuracy and caveats

Short functions may be under‑sampled; rely on sufficient duration, not just rate.
GIL: py-spy shows Python time even when threads exist; don’t mistake multi‑threaded tops for parallel compute in pure Python.
Native extensions often release the GIL; wide native frames are real CPU time and can run in parallel.
Very high rates on very hot processes can increase perturbation—prefer 60–120s at 99 Hz over 5s at 999 Hz.

Minimal checklist before you hit record

Representative load is running (real queries/traffic, warmed caches)
Duration ≥ 30s (prefer 60–120s)
Rate 50–199 Hz depending on host load
Add --idle if investigating latency; --native if C code is involved
Note environment: CPU, Python, OS; keep it next to the SVG/JSON

perf, properly: system‑grade sampling with Python frames

perf gives you low‑overhead, kernel‑to‑user visibility and is great when native code (extensions, syscalls, kernel) matters.

Enable Python symbolization

Recent CPython versions expose a stack trampoline so perf can label Python frames:

# Option A: enable via interpreter flag
python -X perf your_app.py &
PERFPID=$!
 
# Option B: env var
PYTHONPERFSUPPORT=1 your_app.py &

Now record and report:

perf record -F 999 -g -p $PERFPID -- sleep 30
perf report -g

If your Python lacks frame pointers and you need JIT symbols, post‑process:

perf record -F 999 -g -p $PERFPID -- sleep 20
perf inject -i perf.data --jit -o perf.jit.data
perf report -g -i perf.jit.data

Tip: Verify frame pointers if you build Python yourself:

python -m sysconfig | grep -E "no-omit-frame-pointer|mno-omit-leaf-frame-pointer"

Useful perf views

CPU profile on a PID (balanced default):

perf record -F 999 -g -p <PID> -- sleep 30 && perf report -g

Whole system hot code (find noisy neighbors):

sudo perf top -g

Scheduler timelines (who’s waiting vs running):

perf sched record -p <PID> -- sleep 10
perf sched timehist -p <PID> | less

I/O latency hints (block layer):

sudo perf record -e block:block_rq_issue -e block:block_rq_complete -a -- sleep 10
sudo perf script | less

Where perf shines

C extensions and native libraries (NumPy, crypto, compression)
Kernel/syscall time (epoll, disk, networking)
Correlating Python with native hot spots in one view

Quick workflow

flowchart LR A[Enable -X perf / env] --> B[perf record -F 999 -g] B --> C[perf report -g] C --> D{Hot in native?} D -->|Yes| N[Drill into symbol/file] D -->|No| P[Focus on Python frames] N --> E[Optimize or add native flags] P --> E style B fill:#e1f5fe,stroke:#90caf9

Notes:

Prefer a sustained load and 20–60s capture. Use -F 4999 briefly for very bursty issues.
For containers, run perf in the host namespace or grant the pod capabilities (privileged or CAP_PERFMON on newer kernels).

eBPF for off‑CPU and continuous visibility

eBPF lets you sample stacks in‑kernel with very low overhead and capture where threads wait (mutex, I/O, scheduler). Great for latency hunts and always‑on profiles.

Off‑CPU with bpftrace (one‑liner)

sudo bpftrace -e 'tracepoint:sched:sched_switch /pid == $PID/ { @offcpu[ustack] = count(); }' -p <PID> -d 30

This tallies stacks observed when the target is switched off‑CPU. Render the output with a flamegraph tool that accepts folded stacks, or browse counts directly.

CPU sampling with bpftrace

sudo bpftrace -e 'profile:hz:99 /pid == $PID/ { @[ustack] = count(); }' -p <PID> -d 30

You’ll see user stacks at 99 Hz without instrumenting the app. Accuracy improves with frame pointers; Python 3.12’s perf trampoline helps perf, not generic ustack—prefer py-spy or perf for Python‑named frames.

bcc “offcputime” (aggregated)

sudo offcputime-bpfcc -p <PID> -f 99 -d 30 > offcpu.stacks
# Convert to flamegraph if you have the scripts

Containers / K8s

eBPF typically requires privileged pods or an agent DaemonSet.
For continuous profiling, deploy an eBPF agent rather than ad‑hoc commands.

When to reach for eBPF

Latency spikes with low CPU usage
Lock contention, queue backpressure, or I/O stalls suspected
Always‑on, low‑overhead environment profiling across services

Interpreting off‑CPU profiles

flowchart TD X[Off‑CPU stack wide at base] --> L[Likely lock or I/O wait] L --> K[Check mutex/db driver traces] L --> I[Correlate with perf sched timehist] X --> S[Mostly Python sleep/wait] --> T[Application backpressure] style X fill:#fff3e0,stroke:#ffcc80

Practical guidance:

Pair off‑CPU with CPU flamegraphs to avoid shifting blame.
If stacks are mostly library waits (e.g., DB client), instrument retries and pool limits; apply backpressure.

Interpreter-level counters: precise signals, scoped carefully

Tracing and monitoring hooks give semantic clarity when sampling can’t. Use them surgically to answer specific questions, and keep the blast radius small.

Deterministic CPU profiling of a code region (cProfile)

# region_profile.py
import cProfile, pstats, io
 
def work():
    # ... your hot function(s) ...
    return sum(i*i for i in range(500_000))
 
def profile_region() -> None:
    pr = cProfile.Profile()
    pr.enable()
    try:
        work()
    finally:
        pr.disable()
    s = io.StringIO()
    pstats.Stats(pr, stream=s).sort_stats("cumulative").print_stats(20)
    print(s.getvalue())
 
if __name__ == "__main__":
    profile_region()

Use this when you need exact call counts and cumulative times in a bounded scope (tests, CI, dev runs). Avoid wrapping whole services.

Per-thread CPU vs wall time (yappi)

# yappi_profile.py
import time, yappi
 
def cpu_heavy():
    s = 0
    for i in range(3_000_000):
        s += i*i
    return s
 
def io_heavy():
    time.sleep(0.2)
 
if __name__ == "__main__":
    yappi.set_clock_type("cpu")  # or "wall"
    yappi.start()
    cpu_heavy(); io_heavy()
    yappi.stop()
    stats = yappi.get_func_stats()
    stats.sort("tsub").print_all()

Use wall clock for latency investigations; CPU clock to isolate compute.

Lightweight counters with tracing hooks (targeted)

# trace_count.py
import sys
from collections import Counter
 
counts: Counter[str] = Counter()
 
def tracer(frame, event, arg):
    if event == "call":
        name = f"{frame.f_code.co_filename}:{frame.f_code.co_name}"
        if "/my_service/handlers/" in frame.f_code.co_filename:
            counts[name] += 1
    return tracer
 
def main() -> None:
    sys.setprofile(tracer)
    try:
        # run a single request path or test suite here
        pass
    finally:
        sys.setprofile(None)
        for name, n in counts.most_common(15):
            print(n, name)
 
if __name__ == "__main__":
    main()

Scope these to a single request/test; remove after you’ve answered the question.

Memory growth checks (bonus)

# mem_check.py
import tracemalloc, time
 
tracemalloc.start()
# run workload
time.sleep(2)
current, peak = tracemalloc.get_traced_memory()
print(f"current={current/1e6:.1f}MB peak={peak/1e6:.1f}MB")

Great for catch‑and‑fix regressions in CI.

Case study: speeding up a slow FastAPI endpoint

Symptom: p95 latency at 450 ms under modest load, CPU ~35%, spikes during traffic bursts.

Approach:

Confirm CPU hotspots

py-spy record -F 99 -d 60 --pid <PID> -o api_cpu.svg

Observation: wide frames in JSON encoding and response model validation.

Check waiting time

py-spy record --idle -F 99 -d 60 --pid <PID> -o api_idle.svg
perf sched record -p <PID> -- sleep 20 && perf sched timehist -p <PID> | cat

Observation: periodic waits on DB client pool and TLS writes.

Fixes

Switch to orjson response class; pre‑serialize where possible.
Raise DB pool size and add per‑request deadlines; propagate cancellation.
Avoid re‑validating large models on hot paths (cache or .model_dump(mode="json")).

Verify and iterate

py-spy record -F 99 -d 60 --pid <PID> -o after.svg

Result: p95 ~210 ms (−53%), CPU ~28%, reduced off‑CPU waits.

sequenceDiagram participant L as Loadgen participant S as Service participant DB as DB Pool L->>S: burst S->>DB: acquire DB-->>S: wait (backpressure) S-->>L: slow response Note over S: After tuning: larger pool + deadlines + faster JSON L->>S: burst S->>DB: acquire DB-->>S: quicker grant S-->>L: faster response

Production playbook (apply today)

CPU spike, unknown cause: py-spy 60–120s at 99 Hz; fix wide bottoms. If native heavy, confirm with perf.
Latency spike, low CPU: py-spy --idle, perf sched timehist, and off‑CPU stacks; add backpressure and deadlines.
Native/C‑extension suspicion: py-spy --native and perf; check BLAS/crypto/compression hotspots.
Multi‑process servers: py-spy --subprocesses; compare worker mixes.
Memory creep: tracemalloc snapshots; diff peaks.
CI guardrails: small cProfile/yappi runs for critical paths; assert budgets.

Verification checklist (before you declare victory)

Profiles taken under representative load (steady, warmed caches)
Overhead within budget (sampling single‑digits; tracing only in small scopes)
Baseline vs after: same inputs, one change at a time
CPU and off‑CPU both inspected when latency matters
Results reproduced on another host/container

Key takeaways

Use sampling for truth in prod; tracing for answers in tight scopes.
Separate CPU from waiting—off‑CPU tells you why latency explodes.
Keep profiler overhead smaller than the effect you’re measuring.
Pair py-spy with perf/eBPF to see Python and native/kernel paths together.
Add lightweight checks in CI to catch regressions early.

References

Python docs: perf profiling HOWTO — docs.python.org
PEP 669: Low Impact Monitoring for CPython — peps.python.org/pep-0669
py-spy — github.com/benfred/py-spy
speedscope viewer — www.speedscope.app
Brendan Gregg: Flame Graphs — brendangregg.com/flamegraphs
bpftrace reference — bpftrace.org
BCC tools — github.com/iovisor/bcc
Linux perf wiki — perf.wiki.kernel.org
perf sched timehist — brendangregg.com blog
yappi profiler — github.com/sumerc/yappi
cProfile and pstats — docs.python.org
tracemalloc — docs.python.org
Parca (continuous profiling) — parca.dev
Scalene profiler — github.com/plasma-umass/scalene
orjson — github.com/ijl/orjson
asyncio TaskGroup — docs.python.org