You don’t have to choose between correct results and a fast service to get them. The trick is matching the profiler to the question, keeping overhead low enough for reality to shine through, and reading the output like an engineer—not a fortune teller.
This guide builds a production‑first profiling playbook for Python using four pillars that complement each other:
- py-spy for safe, low‑overhead CPU profiling and flamegraphs
- perf for system‑grade sampling and native visibility
- eBPF (bcc/bpftrace, continuous profilers) for off‑CPU and whole‑system views
- Interpreter‑level counters (e.g., tracing hooks, PEP 669 monitoring) when you need precise, semantic signals
We’ll start with a taxonomy and overhead budgets you can apply immediately, then get hands‑on with quick wins you can run today.
Profiling taxonomy (choose the right lens)
- Sampling (py-spy, perf, eBPF): periodic stack snapshots. Low overhead, great for truth in production. You get where time is burned on‑CPU; with the right events, you can also see off‑CPU stacks.
- Tracing (cProfile, yappi): record every call/return. High detail, higher overhead. Use in dev, tight scopes, or test environments.
- Interpreter counters/monitoring (sys.setprofile, sys.settrace, PEP 669
sys.monitoring
): semantic hooks from the VM. Powerful for targeted instrumentation; budget carefully.
Overhead budgets that keep results honest
Practical guidance:
- Aim for single‑digit percent overhead for truth in prod. Prefer sampling (py-spy, perf, eBPF) unless you need exact counts.
- Use tracing profilers on small, representative runs or in CI to validate hypotheses—not on live hot paths.
- For async and I/O‑heavy services, add off‑CPU visibility (eBPF/
perf sched
/block IO) or you’ll misattribute time to the wrong functions.
Quick wins you can run today
1) Live top of Python hotspots (zero code changes)
py-spy top --pid <PID>
What you get: a moving “top” of Python frames consuming CPU across threads. Use this to confirm suspected hot paths before you record anything heavier.
2) Record a flamegraph you can share
py-spy record -F 99 --pid <PID> -o profile.svg
-F 99
: sample at ~99 Hz (friendly default for web services).- Open
profile.svg
in a browser; it’s interactive.
Include idle/waiting stacks when investigating stalls:
py-spy record --idle -F 99 --pid <PID> -o profile.svg
This adds stacks for threads that are sleeping or blocked (e.g., I/O), helping distinguish CPU from wait time.
3) System view with perf (Python symbols included)
On recent Python builds, enable Python function symbolization:
python -X perf your_app.py &
PERFPID=$!
perf record -F 999 -g -p $PERFPID -- sleep 30
perf report -g
This pairs system‑grade sampling with Python frame names so you see both native and Python time on the same call graph.
4) Off‑CPU (waiting) time, at a glance
If CPU looks fine but latency is high, sample waits/off‑CPU stacks. Two lightweight options:
- perf scheduling timeline
perf sched record -p <PID> -- sleep 10
perf sched timehist -p <PID> | less
- eBPF (bcc/bpftrace) off‑CPU profiles (advanced). This captures stacks where threads are not on‑CPU, surfacing mutex, I/O, and backpressure bottlenecks.
How to read a flamegraph without fooling yourself
Flamegraphs show where time accumulates on stacks. Width encodes time; height encodes call depth.
- Look for wide frames near the bottom: those are foundational hotspots.
- Prefer deltas: take a baseline, make one change, take another profile.
- Separate CPU vs off‑CPU: a wide frame in a CPU‑only profile is a compute hotspot; in an idle‑inclusive profile it may be waiting.
Ground rules for fair runs
- Pin scenarios: reproduce a realistic workload (QPS/concurrency, dataset size, cache state). Report the environment (CPU, Python version, OS).
- Warm vs cold: make it explicit. For CPU hotspots, measure warmed; for throughput limits, do both.
- Hold inputs constant; change one thing at a time. Prefer scripts over hand‑driven runs.
- Budget: keep profiler overhead below the “signal” you’re trying to measure.
Minimal, targeted interpreter hooks (when you need exact answers)
Sometimes you need precise counts or lifecycle signals the sampler won’t give you. Use interpreter hooks sparingly and locally.
Counting calls to a specific function without touching callers:
# count_calls.py
import sys
TARGET = ("my_module", "expensive_fn")
counts: dict[str, int] = {"calls": 0}
def _prof(frame, event, arg):
if event == "call":
code = frame.f_code
if code.co_name == TARGET[1] and code.co_filename.endswith(TARGET[0].replace(".", "/") + ".py"):
counts["calls"] += 1
return _prof
def main() -> None:
sys.setprofile(_prof)
try:
import my_module # your program starts here
my_module.run()
finally:
sys.setprofile(None)
print("expensive_fn calls:", counts["calls"])
if __name__ == "__main__":
main()
Use this pattern narrowly (specific modules/functions, short windows). For broader, lower‑overhead monitoring hooks, modern Python exposes sys.monitoring
(PEP 669), which we’ll use later for focused, opt‑in signals.
What you should have now:
- A clear map for choosing sampling vs tracing vs interpreter signals
- A safe default (py-spy) to get actionable CPU profiles in minutes
- First tools to separate CPU vs waiting time (perf/eBPF, py-spy
--idle
)
Next, we’ll go deeper on py-spy (practical flags, accuracy gotchas, and reading real service profiles), then layer in perf/eBPF and interpreter counters for full coverage.
py-spy in practice: flags that matter, results you can trust
py-spy is the fastest path to trustworthy CPU profiles in production. Use these options to get the right picture the first time.
Install and basic usage
pip install py-spy
# Attach to a running process
py-spy top --pid <PID>
# Record an interactive flamegraph
py-spy record -F 99 --pid <PID> -d 60 -o profile.svg
Common options:
-F <Hz>
: sampling frequency (e.g., 50–250). Higher = finer detail, more overhead.-d <sec>
: duration. Prefer 30–120s under representative load.--idle
: include sleeping/blocked stacks (great for latency investigations).--native
: include native frames (C extensions). Slightly higher overhead.--subprocesses
: follow children spawned by the target.--format speedscope -o out.speedscope.json
: export for speedscope/trace visualization.py-spy record -o out.svg -- python app.py
: launch and profile a new process.
Attach and permissions (don’t get blocked)
Linux ptrace
restrictions can prevent attaching to non‑child processes. Fixes:
# Temporarily relax ptrace for this session (revert to 1 or original after)
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
# or
sudo sysctl kernel.yama.ptrace_scope=0
# If running inside a container, you may need extra caps
docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined ...
macOS: attach to non‑system Python; you may need to run as an admin user. System Integrity Protection (SIP) blocks profiling of protected binaries.
Windows: run an elevated shell (Administrator) to attach to other users’ processes.
Sampling recipes you’ll actually use
- CPU hotspots during a load test
py-spy record -F 99 -d 120 --pid <PID> -o cpu-hotspots.svg
- Latency investigation (see waiting time)
py-spy record --idle -F 99 -d 60 --pid <PID> -o latency.svg
- Native work in C extensions (NumPy/Pandas/etc.)
py-spy record --native -F 199 -d 45 --pid <PID> -o native.svg
- Shareable, zoomable timeline with speedscope
py-spy record -F 99 -d 60 --pid <PID> --format speedscope -o profile.speedscope.json
# Open with speedscope (CLI or web):
# npx speedscope profile.speedscope.json
Resolution vs overhead: pick the right point
Guidance:
- Start at 99 Hz for web services; drop to 50 Hz for very busy hosts; spike to 199 Hz briefly for micro‑hotspots.
- Prefer longer duration with moderate rate over ultra‑high rates: time dominates variance.
Following subprocesses and workers
Many apps fork worker processes (gunicorn, Celery, job runners). Follow them:
py-spy record --subprocesses -F 99 -d 90 --pid <MASTER_PID> -o farm.svg
You’ll get a combined flamegraph across the process tree; large bottoms suggest shared bottlenecks (imports, serialization, DB clients).
Containers and Kubernetes
- Ensure the container has
SYS_PTRACE
capability and a permissive seccomp profile. - If attaching from the host, match PID namespaces (
nsenter -t <container-pid> -m -p py-spy ...
). - In Kubernetes, consider a debug pod or sidecar with the needed caps; avoid adding ptrace to public‑facing pods by default.
Reading results: a quick interpretation loop
Tips:
- Don’t chase narrow stacks at the top; fix wide bases first.
- If CPU looks balanced but latency is high, re‑record with
--idle
. - If frames are all native, re‑record with
--native
(or useperf
to see kernel/extension cost).
Accuracy and caveats
- Short functions may be under‑sampled; rely on sufficient duration, not just rate.
- GIL: py-spy shows Python time even when threads exist; don’t mistake multi‑threaded tops for parallel compute in pure Python.
- Native extensions often release the GIL; wide native frames are real CPU time and can run in parallel.
- Very high rates on very hot processes can increase perturbation—prefer 60–120s at 99 Hz over 5s at 999 Hz.
Minimal checklist before you hit record
- Representative load is running (real queries/traffic, warmed caches)
- Duration ≥ 30s (prefer 60–120s)
- Rate 50–199 Hz depending on host load
- Add
--idle
if investigating latency;--native
if C code is involved - Note environment: CPU, Python, OS; keep it next to the SVG/JSON
perf, properly: system‑grade sampling with Python frames
perf gives you low‑overhead, kernel‑to‑user visibility and is great when native code (extensions, syscalls, kernel) matters.
Enable Python symbolization
Recent CPython versions expose a stack trampoline so perf can label Python frames:
# Option A: enable via interpreter flag
python -X perf your_app.py &
PERFPID=$!
# Option B: env var
PYTHONPERFSUPPORT=1 your_app.py &
Now record and report:
perf record -F 999 -g -p $PERFPID -- sleep 30
perf report -g
If your Python lacks frame pointers and you need JIT symbols, post‑process:
perf record -F 999 -g -p $PERFPID -- sleep 20
perf inject -i perf.data --jit -o perf.jit.data
perf report -g -i perf.jit.data
Tip: Verify frame pointers if you build Python yourself:
python -m sysconfig | grep -E "no-omit-frame-pointer|mno-omit-leaf-frame-pointer"
Useful perf views
- CPU profile on a PID (balanced default):
perf record -F 999 -g -p <PID> -- sleep 30 && perf report -g
- Whole system hot code (find noisy neighbors):
sudo perf top -g
- Scheduler timelines (who’s waiting vs running):
perf sched record -p <PID> -- sleep 10
perf sched timehist -p <PID> | less
- I/O latency hints (block layer):
sudo perf record -e block:block_rq_issue -e block:block_rq_complete -a -- sleep 10
sudo perf script | less
Where perf shines
- C extensions and native libraries (NumPy, crypto, compression)
- Kernel/syscall time (epoll, disk, networking)
- Correlating Python with native hot spots in one view
Quick workflow
Notes:
- Prefer a sustained load and 20–60s capture. Use
-F 4999
briefly for very bursty issues. - For containers, run perf in the host namespace or grant the pod capabilities (privileged or
CAP_PERFMON
on newer kernels).
eBPF for off‑CPU and continuous visibility
eBPF lets you sample stacks in‑kernel with very low overhead and capture where threads wait (mutex, I/O, scheduler). Great for latency hunts and always‑on profiles.
Off‑CPU with bpftrace (one‑liner)
sudo bpftrace -e 'tracepoint:sched:sched_switch /pid == $PID/ { @offcpu[ustack] = count(); }' -p <PID> -d 30
This tallies stacks observed when the target is switched off‑CPU. Render the output with a flamegraph tool that accepts folded stacks, or browse counts directly.
CPU sampling with bpftrace
sudo bpftrace -e 'profile:hz:99 /pid == $PID/ { @[ustack] = count(); }' -p <PID> -d 30
You’ll see user stacks at 99 Hz without instrumenting the app. Accuracy improves with frame pointers; Python 3.12’s perf trampoline helps perf, not generic ustack
—prefer py-spy or perf for Python‑named frames.
bcc “offcputime” (aggregated)
sudo offcputime-bpfcc -p <PID> -f 99 -d 30 > offcpu.stacks
# Convert to flamegraph if you have the scripts
Containers / K8s
- eBPF typically requires privileged pods or an agent DaemonSet.
- For continuous profiling, deploy an eBPF agent rather than ad‑hoc commands.
When to reach for eBPF
- Latency spikes with low CPU usage
- Lock contention, queue backpressure, or I/O stalls suspected
- Always‑on, low‑overhead environment profiling across services
Interpreting off‑CPU profiles
Practical guidance:
- Pair off‑CPU with CPU flamegraphs to avoid shifting blame.
- If stacks are mostly library waits (e.g., DB client), instrument retries and pool limits; apply backpressure.
Interpreter-level counters: precise signals, scoped carefully
Tracing and monitoring hooks give semantic clarity when sampling can’t. Use them surgically to answer specific questions, and keep the blast radius small.
Deterministic CPU profiling of a code region (cProfile)
# region_profile.py
import cProfile, pstats, io
def work():
# ... your hot function(s) ...
return sum(i*i for i in range(500_000))
def profile_region() -> None:
pr = cProfile.Profile()
pr.enable()
try:
work()
finally:
pr.disable()
s = io.StringIO()
pstats.Stats(pr, stream=s).sort_stats("cumulative").print_stats(20)
print(s.getvalue())
if __name__ == "__main__":
profile_region()
Use this when you need exact call counts and cumulative times in a bounded scope (tests, CI, dev runs). Avoid wrapping whole services.
Per-thread CPU vs wall time (yappi)
# yappi_profile.py
import time, yappi
def cpu_heavy():
s = 0
for i in range(3_000_000):
s += i*i
return s
def io_heavy():
time.sleep(0.2)
if __name__ == "__main__":
yappi.set_clock_type("cpu") # or "wall"
yappi.start()
cpu_heavy(); io_heavy()
yappi.stop()
stats = yappi.get_func_stats()
stats.sort("tsub").print_all()
Use wall clock for latency investigations; CPU clock to isolate compute.
Lightweight counters with tracing hooks (targeted)
# trace_count.py
import sys
from collections import Counter
counts: Counter[str] = Counter()
def tracer(frame, event, arg):
if event == "call":
name = f"{frame.f_code.co_filename}:{frame.f_code.co_name}"
if "/my_service/handlers/" in frame.f_code.co_filename:
counts[name] += 1
return tracer
def main() -> None:
sys.setprofile(tracer)
try:
# run a single request path or test suite here
pass
finally:
sys.setprofile(None)
for name, n in counts.most_common(15):
print(n, name)
if __name__ == "__main__":
main()
Scope these to a single request/test; remove after you’ve answered the question.
Memory growth checks (bonus)
# mem_check.py
import tracemalloc, time
tracemalloc.start()
# run workload
time.sleep(2)
current, peak = tracemalloc.get_traced_memory()
print(f"current={current/1e6:.1f}MB peak={peak/1e6:.1f}MB")
Great for catch‑and‑fix regressions in CI.
Case study: speeding up a slow FastAPI endpoint
Symptom: p95 latency at 450 ms under modest load, CPU ~35%, spikes during traffic bursts.
Approach:
- Confirm CPU hotspots
py-spy record -F 99 -d 60 --pid <PID> -o api_cpu.svg
Observation: wide frames in JSON encoding and response model validation.
- Check waiting time
py-spy record --idle -F 99 -d 60 --pid <PID> -o api_idle.svg
perf sched record -p <PID> -- sleep 20 && perf sched timehist -p <PID> | cat
Observation: periodic waits on DB client pool and TLS writes.
- Fixes
- Switch to
orjson
response class; pre‑serialize where possible. - Raise DB pool size and add per‑request deadlines; propagate cancellation.
- Avoid re‑validating large models on hot paths (cache or
.model_dump(mode="json")
).
- Verify and iterate
py-spy record -F 99 -d 60 --pid <PID> -o after.svg
Result: p95 ~210 ms (−53%), CPU ~28%, reduced off‑CPU waits.
Production playbook (apply today)
- CPU spike, unknown cause: py-spy 60–120s at 99 Hz; fix wide bottoms. If native heavy, confirm with perf.
- Latency spike, low CPU: py-spy
--idle
,perf sched timehist
, and off‑CPU stacks; add backpressure and deadlines. - Native/C‑extension suspicion: py-spy
--native
and perf; check BLAS/crypto/compression hotspots. - Multi‑process servers: py-spy
--subprocesses
; compare worker mixes. - Memory creep: tracemalloc snapshots; diff peaks.
- CI guardrails: small cProfile/yappi runs for critical paths; assert budgets.
Verification checklist (before you declare victory)
- Profiles taken under representative load (steady, warmed caches)
- Overhead within budget (sampling single‑digits; tracing only in small scopes)
- Baseline vs after: same inputs, one change at a time
- CPU and off‑CPU both inspected when latency matters
- Results reproduced on another host/container
Key takeaways
- Use sampling for truth in prod; tracing for answers in tight scopes.
- Separate CPU from waiting—off‑CPU tells you why latency explodes.
- Keep profiler overhead smaller than the effect you’re measuring.
- Pair py-spy with perf/eBPF to see Python and native/kernel paths together.
- Add lightweight checks in CI to catch regressions early.
References
- Python docs: perf profiling HOWTO — docs.python.org
- PEP 669: Low Impact Monitoring for CPython — peps.python.org/pep-0669
- py-spy — github.com/benfred/py-spy
- speedscope viewer — www.speedscope.app
- Brendan Gregg: Flame Graphs — brendangregg.com/flamegraphs
- bpftrace reference — bpftrace.org
- BCC tools — github.com/iovisor/bcc
- Linux perf wiki — perf.wiki.kernel.org
- perf sched timehist — brendangregg.com blog
- yappi profiler — github.com/sumerc/yappi
- cProfile and pstats — docs.python.org
- tracemalloc — docs.python.org
- Parca (continuous profiling) — parca.dev
- Scalene profiler — github.com/plasma-umass/scalene
- orjson — github.com/ijl/orjson
- asyncio TaskGroup — docs.python.org