The Specializing Interpreter in CPython 3.11+: Why Your Code Got Faster

Published: October 5, 2020 (4y ago)11 min read

Updated: June 15, 2024 (1y ago)

If you upgraded from Python 3.10 to 3.11+ and saw your pure‑Python code get noticeably faster, you’re not imagining it. CPython 3.11 introduced a specializing adaptive interpreter that watches how your code runs, then reshapes hot bytecode on the fly—installing tiny inline caches and switching generic operations to type‑specific fast paths. This first part explains the mental model and how to observe it, so you can write code that specializes well.

What changed in 3.11 (and why it matters)

Interpreters pay a tax for dynamism: attribute lookups walk dictionaries, binary ops branch on operand types, and global lookups bounce through multiple namespaces. CPython 3.11 reduces that tax by:

  • Adaptive bytecode: hot instructions mutate into “adaptive” forms that gather runtime type info, then into specialized forms optimized for the observed types.
  • Inline caches: tiny per‑instruction caches stash results like attribute offsets or dict version stamps to avoid repeated lookups.
  • Superinstructions and dispatch tweaks: common instruction sequences collapse, cutting dispatch overhead.

The result: typical pure‑Python workloads run ~1.1×–1.6× faster vs 3.10, with ~1.25× median improvements on the pyperformance suite. You don’t need to change your code; you just need to avoid patterns that defeat specialization.

flowchart LR A[Generic bytecode<br/>e.g. LOAD_ATTR, BINARY_OP] --> B{Hot?} B -- no --> A B -- yes --> C[Quickening:<br/>replace with *ADAPTIVE* variants + attach caches] C --> D{Stable types<br/>and shapes?} D -- yes --> E[Specialize:<br/>type-specific fast path e.g. LOAD_ATTR_INSTANCE_VALUE, BINARY_OP_ADD_INT] D -- no/changes --> F[Deopt:<br/>fall back to adaptive or generic] E -->|Shape change| F F --> C

Quickening, specialization, and deoptimization

Think of each hot instruction as a tiny JIT‑like state machine:

  1. Quicken: swap the generic opcode for an *_ADAPTIVE form that counts hits/misses and stores a small inline cache next to the instruction.
  2. Specialize: if the site sees the same operand kinds repeatedly (monomorphic or few‑morphic), replace the adaptive form with a specialized opcode that hard‑codes the fast path.
  3. Deopt: if reality changes (different types, mutated shapes, invalidated dict versions), fall back to adaptive/generic and try again.

Inline caches in practice

The biggest wins show up where Python spends time:

  • LOAD_ATTR: caches the strategy to fetch an attribute for a stable instance layout (e.g., offset into a struct for __slots__/dataclass with slots, or a dict version check for objects with unchanged maps).
  • LOAD_GLOBAL/LOAD_NAME: caches globals/builtins lookups using dictionary version stamps.
  • BINARY_OP: specializes common arithmetic and concatenation when operand types stabilize (ints, floats, strings, tuples).
  • Calls: selects faster call paths when signatures are simple and targets are direct.
sequenceDiagram participant VM as Interpreter participant IC as Inline Cache (per-site) participant Obj as Object Shape VM->>IC: Execute LOAD_ATTR x alt Cache empty VM->>Obj: Resolve attribute (slow path) Obj-->>VM: Value + guard (shape/version) VM->>IC: Install value access strategy else Cache valid VM->>IC: Check guard (shape/version) IC-->>VM: Fast load (offset/index) end

See it: disassembling caches and specializations

Python 3.11’s dis can show caches and specialized forms. Warm a function to let the interpreter adapt, then disassemble.

import dis
 
def add2(a, b):
    return a + b
 
# Before warmup: generic bytecode
print(dis.code_info(add2))
dis.dis(add2, show_caches=True)
 
# Warm up with int arguments so BINARY_OP specializes
for _ in range(2000):
    add2(1, 2)
 
print("\nAfter warmup:\n")
dis.dis(add2, show_caches=True)

You’ll see BINARY_OP at first, then an adaptive/specialized form such as BINARY_OP_ADD_INT with cache entries. On attribute access sites, look for LOAD_ATTR_ADAPTIVE specializing into something like LOAD_ATTR_INSTANCE_VALUE with one or more inline cache rows.

Tip: run with PYTHONDISPACH=1 or specialization counters where available in your build to inspect hit/miss stats; for most users, dis(..., show_caches=True) is enough to confirm specialization.

A minimal, fair benchmark you can trust

Use pyperf to stabilize measurements (CPU pinning, warmups, statistics). Compare 3.10 vs 3.11+ or observe the effect of patterns that help/hurt specialization.

# bench/specialize/addition.py
import pyperf
 
def add_loop(n: int) -> int:
    s = 0
    for i in range(n):
        s = s + i  # specializes to INT fast path
    return s
 
runner = pyperf.Runner()
runner.bench_func("add_loop", add_loop, 1_000_000)

Run it on multiple versions, record environment (CPU, OS, Python build), and report median with confidence intervals. Expect the loop to benefit more on 3.11+ when the site stays monomorphic.

Patterns that specialize well (and ones that don’t)

  • Stable shapes win: prefer dataclass(slots=True) or classes with __slots__ for objects you access in hot loops. Fewer megamorphic sites → better LOAD_ATTR specialization.
  • Keep names stable: module‑level constants and functions avoid churning globals/builtins caches.
  • Avoid megamorphic call sites: repeatedly calling many unrelated callables from the same site defeats specialization. Split hot paths or move dispatch out of the loop.
  • Minimize cross‑type arithmetic at a single site: e.g., don’t mix ints, floats, and Decimals at the same + in a hot loop.
  • Don’t fight the cache: mutating __getattr__/__getattribute__ or swapping __dict__ layouts in the hot path forces deopts.
# Good: stable shape & attribute access
from dataclasses import dataclass
 
@dataclass(slots=True)
class Point:
    x: int
    y: int
 
def norm1(p: Point) -> int:
    return p.x * p.x + p.y * p.y  # LOAD_ATTR specializes
 
# Risky: megamorphic site
def apply_all(funcs, x):
    out = x
    for f in funcs:  # many unrelated call targets at one site
        out = f(out) # harder to specialize
    return out

What to remember

  • Specialization is local: each bytecode site adapts to its own history. Keep sites monomorphic or few‑morphic.
  • Guards protect correctness: when shapes or dict versions change, caches deopt and retry. Your code should avoid needless invalidations in hot paths.
  • You don’t need to rewrite Python as C: just write predictable, stable code in critical loops.

Under the hood: globals/builtins caching, attribute fast paths, and arithmetic specialization

This section goes deeper into the mechanics and how to spot when caches help you—or why they occasionally drop back to slow paths.

Globals and builtins: dictionary versioning in action

Global and builtins lookups use dictionary “version” guards. As long as the dictionary hasn’t changed shape (no insert/delete that alters the version), the cache remains valid and skips repeated hashtable lookups.

flowchart TB subgraph Namespace Dict K1["globals dict\nversion = v"] K2["builtins dict\nversion = b"] end L[LOAD_GLOBAL site\n(cache: name→index, versions v,b)] -->|check v,b| H{Valid?} H -- yes --> F[Fast path: direct value fetch] H -- no --> S[Slow path: dict lookup] S --> U[Update cache with new versions] U --> F

Two practical implications:

  • Keep hot names at module scope and avoid rebinding them in the hot loop; rebinding invalidates the cache.
  • Prefer importing functions into the local namespace when used in tight loops to get LOAD_FAST (already the cheapest).
# Good: bind once, fast local loads
from math import sqrt
 
def hypot(xs):
    return sum(sqrt(x) for x in xs)
 
# Risky: repeated globals/builtins lookups if names churn
import math
def hypot2(xs):
    return sum(math.sqrt(x) for x in xs)

Attribute access: stable shapes specialize best

Attribute lookups get per‑site inline caches that encode “how” to fetch a value (e.g., offset for __slots__, index in a shared‑key dict, or a method descriptor path). If the object’s layout and MRO remain stable, lookups become a couple of guarded pointer arithmetic steps.

from dataclasses import dataclass
 
@dataclass(slots=True)
class User:
    id: int
    name: str
 
def greet(u: User) -> str:
    return u.name.upper()  # LOAD_ATTR specializes; then method call specializes
 
# Warmup helps specialization
for _ in range(5000):
    greet(User(1, "Ana"))

If you later attach attributes dynamically or swap __getattribute__, the guard fails and the site deopts until it re‑learns a stable pattern.

Arithmetic (BINARY_OP): common specializations

Sites stabilize quickly for:

  • Integer add/sub/mul, comparisons
  • Float arithmetic
  • String/bytes concatenation and repetition
  • Tuple/list concatenation in simple forms

You can observe specialization and deopt by alternating types:

import dis
 
def agg(a, b):
    return a + b
 
for _ in range(2000):
    agg(1, 2)  # int-int
dis.dis(agg, show_caches=True)
 
# Now perturb types to force deopt/respecialize
for _ in range(10):
    agg(1.0, 2.0)  # float-float
for _ in range(2000):
    agg(1, 2)
print("\nAfter perturbation:\n")
dis.dis(agg, show_caches=True)

Observability checklists

  • Use dis.dis(func, show_caches=True) after a warmup to confirm specializations at call sites, attribute loads, and arithmetic.
  • Microbench with pyperf; always include warmups so caches and specializations settle before timing.

Small, actionable patterns

  • Bind hot globals to locals before loops (local = module.symbol) to convert LOAD_GLOBALLOAD_FAST.
  • Keep object layouts stable in hot code paths (__slots__, dataclasses with slots=True).
  • Avoid megamorphic dispatch at a single site; split mixed‑type work across distinct sites when practical.

Instruction dispatch and superinstructions (why some loops feel lighter)

Beyond caches, 3.11 reduces interpreter dispatch overhead. Some frequent instruction sequences are merged (“superinstructions”) so the VM does fewer dispatches per iteration.

flowchart LR I1[LOAD_FAST] --> I2[LOAD_FAST] I2 --> I3[BINARY_OP] I3 --> I4[STORE_FAST] subgraph 3.11 S1[LOAD_FAST__LOAD_FAST] --> S2[BINARY_OP_ADD_INT] S2 --> S3[STORE_FAST] end

You can’t force a superinstruction directly, but you can help by keeping operations simple and type‑stable so specialization chooses the tightest variants.

Case study: speeding up an attribute‑heavy request path

Imagine a request handler that decodes JSON into dicts, then walks nested structures and computes a score.

# Before: dynamic dicts everywhere
def score(payload: dict) -> int:
    s = 0
    for item in payload["items"]:
        s += item["qty"] * item["price_cents"]
    return s
 
# After: stabilize shapes and names
from dataclasses import dataclass
 
@dataclass(slots=True)
class Item:
    qty: int
    price_cents: int
 
def score_fast(items: list[Item]) -> int:
    s = 0
    for it in items:
        s += it.qty * it.price_cents  # LOAD_ATTR + INT ops specialize
    return s

In practice: parse once into Item objects (or TypedDict → dataclass transition) at the boundary, keep the hot loop boring and stable. Expect attribute loads and arithmetic to stay on fast paths.

Case study: call sites and keyword churn

def transform(a: int, b: int, scale: int = 1) -> int:
    return (a + b) * scale
 
def hot_loop(xs):
    t = transform  # bind target
    s = 0
    for x in xs:
        s += t(x, x+1)          # specializes
    return s
 
def hot_loop_kw(xs):
    t = transform
    s = 0
    for x in xs:
        s += t(a=x, b=x+1)      # kwargs path; slower dispatch
    return s

If you need a constant keyword, pre‑shape with functools.partial outside the loop so the inner site stays positional.

Production measurement playbook

  • Always warm up: run representative traffic or synthetic warmups so caches/specializations settle.
  • Record environment: Python version, build flags, CPU governor, container limits.
  • Compare shapes: disassemble before/after warmups to confirm specialization (dis.dis(func, show_caches=True)).
  • Benchmark tiers: micro (pyperf), macro (end‑to‑end), and steady‑state latency.
  • Watch for deopts: performance cliffs often coincide with schema/name churn or kwargs introduced into hot calls.
sequenceDiagram participant CI as CI Job participant S as Service participant Bench as Bench Harness CI->>S: Deploy 3.10 → 3.11 Bench->>S: Warmup traffic Bench->>S: Measure SR, p50/p95 S-->>CI: Export disassembly snapshots CI-->>CI: Compare specialized sites

Troubleshooting matrix (symptom → likely cause → fix)

  • Hot loop shows no specialization in dis:

    • Likely cause: insufficient warmup or highly variable operand types.
    • Fix: run warmups; split mixed‑type work; bind globals to locals.
  • Attribute loads keep deopting:

    • Likely cause: dynamic attributes added, __getattr__/__getattribute__ interference, or changing __dict__ layout.
    • Fix: use __slots__/dataclasses with slots=True; avoid dynamic mutation in hot code.
  • Calls don’t speed up:

    • Likely cause: *args/**kwargs in hot path or target changes frequently.
    • Fix: prefer positional calls; pre‑bind with partial; keep a stable function object at the site.
  • Regressions after a refactor:

    • Likely cause: moved constants/functions causing global/builtins cache invalidations.
    • Fix: re‑bind hot names to locals inside functions.

Design patterns that cooperate with specialization

  • Boring data at the boundary: convert unstructured payloads into stable, typed containers before hot loops.
  • Hoist dynamic behavior out of the loop: compute once, reuse inside.
  • Separate polymorphism: route different types through separate inner functions so each site specializes independently.
  • Keep exception paths cold: exceptions are fast in 3.11, but raising in the hot path still defeats steady specialization.

Where specialization won’t save you (and what to do instead)

Specialization accelerates Python execution, but some bottlenecks live elsewhere:

  • I/O bound paths: socket, disk, DB waits dominate; use appropriate concurrency (async, threads, or processes) and backpressure.
  • C extension heavy code: the interpreter is idle while native code runs; upgrades help less here. Optimize at the extension boundary and data movement.
  • Highly dynamic patterns: heavy use of *args/**kwargs, __getattr__/__getattribute__ indirection, or frequent mutation of module/class dictionaries cause repeated deopts.
  • Polymorphic cold code: if sites don’t get hot or never stabilize, you’ll see little benefit; focus on shaping the few hot loops instead.

Guidance: profile first, then make hot paths “boring” (stable types, stable shapes, stable names). Keep dynamism at the edges.

Looking ahead: 3.12–3.13 refinements

Subsequent Python releases continue iterating on the specializing interpreter: broader opcode coverage, better caches for common patterns, and general dispatch improvements. Expect incremental wins from simply upgrading, and re‑validate your hot paths each release with the same dis + pyperf routine you used here.

timeline title Specialization evolution (high level) 2021 : PEP 659 accepted 2022 : Python 3.11 ships specializing interpreter 2023 : 3.12 expands/streamlines specializations 2024–2025 : Ongoing refinements and coverage

Verify before you ship: a reproducible checklist

  • Document environment: CPU, OS, Python version/build flags, container limits.
  • Warmup scripts mirror production data shapes and sizes.
  • Capture pre/post disassembly with show_caches=True for critical functions.
  • Track micro (pyperf) and macro (endpoint) results; include tails (p95/p99).
  • Guardrails in CI: prevent kwargs or dynamic attr injection from creeping into hot loops.
  • Re‑run on new Python releases; stash results alongside the code.

Closing thoughts

Specialization is a pragmatic, transparent speedup: keep your hot code predictable, and CPython will meet you halfway. Make shapes and names stable, simplify call sites, and confirm with dis + pyperf. Most teams get solid wins by applying these small, mechanical changes to a handful of loops.


References

  • PEP 659 — Specializing Adaptive Interpreter. https://peps.python.org/pep-0659/
  • What’s New In Python 3.11. https://docs.python.org/3/whatsnew/3.11.html
  • What’s New In Python 3.12 (performance notes). https://docs.python.org/3/whatsnew/3.12.html
  • What’s New In Python 3.13 (performance notes). https://docs.python.org/3/whatsnew/3.13.html
  • dis — Disassembler for Python bytecode (3.11+, show_caches). https://docs.python.org/3/library/dis.html
  • Python Performance (pyperformance). https://github.com/python/pyperformance
  • Faster CPython ideas (project notes). https://github.com/faster-cpython/ideas
  • CPython source — specialization code paths. https://github.com/python/cpython/tree/3.11/Objects/specialize.c
  • CPython source — bytecode/opcode definitions. https://github.com/python/cpython/tree/3.11/Python
  • Python 3.11 release announcement. https://blog.python.org/2022/10/python-3110-is-now-available.html
  • TestDriven.io — Python 3.11 performance overview. https://testdriven.io/blog/python311/
  • Phoronix — Python 3.11 performance measurements. https://www.phoronix.com/review/python-311-performance
  • Andy Pearce — 3.11 performance improvements. https://www.andy-pearce.com/blog/posts/2022/Dec/whats-new-in-python-311-performance-improvements/
  • Playful Python — Specialising Adaptive Interpreter. https://www.playfulpython.com/python-3-11-specialising-adaptive-interpreter/
  • Real Python — Python 3.11 new features. https://realpython.com/python311-new-features/
  • Python Developer Guide — internals/bytecode. https://devguide.python.org/internals/bytecode/
  • Mark Shannon — Faster CPython updates (discussion). https://discuss.python.org/t/faster-cpython-project-updates/13121
  • dict versioning background (PEP 509). https://peps.python.org/pep-0509/
  • Microsoft DevBlog — Faster CPython collaboration. https://devblogs.microsoft.com/python/faster-cpython/