Python lets you ship ideas quickly. But when hot paths turn CPU‑bound, you need native speed without turning your codebase into a build‑system museum. This guide shows you the pragmatic ways to cross the boundary—Cython, cffi, HPy, and the raw C API—and how to choose between them with a bias for safety, packaging sanity, and maintainable performance.
We’ll set the mental model, give you a minimal C extension you can build today, and establish packaging/ABI ground rules so distribution doesn’t become the bottleneck.
Who this is for
- Backend, data, and infra engineers who need predictable speedups with Python in production.
- Teams owning performance‑critical kernels (hashing, parsing, transforms) or binding existing C/C++/Rust libs.
- People who want portable wheels and a clean rollback story, not bespoke CI fires.
Key takeaways
- Pick the path that matches your constraints, not hype: Cython for Python‑first kernels, cffi for wrapping existing C libraries, HPy for future‑proof portability and debug tooling, raw C API when you need ultimate control.
- Understand packaging and ABI early: abi3, Py_LIMITED_API, manylinux/musllinux, and auditwheel decide your release friction.
- Release the GIL in native hot loops and keep Python orchestration cheap. Design for zero‑copy boundaries via the buffer protocol where possible.
What “fast” and “safe” actually mean here
- Fast: move Python out of inner loops, minimize crossings, batch work, use contiguous memory, and exploit the CPU (SIMD, cache‑friendly layouts) from native code.
- Safe: clear ownership, correct reference counting, no accidental copies, explicit GIL release around long native work, reproducible builds, and wheels that don’t break consumers.
Packaging and ABI mental model (don’t skip this)
The performance win is worthless if distribution is painful. Anchor on four ideas:
- abi3 + Py_LIMITED_API: build a single wheel that works across Python 3.x minor versions on the same platform by targeting the stable C API subset. Ideal when you don’t need bleeding‑edge CPython internals.
- manylinux/musllinux: portable Linux wheels with audited dependencies. Use
auditwheel
to repair wheels and vendor shared libs appropriately. - Universal vs CPython‑specific APIs: HPy’s Universal ABI aims to run across CPython, PyPy, GraalPy; classic
#include <Python.h>
ties you to CPython’s ABI unless you restrict to the stable subset. - Build backends: prefer modern backends (
scikit-build-core
,setuptools
withpyproject.toml
, ormeson-python
). Keep builds declarative and cacheable in CI.
Minimal pyproject.toml
variants you can adopt:
[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "mincext"
version = "0.1.0"
requires-python = ">=3.8"
# scikit-build-core (great for C/C++/Fortran or when CMake is already in play)
[build-system]
requires = ["scikit-build-core>=0.7", "pybind11<3; python_version<'3.13'" ]
build-backend = "scikit_build_core.build"
[project]
name = "mincext"
version = "0.1.0"
requires-python = ">=3.8"
With setuptools you can set PYTHON_LIMITED_API=cp38
via compiler args to emit abi3 wheels. With scikit‑build‑core, pass equivalent definitions in CMake or tool config.
A minimal C extension you can build and wheel today
This is the smallest useful CPython extension showing argument parsing, error handling, and return conversion. Keep it boring and explicit.
// src/mincext.c
#define PY_SSIZE_T_CLEAN
#include <Python.h>
// add(a: int, b: int) -> int
static PyObject *minc_add(PyObject *self, PyObject *args) {
long a, b;
if (!PyArg_ParseTuple(args, "ll", &a, &b)) {
return NULL; // TypeError already set by PyArg_ParseTuple
}
long sum = a + b;
return PyLong_FromLong(sum);
}
static PyMethodDef MincextMethods[] = {
{"add", minc_add, METH_VARARGS, "Add two integers."},
{NULL, NULL, 0, NULL}
};
static struct PyModuleDef mincextmodule = {
PyModuleDef_HEAD_INIT,
"mincext", // m_name
"Example minimal C extension", // m_doc
-1, // m_size
MincextMethods // m_methods
};
PyMODINIT_FUNC PyInit_mincext(void) { return PyModule_Create(&mincextmodule); }
Set up setuptools
to compile it:
# setup.py (simple, works with the pyproject build-system block above)
from setuptools import setup, Extension
ext = Extension(
"mincext",
sources=["src/mincext.c"],
# Define Py_LIMITED_API to target abi3 (optional, restricts API surface)
define_macros=[("Py_LIMITED_API", "0x03080000")], # cp38+ stable ABI
)
setup(name="mincext", version="0.1.0", ext_modules=[ext])
Build and import:
python -m build # or: python setup.py bdist_wheel
python -c "import mincext; print(mincext.add(2, 3))"
``;
Linux portability tip: run `auditwheel show dist/*.whl` and, if needed, `auditwheel repair` to produce `manylinux` wheels.
### Releasing the GIL in native sections
Long‑running native code should release the GIL so other Python threads can run. Use these macros around code that does not touch Python objects:
```c
#include <Python.h>
static PyObject *minc_sum_array(PyObject *self, PyObject *args) {
const unsigned char *buf; Py_ssize_t len;
if (!PyArg_ParseTuple(args, "y#", &buf, &len)) return NULL; // steals nothing
unsigned long long acc = 0;
Py_BEGIN_ALLOW_THREADS
for (Py_ssize_t i = 0; i < len; ++i) acc += buf[i];
Py_END_ALLOW_THREADS
return PyLong_FromUnsignedLongLong(acc);
}
Rules of thumb:
- Only release the GIL when touching raw memory or external I/O; reacquire it before interacting with Python objects.
- Prefer chunked loops and clear cancellation points for responsiveness.
Zero‑copy boundaries via the buffer protocol
Avoid extra copies by accepting objects that expose PEP 3118
buffers (e.g., bytes
, bytearray
, memoryview
, NumPy arrays). Parse with y*
/y#
or obtain a Py_buffer
and validate layout.
static PyObject *minc_scale_inplace(PyObject *self, PyObject *args) {
Py_buffer view; int factor;
if (!PyArg_ParseTuple(args, "*i", &view, &factor)) return NULL; // * = get buffer
// Require 1D contiguous writable bytes
if (view.ndim != 1 || !view.buf || !(view.readonly == 0)) {
PyBuffer_Release(&view);
PyErr_SetString(PyExc_ValueError, "need a writable 1D buffer");
return NULL;
}
Py_BEGIN_ALLOW_THREADS
for (Py_ssize_t i = 0; i < view.len; ++i) ((unsigned char*)view.buf)[i] *= (unsigned char)factor;
Py_END_ALLOW_THREADS
PyBuffer_Release(&view);
Py_RETURN_NONE;
}
When to reach for each tool (quick heuristics you can actually use)
- Cython: You own the algorithm and want incremental wins by typing hot loops. Great developer ergonomics; release the GIL in
with nogil
blocks; typed memoryviews map to buffers efficiently. - cffi: You already have a C library and want a thin, Pythonic binding with minimal C glue. Use API mode for compiled bindings; ABI mode for late‑bound shared libs.
- HPy: You want portability across interpreters and a safer, modern API with debug tooling. Target the Universal ABI when feasible.
- Raw C API: You need exact control (custom types, vectorcall, fine‑tuned parsing) and accept CPython‑specific maintenance. Consider
Py_LIMITED_API
if you can live within the stable subset.
Bench and safety checklist (apply before scaling up)
- Verify zero‑copy: assert object exposes expected buffer; measure copies with counters or memory bandwidth.
- Count crossings: batch work; prefer fewer, larger calls over many tiny ones.
- Concurrency: release the GIL only in pure native sections; ensure thread‑safe access to any globals.
- Packaging: produce abi3/manylinux wheels where possible; run
auditwheel
/delvewheel
in CI. - Observability: record bytes processed, calls, error paths, and time per call. Fail closed on layout mismatches.
In the next sections, we’ll go deeper on Cython (typed memoryviews, with nogil
, and layout contracts), cffi (ABI vs API modes and performance gotchas), HPy (handles, debug mode, and universal wheels), and round it out with a compact, production‑grade native module template covering vectorcall and custom types.
Cython for Python‑first kernels (the sharp, productive path)
If you already own the hot loop in Python, Cython lets you keep Python syntax while compiling to C. The big wins come from: typed variables, typed memoryviews (zero‑copy views over buffers), avoiding Python object boxing in loops, and releasing the GIL around pure native work.
Typed memoryviews: zero‑copy access to array data
Use memoryviews to read/write contiguous data without copying. The ::1
stride asserts C‑contiguity, enabling the fastest pointer arithmetic.
# cykernels.pyx
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def sum_bytes(const unsigned char[::1] data) -> unsigned long long:
cdef Py_ssize_t i, n = data.shape[0]
cdef unsigned long long acc = 0
with cython.nogil:
for i in range(n):
acc += data[i]
return acc
@cython.boundscheck(False)
@cython.wraparound(False)
def scale_inplace(unsigned char[::1] data, int factor) -> None:
cdef Py_ssize_t i, n = data.shape[0]
with cython.nogil:
for i in range(n):
data[i] = <unsigned char>((<int>data[i]) * factor)
Usage from Python (works with bytes
, bytearray
, memoryview
, NumPy arrays, Arrow buffers, etc.):
import cykernels, numpy as np
arr = np.arange(10, dtype=np.uint8)
cykernels.scale_inplace(arr, 2)
print(arr.sum(), cykernels.sum_bytes(arr))
Notes:
- Disable bounds/wraparound checks in hot loops; keep them on in tests if needed.
with cython.nogil:
mirrors C’sPy_BEGIN_ALLOW_THREADS
/Py_END_ALLOW_THREADS
.
Exposing C libraries via .pxd (no glue copies)
Declare external functions and structs once in a .pxd
, then call them from .pyx
with zero marshaling beyond pointer/primitive passing.
# fastops.pxd
cdef extern from "fastops.h":
ctypedef unsigned long size_t
int saxpy(float *dst, const float *x, const float *y, float a, size_t n)
# fastops.pyx
cimport cython
from libc.stdlib cimport malloc, free
from fastops cimport saxpy
@cython.boundscheck(False)
@cython.wraparound(False)
def saxpy_mv(float[::1] dst, const float[::1] x, const float[::1] y, float a) -> int:
cdef Py_ssize_t n = dst.shape[0]
if x.shape[0] != n or y.shape[0] != n:
raise ValueError("length mismatch")
with cython.nogil:
return saxpy(&dst[0], &x[0], &y[0], a, <unsigned long>n)
Build hint: add your C library and include paths in the extension’s libraries
, library_dirs
, and include_dirs
. With setuptools
, you can pass them via Extension(...)
or in setup.cfg
.
Extension types (cdef class
) for low‑overhead objects
When you need stateful, hot objects (iterators, codecs), cdef class
gives C‑layout objects with Python interop.
# codec.pyx
cimport cython
cdef class XorCodec:
cdef unsigned char key
def __cinit__(self, int key):
self.key = <unsigned char>key
@cython.boundscheck(False)
@cython.wraparound(False)
def apply_inplace(self, unsigned char[::1] buf) -> None:
cdef Py_ssize_t i, n = buf.shape[0]
with cython.nogil:
for i in range(n):
buf[i] ^= self.key
Guidance:
- Methods that touch Python objects require the GIL; keep hot paths in
nogil
blocks touching only raw memory. - Prefer memoryviews over
np.ndarray
C‑API; it avoids an extra compile‑time dependency.
Parallel loops with prange
(opt‑in)
Use cython.parallel.prange
to split loops across threads. Requires compiling with OpenMP (e.g., -fopenmp
on GCC/Clang, /openmp
on MSVC) and linking flags.
# parsum.pyx
from cython.parallel import prange
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def sum_u32(const unsigned int[::1] a) -> unsigned long long:
cdef Py_ssize_t i, n = a.shape[0]
cdef unsigned long long acc = 0
with cython.nogil:
for i in prange(n, schedule='static', nogil=True):
acc += a[i]
return acc
Add platform‑specific OpenMP flags in your build config; fall back to single‑thread if flags aren’t available. Measure: not all kernels benefit from parallelism due to memory bandwidth.
Error handling patterns that don’t surprise callers
- For functions returning C integers, annotate error returns:
cdef int fn(...) except -1
so Python exceptions are raised when you return-1
. - For pointer returns, use
except NULL
. - Prefer raising Python exceptions on contract violations (shape/type), not sentinel values.
cdef int div_floor(int a, int b) except -1:
if b == 0:
raise ZeroDivisionError()
return a // b
Compiler directives that matter in hot paths
@cython.boundscheck(False)
,@cython.wraparound(False)
: remove per‑access checks.@cython.cdivision(True)
: use C integer division semantics (no Python exceptions).@cython.infer_types(True)
: let Cython infer more C types (validate in generated C).
Minimal build configuration (setuptools)
Keep builds declarative via pyproject.toml
, then cythonize in setup.py
or setup.cfg
.
[build-system]
requires = ["setuptools>=68", "wheel", "cython>=3.0"]
build-backend = "setuptools.build_meta"
# setup.py (excerpt)
from setuptools import setup, Extension
from Cython.Build import cythonize
extensions = [
Extension(
"cykernels",
sources=["cykernels.pyx"],
extra_compile_args=["-O3"],
),
Extension(
"parsum",
sources=["parsum.pyx"],
extra_compile_args=["-O3", "-fopenmp"],
extra_link_args=["-fopenmp"],
),
]
setup(
name="cyexts",
version="0.1.0",
ext_modules=cythonize(extensions, language_level=3),
)
On macOS with Apple Clang, OpenMP requires installing libomp
and passing -Xpreprocessor -fopenmp
plus -lomp
. Consider making OpenMP optional and feature‑gated.
cffi: the fastest path to existing C libraries
If you already have a C library (or the platform provides one), cffi
lets you call it with minimal glue and strong safety. Two ways to bind:
- ABI mode: parse C declarations and
dlopen()
an existing shared library at runtime. Zero compile step for your code; great for system libs. - API (out‑of‑line) mode: you ship a tiny compiled helper module for stable imports and distribution.
ABI mode: no compile step
# abi_strlen.py
from cffi import FFI
ffi = FFI()
ffi.cdef("size_t strlen(const char *s);")
# On Linux, libc is usually available via None (the main program) or "c"
try:
C = ffi.dlopen(None)
except OSError:
C = ffi.dlopen("c")
s = b"hello world"
n = C.strlen(s)
print(n) # 11
Zero‑copy tip: use ffi.from_buffer
to pass Python buffers (e.g., bytearray
, NumPy) without copying.
# abi_memset.py
from cffi import FFI
ffi = FFI(); ffi.cdef("void *memset(void *s, int c, size_t n);")
C = ffi.dlopen(None)
buf = bytearray(8)
C.memset(ffi.from_buffer(buf), 0x7F, len(buf))
print(list(buf)) # [127, 127, ...]
API (out‑of‑line) mode: compiled helper module
# build_cadd.py
from cffi import FFI
ffi = FFI()
ffi.cdef("int add(int a, int b);")
ffi.set_source("_cadd", "int add(int a,int b){return a+b;}")
if __name__ == "__main__":
ffi.compile(verbose=True)
# use_cadd.py
from _cadd import lib
print(lib.add(2, 3)) # 5
Guidance:
- Prefer API mode for distribution: you get a normal import target and can wheel it like any extension.
- Macros: the C preprocessor isn’t executed by cffi’s parser; replicate constants in
cdef
or compute them at runtime via helper C code inset_source
. - Callbacks: use
ffi.callback("int(int)")
to create a C function pointer from a Python callable, but keep callbacks rare on hot paths.
Packaging notes
- API mode emits a platform extension you can wheel. Audit external shared libs with
auditwheel
(Linux) ordelocate
/delvewheel
(macOS/Windows). - ABI mode has no extension to wheel; you ship pure Python but rely on the presence of the shared library at runtime.
HPy: modern, portable C extensions with debugable handles
The classic Python/C API exposes raw PyObject *
and reference counting—powerful but fragile and CPython‑centric. HPy introduces a handle‑based API with a Universal ABI that runs across multiple interpreters (CPython, PyPy, GraalPy) and a Debug mode that catches leaks and lifetime mistakes.
Minimal HPy module (universal)
// myhpy.c
#include <hpy.h>
HPyDef_METH(forty_two, "forty_two", forty_two_impl, HPyFunc_NOARGS)
static HPy forty_two_impl(HPyContext *ctx, HPy self) {
return HPyLong_FromLong(ctx, 42);
}
static HPyDef *module_defines[] = { &forty_two, NULL };
static HPyModuleDef moduledef = {
.name = "myhpy",
.doc = "HPy example",
.size = -1,
.defines = module_defines,
};
HPy_MODINIT(myhpy)
static HPy init_myhpy_impl(HPyContext *ctx) {
return HPyModule_Create(ctx, &moduledef);
}
Build options in practice:
- Universal ABI: single binary per platform/arch works across supported Python versions/implementations.
- CPython ABI: for CPython‑specific integration when you need it.
- Enable Debug mode builds in CI to catch handle leaks and use‑after‑close.
Handles, lifetime, and errors
- Treat every
HPy
like a borrowed resource; close what you open unless ownership is explicitly transferred by an API. - Use
HPyErr_*
helpers to raise errors; returnHPy_NULL
on failure for appropriate signatures. - Convert primitives via
HPyLong_FromLong
,HPyFloat_FromDouble
, etc.; parse arguments with helpers or by reading from args tuples depending on the function kind.
Where HPy fits today
- Great for libraries that want portability beyond CPython and better safety during development.
- Plays well with alternative interpreters; can coexist with classic C API in staged migrations.
- Cython’s experimental HPy backend aims to let you keep
.pyx
surface while targeting HPy under the hood (plan migrations, validate feature coverage).
Packaging notes
- Use HPy’s build helpers (setuptools plugin) or integrate include paths with your backend (e.g., scikit‑build‑core). Produce universal and debug variants.
- Distribute wheels per platform as usual; universal refers to Python implementation ABI compatibility, not OS/arch.
Choosing between cffi and HPy for bindings
- If the C surface is stable and you want minimal C glue, start with cffi (API mode) and ship wheels quickly.
- If you need a compiled module with better cross‑implementation support and safer C semantics, target HPy.
- For CPython‑only tight integrations or custom types, the classic C API still wins on control; pair it with abi3 when possible.
Classic C API: precision tools (vectorcall, custom types, abi3)
When you need tight control—custom types, lowest overhead call sites, or specialized parsing—the classic C API delivers. Favor the stable ABI where you can, and adopt modern calling conventions to cut overhead.
Vectorcall: faster calls with fewer temporaries
Vectorcall (PEP 590) avoids building argument tuples, letting the runtime pass pointers directly. You can use it via function flags (METH_FASTCALL | METH_KEYWORDS
) or by implementing the vectorcall slot on a custom type.
Minimal custom type with vectorcall:
// vecobj.c (excerpt)
#define PY_SSIZE_T_CLEAN
#include <Python.h>
typedef struct {
PyObject_HEAD
long factor;
vectorcallfunc vectorcall;
} MultObject;
static PyObject *
Mult_vectorcall(PyObject *self, PyObject *const *args, size_t nargsf, PyObject *kwnames) {
if (kwnames && PyTuple_GET_SIZE(kwnames) != 0) {
PyErr_SetString(PyExc_TypeError, "no keyword arguments");
return NULL;
}
Py_ssize_t nargs = PyVectorcall_NARGS(nargsf);
if (nargs != 1) {
PyErr_SetString(PyExc_TypeError, "expected 1 positional arg");
return NULL;
}
long x = PyLong_AsLong(args[0]);
if (PyErr_Occurred()) return NULL;
long f = ((MultObject *)self)->factor;
return PyLong_FromLong(f * x);
}
static int Mult_init(MultObject *self, PyObject *args, PyObject *kw) {
long f = 1;
static char *kwlist[] = {"factor", NULL};
if (!PyArg_ParseTupleAndKeywords(args, kw, "|l", kwlist, &f)) return -1;
self->factor = f;
self->vectorcall = Mult_vectorcall; // set per-instance function pointer
return 0;
}
static PyTypeObject MultType = {
PyVarObject_HEAD_INIT(NULL, 0)
.tp_name = "mincext.Mult",
.tp_basicsize = sizeof(MultObject),
.tp_flags = Py_TPFLAGS_DEFAULT | Py_TPFLAGS_HAVE_VECTORCALL,
.tp_new = PyType_GenericNew,
.tp_init = (initproc)Mult_init,
.tp_vectorcall_offset = offsetof(MultObject, vectorcall),
};
static PyModuleDef mod = { PyModuleDef_HEAD_INIT, "mincext", 0, -1, 0 };
PyMODINIT_FUNC PyInit_mincext(void) {
PyObject *m = PyModule_Create(&mod);
if (!m) return NULL;
if (PyType_Ready(&MultType) < 0) return NULL;
Py_INCREF(&MultType);
if (PyModule_AddObject(m, "Mult", (PyObject *)&MultType) < 0) {
Py_DECREF(&MultType); Py_DECREF(m); return NULL;
}
return m;
}
Usage:
from mincext import Mult
m = Mult(factor=3)
assert m(7) == 21
Notes:
- Set
tp_vectorcall_offset
and the pointer in__init__
(or a factory) so each instance carries the fastcall entry. - If you only need a fast module function, use
METH_FASTCALL | METH_KEYWORDS
onPyMethodDef
for a simpler path.
Error handling and reference counting (the safe pattern)
- Prefer one exit path with
goto error;
and cleanups viaPy_XDECREF
/Py_CLEAR
. - Use
Py_SETREF(dst, src)
/Py_XSETREF
when replacing owned references. - Create new borrows with
Py_NewRef(obj)
(3.10+); avoid stealing semantics unless documented.
static PyObject *do_work(PyObject *self, PyObject *args) {
PyObject *a = NULL, *b = NULL, *out = NULL;
if (!PyArg_ParseTuple(args, "OO", &a, &b)) return NULL;
out = PyTuple_Pack(2, a, b);
if (!out) goto error;
return out;
error:
Py_XDECREF(out);
return NULL;
}
Targeting the stable ABI (abi3)
Ship one wheel per platform that works across Python 3.x minors by limiting to the stable API.
Setuptools example:
# setup.py
from setuptools import setup, Extension
ext = Extension(
"mincext",
sources=["vecobj.c"],
define_macros=[("Py_LIMITED_API", "0x03080000")], # target 3.8+ stable ABI
)
setup(
name="mincext",
version="0.1.0",
ext_modules=[ext],
options={"bdist_wheel": {"py_limited_api": "cp38"}},
)
Or via setup.cfg
:
[bdist_wheel]
py_limited_api = cp38
Packaging tips:
- Linux:
auditwheel show/repair dist/*.whl
to produce manylinux wheels. - macOS:
delocate-listdeps
/delocate-wheel
to vendor.dylib
s. - Windows:
delvewheel show
/delvewheel repair
for.dll
s.
CI smoke tests that prevent surprises
- Matrix: OS × Python (min,max) for import test and a tiny call.
- Verify wheels are manylinux/musllinux where applicable.
- Run
pip install
from the built wheel in a clean venv;python -c "import pkg; print(pkg.__version__)"
.
Final production checklist
- Choose the thinnest boundary: buffers for raw data, structs by pointer, batch work.
- Release the GIL around pure native loops; reacquire before touching Python objects.
- Prefer abi3 unless you need unstable internals; audit wheels on CI.
- For existing C libraries, start with cffi (API mode) or HPy if you need portability and safety; drop to C API when you need full control.
References
- PEP 590: Vectorcall — peps.python.org/pep-0590
- PEP 3118: Buffer Protocol — peps.python.org/pep-3118
- Stable ABI and Py_LIMITED_API — Python C API: Stable ABI
- Extending and Embedding the Python Interpreter — Python Docs
- Python/C API Reference — Python Docs
- Vectorcall Protocol docs — Python Docs
- Setuptools: Building abi3 wheels — Setuptools User Guide
- Python Packaging User Guide: Binary Extensions — PyPUG
- PEP 517/518: Build backends and pyproject — PEP 517, PEP 518
- manylinux policy (PEP 600) — peps.python.org/pep-0600
- musllinux policy (PEP 656) — peps.python.org/pep-0656
- auditwheel — PyPA auditwheel
- delocate (macOS wheel repair) — Delocate
- delvewheel (Windows wheel repair) — Delvewheel
- Cython: User Guide and
nogil
— Cython Docs - cffi: Foreign Function Interface for Python — cffi Docs
- HPy: A better C API for Python — HPy Project
- scikit-build-core — Docs
- meson-python — Docs
- Buffer Protocol (how-to) — Python Docs