Building Fast Native Extensions: Cython, cffi, HPy, and a Tiny C-extension by Hand

Published: November 5, 2020 (4y ago)17 min read

Updated: November 21, 2024 (8mo ago)

Python lets you ship ideas quickly. But when hot paths turn CPU‑bound, you need native speed without turning your codebase into a build‑system museum. This guide shows you the pragmatic ways to cross the boundary—Cython, cffi, HPy, and the raw C API—and how to choose between them with a bias for safety, packaging sanity, and maintainable performance.

We’ll set the mental model, give you a minimal C extension you can build today, and establish packaging/ABI ground rules so distribution doesn’t become the bottleneck.

Who this is for

  • Backend, data, and infra engineers who need predictable speedups with Python in production.
  • Teams owning performance‑critical kernels (hashing, parsing, transforms) or binding existing C/C++/Rust libs.
  • People who want portable wheels and a clean rollback story, not bespoke CI fires.

Key takeaways

  • Pick the path that matches your constraints, not hype: Cython for Python‑first kernels, cffi for wrapping existing C libraries, HPy for future‑proof portability and debug tooling, raw C API when you need ultimate control.
  • Understand packaging and ABI early: abi3, Py_LIMITED_API, manylinux/musllinux, and auditwheel decide your release friction.
  • Release the GIL in native hot loops and keep Python orchestration cheap. Design for zero‑copy boundaries via the buffer protocol where possible.
flowchart TD A[What are you optimizing?] -->|Existing C/C++ lib| B[cffi or HPy] A -->|Python loop hot path| C[Cython] A -->|Tight control/CPython-only| D[C API by hand] B --> B1{Need portable wheels?} B1 -->|Yes, py3.x compatibility| B2[abi3 + auditwheel] B1 -->|No, platform-specific ok| B3[ABI mode + platform wheels] C --> C1{Cross-implementation future?} C1 -->|Yes| C2[Cython + HPy backend (where feasible)] C1 -->|No| C3[Cython classic] D --> D1{Maintenance budget?} D1 -->|Low| E[Prefer HPy or Cython] D1 -->|High| F[C API + vectorcall + buffer]

What “fast” and “safe” actually mean here

  • Fast: move Python out of inner loops, minimize crossings, batch work, use contiguous memory, and exploit the CPU (SIMD, cache‑friendly layouts) from native code.
  • Safe: clear ownership, correct reference counting, no accidental copies, explicit GIL release around long native work, reproducible builds, and wheels that don’t break consumers.

Packaging and ABI mental model (don’t skip this)

The performance win is worthless if distribution is painful. Anchor on four ideas:

  • abi3 + Py_LIMITED_API: build a single wheel that works across Python 3.x minor versions on the same platform by targeting the stable C API subset. Ideal when you don’t need bleeding‑edge CPython internals.
  • manylinux/musllinux: portable Linux wheels with audited dependencies. Use auditwheel to repair wheels and vendor shared libs appropriately.
  • Universal vs CPython‑specific APIs: HPy’s Universal ABI aims to run across CPython, PyPy, GraalPy; classic #include <Python.h> ties you to CPython’s ABI unless you restrict to the stable subset.
  • Build backends: prefer modern backends (scikit-build-core, setuptools with pyproject.toml, or meson-python). Keep builds declarative and cacheable in CI.
graph LR subgraph Source S1[pyproject.toml] S2[src/native.c or .pyx] S3[headers/.h] end subgraph Build B1[Backend: setuptools / scikit-build-core / meson-python] B2[Compiler & linker] B3[auditwheel / delvewheel] end subgraph Artifacts W1[abi3 wheel] W2[platform wheel] W3[sdist] end S1-->B1 S2-->B1 S3-->B1 B1-->B2-->B3-->W1 B1-->B2-->W2 B1-->W3

Minimal pyproject.toml variants you can adopt:

[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"
 
[project]
name = "mincext"
version = "0.1.0"
requires-python = ">=3.8"
# scikit-build-core (great for C/C++/Fortran or when CMake is already in play)
[build-system]
requires = ["scikit-build-core>=0.7", "pybind11<3; python_version<'3.13'" ]
build-backend = "scikit_build_core.build"
 
[project]
name = "mincext"
version = "0.1.0"
requires-python = ">=3.8"

With setuptools you can set PYTHON_LIMITED_API=cp38 via compiler args to emit abi3 wheels. With scikit‑build‑core, pass equivalent definitions in CMake or tool config.


A minimal C extension you can build and wheel today

This is the smallest useful CPython extension showing argument parsing, error handling, and return conversion. Keep it boring and explicit.

// src/mincext.c
#define PY_SSIZE_T_CLEAN
#include <Python.h>
 
// add(a: int, b: int) -> int
static PyObject *minc_add(PyObject *self, PyObject *args) {
    long a, b;
    if (!PyArg_ParseTuple(args, "ll", &a, &b)) {
        return NULL; // TypeError already set by PyArg_ParseTuple
    }
    long sum = a + b;
    return PyLong_FromLong(sum);
}
 
static PyMethodDef MincextMethods[] = {
    {"add", minc_add, METH_VARARGS, "Add two integers."},
    {NULL, NULL, 0, NULL}
};
 
static struct PyModuleDef mincextmodule = {
    PyModuleDef_HEAD_INIT,
    "mincext",            // m_name
    "Example minimal C extension", // m_doc
    -1,                    // m_size
    MincextMethods         // m_methods
};
 
PyMODINIT_FUNC PyInit_mincext(void) { return PyModule_Create(&mincextmodule); }

Set up setuptools to compile it:

# setup.py (simple, works with the pyproject build-system block above)
from setuptools import setup, Extension
 
ext = Extension(
    "mincext",
    sources=["src/mincext.c"],
    # Define Py_LIMITED_API to target abi3 (optional, restricts API surface)
    define_macros=[("Py_LIMITED_API", "0x03080000")],  # cp38+ stable ABI
)
 
setup(name="mincext", version="0.1.0", ext_modules=[ext])

Build and import:

python -m build  # or: python setup.py bdist_wheel
python -c "import mincext; print(mincext.add(2, 3))"
``;
 
Linux portability tip: run `auditwheel show dist/*.whl` and, if needed, `auditwheel repair` to produce `manylinux` wheels.
 
### Releasing the GIL in native sections
 
Long‑running native code should release the GIL so other Python threads can run. Use these macros around code that does not touch Python objects:
 
```c
#include <Python.h>
 
static PyObject *minc_sum_array(PyObject *self, PyObject *args) {
    const unsigned char *buf; Py_ssize_t len;
    if (!PyArg_ParseTuple(args, "y#", &buf, &len)) return NULL; // steals nothing
    unsigned long long acc = 0;
    Py_BEGIN_ALLOW_THREADS
    for (Py_ssize_t i = 0; i < len; ++i) acc += buf[i];
    Py_END_ALLOW_THREADS
    return PyLong_FromUnsignedLongLong(acc);
}

Rules of thumb:

  • Only release the GIL when touching raw memory or external I/O; reacquire it before interacting with Python objects.
  • Prefer chunked loops and clear cancellation points for responsiveness.

Zero‑copy boundaries via the buffer protocol

Avoid extra copies by accepting objects that expose PEP 3118 buffers (e.g., bytes, bytearray, memoryview, NumPy arrays). Parse with y*/y# or obtain a Py_buffer and validate layout.

static PyObject *minc_scale_inplace(PyObject *self, PyObject *args) {
    Py_buffer view; int factor;
    if (!PyArg_ParseTuple(args, "*i", &view, &factor)) return NULL; // * = get buffer
    // Require 1D contiguous writable bytes
    if (view.ndim != 1 || !view.buf || !(view.readonly == 0)) {
        PyBuffer_Release(&view);
        PyErr_SetString(PyExc_ValueError, "need a writable 1D buffer");
        return NULL;
    }
    Py_BEGIN_ALLOW_THREADS
    for (Py_ssize_t i = 0; i < view.len; ++i) ((unsigned char*)view.buf)[i] *= (unsigned char)factor;
    Py_END_ALLOW_THREADS
    PyBuffer_Release(&view);
    Py_RETURN_NONE;
}

When to reach for each tool (quick heuristics you can actually use)

  • Cython: You own the algorithm and want incremental wins by typing hot loops. Great developer ergonomics; release the GIL in with nogil blocks; typed memoryviews map to buffers efficiently.
  • cffi: You already have a C library and want a thin, Pythonic binding with minimal C glue. Use API mode for compiled bindings; ABI mode for late‑bound shared libs.
  • HPy: You want portability across interpreters and a safer, modern API with debug tooling. Target the Universal ABI when feasible.
  • Raw C API: You need exact control (custom types, vectorcall, fine‑tuned parsing) and accept CPython‑specific maintenance. Consider Py_LIMITED_API if you can live within the stable subset.
sequenceDiagram participant Py as Python participant Ext as Extension Module participant CPU as Native Loop Py->>Ext: call fn(buf) Ext->>Ext: parse args (no copies) Ext-->>CPU: release GIL CPU-->>Ext: compute on raw memory Ext->>Py: reacquire GIL, return result

Bench and safety checklist (apply before scaling up)

  • Verify zero‑copy: assert object exposes expected buffer; measure copies with counters or memory bandwidth.
  • Count crossings: batch work; prefer fewer, larger calls over many tiny ones.
  • Concurrency: release the GIL only in pure native sections; ensure thread‑safe access to any globals.
  • Packaging: produce abi3/manylinux wheels where possible; run auditwheel/delvewheel in CI.
  • Observability: record bytes processed, calls, error paths, and time per call. Fail closed on layout mismatches.

In the next sections, we’ll go deeper on Cython (typed memoryviews, with nogil, and layout contracts), cffi (ABI vs API modes and performance gotchas), HPy (handles, debug mode, and universal wheels), and round it out with a compact, production‑grade native module template covering vectorcall and custom types.


Cython for Python‑first kernels (the sharp, productive path)

If you already own the hot loop in Python, Cython lets you keep Python syntax while compiling to C. The big wins come from: typed variables, typed memoryviews (zero‑copy views over buffers), avoiding Python object boxing in loops, and releasing the GIL around pure native work.

graph LR A[.pyx Cython source] --> B[C code generated] B --> C[Compiler & linker] C --> D[.so/.pyd extension] D --> E[import in Python]

Typed memoryviews: zero‑copy access to array data

Use memoryviews to read/write contiguous data without copying. The ::1 stride asserts C‑contiguity, enabling the fastest pointer arithmetic.

# cykernels.pyx
cimport cython
 
@cython.boundscheck(False)
@cython.wraparound(False)
def sum_bytes(const unsigned char[::1] data) -> unsigned long long:
    cdef Py_ssize_t i, n = data.shape[0]
    cdef unsigned long long acc = 0
    with cython.nogil:
        for i in range(n):
            acc += data[i]
    return acc
 
@cython.boundscheck(False)
@cython.wraparound(False)
def scale_inplace(unsigned char[::1] data, int factor) -> None:
    cdef Py_ssize_t i, n = data.shape[0]
    with cython.nogil:
        for i in range(n):
            data[i] = <unsigned char>((<int>data[i]) * factor)

Usage from Python (works with bytes, bytearray, memoryview, NumPy arrays, Arrow buffers, etc.):

import cykernels, numpy as np
arr = np.arange(10, dtype=np.uint8)
cykernels.scale_inplace(arr, 2)
print(arr.sum(), cykernels.sum_bytes(arr))

Notes:

  • Disable bounds/wraparound checks in hot loops; keep them on in tests if needed.
  • with cython.nogil: mirrors C’s Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS.

Exposing C libraries via .pxd (no glue copies)

Declare external functions and structs once in a .pxd, then call them from .pyx with zero marshaling beyond pointer/primitive passing.

# fastops.pxd
cdef extern from "fastops.h":
    ctypedef unsigned long size_t
    int saxpy(float *dst, const float *x, const float *y, float a, size_t n)
# fastops.pyx
cimport cython
from libc.stdlib cimport malloc, free
from fastops cimport saxpy
 
@cython.boundscheck(False)
@cython.wraparound(False)
def saxpy_mv(float[::1] dst, const float[::1] x, const float[::1] y, float a) -> int:
    cdef Py_ssize_t n = dst.shape[0]
    if x.shape[0] != n or y.shape[0] != n:
        raise ValueError("length mismatch")
    with cython.nogil:
        return saxpy(&dst[0], &x[0], &y[0], a, <unsigned long>n)

Build hint: add your C library and include paths in the extension’s libraries, library_dirs, and include_dirs. With setuptools, you can pass them via Extension(...) or in setup.cfg.

Extension types (cdef class) for low‑overhead objects

When you need stateful, hot objects (iterators, codecs), cdef class gives C‑layout objects with Python interop.

# codec.pyx
cimport cython
 
cdef class XorCodec:
    cdef unsigned char key
 
    def __cinit__(self, int key):
        self.key = <unsigned char>key
 
    @cython.boundscheck(False)
    @cython.wraparound(False)
    def apply_inplace(self, unsigned char[::1] buf) -> None:
        cdef Py_ssize_t i, n = buf.shape[0]
        with cython.nogil:
            for i in range(n):
                buf[i] ^= self.key

Guidance:

  • Methods that touch Python objects require the GIL; keep hot paths in nogil blocks touching only raw memory.
  • Prefer memoryviews over np.ndarray C‑API; it avoids an extra compile‑time dependency.

Parallel loops with prange (opt‑in)

Use cython.parallel.prange to split loops across threads. Requires compiling with OpenMP (e.g., -fopenmp on GCC/Clang, /openmp on MSVC) and linking flags.

# parsum.pyx
from cython.parallel import prange
cimport cython
 
@cython.boundscheck(False)
@cython.wraparound(False)
def sum_u32(const unsigned int[::1] a) -> unsigned long long:
    cdef Py_ssize_t i, n = a.shape[0]
    cdef unsigned long long acc = 0
    with cython.nogil:
        for i in prange(n, schedule='static', nogil=True):
            acc += a[i]
    return acc

Add platform‑specific OpenMP flags in your build config; fall back to single‑thread if flags aren’t available. Measure: not all kernels benefit from parallelism due to memory bandwidth.

Error handling patterns that don’t surprise callers

  • For functions returning C integers, annotate error returns: cdef int fn(...) except -1 so Python exceptions are raised when you return -1.
  • For pointer returns, use except NULL.
  • Prefer raising Python exceptions on contract violations (shape/type), not sentinel values.
cdef int div_floor(int a, int b) except -1:
    if b == 0:
        raise ZeroDivisionError()
    return a // b

Compiler directives that matter in hot paths

  • @cython.boundscheck(False), @cython.wraparound(False): remove per‑access checks.
  • @cython.cdivision(True): use C integer division semantics (no Python exceptions).
  • @cython.infer_types(True): let Cython infer more C types (validate in generated C).

Minimal build configuration (setuptools)

Keep builds declarative via pyproject.toml, then cythonize in setup.py or setup.cfg.

[build-system]
requires = ["setuptools>=68", "wheel", "cython>=3.0"]
build-backend = "setuptools.build_meta"
# setup.py (excerpt)
from setuptools import setup, Extension
from Cython.Build import cythonize
 
extensions = [
    Extension(
        "cykernels",
        sources=["cykernels.pyx"],
        extra_compile_args=["-O3"],
    ),
    Extension(
        "parsum",
        sources=["parsum.pyx"],
        extra_compile_args=["-O3", "-fopenmp"],
        extra_link_args=["-fopenmp"],
    ),
]
 
setup(
    name="cyexts",
    version="0.1.0",
    ext_modules=cythonize(extensions, language_level=3),
)

On macOS with Apple Clang, OpenMP requires installing libomp and passing -Xpreprocessor -fopenmp plus -lomp. Consider making OpenMP optional and feature‑gated.

flowchart TD MV[Typed memoryview] --> P[Raw pointer math] P --> NOGIL[with nogil] NOGIL --> O3[Optimized native loop] O3 --> RET[Return small Python object] style MV fill:#e1f5fe style NOGIL fill:#fff3e0 style O3 fill:#e8f5e8

cffi: the fastest path to existing C libraries

If you already have a C library (or the platform provides one), cffi lets you call it with minimal glue and strong safety. Two ways to bind:

  • ABI mode: parse C declarations and dlopen() an existing shared library at runtime. Zero compile step for your code; great for system libs.
  • API (out‑of‑line) mode: you ship a tiny compiled helper module for stable imports and distribution.
flowchart LR A[Your Python code] -->|ABI mode| B[ffi.cdef + dlopen(lib)] A -->|API mode| C[ffi.set_source + compile] B --> D[Direct calls into libC] C --> E[_module.lib callable]

ABI mode: no compile step

# abi_strlen.py
from cffi import FFI
ffi = FFI()
ffi.cdef("size_t strlen(const char *s);")
 
# On Linux, libc is usually available via None (the main program) or "c"
try:
    C = ffi.dlopen(None)
except OSError:
    C = ffi.dlopen("c")
 
s = b"hello world"
n = C.strlen(s)
print(n)  # 11

Zero‑copy tip: use ffi.from_buffer to pass Python buffers (e.g., bytearray, NumPy) without copying.

# abi_memset.py
from cffi import FFI
ffi = FFI(); ffi.cdef("void *memset(void *s, int c, size_t n);")
C = ffi.dlopen(None)
buf = bytearray(8)
C.memset(ffi.from_buffer(buf), 0x7F, len(buf))
print(list(buf))  # [127, 127, ...]

API (out‑of‑line) mode: compiled helper module

# build_cadd.py
from cffi import FFI
ffi = FFI()
ffi.cdef("int add(int a, int b);")
ffi.set_source("_cadd", "int add(int a,int b){return a+b;}")
if __name__ == "__main__":
    ffi.compile(verbose=True)
# use_cadd.py
from _cadd import lib
print(lib.add(2, 3))  # 5

Guidance:

  • Prefer API mode for distribution: you get a normal import target and can wheel it like any extension.
  • Macros: the C preprocessor isn’t executed by cffi’s parser; replicate constants in cdef or compute them at runtime via helper C code in set_source.
  • Callbacks: use ffi.callback("int(int)") to create a C function pointer from a Python callable, but keep callbacks rare on hot paths.

Packaging notes

  • API mode emits a platform extension you can wheel. Audit external shared libs with auditwheel (Linux) or delocate/delvewheel (macOS/Windows).
  • ABI mode has no extension to wheel; you ship pure Python but rely on the presence of the shared library at runtime.

HPy: modern, portable C extensions with debugable handles

The classic Python/C API exposes raw PyObject * and reference counting—powerful but fragile and CPython‑centric. HPy introduces a handle‑based API with a Universal ABI that runs across multiple interpreters (CPython, PyPy, GraalPy) and a Debug mode that catches leaks and lifetime mistakes.

sequenceDiagram participant Py as Python participant Mod as HPy Module participant Ctx as HPyContext Py->>Mod: call Mod->>Ctx: create handles Ctx-->>Mod: return HPy objects via handles Mod->>Ctx: close handles Mod-->>Py: result

Minimal HPy module (universal)

// myhpy.c
#include <hpy.h>
 
HPyDef_METH(forty_two, "forty_two", forty_two_impl, HPyFunc_NOARGS)
static HPy forty_two_impl(HPyContext *ctx, HPy self) {
    return HPyLong_FromLong(ctx, 42);
}
 
static HPyDef *module_defines[] = { &forty_two, NULL };
 
static HPyModuleDef moduledef = {
    .name = "myhpy",
    .doc = "HPy example",
    .size = -1,
    .defines = module_defines,
};
 
HPy_MODINIT(myhpy)
static HPy init_myhpy_impl(HPyContext *ctx) {
    return HPyModule_Create(ctx, &moduledef);
}

Build options in practice:

  • Universal ABI: single binary per platform/arch works across supported Python versions/implementations.
  • CPython ABI: for CPython‑specific integration when you need it.
  • Enable Debug mode builds in CI to catch handle leaks and use‑after‑close.

Handles, lifetime, and errors

  • Treat every HPy like a borrowed resource; close what you open unless ownership is explicitly transferred by an API.
  • Use HPyErr_* helpers to raise errors; return HPy_NULL on failure for appropriate signatures.
  • Convert primitives via HPyLong_FromLong, HPyFloat_FromDouble, etc.; parse arguments with helpers or by reading from args tuples depending on the function kind.

Where HPy fits today

  • Great for libraries that want portability beyond CPython and better safety during development.
  • Plays well with alternative interpreters; can coexist with classic C API in staged migrations.
  • Cython’s experimental HPy backend aims to let you keep .pyx surface while targeting HPy under the hood (plan migrations, validate feature coverage).
flowchart TD Start[Need portable C ext?] -->|Yes| U[HPy Universal] Start -->|CPython only & internals| CAPI[Classic C API] U --> DBG[Debug build in CI] DBG --> Ship[Release Universal wheel]

Packaging notes

  • Use HPy’s build helpers (setuptools plugin) or integrate include paths with your backend (e.g., scikit‑build‑core). Produce universal and debug variants.
  • Distribute wheels per platform as usual; universal refers to Python implementation ABI compatibility, not OS/arch.

Choosing between cffi and HPy for bindings

  • If the C surface is stable and you want minimal C glue, start with cffi (API mode) and ship wheels quickly.
  • If you need a compiled module with better cross‑implementation support and safer C semantics, target HPy.
  • For CPython‑only tight integrations or custom types, the classic C API still wins on control; pair it with abi3 when possible.

Classic C API: precision tools (vectorcall, custom types, abi3)

When you need tight control—custom types, lowest overhead call sites, or specialized parsing—the classic C API delivers. Favor the stable ABI where you can, and adopt modern calling conventions to cut overhead.

Vectorcall: faster calls with fewer temporaries

Vectorcall (PEP 590) avoids building argument tuples, letting the runtime pass pointers directly. You can use it via function flags (METH_FASTCALL | METH_KEYWORDS) or by implementing the vectorcall slot on a custom type.

Minimal custom type with vectorcall:

// vecobj.c (excerpt)
#define PY_SSIZE_T_CLEAN
#include <Python.h>
 
typedef struct {
    PyObject_HEAD
    long factor;
    vectorcallfunc vectorcall;
} MultObject;
 
static PyObject *
Mult_vectorcall(PyObject *self, PyObject *const *args, size_t nargsf, PyObject *kwnames) {
    if (kwnames && PyTuple_GET_SIZE(kwnames) != 0) {
        PyErr_SetString(PyExc_TypeError, "no keyword arguments");
        return NULL;
    }
    Py_ssize_t nargs = PyVectorcall_NARGS(nargsf);
    if (nargs != 1) {
        PyErr_SetString(PyExc_TypeError, "expected 1 positional arg");
        return NULL;
    }
    long x = PyLong_AsLong(args[0]);
    if (PyErr_Occurred()) return NULL;
    long f = ((MultObject *)self)->factor;
    return PyLong_FromLong(f * x);
}
 
static int Mult_init(MultObject *self, PyObject *args, PyObject *kw) {
    long f = 1;
    static char *kwlist[] = {"factor", NULL};
    if (!PyArg_ParseTupleAndKeywords(args, kw, "|l", kwlist, &f)) return -1;
    self->factor = f;
    self->vectorcall = Mult_vectorcall; // set per-instance function pointer
    return 0;
}
 
static PyTypeObject MultType = {
    PyVarObject_HEAD_INIT(NULL, 0)
    .tp_name = "mincext.Mult",
    .tp_basicsize = sizeof(MultObject),
    .tp_flags = Py_TPFLAGS_DEFAULT | Py_TPFLAGS_HAVE_VECTORCALL,
    .tp_new = PyType_GenericNew,
    .tp_init = (initproc)Mult_init,
    .tp_vectorcall_offset = offsetof(MultObject, vectorcall),
};
 
static PyModuleDef mod = { PyModuleDef_HEAD_INIT, "mincext", 0, -1, 0 };
 
PyMODINIT_FUNC PyInit_mincext(void) {
    PyObject *m = PyModule_Create(&mod);
    if (!m) return NULL;
    if (PyType_Ready(&MultType) < 0) return NULL;
    Py_INCREF(&MultType);
    if (PyModule_AddObject(m, "Mult", (PyObject *)&MultType) < 0) {
        Py_DECREF(&MultType); Py_DECREF(m); return NULL;
    }
    return m;
}

Usage:

from mincext import Mult
m = Mult(factor=3)
assert m(7) == 21

Notes:

  • Set tp_vectorcall_offset and the pointer in __init__ (or a factory) so each instance carries the fastcall entry.
  • If you only need a fast module function, use METH_FASTCALL | METH_KEYWORDS on PyMethodDef for a simpler path.

Error handling and reference counting (the safe pattern)

  • Prefer one exit path with goto error; and cleanups via Py_XDECREF/Py_CLEAR.
  • Use Py_SETREF(dst, src)/Py_XSETREF when replacing owned references.
  • Create new borrows with Py_NewRef(obj) (3.10+); avoid stealing semantics unless documented.
static PyObject *do_work(PyObject *self, PyObject *args) {
    PyObject *a = NULL, *b = NULL, *out = NULL;
    if (!PyArg_ParseTuple(args, "OO", &a, &b)) return NULL;
    out = PyTuple_Pack(2, a, b);
    if (!out) goto error;
    return out;
error:
    Py_XDECREF(out);
    return NULL;
}

Targeting the stable ABI (abi3)

Ship one wheel per platform that works across Python 3.x minors by limiting to the stable API.

Setuptools example:

# setup.py
from setuptools import setup, Extension
 
ext = Extension(
    "mincext",
    sources=["vecobj.c"],
    define_macros=[("Py_LIMITED_API", "0x03080000")],  # target 3.8+ stable ABI
)
 
setup(
    name="mincext",
    version="0.1.0",
    ext_modules=[ext],
    options={"bdist_wheel": {"py_limited_api": "cp38"}},
)

Or via setup.cfg:

[bdist_wheel]
py_limited_api = cp38

Packaging tips:

  • Linux: auditwheel show/repair dist/*.whl to produce manylinux wheels.
  • macOS: delocate-listdeps / delocate-wheel to vendor .dylibs.
  • Windows: delvewheel show / delvewheel repair for .dlls.

CI smoke tests that prevent surprises

  • Matrix: OS × Python (min,max) for import test and a tiny call.
  • Verify wheels are manylinux/musllinux where applicable.
  • Run pip install from the built wheel in a clean venv; python -c "import pkg; print(pkg.__version__)".

Final production checklist

  • Choose the thinnest boundary: buffers for raw data, structs by pointer, batch work.
  • Release the GIL around pure native loops; reacquire before touching Python objects.
  • Prefer abi3 unless you need unstable internals; audit wheels on CI.
  • For existing C libraries, start with cffi (API mode) or HPy if you need portability and safety; drop to C API when you need full control.

References