You tried to make your C program “asynchronous,” flipped a flag, and suddenly your code either spins, blocks in weird places, or delivers callbacks in the wrong thread. Been there. The fix isn’t a magic API; it’s picking the right async model for your workload and then using it with boring discipline.
This post is a practical tour of three real-world approaches to async I/O in C:
- POSIX AIO: standardized, file-centric completion I/O
- io_uring: modern Linux completion queues with batching and low overhead
- Thread-pool async: wrap blocking I/O with worker threads and explicit cancellation
We’ll keep the vibe production-first: simple contracts, robust loops, deadline-aware cancellation, and APIs that are testable without a live kernel circus.
What “async” actually means here
Clarify the vocabulary so we don’t talk past each other:
- Blocking I/O: a syscall may park your calling thread until progress is possible. Simple to write; hard to scale.
- Nonblocking I/O (readiness): syscalls return immediately with
EAGAIN
when progress isn’t possible; you multiplex withepoll
/kqueue
. Great for many sockets; you manage partials and backpressure. See the groundwork in the I/O patterns and event-loop posts. - Completion I/O: you submit work and later receive a completion (callback, signal, or queue entry) with the result. That’s POSIX AIO and io_uring territory.
- Thread-pool async: you keep the API “async” by doing blocking work elsewhere and reporting completion back. Works everywhere; you must own fairness, limits, and cancellation semantics.
Key design dimensions we’ll use throughout:
- Latency: median and tail; syscall/batching overheads; context switches
- Throughput: ops/s under load; batching; PCIe/storage queue utilization
- Complexity: API surface, footguns, and operational know-how
- Portability: which platforms you can ship today without conditional jungles
- Cancellation: precise, prompt, safe to reason about
Ground contract you cannot dodge
Before diving into completion APIs, remember what individual reads/writes actually promise:
- Success does not mean “all of it.” Reads/writes can be short; your code must loop.
EINTR
happens; retry-friendly loops are mandatory.- On nonblocking fds,
EAGAIN
is a scheduling signal, not an error. - EOF (
read()
returns 0) is a first-class outcome; treat it explicitly.
If those statements raise an eyebrow, pause and adopt the robust helpers and event-loop patterns from the foundational posts. Async doesn’t absolve you from those truths—it just changes how you’re notified.
POSIX AIO in practice (what it is and what it isn’t)
POSIX AIO is a standardized completion I/O API built around the struct aiocb
control block and functions like aio_read()
, aio_write()
, aio_error()
, aio_return()
, and aio_suspend()
.
What it’s good at:
- Regular files on POSIX systems: overlapped reads/writes, especially when the underlying OS/device pipeline benefits from queueing requests.
- Batch submission via
lio_listio()
and waiting withaio_suspend()
for multiple completions. - Notification options: poll with
aio_error()
, block on a set withaio_suspend()
, or request delivery via signals orSIGEV_THREAD
callbacks.
What it’s not great at:
- Sockets and pipes vary by platform: some implementations don’t support them or degrade to thread-based emulation. Always verify your target OS semantics.
- Cancellation is best-effort:
aio_cancel()
may report that an operation is in progress and cannot be canceled; design for idempotent completions. - Portability wrinkles: API exists everywhere POSIX-ish, but performance and coverage differ notably between, say, Linux and BSDs.
The core flow
- Fill an
aiocb
with the file descriptor, buffer, length, and offset. Optionally set asigevent
for notification. - Call
aio_read()
/aio_write()
; it returns immediately. - Later, check completion via
aio_error()
/aio_return()
or wait on multiple withaio_suspend()
.
Minimal example (polling for completion):
#include <aio.h>
#include <errno.h>
#include <fcntl.h>
#include <stddef.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
int read_file_async(const char *path, off_t off, void *buf, size_t len) {
int fd = open(path, O_RDONLY | O_CLOEXEC);
if (fd < 0) return -1;
struct aiocb cb;
memset(&cb, 0, sizeof cb);
cb.aio_fildes = fd;
cb.aio_buf = buf;
cb.aio_nbytes = len;
cb.aio_offset = off;
if (aio_read(&cb) != 0) { close(fd); return -1; }
// Busy-wait/poll style (replace with aio_suspend for sets and timeouts)
for (;;) {
int e = aio_error(&cb);
if (e == 0) break; // completed
if (e == EINPROGRESS) continue; // not done yet
// error path
close(fd);
return -1;
}
ssize_t r = aio_return(&cb);
close(fd);
return (int)r; // bytes read or -1
}
Waiting on a set with a deadline is usually cleaner:
#include <time.h>
// Wait until any of the operations in cbv[0..n) completes or the deadline expires.
// Returns index of a completed aiocb, or -1 on timeout/error.
int wait_any(struct aiocb *const cbv[], int n, struct timespec *deadline) {
int rc = aio_suspend(cbv, n, deadline); // 0 on any completion
if (rc != 0) return -1; // errno = EAGAIN for timeout (on some OSes), or EINTR
for (int i = 0; i < n; ++i) {
if (cbv[i] && aio_error(cbv[i]) == 0) return i;
}
return -1; // none completed (timeout or interrupted)
}
Batch submission with lio_listio()
:
// Submit a batch and wait for all to complete.
int read_many(struct aiocb **list, int n) {
if (lio_listio(LIO_WAIT, list, n, NULL) != 0) return -1; // synchronous wait mode
// In LIO_NOWAIT mode, you’d use aio_suspend or notifications
for (int i = 0; i < n; ++i) {
if (aio_error(list[i]) != 0) return -1;
ssize_t r = aio_return(list[i]);
if (r < 0 || (size_t)r != list[i]->aio_nbytes) return -1;
}
return 0;
}
Notifications and callbacks
struct sigevent
lets you choose how the OS tells you about completion:
SIGEV_NONE
: you pollSIGEV_SIGNAL
: deliver a signal withsigev_signo
SIGEV_THREAD
: invoke a user-supplied function on a system-managed thread
Signal delivery intertwines with your process-wide signal strategy. If you already centralize signals via sigaction
, ppoll
/pselect
, and a self-pipe or eventfd (see robust I/O patterns), prefer polling/suspend over signal-driven callbacks to keep control flow testable.
Thread-callbacks (SIGEV_THREAD
) are convenient but easy to misuse: completions now run on foreign threads. Guard shared data with proper synchronization and keep handlers tiny.
Cancellation (best-effort, design accordingly)
aio_cancel(fd, aiocb*)
attempts to cancel an outstanding operation. Outcomes include:
- Canceled before starting
- In progress (cannot cancel)
- Not found on that fd
Treat cancellations as advisory: make completion handlers idempotent, and tolerate a late completion after you thought you canceled. Always encode operation identity (e.g., a monotonically increasing token) so a late completion can be ignored safely by the consumer that requested cancellation.
Minimal cancellation-aware shape:
struct op {
volatile int cancelled; // set to 1 when cancel requested
// ... aiocb, buffer, bookkeeping
};
void on_complete(struct op *o, ssize_t result) {
if (o->cancelled) {
// Drop results; caller moved on
return;
}
// Deliver result
}
Practical guidance for POSIX AIO
- Prefer
aio_suspend
over signal-driven completion to keep control flow explicit and testable. - Use fixed buffers, avoid stack-lifetime surprises; lifetime must exceed the async op.
- Treat partial completions and short counts as routine; verify
aio_return
bytes. - For time-bounded work, use deadline-based waits (compute a
struct timespec
once and pass toaio_suspend
). - Maintain per-file or per-component in-flight caps; AIO queues can become a surprise memory sink.
Where we’re headed next
We’ve set the mental model and covered POSIX AIO’s contract, flow, and pitfalls. Next we’ll dig into io_uring: submission/completion rings, batching, linked timeouts, and sharp but powerful edges—and then contrast all of this with a disciplined thread-pool async design that works everywhere.
io_uring: completion queues with real teeth
io_uring
delivers kernel-backed submission and completion queues that drastically reduce syscall overhead and enable powerful batching/linking semantics. Unlike readiness APIs, you describe the operation up front (fd, buffers, offsets), submit SQEs (Submission Queue Entries), and later reap CQEs (Completion Queue Entries) with results.
High-level wins:
- Low overhead: shared memory rings; many submissions/completions per syscall
- Batching: submit a group in one go; amortize costs
- Chaining: link operations so one runs only after the previous completes
- Linked timeouts: attach a deadline to any operation without extra machinery
Constraints:
- Linux-only; requires relatively recent kernels for advanced ops
- Complexity: lifetime, buffer registration, and chaining semantics are sharp edges
- You own partials/backpressure semantics just like any I/O path
The mental model: SQ and CQ
- The Submission Queue (SQ) holds descriptors of operations you want the kernel to perform. You fill SQEs and submit.
- The Completion Queue (CQ) holds results. You poll/await and process CQEs, each carrying the
user_data
you attached to its SQE and ares
status/byte-count.
You'll often use the excellent liburing
helpers to avoid writing raw ring plumbing.
Minimal setup with liburing
#include <liburing.h>
#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <sys/uio.h>
#include <unistd.h>
struct io_uring ring;
static int ring_init(unsigned entries) {
int rc = io_uring_queue_init(entries, &ring, 0);
return rc; // 0 on success
}
static void ring_close(void) {
io_uring_queue_exit(&ring);
}
Submitting a read with a linked timeout
This pattern is the workhorse for bounded-latency I/O: one SQE does the I/O, another SQE adds a timeout, and they’re linked so the timeout cancels the I/O if it doesn’t complete in time.
#include <time.h>
struct read_ctx {
int fd;
struct iovec iov;
};
// Submit one readv with a linked timeout (milliseconds). Returns 0 on submission success.
static int submit_read_with_timeout(int fd, void *buf, size_t len, off_t off, unsigned timeout_ms, uint64_t tag) {
struct io_uring_sqe *sqe;
struct __kernel_timespec ts = { .tv_sec = (time_t)(timeout_ms / 1000), .tv_nsec = (long)(timeout_ms % 1000) * 1000000L };
// READV SQE
struct iovec iov = { .iov_base = buf, .iov_len = len };
sqe = io_uring_get_sqe(&ring);
if (!sqe) return -1;
io_uring_prep_readv(sqe, fd, &iov, 1, off);
io_uring_sqe_set_data64(sqe, tag); // attach user tag
sqe->flags |= IOSQE_IO_LINK; // link next SQE as a dependent
// TIMEOUT SQE (linked)
sqe = io_uring_get_sqe(&ring);
if (!sqe) return -1;
io_uring_prep_link_timeout(sqe, &ts, 0 /* flags */);
io_uring_sqe_set_data64(sqe, tag ^ 0x1ULL); // distinguish timeout completion
int submitted = io_uring_submit(&ring);
return submitted >= 2 ? 0 : -1;
}
// Wait for at least one completion and drain any available.
static int reap_once(uint64_t *out_tag, int *out_res) {
struct io_uring_cqe *cqe = NULL;
int rc = io_uring_wait_cqe(&ring, &cqe); // blocks until at least one CQE
if (rc != 0) return -1;
*out_tag = io_uring_cqe_get_data64(cqe);
*out_res = cqe->res; // bytes or -errno
io_uring_cqe_seen(&ring, cqe);
return 1;
}
Notes:
- The linked timeout completes with
-ETIME
if it fires. If the read finishes first, the kernel cancels the timeout internally and you won’t see it (or you’ll see it with-ECANCELED
depending on linkage semantics and kernel version). Always treat each CQE by inspectingres
. - Use distinct
user_data
(e.g., tag vs tag^1) to route completions. - For large-scale systems, avoid heap allocations per op; embed small context in a slab and pass the pointer via
user_data
.
Writing with batching
Batching multiple SQEs reduces syscalls and improves throughput. Prepare a handful, then submit once.
static int submit_write_batch(int fd, struct iovec *iov, int iovcnt, off_t off, uint64_t base_tag) {
for (int i = 0; i < iovcnt; ++i) {
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
if (!sqe) return -1;
io_uring_prep_writev(sqe, fd, &iov[i], 1, off);
io_uring_sqe_set_data64(sqe, base_tag + (uint64_t)i);
off += (off_t)iov[i].iov_len;
}
int submitted = io_uring_submit(&ring);
return submitted == iovcnt ? 0 : -1;
}
You then call io_uring_peek_cqe()
in a loop to drain all available completions without blocking, or io_uring_wait_cqe()
when you need to block for progress. Always check res
for short writes and retry semantics.
Cancellation patterns that actually work
You generally have two robust levers:
- Linked timeouts (shown above): best default; self-contained and precise.
- Explicit cancel: prepare a cancel SQE targeting a specific
user_data
or fd when you must abort in-flight ops immediately (e.g., shutdown).
Example (target by user_data):
static int submit_cancel(uint64_t target_tag) {
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
if (!sqe) return -1;
io_uring_prep_cancel64(sqe, target_tag, 0 /* flags */);
io_uring_sqe_set_data64(sqe, target_tag ^ 0xCA);
return io_uring_submit(&ring) >= 1 ? 0 : -1;
}
Expectations:
- Cancel returns a CQE with
res
indicating success (0
), not found (-ENOENT
), or “couldn’t cancel in time” style results. You may still receive the original op’s CQE later—design handlers to ignore late results via tags/state. - For mass shutdown, a cancel-by-fd variant (where available) aborts all ops for a descriptor. Use carefully to avoid surprising peer state.
Registered buffers and files (when you’re chasing tail latency)
Registering buffers/files eliminates per-op pinning/lookup overheads:
io_uring_register_buffers
: pin user memory; later refer by buffer indexio_uring_register_files
: register fds; refer by file slot index
Tradeoffs: complexity and lifetime management increase. Only pull this lever after measuring overhead and confirming hotspots.
Footguns to avoid
- Lifetime: the buffer and any context referenced by an SQE must outlive the operation. Free only after you handle its CQE.
- Mixed ownership: don’t close a fd while ops are in flight unless you’re deliberately cancelling-by-fd and will drain completions.
- CQE starvation: always drain the CQ fully; if you only pop one CQE per wakeup, you can stall the ring under load.
- Partial completions: a read/write may return fewer bytes than requested; resubmit remaining spans as needed.
- Error routing:
res
is-errno
on failure. Log the specific negative code; don’t collapse into generic “I/O error.”
Thread-pool async: portable and predictable when done right
When portability or simplicity wins, you can get “async” by performing blocking I/O on worker threads and reporting completion back. The key is to make cancellation and deadlines explicit and to avoid oversubscription.
What you get:
- Works everywhere: files, sockets, oddball drivers, any BSD/Linux/macOS
- Straight-line code inside tasks: you can reuse robust blocking helpers
- Clear ownership: your code fully controls fairness, limits, and shutdown
What you must own:
- Thread count and fairness: don’t create more runnable I/O threads than cores unless they mostly block; isolate CPU-bound from I/O-bound pools
- Cancellation semantics: cooperative cancellation via tokens + cancelable waits
- Deadlines: convert time budgets into poll-based waits; don’t block indefinitely
A minimal fixed-size thread pool
We’ll sketch a small pool with a bounded job queue. Jobs carry a function pointer, a context pointer, and a cancellation flag. Workers pop jobs, check cancellation, run, and post completion.
#include <pthread.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
typedef void (*job_fn)(void *arg);
struct job {
job_fn fn;
void *arg;
volatile int cancelled; // cooperative cancellation flag
struct job *next;
};
struct tpool {
pthread_t *threads;
int nthreads;
// simple singly-linked queue protected by a mutex+cond
struct job *head, *tail;
pthread_mutex_t mu;
pthread_cond_t cv;
int stopping; // pool is shutting down
};
static void *tpool_worker(void *arg) {
struct tpool *p = (struct tpool *)arg;
for (;;) {
pthread_mutex_lock(&p->mu);
while (!p->stopping && p->head == NULL) {
pthread_cond_wait(&p->cv, &p->mu);
}
if (p->stopping && p->head == NULL) { pthread_mutex_unlock(&p->mu); break; }
struct job *j = p->head; p->head = j->next; if (!p->head) p->tail = NULL;
pthread_mutex_unlock(&p->mu);
if (!j->cancelled) j->fn(j->arg);
free(j);
}
return NULL;
}
static bool tpool_init(struct tpool *p, int nthreads) {
memset(p, 0, sizeof *p);
p->nthreads = nthreads;
pthread_mutex_init(&p->mu, NULL);
pthread_cond_init(&p->cv, NULL);
p->threads = (pthread_t *)calloc((size_t)nthreads, sizeof *p->threads);
if (!p->threads) return false;
for (int i = 0; i < nthreads; ++i) {
if (pthread_create(&p->threads[i], NULL, tpool_worker, p) != 0) return false;
}
return true;
}
static void tpool_shutdown(struct tpool *p) {
pthread_mutex_lock(&p->mu); p->stopping = 1; pthread_cond_broadcast(&p->cv); pthread_mutex_unlock(&p->mu);
for (int i = 0; i < p->nthreads; ++i) pthread_join(p->threads[i], NULL);
free(p->threads);
pthread_mutex_destroy(&p->mu); pthread_cond_destroy(&p->cv);
}
static struct job *tpool_submit(struct tpool *p, job_fn fn, void *arg) {
struct job *j = (struct job *)calloc(1, sizeof *j);
if (!j) return NULL; j->fn = fn; j->arg = arg; j->next = NULL;
pthread_mutex_lock(&p->mu);
if (p->tail) p->tail->next = j; else p->head = j; p->tail = j;
pthread_cond_signal(&p->cv);
pthread_mutex_unlock(&p->mu);
return j;
}
static void tpool_cancel(struct job *j) { if (j) j->cancelled = 1; }
This pool intentionally omits fancy features (work stealing, priorities). The point is: bounded, simple, and testable.
Cancelable, deadline-aware blocking I/O
Inside the worker, do not call plain read()
/write()
and hope. Use a poll-based helper that can be canceled cooperatively (e.g., with a self-pipe) and honors deadlines.
#include <errno.h>
#include <poll.h>
#include <signal.h>
#include <time.h>
#include <unistd.h>
struct cancel_fd { int r; int w; };
static int make_nonblocking(int fd) {
int fl = fcntl(fd, F_GETFL, 0); if (fl < 0) return -1;
return fcntl(fd, F_SETFL, fl | O_NONBLOCK);
}
static bool cancel_fd_init(struct cancel_fd *c) {
int fds[2]; if (pipe(fds) != 0) return false; (void)make_nonblocking(fds[0]); (void)make_nonblocking(fds[1]);
c->r = fds[0]; c->w = fds[1]; return true;
}
static void cancel_fd_signal(struct cancel_fd *c) { (void)write(c->w, "x", 1); }
static void cancel_fd_close(struct cancel_fd *c) { close(c->r); close(c->w); }
static int ms_left(struct timespec deadline) {
struct timespec now; clock_gettime(CLOCK_MONOTONIC, &now);
long ms = (long)((deadline.tv_sec - now.tv_sec) * 1000) + (long)((deadline.tv_nsec - now.tv_nsec) / 1000000);
if (ms < 0) return 0; if (ms > 0x3fffffff) return 0x3fffffff; return (int)ms;
}
// Read up to len bytes before deadline or cancellation. Returns bytes read (>=0), 0 on EOF, or -1 on error.
static ssize_t read_cancellable(int fd, void *buf, size_t len, struct timespec deadline, int cancel_rfd) {
size_t used = 0;
for (;;) {
ssize_t r = read(fd, (char *)buf + used, len - used);
if (r > 0) { used += (size_t)r; if (used == len) return (ssize_t)used; continue; }
if (r == 0) return (ssize_t)used; // EOF or nothing yet
if (errno == EINTR) continue;
if (errno == EAGAIN || errno == EWOULDBLOCK) {
struct pollfd pfds[2] = { { .fd = fd, .events = POLLIN }, { .fd = cancel_rfd, .events = POLLIN } };
int rdy = poll(pfds, 2, ms_left(deadline));
if (rdy == 0) return (ssize_t)used; // timeout; partial OK
if (rdy < 0 && errno == EINTR) continue;
if (rdy < 0) return -1; // error
if (pfds[1].revents) { errno = ECANCELED; return -1; }
continue; // readable
}
return -1; // hard error
}
}
This turns blocking I/O into a cancelable, deadline-aware loop. For writes, mirror the logic with partial sends and POLLOUT
.
Putting it together: an async read job
Define a small job context that owns the fd, buffer, deadline, and cancel handle. The job runs in the pool; the caller can cancel cooperatively and receives the result via a user-provided callback.
typedef void (*read_done_cb)(ssize_t result, void *udata);
struct async_read_job {
int fd; void *buf; size_t len; off_t off;
struct timespec deadline;
struct cancel_fd cancel;
read_done_cb cb; void *udata;
};
static void async_read_run(void *arg) {
struct async_read_job *j = (struct async_read_job *)arg;
// Optional: pre-position with lseek if off >= 0
if (j->off >= 0) (void)lseek(j->fd, j->off, SEEK_SET);
(void)make_nonblocking(j->fd);
ssize_t r = read_cancellable(j->fd, j->buf, j->len, j->deadline, j->cancel.r);
j->cb(r, j->udata);
cancel_fd_close(&j->cancel);
free(j);
}
// Submits an async read; returns a handle you can cancel.
struct job *async_read_submit(struct tpool *p, int fd, void *buf, size_t len, off_t off, int timeout_ms, read_done_cb cb, void *ud) {
struct async_read_job *j = (struct async_read_job *)calloc(1, sizeof *j);
if (!j) return NULL;
j->fd = fd; j->buf = buf; j->len = len; j->off = off; j->cb = cb; j->udata = ud;
clock_gettime(CLOCK_MONOTONIC, &j->deadline);
j->deadline.tv_sec += timeout_ms / 1000;
j->deadline.tv_nsec += (long)(timeout_ms % 1000) * 1000000L;
if (j->deadline.tv_nsec >= 1000000000L) { j->deadline.tv_sec += 1; j->deadline.tv_nsec -= 1000000000L; }
if (!cancel_fd_init(&j->cancel)) { free(j); return NULL; }
struct job *h = tpool_submit(p, async_read_run, j);
if (!h) { cancel_fd_close(&j->cancel); free(j); }
return h;
}
// Cooperative cancellation: signal the cancel pipe; worker observes and exits early.
void async_read_cancel(struct job *h) { tpool_cancel(h); /* also signal in case waiting */ }
For cancellation to be prompt, call cancel_fd_signal(&job->cancel)
when canceling (e.g., stash a map from job*
to cancel_fd
). The illustration above keeps code compact; in production you’ll wrap the handle to expose both cancel()
and join()
/on_complete()
.
Dispatching results back to your main loop
If your system uses a single-threaded event loop, you don’t want callbacks running on worker threads mutating shared state. A simple pattern:
- Worker threads push results onto a lock-free or mutex-protected queue and write a byte to a loop wakeup fd (
eventfd
or self-pipe) - The loop reads and drains the completion queue, executing callbacks in the loop thread
This preserves single-threaded invariants while still offloading blocking I/O.
Tuning and pitfalls
- Right-size the pool: start with
min(4, num_cores)
for I/O-heavy tasks and measure. Separate pools for CPU-heavy and I/O-heavy work. - Bound queues: reject/shed when the job queue grows beyond a threshold; surface backpressure instead of OOM.
- Cooperative cancellation only: POSIX thread cancellation is perilous. Prefer explicit tokens and cancelable waits.
- Resource lifetime: ensure buffers and fds outlive the job; free only after completion handling.
- Signals and SIGPIPE: adopt the same signal discipline as in the robust I/O post (ignore
SIGPIPE
, useSA_RESTART
judiciously).
A thin, unified abstraction (pick a backend, keep the app simple)
You don’t want the rest of your codebase to care whether the engine is POSIX AIO, io_uring, or a thread pool. Standardize a small interface with consistent semantics, then plug in a backend per platform.
Design goals:
- Uniform result reporting: bytes on success,
-errno
on failure - Explicit deadlines/cancellation
- Buffer/offset lifetime rules that are the same everywhere
- Pluggable backends with identical call sites
Minimal API shape
#include <stdint.h>
#include <sys/uio.h>
typedef uint64_t aio_token; // unique per submitted op
enum aio_kind { AIO_READ, AIO_WRITE };
struct aio_req {
enum aio_kind kind;
int fd;
struct iovec iov; // single-span for simplicity; extend to iovecs as needed
off_t offset; // -1 for current position
int timeout_ms; // 0 = no deadline
uint64_t user_tag; // echoed on completion
};
struct aio_cqe {
aio_token token;
uint64_t user_tag;
int res; // >=0 = bytes, <0 = -errno
};
struct aio;
// Create/destroy with a chosen backend ("io_uring", "posix_aio", "threads").
struct aio *aio_create(const char *backend, int queue_depth);
void aio_destroy(struct aio *A);
// Submit one request; returns 0 on success and sets *tok; negative errno on failure.
int aio_submit(struct aio *A, const struct aio_req *req, aio_token *tok);
// Best-effort cancel; returns 0 on submitted, -ENOENT if not found.
int aio_cancel(struct aio *A, aio_token tok);
// Reap at most cap completions; returns count (>=0) or -errno; non-blocking when timeout_ms==0.
int aio_reap(struct aio *A, struct aio_cqe *out, int cap, int timeout_ms);
Semantics:
- Offsets: if
offset >= 0
, the backend uses positioned I/O (*_p
orRWF_NOWAIT
variants where applicable). Otherwise, it uses the current file position (seek-once then nonblocking/poll for threads). - Deadlines: if
timeout_ms > 0
, backends should enforce it (linked timeouts for io_uring; deadline in waits for others) and return-ETIME
when expired. - Cancellation: best-effort; late completions are possible. App code must be idempotent.
Adapter sketch: io_uring backend
// On create: io_uring_queue_init(queue_depth, &ring, flags)
// On submit: prep READV/WRITEV; set user_data=token; if timeout_ms>0, link a timeout SQE
// On reap: io_uring_peek_batch_cqe / io_uring_wait_cqe; fill aio_cqe
// On cancel: prep cancel64 with token
Notes:
- Register buffers/files after measuring; it’s a tail-latency lever.
- Drain CQEs fully per wakeup to avoid ring stalls.
Adapter sketch: POSIX AIO backend
// On create: allocate a table of aiocb slots and a wait set
// On submit: fill aiocb; aio_read/aio_write; remember token↔aiocb; start a timer thread or use aio_suspend with a global deadline wheel
// On reap: scan for completed via aio_error==0; aio_return; emit cqe
// On cancel: aio_cancel(fd, aiocb*) and mark token as cancelled
Notes:
- Use
aio_suspend
over signals for predictability; multiplex deadlines with a timer heap. - Expect best-effort cancellation; return
-ETIME
from your own deadline logic even when the kernel later completes the op—drop it by token.
Adapter sketch: thread-pool backend
// On create: start fixed-size pool; create a completion queue and a wakeup fd
// On submit: package req in a job; worker performs cancelable reads/writes with poll; push cqe on completion queue
// On reap: drain completion queue; block with poll/select on the wakeup fd when timeout_ms>0
// On cancel: set job->cancel flag and signal its cancel pipe
Notes:
- Keep separate pools for CPU-bound and I/O-bound tasks to avoid head-of-line blocking.
- Bound the job queue; surface backpressure instead of unbounded memory growth.
Feature probing and selection
- Prefer io_uring when available: attempt
io_uring_queue_init
; on failure, fall back to POSIX AIO, then to threads. - Allow runtime/environment overrides (e.g.,
AIO_BACKEND=threads
) for diagnostics. - Log the selected backend and queue depth on startup.
Error normalization and observability
- Always surface
-errno
inres
. Don’t translate; callers can switch on-EAGAIN
,-EPIPE
,-ECANCELED
,-ETIME
. - Attach the original
user_tag
so higher layers can correlate with requests. - Instrument: submissions, completions (bytes/errnos), time-in-queue, deadlines met/missed, cancels requested/satisfied.
Choosing wisely (quick guidance)
- Need Linux-only, high-throughput, low tail latency, and are comfortable with kernel specifics? io_uring.
- Need portability across BSD/macOS/Linux with decent file I/O completion and can tolerate implementation variance? POSIX AIO.
- Need universal coverage and simpler reasoning, with explicit fairness and deadlines? Thread-pool async (with disciplined cancelable waits).
Hybrid is common: io_uring for hot file/network paths on Linux; thread-pool for everything else and for portability.
Production checklist
- Define a single async API and hide backend differences behind adapters
- Require deadlines on every external I/O; return
-ETIME
when missed - Make cancellation idempotent; tolerate late completions via tokens
- Drain completion queues fully; never starve your ring or result queues
- Bound in-flight ops and job queues; apply backpressure at admission
- Normalize errors as
-errno
; log with fd, offset, bytes, errno, latency - Test with tiny buffers, forced
EAGAIN
, injected signals, and timeouts - Measure under realistic concurrency; validate tail latency, not just p50
Closing thoughts
Async I/O in C isn’t about picking the fanciest API; it’s about enforcing simple, explicit contracts: bytes or -errno
, deadlines that bite, and cancellation that’s safe to reason about. Wrap those contracts in a tiny abstraction, choose the best backend per platform, and your code stays small, predictable, and fast—without 3 a.m. incidents.