Async I/O in C: POSIX AIO vs io_uring vs Threads

You tried to make your C program “asynchronous,” flipped a flag, and suddenly your code either spins, blocks in weird places, or delivers callbacks in the wrong thread. Been there. The fix isn’t a magic API; it’s picking the right async model for your workload and then using it with boring discipline.

This post is a practical tour of three real-world approaches to async I/O in C:

POSIX AIO: standardized, file-centric completion I/O
io_uring: modern Linux completion queues with batching and low overhead
Thread-pool async: wrap blocking I/O with worker threads and explicit cancellation

graph TB subgraph "POSIX AIO" P1[Submit aio_read/write] --> P2[aio_suspend/poll] P2 --> P3[aio_error/return] P3 --> P4[Handle completion] end subgraph "io_uring" I1[Fill SQE] --> I2[io_uring_submit] I2 --> I3[io_uring_wait_cqe] I3 --> I4[Process CQE batch] I4 --> I5[io_uring_cqe_seen] end subgraph "Thread Pool" T1[Submit job] --> T2[Worker picks up] T2 --> T3[Blocking I/O + poll] T3 --> T4[Post completion] T4 --> T5[Main thread processes] end subgraph "Comparison" C1["POSIX AIO ✓ Portable ✓ Standard ~ Variable performance ~ Platform differences"] C2["io_uring ✓ High performance ✓ Low overhead ✓ Advanced features ✗ Linux only"] C3["Thread Pool ✓ Universal ✓ Debuggable ✓ Predictable ~ Resource overhead"] end style P1 fill:#e8f5e8 style I1 fill:#e1f5fe style T1 fill:#fff3e0

We’ll keep the vibe production-first: simple contracts, robust loops, deadline-aware cancellation, and APIs that are testable without a live kernel circus.

What “async” actually means here

Clarify the vocabulary so we don’t talk past each other:

Blocking I/O: a syscall may park your calling thread until progress is possible. Simple to write; hard to scale.
Nonblocking I/O (readiness): syscalls return immediately with EAGAIN when progress isn’t possible; you multiplex with epoll/kqueue. Great for many sockets; you manage partials and backpressure. See the groundwork in the I/O patterns and event-loop posts.
Completion I/O: you submit work and later receive a completion (callback, signal, or queue entry) with the result. That’s POSIX AIO and io_uring territory.
Thread-pool async: you keep the API “async” by doing blocking work elsewhere and reporting completion back. Works everywhere; you must own fairness, limits, and cancellation semantics.

Key design dimensions we’ll use throughout:

Latency: median and tail; syscall/batching overheads; context switches
Throughput: ops/s under load; batching; PCIe/storage queue utilization
Complexity: API surface, footguns, and operational know-how
Portability: which platforms you can ship today without conditional jungles
Cancellation: precise, prompt, safe to reason about

Ground contract you cannot dodge

Before diving into completion APIs, remember what individual reads/writes actually promise:

Success does not mean “all of it.” Reads/writes can be short; your code must loop.
EINTR happens; retry-friendly loops are mandatory.
On nonblocking fds, EAGAIN is a scheduling signal, not an error.
EOF (read() returns 0) is a first-class outcome; treat it explicitly.

If those statements raise an eyebrow, pause and adopt the robust helpers and event-loop patterns from the foundational posts. Async doesn’t absolve you from those truths—it just changes how you’re notified.

POSIX AIO in practice (what it is and what it isn’t)

POSIX AIO is a standardized completion I/O API built around the struct aiocb control block and functions like aio_read(), aio_write(), aio_error(), aio_return(), and aio_suspend().

What it’s good at:

Regular files on POSIX systems: overlapped reads/writes, especially when the underlying OS/device pipeline benefits from queueing requests.
Batch submission via lio_listio() and waiting with aio_suspend() for multiple completions.
Notification options: poll with aio_error(), block on a set with aio_suspend(), or request delivery via signals or SIGEV_THREAD callbacks.

What it’s not great at:

Sockets and pipes vary by platform: some implementations don’t support them or degrade to thread-based emulation. Always verify your target OS semantics.
Cancellation is best-effort: aio_cancel() may report that an operation is in progress and cannot be canceled; design for idempotent completions.
Portability wrinkles: API exists everywhere POSIX-ish, but performance and coverage differ notably between, say, Linux and BSDs.

The core flow

Fill an aiocb with the file descriptor, buffer, length, and offset. Optionally set a sigevent for notification.
Call aio_read()/aio_write(); it returns immediately.
Later, check completion via aio_error()/aio_return() or wait on multiple with aio_suspend().

Minimal example (polling for completion):

#include <aio.h>
#include <errno.h>
#include <fcntl.h>
#include <stddef.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
 
int read_file_async(const char *path, off_t off, void *buf, size_t len) {
  int fd = open(path, O_RDONLY | O_CLOEXEC);
  if (fd < 0) return -1;
 
  struct aiocb cb;
  memset(&cb, 0, sizeof cb);
  cb.aio_fildes = fd;
  cb.aio_buf    = buf;
  cb.aio_nbytes = len;
  cb.aio_offset = off;
 
  if (aio_read(&cb) != 0) { close(fd); return -1; }
 
  // Busy-wait/poll style (replace with aio_suspend for sets and timeouts)
  for (;;) {
    int e = aio_error(&cb);
    if (e == 0) break;             // completed
    if (e == EINPROGRESS) continue; // not done yet
    // error path
    close(fd);
    return -1;
  }
 
  ssize_t r = aio_return(&cb);
  close(fd);
  return (int)r; // bytes read or -1
}

Waiting on a set with a deadline is usually cleaner:

#include <time.h>
 
// Wait until any of the operations in cbv[0..n) completes or the deadline expires.
// Returns index of a completed aiocb, or -1 on timeout/error.
int wait_any(struct aiocb *const cbv[], int n, struct timespec *deadline) {
  int rc = aio_suspend(cbv, n, deadline); // 0 on any completion
  if (rc != 0) return -1; // errno = EAGAIN for timeout (on some OSes), or EINTR
  for (int i = 0; i < n; ++i) {
    if (cbv[i] && aio_error(cbv[i]) == 0) return i;
  }
  return -1; // none completed (timeout or interrupted)
}

Batch submission with lio_listio():

// Submit a batch and wait for all to complete.
int read_many(struct aiocb **list, int n) {
  if (lio_listio(LIO_WAIT, list, n, NULL) != 0) return -1; // synchronous wait mode
  // In LIO_NOWAIT mode, you’d use aio_suspend or notifications
  for (int i = 0; i < n; ++i) {
    if (aio_error(list[i]) != 0) return -1;
    ssize_t r = aio_return(list[i]);
    if (r < 0 || (size_t)r != list[i]->aio_nbytes) return -1;
  }
  return 0;
}

Notifications and callbacks

struct sigevent lets you choose how the OS tells you about completion:

SIGEV_NONE: you poll
SIGEV_SIGNAL: deliver a signal with sigev_signo
SIGEV_THREAD: invoke a user-supplied function on a system-managed thread

Signal delivery intertwines with your process-wide signal strategy. If you already centralize signals via sigaction, ppoll/pselect, and a self-pipe or eventfd (see robust I/O patterns), prefer polling/suspend over signal-driven callbacks to keep control flow testable.

Thread-callbacks (SIGEV_THREAD) are convenient but easy to misuse: completions now run on foreign threads. Guard shared data with proper synchronization and keep handlers tiny.

Cancellation (best-effort, design accordingly)

aio_cancel(fd, aiocb*) attempts to cancel an outstanding operation. Outcomes include:

Canceled before starting
In progress (cannot cancel)
Not found on that fd

Treat cancellations as advisory: make completion handlers idempotent, and tolerate a late completion after you thought you canceled. Always encode operation identity (e.g., a monotonically increasing token) so a late completion can be ignored safely by the consumer that requested cancellation.

Minimal cancellation-aware shape:

struct op {
  volatile int cancelled; // set to 1 when cancel requested
  // ... aiocb, buffer, bookkeeping
};
 
void on_complete(struct op *o, ssize_t result) {
  if (o->cancelled) {
    // Drop results; caller moved on
    return;
  }
  // Deliver result
}

Practical guidance for POSIX AIO

Prefer aio_suspend over signal-driven completion to keep control flow explicit and testable.
Use fixed buffers, avoid stack-lifetime surprises; lifetime must exceed the async op.
Treat partial completions and short counts as routine; verify aio_return bytes.
For time-bounded work, use deadline-based waits (compute a struct timespec once and pass to aio_suspend).
Maintain per-file or per-component in-flight caps; AIO queues can become a surprise memory sink.

Where we’re headed next

We’ve set the mental model and covered POSIX AIO’s contract, flow, and pitfalls. Next we’ll dig into io_uring: submission/completion rings, batching, linked timeouts, and sharp but powerful edges—and then contrast all of this with a disciplined thread-pool async design that works everywhere.

io_uring: completion queues with real teeth

io_uring delivers kernel-backed submission and completion queues that drastically reduce syscall overhead and enable powerful batching/linking semantics. Unlike readiness APIs, you describe the operation up front (fd, buffers, offsets), submit SQEs (Submission Queue Entries), and later reap CQEs (Completion Queue Entries) with results.

High-level wins:

Low overhead: shared memory rings; many submissions/completions per syscall
Batching: submit a group in one go; amortize costs
Chaining: link operations so one runs only after the previous completes
Linked timeouts: attach a deadline to any operation without extra machinery

Constraints:

Linux-only; requires relatively recent kernels for advanced ops
Complexity: lifetime, buffer registration, and chaining semantics are sharp edges
You own partials/backpressure semantics just like any I/O path

The mental model: SQ and CQ

The Submission Queue (SQ) holds descriptors of operations you want the kernel to perform. You fill SQEs and submit.
The Completion Queue (CQ) holds results. You poll/await and process CQEs, each carrying the user_data you attached to its SQE and a res status/byte-count.

sequenceDiagram participant App as Application participant SQ as Submission Queue participant Kernel as Kernel participant CQ as Completion Queue Note over App, CQ: io_uring Ring Architecture App->>SQ: 1. Fill SQE (fd, op, buffers, user_data) App->>SQ: 2. Fill SQE (another operation) App->>SQ: 3. Fill SQE (linked timeout) App->>Kernel: io_uring_submit() - batch submit Note over Kernel: Process operations asynchronously Kernel->>CQ: Complete operation 1 (res, user_data) Kernel->>CQ: Complete operation 2 (res, user_data) Kernel->>CQ: Timeout completion (-ETIME) App->>CQ: io_uring_wait_cqe() - wait for any CQ-->>App: Return CQE batch App->>CQ: io_uring_cqe_seen() - mark processed Note right of App: Batched submission and completion reduces syscalls

You'll often use the excellent liburing helpers to avoid writing raw ring plumbing.

Minimal setup with liburing

#include <liburing.h>
#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <sys/uio.h>
#include <unistd.h>
 
struct io_uring ring;
 
static int ring_init(unsigned entries) {
  int rc = io_uring_queue_init(entries, &ring, 0);
  return rc; // 0 on success
}
 
static void ring_close(void) {
  io_uring_queue_exit(&ring);
}

Submitting a read with a linked timeout

This pattern is the workhorse for bounded-latency I/O: one SQE does the I/O, another SQE adds a timeout, and they’re linked so the timeout cancels the I/O if it doesn’t complete in time.

#include <time.h>
 
struct read_ctx {
  int fd;
  struct iovec iov;
};
 
// Submit one readv with a linked timeout (milliseconds). Returns 0 on submission success.
static int submit_read_with_timeout(int fd, void *buf, size_t len, off_t off, unsigned timeout_ms, uint64_t tag) {
  struct io_uring_sqe *sqe;
  struct __kernel_timespec ts = { .tv_sec = (time_t)(timeout_ms / 1000), .tv_nsec = (long)(timeout_ms % 1000) * 1000000L };
 
  // READV SQE
  struct iovec iov = { .iov_base = buf, .iov_len = len };
  sqe = io_uring_get_sqe(&ring);
  if (!sqe) return -1;
  io_uring_prep_readv(sqe, fd, &iov, 1, off);
  io_uring_sqe_set_data64(sqe, tag);              // attach user tag
  sqe->flags |= IOSQE_IO_LINK;                    // link next SQE as a dependent
 
  // TIMEOUT SQE (linked)
  sqe = io_uring_get_sqe(&ring);
  if (!sqe) return -1;
  io_uring_prep_link_timeout(sqe, &ts, 0 /* flags */);
  io_uring_sqe_set_data64(sqe, tag ^ 0x1ULL);     // distinguish timeout completion
 
  int submitted = io_uring_submit(&ring);
  return submitted >= 2 ? 0 : -1;
}
 
// Wait for at least one completion and drain any available.
static int reap_once(uint64_t *out_tag, int *out_res) {
  struct io_uring_cqe *cqe = NULL;
  int rc = io_uring_wait_cqe(&ring, &cqe); // blocks until at least one CQE
  if (rc != 0) return -1;
  *out_tag = io_uring_cqe_get_data64(cqe);
  *out_res = cqe->res;     // bytes or -errno
  io_uring_cqe_seen(&ring, cqe);
  return 1;
}

Notes:

The linked timeout completes with -ETIME if it fires. If the read finishes first, the kernel cancels the timeout internally and you won’t see it (or you’ll see it with -ECANCELED depending on linkage semantics and kernel version). Always treat each CQE by inspecting res.
Use distinct user_data (e.g., tag vs tag^1) to route completions.
For large-scale systems, avoid heap allocations per op; embed small context in a slab and pass the pointer via user_data.

Writing with batching

Batching multiple SQEs reduces syscalls and improves throughput. Prepare a handful, then submit once.

static int submit_write_batch(int fd, struct iovec *iov, int iovcnt, off_t off, uint64_t base_tag) {
  for (int i = 0; i < iovcnt; ++i) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    if (!sqe) return -1;
    io_uring_prep_writev(sqe, fd, &iov[i], 1, off);
    io_uring_sqe_set_data64(sqe, base_tag + (uint64_t)i);
    off += (off_t)iov[i].iov_len;
  }
  int submitted = io_uring_submit(&ring);
  return submitted == iovcnt ? 0 : -1;
}

You then call io_uring_peek_cqe() in a loop to drain all available completions without blocking, or io_uring_wait_cqe() when you need to block for progress. Always check res for short writes and retry semantics.

Cancellation patterns that actually work

You generally have two robust levers:

Linked timeouts (shown above): best default; self-contained and precise.
Explicit cancel: prepare a cancel SQE targeting a specific user_data or fd when you must abort in-flight ops immediately (e.g., shutdown).

Example (target by user_data):

static int submit_cancel(uint64_t target_tag) {
  struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
  if (!sqe) return -1;
  io_uring_prep_cancel64(sqe, target_tag, 0 /* flags */);
  io_uring_sqe_set_data64(sqe, target_tag ^ 0xCA);
  return io_uring_submit(&ring) >= 1 ? 0 : -1;
}

Expectations:

Cancel returns a CQE with res indicating success (0), not found (-ENOENT), or “couldn’t cancel in time” style results. You may still receive the original op’s CQE later—design handlers to ignore late results via tags/state.
For mass shutdown, a cancel-by-fd variant (where available) aborts all ops for a descriptor. Use carefully to avoid surprising peer state.

Registered buffers and files (when you’re chasing tail latency)

Registering buffers/files eliminates per-op pinning/lookup overheads:

io_uring_register_buffers: pin user memory; later refer by buffer index
io_uring_register_files: register fds; refer by file slot index

Tradeoffs: complexity and lifetime management increase. Only pull this lever after measuring overhead and confirming hotspots.

Footguns to avoid

Lifetime: the buffer and any context referenced by an SQE must outlive the operation. Free only after you handle its CQE.
Mixed ownership: don’t close a fd while ops are in flight unless you’re deliberately cancelling-by-fd and will drain completions.
CQE starvation: always drain the CQ fully; if you only pop one CQE per wakeup, you can stall the ring under load.
Partial completions: a read/write may return fewer bytes than requested; resubmit remaining spans as needed.
Error routing: res is -errno on failure. Log the specific negative code; don’t collapse into generic “I/O error.”

Thread-pool async: portable and predictable when done right

When portability or simplicity wins, you can get “async” by performing blocking I/O on worker threads and reporting completion back. The key is to make cancellation and deadlines explicit and to avoid oversubscription.

What you get:

Works everywhere: files, sockets, oddball drivers, any BSD/Linux/macOS
Straight-line code inside tasks: you can reuse robust blocking helpers
Clear ownership: your code fully controls fairness, limits, and shutdown

What you must own:

Thread count and fairness: don’t create more runnable I/O threads than cores unless they mostly block; isolate CPU-bound from I/O-bound pools
Cancellation semantics: cooperative cancellation via tokens + cancelable waits
Deadlines: convert time budgets into poll-based waits; don’t block indefinitely

A minimal fixed-size thread pool

We’ll sketch a small pool with a bounded job queue. Jobs carry a function pointer, a context pointer, and a cancellation flag. Workers pop jobs, check cancellation, run, and post completion.

#include <pthread.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
 
typedef void (*job_fn)(void *arg);
 
struct job {
  job_fn fn;
  void  *arg;
  volatile int cancelled; // cooperative cancellation flag
  struct job *next;
};
 
struct tpool {
  pthread_t *threads;
  int        nthreads;
  // simple singly-linked queue protected by a mutex+cond
  struct job *head, *tail;
  pthread_mutex_t mu;
  pthread_cond_t  cv;
  int stopping; // pool is shutting down
};
 
static void *tpool_worker(void *arg) {
  struct tpool *p = (struct tpool *)arg;
  for (;;) {
    pthread_mutex_lock(&p->mu);
    while (!p->stopping && p->head == NULL) {
      pthread_cond_wait(&p->cv, &p->mu);
    }
    if (p->stopping && p->head == NULL) { pthread_mutex_unlock(&p->mu); break; }
    struct job *j = p->head; p->head = j->next; if (!p->head) p->tail = NULL;
    pthread_mutex_unlock(&p->mu);
    if (!j->cancelled) j->fn(j->arg);
    free(j);
  }
  return NULL;
}
 
static bool tpool_init(struct tpool *p, int nthreads) {
  memset(p, 0, sizeof *p);
  p->nthreads = nthreads;
  pthread_mutex_init(&p->mu, NULL);
  pthread_cond_init(&p->cv, NULL);
  p->threads = (pthread_t *)calloc((size_t)nthreads, sizeof *p->threads);
  if (!p->threads) return false;
  for (int i = 0; i < nthreads; ++i) {
    if (pthread_create(&p->threads[i], NULL, tpool_worker, p) != 0) return false;
  }
  return true;
}
 
static void tpool_shutdown(struct tpool *p) {
  pthread_mutex_lock(&p->mu); p->stopping = 1; pthread_cond_broadcast(&p->cv); pthread_mutex_unlock(&p->mu);
  for (int i = 0; i < p->nthreads; ++i) pthread_join(p->threads[i], NULL);
  free(p->threads);
  pthread_mutex_destroy(&p->mu); pthread_cond_destroy(&p->cv);
}
 
static struct job *tpool_submit(struct tpool *p, job_fn fn, void *arg) {
  struct job *j = (struct job *)calloc(1, sizeof *j);
  if (!j) return NULL; j->fn = fn; j->arg = arg; j->next = NULL;
  pthread_mutex_lock(&p->mu);
  if (p->tail) p->tail->next = j; else p->head = j; p->tail = j;
  pthread_cond_signal(&p->cv);
  pthread_mutex_unlock(&p->mu);
  return j;
}
 
static void tpool_cancel(struct job *j) { if (j) j->cancelled = 1; }

This pool intentionally omits fancy features (work stealing, priorities). The point is: bounded, simple, and testable.

Cancelable, deadline-aware blocking I/O

Inside the worker, do not call plain read()/write() and hope. Use a poll-based helper that can be canceled cooperatively (e.g., with a self-pipe) and honors deadlines.

#include <errno.h>
#include <poll.h>
#include <signal.h>
#include <time.h>
#include <unistd.h>
 
struct cancel_fd { int r; int w; };
 
static int make_nonblocking(int fd) {
  int fl = fcntl(fd, F_GETFL, 0); if (fl < 0) return -1;
  return fcntl(fd, F_SETFL, fl | O_NONBLOCK);
}
 
static bool cancel_fd_init(struct cancel_fd *c) {
  int fds[2]; if (pipe(fds) != 0) return false; (void)make_nonblocking(fds[0]); (void)make_nonblocking(fds[1]);
  c->r = fds[0]; c->w = fds[1]; return true;
}
 
static void cancel_fd_signal(struct cancel_fd *c) { (void)write(c->w, "x", 1); }
 
static void cancel_fd_close(struct cancel_fd *c) { close(c->r); close(c->w); }
 
static int ms_left(struct timespec deadline) {
  struct timespec now; clock_gettime(CLOCK_MONOTONIC, &now);
  long ms = (long)((deadline.tv_sec - now.tv_sec) * 1000) + (long)((deadline.tv_nsec - now.tv_nsec) / 1000000);
  if (ms < 0) return 0; if (ms > 0x3fffffff) return 0x3fffffff; return (int)ms;
}
 
// Read up to len bytes before deadline or cancellation. Returns bytes read (>=0), 0 on EOF, or -1 on error.
static ssize_t read_cancellable(int fd, void *buf, size_t len, struct timespec deadline, int cancel_rfd) {
  size_t used = 0;
  for (;;) {
    ssize_t r = read(fd, (char *)buf + used, len - used);
    if (r > 0) { used += (size_t)r; if (used == len) return (ssize_t)used; continue; }
    if (r == 0) return (ssize_t)used; // EOF or nothing yet
    if (errno == EINTR) continue;
    if (errno == EAGAIN || errno == EWOULDBLOCK) {
      struct pollfd pfds[2] = { { .fd = fd, .events = POLLIN }, { .fd = cancel_rfd, .events = POLLIN } };
      int rdy = poll(pfds, 2, ms_left(deadline));
      if (rdy == 0) return (ssize_t)used; // timeout; partial OK
      if (rdy < 0 && errno == EINTR) continue;
      if (rdy < 0) return -1; // error
      if (pfds[1].revents) { errno = ECANCELED; return -1; }
      continue; // readable
    }
    return -1; // hard error
  }
}

This turns blocking I/O into a cancelable, deadline-aware loop. For writes, mirror the logic with partial sends and POLLOUT.

Putting it together: an async read job

Define a small job context that owns the fd, buffer, deadline, and cancel handle. The job runs in the pool; the caller can cancel cooperatively and receives the result via a user-provided callback.

typedef void (*read_done_cb)(ssize_t result, void *udata);
 
struct async_read_job {
  int fd; void *buf; size_t len; off_t off;
  struct timespec deadline;
  struct cancel_fd cancel;
  read_done_cb cb; void *udata;
};
 
static void async_read_run(void *arg) {
  struct async_read_job *j = (struct async_read_job *)arg;
  // Optional: pre-position with lseek if off >= 0
  if (j->off >= 0) (void)lseek(j->fd, j->off, SEEK_SET);
  (void)make_nonblocking(j->fd);
  ssize_t r = read_cancellable(j->fd, j->buf, j->len, j->deadline, j->cancel.r);
  j->cb(r, j->udata);
  cancel_fd_close(&j->cancel);
  free(j);
}
 
// Submits an async read; returns a handle you can cancel.
struct job *async_read_submit(struct tpool *p, int fd, void *buf, size_t len, off_t off, int timeout_ms, read_done_cb cb, void *ud) {
  struct async_read_job *j = (struct async_read_job *)calloc(1, sizeof *j);
  if (!j) return NULL;
  j->fd = fd; j->buf = buf; j->len = len; j->off = off; j->cb = cb; j->udata = ud;
  clock_gettime(CLOCK_MONOTONIC, &j->deadline);
  j->deadline.tv_sec += timeout_ms / 1000;
  j->deadline.tv_nsec += (long)(timeout_ms % 1000) * 1000000L;
  if (j->deadline.tv_nsec >= 1000000000L) { j->deadline.tv_sec += 1; j->deadline.tv_nsec -= 1000000000L; }
  if (!cancel_fd_init(&j->cancel)) { free(j); return NULL; }
  struct job *h = tpool_submit(p, async_read_run, j);
  if (!h) { cancel_fd_close(&j->cancel); free(j); }
  return h;
}
 
// Cooperative cancellation: signal the cancel pipe; worker observes and exits early.
void async_read_cancel(struct job *h) { tpool_cancel(h); /* also signal in case waiting */ }

For cancellation to be prompt, call cancel_fd_signal(&job->cancel) when canceling (e.g., stash a map from job* to cancel_fd). The illustration above keeps code compact; in production you’ll wrap the handle to expose both cancel() and join()/on_complete().

Dispatching results back to your main loop

If your system uses a single-threaded event loop, you don’t want callbacks running on worker threads mutating shared state. A simple pattern:

Worker threads push results onto a lock-free or mutex-protected queue and write a byte to a loop wakeup fd (eventfd or self-pipe)
The loop reads and drains the completion queue, executing callbacks in the loop thread

This preserves single-threaded invariants while still offloading blocking I/O.

Tuning and pitfalls

Right-size the pool: start with min(4, num_cores) for I/O-heavy tasks and measure. Separate pools for CPU-heavy and I/O-heavy work.
Bound queues: reject/shed when the job queue grows beyond a threshold; surface backpressure instead of OOM.
Cooperative cancellation only: POSIX thread cancellation is perilous. Prefer explicit tokens and cancelable waits.
Resource lifetime: ensure buffers and fds outlive the job; free only after completion handling.
Signals and SIGPIPE: adopt the same signal discipline as in the robust I/O post (ignore SIGPIPE, use SA_RESTART judiciously).

A thin, unified abstraction (pick a backend, keep the app simple)

You don’t want the rest of your codebase to care whether the engine is POSIX AIO, io_uring, or a thread pool. Standardize a small interface with consistent semantics, then plug in a backend per platform.

Design goals:

Uniform result reporting: bytes on success, -errno on failure
Explicit deadlines/cancellation
Buffer/offset lifetime rules that are the same everywhere
Pluggable backends with identical call sites

Minimal API shape

#include <stdint.h>
#include <sys/uio.h>
 
typedef uint64_t aio_token; // unique per submitted op
 
enum aio_kind { AIO_READ, AIO_WRITE };
 
struct aio_req {
  enum aio_kind kind;
  int fd;
  struct iovec iov;   // single-span for simplicity; extend to iovecs as needed
  off_t offset;       // -1 for current position
  int timeout_ms;     // 0 = no deadline
  uint64_t user_tag;  // echoed on completion
};
 
struct aio_cqe {
  aio_token token;
  uint64_t  user_tag;
  int       res;      // >=0 = bytes, <0 = -errno
};
 
struct aio;
 
// Create/destroy with a chosen backend ("io_uring", "posix_aio", "threads").
struct aio *aio_create(const char *backend, int queue_depth);
void        aio_destroy(struct aio *A);
 
// Submit one request; returns 0 on success and sets *tok; negative errno on failure.
int         aio_submit(struct aio *A, const struct aio_req *req, aio_token *tok);
 
// Best-effort cancel; returns 0 on submitted, -ENOENT if not found.
int         aio_cancel(struct aio *A, aio_token tok);
 
// Reap at most cap completions; returns count (>=0) or -errno; non-blocking when timeout_ms==0.
int         aio_reap(struct aio *A, struct aio_cqe *out, int cap, int timeout_ms);

Semantics:

Offsets: if offset >= 0, the backend uses positioned I/O (*_p or RWF_NOWAIT variants where applicable). Otherwise, it uses the current file position (seek-once then nonblocking/poll for threads).
Deadlines: if timeout_ms > 0, backends should enforce it (linked timeouts for io_uring; deadline in waits for others) and return -ETIME when expired.
Cancellation: best-effort; late completions are possible. App code must be idempotent.

Adapter sketch: io_uring backend

// On create: io_uring_queue_init(queue_depth, &ring, flags)
// On submit: prep READV/WRITEV; set user_data=token; if timeout_ms>0, link a timeout SQE
// On reap: io_uring_peek_batch_cqe / io_uring_wait_cqe; fill aio_cqe
// On cancel: prep cancel64 with token

Notes:

Register buffers/files after measuring; it’s a tail-latency lever.
Drain CQEs fully per wakeup to avoid ring stalls.

Adapter sketch: POSIX AIO backend

// On create: allocate a table of aiocb slots and a wait set
// On submit: fill aiocb; aio_read/aio_write; remember token↔aiocb; start a timer thread or use aio_suspend with a global deadline wheel
// On reap: scan for completed via aio_error==0; aio_return; emit cqe
// On cancel: aio_cancel(fd, aiocb*) and mark token as cancelled

Notes:

Use aio_suspend over signals for predictability; multiplex deadlines with a timer heap.
Expect best-effort cancellation; return -ETIME from your own deadline logic even when the kernel later completes the op—drop it by token.

Adapter sketch: thread-pool backend

// On create: start fixed-size pool; create a completion queue and a wakeup fd
// On submit: package req in a job; worker performs cancelable reads/writes with poll; push cqe on completion queue
// On reap: drain completion queue; block with poll/select on the wakeup fd when timeout_ms>0
// On cancel: set job->cancel flag and signal its cancel pipe

Notes:

Keep separate pools for CPU-bound and I/O-bound tasks to avoid head-of-line blocking.
Bound the job queue; surface backpressure instead of unbounded memory growth.

Feature probing and selection

Prefer io_uring when available: attempt io_uring_queue_init; on failure, fall back to POSIX AIO, then to threads.
Allow runtime/environment overrides (e.g., AIO_BACKEND=threads) for diagnostics.
Log the selected backend and queue depth on startup.

Error normalization and observability

Always surface -errno in res. Don’t translate; callers can switch on -EAGAIN, -EPIPE, -ECANCELED, -ETIME.
Attach the original user_tag so higher layers can correlate with requests.
Instrument: submissions, completions (bytes/errnos), time-in-queue, deadlines met/missed, cancels requested/satisfied.

Choosing wisely (quick guidance)

Need Linux-only, high-throughput, low tail latency, and are comfortable with kernel specifics? io_uring.
Need portability across BSD/macOS/Linux with decent file I/O completion and can tolerate implementation variance? POSIX AIO.
Need universal coverage and simpler reasoning, with explicit fairness and deadlines? Thread-pool async (with disciplined cancelable waits).

Hybrid is common: io_uring for hot file/network paths on Linux; thread-pool for everything else and for portability.

Production checklist

Define a single async API and hide backend differences behind adapters
Require deadlines on every external I/O; return -ETIME when missed
Make cancellation idempotent; tolerate late completions via tokens
Drain completion queues fully; never starve your ring or result queues
Bound in-flight ops and job queues; apply backpressure at admission
Normalize errors as -errno; log with fd, offset, bytes, errno, latency
Test with tiny buffers, forced EAGAIN, injected signals, and timeouts
Measure under realistic concurrency; validate tail latency, not just p50

Closing thoughts

Async I/O in C isn’t about picking the fanciest API; it’s about enforcing simple, explicit contracts: bytes or -errno, deadlines that bite, and cancellation that’s safe to reason about. Wrap those contracts in a tiny abstraction, choose the best backend per platform, and your code stays small, predictable, and fast—without 3 a.m. incidents.