So you called write()
once and assumed all your data magically teleported to disk or across the network. Adorable.
In real systems the kernel is a strict bouncer. You cross into its world via syscalls, you’re handed integer tokens called file descriptors, and you follow rules about what “write succeeded” actually means. If you don’t internalize those rules, you’ll ship data truncation bugs, rare deadlocks, and the occasional 3 a.m. incident.
This post is your practical field guide: what syscalls really promise, how file descriptors behave, and how to build I/O loops that survive signals, short writes, and nonblocking chaos—without turning your codebase into spaghetti.
The boundary: what a syscall actually is
Calling read()
, write()
, accept()
, or open()
transfers control to the kernel. That transition:
- Switches privilege levels and executes kernel code on your behalf
- May block your thread until the resource is ready (unless you asked for nonblocking)
- Returns either a non-negative result or
-1
witherrno
set
Return-value truth table, condensed:
>= 0
: success. Forread()
, it’s the byte count read. Forwrite()
, it’s the byte count written. Foropen()
, it’s a new file descriptor.-1
: failure. Inspecterrno
for why (e.g.,EINTR
,EAGAIN
,EPIPE
).
Key mindset: success does not mean “all of it.” It means “some of it.” Your code must be prepared to loop.
File descriptors: small integers, big responsibilities
File descriptors (FDs) are small integers indexing a per-process table of “open files.” An “open file” is a kernel object pointing to a driver/stream/regular file plus state (offset, flags, reference counts).
What an FD can represent:
- Regular files, directories (limited ops), character/block devices
- Pipes (
pipe()
), UNIX domain sockets, TCP/UDP sockets - Eventfd, signalfd, epoll/kqueue descriptors, timerfd (Linux)
Semantics and lifecycle you must know:
- Creation:
open()
,socket()
,accept()
,pipe()
, etc. On success you get a non-negative FD. - Duplication:
dup()
,dup2()
,dup3()
create new FDs referencing the same open-file description (shared offset/flags). Useful for redirectingstdin/stdout/stderr
or implementing tee-like behavior. - Close:
close(fd)
decrements the reference count; when the last reference drops, the kernel releases the resource. Never leak FDs—long-running services will run out. - Inheritance: after
fork()
, the child inherits copies of the parent’s FDs. Afterexecve()
, inherited FDs remain open unless markedCLOEXEC
. UseO_CLOEXEC
(onopen
/socket
/accept4
) orfcntl(F_SETFD, FD_CLOEXEC)
to prevent accidental leaks into child processes. - Per-FD flags:
O_NONBLOCK
governs blocking behavior;O_APPEND
,O_SYNC
,O_DIRECT
and friends alter write semantics or caching. Set flags withopen(..., O_NONBLOCK)
orfcntl(F_SETFL, O_NONBLOCK)
.
A quick correctness checklist for FDs
- Always set
CLOEXEC
when creating long-lived FDs in services - Decide up front: blocking vs nonblocking
- Handle closure on all paths (including error paths)
- Don’t mix blocking reads with nonblocking writes on the same socket without understanding backpressure
Blocking vs nonblocking: what you actually asked the kernel to do
With blocking FDs (the default), read()
and write()
may stall your thread:
read()
blocks until at least one byte is available or EOFwrite()
blocks until the kernel can accept at least one byte into its buffers
With nonblocking FDs (O_NONBLOCK
):
read()
returns-1
witherrno == EAGAIN
(orEWOULDBLOCK
) if no data is readywrite()
returns-1
witherrno == EAGAIN
if kernel buffers are full
Nonblocking buys you control: you integrate readiness APIs (epoll/kqueue/poll) and implement timeouts and backpressure explicitly. It also forces you to write robust loops.
The small print: short reads, short writes, and EOF
On success, read(fd, buf, n)
may return any 0 <= r <= n
:
r > 0
: you got some bytes; loop if you need morer == 0
: EOF for streams/files (peer closed for sockets)
On success, write(fd, buf, n)
may return any 0 < w <= n
:
- Regular files typically write fully, but partial writes can occur (signals, quotas,
O_NONBLOCK
, resource pressure) - Pipes/sockets frequently produce short writes; treat any partial as routine
Errors that matter:
EINTR
: the syscall was interrupted by a signal before transferring any bytes; retryEAGAIN
/EWOULDBLOCK
: would block on nonblocking FDs; wait for readiness, then retryEPIPE
: writing to a closed pipe/socket; peer is gone (often accompanied bySIGPIPE
unless suppressed)
If your code assumes “all bytes in one go,” it’s wrong by construction.
Robust I/O loops: minimal, correct, boring (the good kind)
Two foundational helpers cover 80% of production needs: “write all” and “read exactly N unless EOF.” They are EINTR
-safe, handle partial transfers, and optionally integrate nonblocking retries.
#include <errno.h>
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
#include <unistd.h>
// Write the entire buffer (best-effort). Returns true on success, false on error.
// For nonblocking fds, the caller should ensure writability (e.g., epoll/kqueue) before calling.
bool write_all(int fd, const void *buf, size_t len) {
const uint8_t *p = (const uint8_t *)buf;
size_t remaining = len;
while (remaining > 0) {
ssize_t w = write(fd, p, remaining);
if (w > 0) {
p += (size_t)w;
remaining -= (size_t)w;
continue;
}
if (w == -1 && errno == EINTR) {
continue; // interrupted, retry
}
if (w == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
// Would block: the caller must wait for POLLOUT/EVFILT_WRITE then retry.
return false;
}
return false; // other error (EPIPE, ENOSPC, etc.)
}
return true;
}
#include <errno.h>
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
#include <unistd.h>
// Read exactly N bytes unless EOF occurs first.
// Returns the number of bytes placed into buf (<= len). 0 means EOF from the start.
ssize_t read_exact(int fd, void *buf, size_t len) {
uint8_t *p = (uint8_t *)buf;
size_t total = 0;
while (total < len) {
ssize_t r = read(fd, p + total, len - total);
if (r > 0) {
total += (size_t)r;
continue;
}
if (r == 0) {
// EOF before we reached len
return (ssize_t)total;
}
if (r == -1 && errno == EINTR) {
continue; // interrupted, retry
}
if (r == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
// Would block: the caller must wait for POLLIN/EVFILT_READ then retry.
return (ssize_t)total;
}
return -1; // other error
}
return (ssize_t)total;
}
Notes:
- These helpers separate “transfer” from “readiness.” In a nonblocking design, you wait for readiness (epoll/kqueue), then call these until
EAGAIN
. - For blocking FDs, these spin until completion or error, which is fine in simple CLI tools but can cause head-of-line blocking in servers.
- Suppress
SIGPIPE
on sockets (setSO_NOSIGPIPE
on BSDs or usesend(..., MSG_NOSIGNAL)
on Linux) so a peer-close yieldsEPIPE
instead of terminating the process.
Practical edge cases to cement your intuition
-
read()
on a TCP socket can return fewer bytes than requested even if the sender wrote a single, larger buffer. TCP is a byte stream; message boundaries don’t exist. -
write()
to a pipe with an atomic write size limit (POSIX requires atomicity up toPIPE_BUF
) may still short-write whenO_NONBLOCK
is set and buffers are tight. -
On regular files, a blocking
write()
is often full-sized, but a signal arriving mid-flight can causeEINTR
without progress, or partial completion followed by an error. Loop regardless. -
Large buffers may be split due to kernel limits, cgroup I/O throttling, or filesystem peculiarities. Your loop doesn’t care—keep going until done.
Scatter/gather I/O that actually scales
Vectored I/O lets you move bytes from or to multiple non-contiguous buffers in a single syscall. Instead of stitching headers + payload into a temporary buffer (and copying), you describe them with an array of
struct iovec
and callwritev()
(orreadv()
for the reverse).Why this matters:
- Fewer syscalls: amortize syscall overhead under high throughput
- Fewer copies: keep data in place; better cache locality
- Cleaner code: describe segments declaratively
The catch: partial completion can split anywhere across your segments. You must advance the iovec array after each call.
Advancing iovecs after a partial
#include <sys/uio.h> #include <stddef.h> // Advance an iovec array by `bytes` consumed, mutating base/len and iovcnt. // On return, *piov points to the first unconsumed segment and *piovcnt is updated. static void advance_iovecs(struct iovec **piov, int *piovcnt, size_t bytes) { struct iovec *iov = *piov; int cnt = *piovcnt; size_t remain = bytes; while (cnt > 0 && remain > 0) { if (remain >= iov->iov_len) { remain -= iov->iov_len; ++iov; --cnt; } else { iov->iov_base = (char *)iov->iov_base + remain; iov->iov_len -= remain; remain = 0; } } *piov = iov; *piovcnt = cnt; }
Robust
writev
: full send with retriesTwo variants are useful in practice—one that returns early on
EAGAIN
for nonblocking designs, and one that waits up to a timeout.#include <errno.h> #include <poll.h> #include <stdbool.h> #include <sys/uio.h> #include <unistd.h> // Attempt to write all iovecs. Returns true when everything is written. // On nonblocking fds: returns false with errno=EAGAIN when you should wait for POLLOUT. bool writev_all_try(int fd, struct iovec *iov, int iovcnt) { while (iovcnt > 0) { ssize_t w = writev(fd, iov, iovcnt); if (w > 0) { advance_iovecs(&iov, &iovcnt, (size_t)w); continue; } if (w == -1 && errno == EINTR) { continue; // signal, retry } if (w == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) { return false; // caller should poll for writable } return false; // hard error (EPIPE, ENOSPC, ...) } return true; }
#include <time.h> static int wait_writable(int fd, int timeout_ms) { struct pollfd p = { .fd = fd, .events = POLLOUT }; for (;;) { int r = poll(&p, 1, timeout_ms); if (r > 0) return 1; // ready if (r == 0) return 0; // timeout if (r < 0 && errno == EINTR) continue; return -1; // error } } // Deadline-based helper: returns milliseconds remaining, clamped to [0, INT_MAX] static int ms_left(struct timespec deadline) { struct timespec now; clock_gettime(CLOCK_MONOTONIC, &now); long ms = (long)((deadline.tv_sec - now.tv_sec) * 1000) + (long)((deadline.tv_nsec - now.tv_nsec) / 1000000); if (ms < 0) return 0; if (ms > 0x3fffffff) return 0x3fffffff; return (int)ms; } // Write all iovecs before the deadline. Returns true on success, false on timeout/error. bool writev_all_until(int fd, struct iovec *iov, int iovcnt, struct timespec deadline) { while (iovcnt > 0) { ssize_t w = writev(fd, iov, iovcnt); if (w > 0) { advance_iovecs(&iov, &iovcnt, (size_t)w); continue; } if (w == -1 && errno == EINTR) { continue; } if (w == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) { int left = ms_left(deadline); int wr = wait_writable(fd, left); if (wr == 1) continue; return false; // timeout or poll error } return false; // hard error } return true; }
Signal nuance for sockets: avoid
SIGPIPE
on peer-close. Options include ignoringSIGPIPE
process-wide withsigaction(SIGPIPE, SIG_IGN, ...)
, usingsend()
/sendmsg()
withMSG_NOSIGNAL
(Linux), orSO_NOSIGPIPE
(macOS/BSD). With plainwritev()
, ignoringSIGPIPE
is the simplest.Robust
readv
: fill buffers, honor EOF and timeoutsreadv()
mirrorswritev()
: it can return fewer bytes than requested even when data exists, and it can split across segments. The loop looks similar.static int wait_readable(int fd, int timeout_ms) { struct pollfd p = { .fd = fd, .events = POLLIN }; for (;;) { int r = poll(&p, 1, timeout_ms); if (r > 0) return 1; // ready if (r == 0) return 0; // timeout if (r < 0 && errno == EINTR) continue; return -1; // error } } // Read exactly the iovec payload or stop on EOF/timeout/error. // Returns total bytes read (<= total iovec length), or -1 on error, 0 on immediate EOF. ssize_t readv_exact_until(int fd, struct iovec *iov, int iovcnt, struct timespec deadline) { size_t total = 0; // Compute total requested for (int i = 0; i < iovcnt; ++i) total += iov[i].iov_len; size_t consumed = 0; while (iovcnt > 0) { ssize_t r = readv(fd, iov, iovcnt); if (r > 0) { consumed += (size_t)r; advance_iovecs(&iov, &iovcnt, (size_t)r); continue; } if (r == 0) { return (ssize_t)consumed; // EOF } if (r == -1 && errno == EINTR) { continue; } if (r == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) { int left = ms_left(deadline); int rr = wait_readable(fd, left); if (rr == 1) continue; return (ssize_t)consumed; // timeout: return what we have } return -1; // hard error } return (ssize_t)consumed; // full success }
Design choices to note:
- Timeout as a deadline, not a per-iteration slice: makes progress consistent under partial transfers
- On timeout,
readv_exact_until
returns the bytes gathered so far (likeread_exact
); callers can decide whether to fail or process partials EINTR
is treated as a non-event—just retry
A small nonblocking pattern: toggling
O_NONBLOCK
If you need time-bounded I/O on an FD you usually run in blocking mode, consider temporarily setting
O_NONBLOCK
around the operation sopoll()
can control waiting explicitly. Beware of races if other threads share the FD.#include <fcntl.h> static bool set_nonblocking(int fd, bool on) { int flags = fcntl(fd, F_GETFL, 0); if (flags == -1) return false; int want = on ? (flags | O_NONBLOCK) : (flags & ~O_NONBLOCK); if (want == flags) return true; return fcntl(fd, F_SETFL, want) == 0; }
Use with care in single-threaded tools; prefer dedicated nonblocking sockets in servers.
Signals without surprises: making I/O signal-safe
Signals can interrupt syscalls, flipping success into -1
with errno == EINTR
. Good news: if your loops already retry on EINTR
, you’re most of the way there. A few additional practices make the system predictable.
- Prefer
sigaction
oversignal()
and setSA_RESTART
when appropriate so many syscalls resume automatically. Still treatEINTR
as routine. - Ignore
SIGPIPE
globally or useMSG_NOSIGNAL
/SO_NOSIGPIPE
so a peer-close on sockets yieldsEPIPE
instead of killing the process. - For precise coordination between signals and timeouts, use
ppoll
/pselect
with a signal mask to avoid classic races.
#include <signal.h>
static volatile sig_atomic_t g_got_sigint = 0;
static void on_sigint(int signo) { (void)signo; g_got_sigint = 1; }
static void install_signal_handlers(void) {
// Ignore SIGPIPE so writes on closed sockets set EPIPE
struct sigaction ign = {0};
ign.sa_handler = SIG_IGN;
sigemptyset(&ign.sa_mask);
ign.sa_flags = 0;
sigaction(SIGPIPE, &ign, NULL);
// Handle SIGINT and request graceful shutdown; SA_RESTART restarts many syscalls
struct sigaction sa = {0};
sa.sa_handler = on_sigint;
sigemptyset(&sa.sa_mask);
sa.sa_flags = SA_RESTART;
sigaction(SIGINT, &sa, NULL);
}
Race-free timeouts with ppoll
/pselect
poll()
has a race: a signal can arrive after you check the flag but before you call poll()
, leaving you blocked. ppoll
/pselect
solve this by atomically swapping the signal mask during the wait.
#include <errno.h>
#include <poll.h>
#include <signal.h>
#include <time.h>
// Wait for readiness while unblocking the provided signals during the wait.
// Returns 1 when any fd is ready, 0 on timeout, -1 on error.
static int wait_rw_with_mask(struct pollfd *fds, nfds_t nfds,
struct timespec *ts, const sigset_t *unblock) {
#if defined(_GNU_SOURCE) || defined(__linux__)
// Linux: use ppoll
for (;;) {
int r = ppoll(fds, nfds, ts, unblock);
if (r >= 0) return r;
if (errno == EINTR) continue;
return -1;
}
#else
// Portable fallback: temporarily set mask then poll; small race may remain
sigset_t prev;
pthread_sigmask(SIG_SETMASK, unblock, &prev);
int r;
for (;;) {
r = poll(fds, nfds, ts ? (int)(ts->tv_sec * 1000 + ts->tv_nsec / 1000000) : -1);
if (r >= 0) break;
if (errno == EINTR) continue;
break;
}
pthread_sigmask(SIG_SETMASK, &prev, NULL);
return r;
#endif
}
Notes:
- On Linux,
ppoll
is the clean choice. On other platforms, considerpselect
or the platform’s event loop which usually integrates signal delivery. - Use a deadline to compute
struct timespec
per wait.
Cancellation you can reason about: self-pipe/eventfd
Long waits should be cancelable. Two portable primitives make this easy:
- Self-pipe: create
pipe()
; include the read end in your poll set. To cancel, write a byte to the write end. - Linux
eventfd
: cheaper, 64-bit counter you can increment; also pollable.
#include <fcntl.h>
#include <unistd.h>
struct cancel_fd { int r; int w; };
static int make_nonblocking(int fd) {
int flags = fcntl(fd, F_GETFL, 0);
if (flags == -1) return -1;
return fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}
static bool cancel_fd_init(struct cancel_fd *c) {
int fds[2];
if (pipe(fds) != 0) return false;
(void)make_nonblocking(fds[0]);
(void)make_nonblocking(fds[1]);
c->r = fds[0];
c->w = fds[1];
return true;
}
static void cancel_fd_signal(struct cancel_fd *c) {
(void)write(c->w, "x", 1); // best-effort; nonblocking
}
static void cancel_fd_drain(struct cancel_fd *c) {
char buf[64];
while (read(c->r, buf, sizeof buf) > 0) {}
}
Integrate into readiness waits by adding the cancel read FD to your poll set and returning a distinct status when it becomes readable.
enum wait_result { WAIT_READY = 1, WAIT_TIMEOUT = 0, WAIT_ERROR = -1, WAIT_CANCELLED = -2 };
static int io_wait_rw(int io_fd, short events, int cancel_fd, struct timespec *ts) {
struct pollfd pfds[2];
pfds[0].fd = io_fd; pfds[0].events = events; pfds[0].revents = 0;
pfds[1].fd = cancel_fd; pfds[1].events = POLLIN; pfds[1].revents = 0;
for (;;) {
int r = poll(pfds, 2, ts ? (int)(ts->tv_sec * 1000 + ts->tv_nsec / 1000000) : -1);
if (r > 0) {
if (pfds[1].revents) return WAIT_CANCELLED;
if (pfds[0].revents) return WAIT_READY;
continue;
}
if (r == 0) return WAIT_TIMEOUT;
if (errno == EINTR) continue;
return WAIT_ERROR;
}
}
Framing reads: stop exactly at a delimiter
Many text protocols and CLI tools need to read until a delimiter (e.g., \n
) with a time budget. The helper below accumulates into a caller-provided buffer, stops when the delimiter is found or capacity is reached, and respects a deadline. It returns the number of bytes stored (which may be partial on timeout/EOF) or -1
on error.
#include <string.h>
// Reads up to cap bytes or until delim is seen; returns count (>=0) or -1 on error.
// The returned count includes the delimiter if present. The buffer is not NUL-terminated.
ssize_t read_until_delim(int fd, char *buf, size_t cap, char delim, struct timespec deadline) {
size_t used = 0;
while (used < cap) {
ssize_t r = read(fd, buf + used, cap - used);
if (r > 0) {
used += (size_t)r;
// Check for delimiter; keep the last position to resume efficiently
char *pos = memchr(buf, (unsigned char)delim, used);
if (pos) {
size_t have = (size_t)(pos - buf + 1);
return (ssize_t)have;
}
continue;
}
if (r == 0) {
return (ssize_t)used; // EOF
}
if (errno == EINTR) {
continue;
}
if (errno == EAGAIN || errno == EWOULDBLOCK) {
// Wait for readability then retry
int left_ms = ms_left(deadline);
int rr = wait_readable(fd, left_ms);
if (rr == 1) continue;
return (ssize_t)used; // timeout or poll error: return what we have
}
return -1; // hard error
}
return (ssize_t)used; // buffer full without delimiter
}
Practical tips:
- Cap lines to a reasonable maximum and treat longer ones as an error to avoid unbounded memory usage.
- Reuse buffers between calls to reduce allocations and improve cache locality.
- For binary protocols, prefer length-prefix framing and use
read_exact
/readv_exact_until
to collect exactly the announced size.
Backpressure: when the kernel says “not now”
EAGAIN
is not an error—it’s a signal that downstream buffers are full. Good patterns:
- Treat
EAGAIN
as a scheduling event: stop writing, register interest in writability, and try again when notified - Bound your userland output queue (bytes and messages). If limits are exceeded, shed work or apply upstream backpressure
- Prefer deadlines for end-to-end operations. A write that never makes progress should eventually time out and surface an error you can observe
Simple budgeted sender loop:
// Attempts to flush up to max_bytes from iovecs. Returns bytes flushed (>=0) or -1 on error.
ssize_t flush_budgeted(int fd, struct iovec *iov, int iovcnt, size_t max_bytes) {
size_t sent = 0;
while (iovcnt > 0 && sent < max_bytes) {
size_t want = iov[0].iov_len;
if (want > max_bytes - sent) want = max_bytes - sent;
struct iovec tmp = { .iov_base = iov[0].iov_base, .iov_len = want };
ssize_t w = writev(fd, &tmp, 1);
if (w > 0) {
sent += (size_t)w;
advance_iovecs(&iov, &iovcnt, (size_t)w);
continue;
}
if (w == -1 && (errno == EINTR)) continue;
if (w == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) break; // yield
return -1; // hard error
}
return (ssize_t)sent;
}
This pattern is friendly to event loops: you push a bit each wakeup and never monopolize the thread.
Length-prefixed framing: robust, binary-friendly
Length-prefixed messages avoid delimiter corner cases and make partials easy to manage. A minimal pair of helpers:
#include <arpa/inet.h>
#include <stdlib.h>
struct msg { const void *data; uint32_t len; };
bool send_msg(int fd, const void *data, uint32_t len, struct timespec deadline) {
uint32_t nlen = htonl(len);
struct iovec iov[2] = {
{ .iov_base = &nlen, .iov_len = sizeof nlen },
{ .iov_base = (void *)data, .iov_len = len }
};
return writev_all_until(fd, iov, 2, deadline);
}
// Returns malloc'd buffer on success (caller frees) and sets *out_len; NULL on error/timeout/EOF.
void *recv_msg(int fd, uint32_t *out_len, struct timespec deadline) {
uint32_t nlen = 0;
// Read 4-byte header exactly (or return partial count; treat <4 as EOF/timeout)
struct iovec hiov = { .iov_base = &nlen, .iov_len = sizeof nlen };
ssize_t hdr = readv_exact_until(fd, &hiov, 1, deadline);
if (hdr != (ssize_t)sizeof nlen) return NULL;
uint32_t len = ntohl(nlen);
void *buf = malloc(len);
if (!buf) return NULL;
struct iovec piov = { .iov_base = buf, .iov_len = len };
ssize_t body = readv_exact_until(fd, &piov, 1, deadline);
if (body != (ssize_t)len) { free(buf); return NULL; }
*out_len = len;
return buf;
}
Notes:
- Enforce sane maximums (e.g., reject
len > 16 MiB
) to avoid memory bombs - Consider a small fixed-size header struct for versioning and checksums
Nonblocking integration sketch (epoll-style)
The exact event loop will vary, but the principles are consistent:
- Always drain reads until
EAGAIN
- For writes, try to flush the queue; if incomplete, enable
POLLOUT
/EPOLLOUT
and resume on notification - Bound per-connection buffers and enforce deadlines
struct bufseg { struct iovec iov[4]; int iovcnt; };
struct conn {
int fd;
struct bufseg outq[64];
int q_head, q_tail; // ring
};
static bool conn_flush(struct conn *c) {
// Try to write from the head segment only to keep fairness
while (c->q_head != c->q_tail) {
struct bufseg *s = &c->outq[c->q_head];
if (s->iovcnt == 0) { c->q_head = (c->q_head + 1) % 64; continue; }
ssize_t w = writev(c->fd, s->iov, s->iovcnt);
if (w > 0) { advance_iovecs(&s->iov, &s->iovcnt, (size_t)w); continue; }
if (w == -1 && errno == EINTR) continue;
if (w == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) return false; // need POLLOUT
return false; // hard error
}
return true; // queue empty
}
static void conn_on_readable(struct conn *c) {
char buf[4096];
for (;;) {
ssize_t r = read(c->fd, buf, sizeof buf);
if (r > 0) {
// process buf[0..r)
continue;
}
if (r == 0) { /* peer closed */ break; }
if (r == -1 && errno == EINTR) continue;
if (r == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) break; // drained
break; // hard error
}
}
This sketch purposely omits epoll setup/teardown; its goal is to emphasize the robust read-until-EAGAIN
/write-drain-until-EAGAIN
pattern and bounded queues.
Production checklist
- Set
CLOEXEC
on all long-lived FDs; audit child processes - Decide blocking vs nonblocking per component; default to nonblocking in servers
- Always handle partials and
EINTR
- Treat
EAGAIN
as backpressure; never spin—wait for readiness - Bound buffers and enforce deadlines; surface timeouts as errors
- Suppress
SIGPIPE
for socket I/O; expectEPIPE
- Log with context: fd, peer address, bytes attempted/achieved, errno
- Test with
socketpair()
,pipe()
, and tiny buffers; inject signals; simulate timeouts
Closing thoughts
Robust I/O in C isn’t about cleverness—it’s about discipline. Embrace the kernel’s contract: syscalls may be interrupted, reads and writes may be partial, and nonblocking is a conversation with the scheduler. Wrap these truths in small, boring helpers, add deadlines and backpressure, and your services will trade 3 a.m. incidents for predictable, observable behavior.