You built a socket server, it works on your laptop, then it faceplants in production the moment a thousand clients show up. Threads block. Buffers fill. Latency spikes. The fix isn’t a bigger server—it’s the right I/O model and a disciplined event loop.
This post is a practical field guide to Linux epoll
and BSD kqueue
: what they actually signal, how to choose level vs edge triggering, and how to write handlers that scale without starving neighbors or spinning CPUs. We’ll keep the vibe production-first: small, boring loops that survive signals, partial reads/writes, and backpressure.
Readiness I/O in one minute
Readiness APIs (epoll
, kqueue
, and even old poll
) tell you “this file descriptor won’t block for operation X right now.” They do not transfer any bytes—that’s still your job via read()
/write()
/recv()
/send()
.
Two big consequences:
- You must still handle short reads/writes and
EAGAIN
. - Readiness is a hint, not a contract for “all the data.” After you act, conditions may change.
At scale, you make all sockets nonblocking, register interest in events, then react to notifications with small, deterministic handlers. Do the minimum per wakeup, re-arm interest correctly, and move on.
Level-triggered vs edge-triggered: what’s the difference?
- Level-triggered (LT): “While the condition is true, I’ll keep notifying you.” If a socket is readable and you don’t drain it, you’ll be notified again and again.
- Edge-triggered (ET): “I’ll notify you on state CHANGE.” If a socket becomes readable, you get an event once. If you don’t drain it fully, you may not get another notification until more data arrives or state changes again.
ET can reduce wakeups under load but it is unforgiving: handlers must drain until EAGAIN
. LT is friendlier but can cause repeated notifications if you only nibble a few bytes per wakeup.
The golden rule for ET
Whether reading or writing: loop until the syscall returns -1
with errno == EAGAIN
(or EWOULDBLOCK
), then stop. Forget that, and you’ll ship a stuck connection that never wakes up again.
Minimal, correct drain loops
#include <errno.h>
#include <unistd.h>
static void on_readable_et(int fd) {
char buf[4096];
for (;;) {
ssize_t r = read(fd, buf, sizeof buf);
if (r > 0) {
// process buf[0..r)
continue;
}
if (r == 0) {
// peer closed
break;
}
if (r == -1 && errno == EINTR) {
continue; // try again
}
if (r == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
// drained for now
break;
}
break; // hard error
}
}
For writes, you attempt to flush pending buffers until EAGAIN
, then enable writable notifications to resume later:
#include <errno.h>
#include <sys/uio.h>
#include <unistd.h>
// Advance an iovec array by `bytes` consumed, mutating base/len and iovcnt.
static void advance_iovecs(struct iovec **piov, int *piovcnt, size_t bytes) {
struct iovec *iov = *piov; int cnt = *piovcnt; size_t left = bytes;
while (cnt > 0 && left > 0) {
if (left >= iov->iov_len) { left -= iov->iov_len; ++iov; --cnt; }
else { iov->iov_base = (char *)iov->iov_base + left; iov->iov_len -= left; left = 0; }
}
*piov = iov; *piovcnt = cnt;
}
static int flush_writeq_et(int fd, struct iovec *iov, int iovcnt) {
while (iovcnt > 0) {
ssize_t w = writev(fd, iov, iovcnt);
if (w > 0) { advance_iovecs(&iov, &iovcnt, (size_t)w); continue; }
if (w == -1 && errno == EINTR) { continue; }
if (w == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
// Need to re-enable POLLOUT/EVFILT_WRITE and resume later
return 0; // partial
}
return -1; // hard error (EPIPE, etc.)
}
return 1; // fully flushed
}
Correctly arming and re-arming interest
Subtleties differ by API:
-
epoll
(Linux)- LT by default; add
EPOLLET
for ET. - Use
EPOLLONESHOT
if you want “notify once, you must re-arm explicitly.” Great for handoff across worker threads. - Always register interest for the operations you intend to perform next (e.g., re-enable
EPOLLOUT
only if you still have bytes to flush).
- LT by default; add
-
kqueue
(BSD/macOS)- EVFILT_READ/EVFILT_WRITE are level-triggered; the filter semantics carry sizes (“how many bytes are ready”).
- You almost always follow the same drain-until-
EAGAIN
pattern. Even though the filter is level-triggered, you must consume enough that the kernel won’t keep notifying you incessantly. - EV_CLEAR approximates edge-trigger behavior by clearing the state after delivery. Treat it with the same discipline as ET.
Fairness: don’t monopolize the loop
Even in ET mode, never spin on a single hot socket while others starve. Bound the work per wakeup—e.g., cap processed bytes or messages—then yield. Re-arming ensures you’ll be called again promptly.
#define MAX_BYTES_PER_WAKE (64 * 1024)
static void on_readable_fair(int fd) {
char buf[4096]; size_t budget = MAX_BYTES_PER_WAKE;
for (;;) {
if (budget == 0) break; // yield to peers
ssize_t want = sizeof buf; if ((size_t)want > budget) want = (ssize_t)budget;
ssize_t r = read(fd, buf, (size_t)want);
if (r > 0) { budget -= (size_t)r; /* process */ continue; }
if (r == 0) break;
if (r == -1 && errno == EINTR) continue;
if (r == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) break;
break;
}
}
The combination of “drain-until-EAGAIN
” and “bounded budget per wakeup” keeps throughput high and tail latency sane.
A quick correctness checklist you can adopt today
- Make every network socket nonblocking from creation.
- Choose ET only if you implement drain-until-
EAGAIN
rigorously. - Re-enable
POLLOUT
/EVFILT_WRITE
only when you have pending bytes. - Bound per-connection output queues (bytes and messages) to avoid memory blowups.
- Budget work per wakeup; never monopolize the loop.
- Treat
EINTR
as routine; return to the loop, don’t crash.
We’ll build on these foundations with timers, one-shot handoff, cancellation, and backpressure strategies next—keeping the core invariant intact: do a little, do it right, and keep going.
Timers and deadlines that don’t lie
Time drives everything in a server: idle timeouts, request deadlines, retries, periodic housekeeping. The rules:
- Use a monotonic clock for scheduling (not wall time). Wall time can jump.
- Drive your poller’s wait timeout from the next due timer.
- Fire due timers before handling new readiness so you honor deadlines.
Timer primitives
- Linux:
timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK|TFD_CLOEXEC)
integrates cleanly withepoll
. - BSD/macOS:
EVFILT_TIMER
onkqueue
creates one-shot or periodic timers that deliver kevents.
// Linux timerfd: one-shot timer; re-arm explicitly after each expiration
#include <sys/timerfd.h>
#include <stdint.h>
#include <unistd.h>
int make_timerfd(void) {
int tfd = timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK | TFD_CLOEXEC);
return tfd; // add to epoll with EPOLLIN
}
void arm_timerfd_once(int tfd, long ms_from_now) {
struct itimerspec its = {0};
its.it_value.tv_sec = ms_from_now / 1000;
its.it_value.tv_nsec = (ms_from_now % 1000) * 1000000L;
(void)timerfd_settime(tfd, 0, &its, NULL);
}
void on_timerfd_ready(int tfd) {
uint64_t expirations = 0;
(void)read(tfd, &expirations, sizeof expirations); // drain
// run scheduled tasks; optionally re-arm
}
// BSD/macOS: kqueue EVFILT_TIMER (periodic example)
#include <sys/event.h>
void add_periodic_timer(int kq, int ident, int interval_ms) {
struct kevent ev;
EV_SET(&ev, (uintptr_t)ident, EVFILT_TIMER, EV_ADD | EV_ENABLE, 0, interval_ms, NULL);
kevent(kq, &ev, 1, NULL, 0, NULL);
}
A minimal timer manager (min-heap)
For many servers, a binary min-heap keyed by due time is enough: O(log n) insert/cancel, O(log n) pop. On each loop tick:
- Pop and run all timers whose
due <= now
. - Compute wait timeout as
max(0, next_due - now)
and pass toepoll_wait
/kevent
.
#include <stdint.h>
#include <time.h>
typedef void (*timer_cb)(void *arg);
struct timer { uint64_t due_ns; timer_cb cb; void *arg; };
// Assume you have a binary heap implementation over `struct timer` with:
// heap_top_due_ns(), heap_push(), heap_pop(), heap_empty()
static uint64_t now_ns(void) {
struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts);
return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec;
}
static int ms_until_next_timer(void) {
if (heap_empty()) return 1000; // default idle wait
uint64_t n = now_ns();
uint64_t due = heap_top_due_ns();
if (due <= n) return 0;
uint64_t delta_ns = due - n;
uint64_t ms = delta_ns / 1000000ull;
if (ms > 0x3fffffff) ms = 0x3fffffff;
return (int)ms;
}
static void run_due_timers(void) {
uint64_t n = now_ns();
while (!heap_empty() && heap_top_due_ns() <= n) {
struct timer t = heap_pop();
t.cb(t.arg);
}
}
Integrate by calling run_due_timers()
before each wait, and use ms_until_next_timer()
as your poll timeout.
Cancellation and user-triggered wakeups
Long waits should be cancelable (shutdowns, config reload, task preemption). Two patterns:
- Linux:
eventfd
added toepoll
; write increments the counter to wake the loop. - BSD/macOS:
EVFILT_USER
allows userland-triggered kevents.
// Linux eventfd cancellation
#include <sys/eventfd.h>
int make_cancel_fd(void) {
int efd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
return efd; // add to epoll with EPOLLIN
}
void signal_cancel(int efd) { uint64_t one = 1; (void)write(efd, &one, sizeof one); }
void on_cancel_ready(int efd) { uint64_t n; (void)read(efd, &n, sizeof n); /* drain */ }
// BSD/macOS EVFILT_USER cancellation
void add_user_event(int kq, uintptr_t ident) {
struct kevent ev; EV_SET(&ev, ident, EVFILT_USER, EV_ADD | EV_ENABLE, 0, 0, NULL);
kevent(kq, &ev, 1, NULL, 0, NULL);
}
void trigger_user_event(int kq, uintptr_t ident) {
struct kevent ev; EV_SET(&ev, ident, EVFILT_USER, 0, NOTE_TRIGGER, 0, NULL);
kevent(kq, &ev, 1, NULL, 0, NULL);
}
Both approaches integrate like any other fd/filter, letting you break out of epoll_wait
/kevent
immediately to act on the signal.
Accept without the thundering herd
Multiple workers blocked on the same listen socket can all wake when a new connection arrives. Only one wins accept()
, the rest stampede and go back to sleep: wasted wakeups and cache churn.
Mitigations:
- Linux
EPOLLEXCLUSIVE
on the listen fd to wake a single waiter. - Use
SO_REUSEPORT
to spread connections across processes, each with its own listen socket. - One-shot accept: handle one accept per wakeup and re-arm (
EPOLLONESHOT
, orEV_DISPATCH
-like behavior), or funnel accepts through a dedicated thread.
// Linux: register listen fd with EPOLLEXCLUSIVE to reduce herd
#include <sys/epoll.h>
void add_listen_epoll_exclusive(int epfd, int lfd) {
struct epoll_event ev = {0};
ev.events = EPOLLIN | EPOLLEXCLUSIVE; // level-triggered is fine for listen
ev.data.fd = lfd;
epoll_ctl(epfd, EPOLL_CTL_ADD, lfd, &ev);
}
void on_listen_ready(int epfd, int lfd) {
for (;;) {
int cfd = accept4(lfd, NULL, NULL, SOCK_NONBLOCK | SOCK_CLOEXEC);
if (cfd >= 0) {
// register cfd for I/O
struct epoll_event ev = { .events = EPOLLIN | EPOLLET, .data.fd = cfd };
epoll_ctl(epfd, EPOLL_CTL_ADD, cfd, &ev);
continue;
}
if (cfd == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) break; // drained backlog
if (cfd == -1 && errno == EINTR) continue; // retry
break; // other error
}
}
On BSD/macOS, prefer a single acceptor (or per-process with SO_REUSEPORT
) that dispatches accepted sockets to workers. If you must share, use EV_CLEAR
/dispatch patterns and accept only a bounded number per wakeup.
Backpressure that actually works
Backpressure is a policy, not an afterthought. Goals:
- Don’t let one slow peer soak all memory.
- Keep the loop responsive under bursty load.
- Fail fast when you cannot make progress within deadlines.
Practical rules:
- Maintain per-connection output queues with byte/message caps; reject or shed when exceeded.
- Enable writable notifications only when the queue is non-empty; disable immediately when flushed.
- Push a bounded amount per wakeup to preserve fairness.
struct conn {
int fd; size_t out_bytes;
// your queue container here
int want_writable; // 0/1
};
static void maybe_toggle_writable(int epfd, struct conn *c) {
struct epoll_event ev = { .data.fd = c->fd };
ev.events = EPOLLIN | EPOLLET | (c->out_bytes > 0 ? EPOLLOUT : 0);
epoll_ctl(epfd, EPOLL_CTL_MOD, c->fd, &ev);
}
static void on_writable(struct conn *c) {
// Try to flush from the head of the queue using writev; update out_bytes
// Stop on EAGAIN; when out_bytes drops to 0, disable EPOLLOUT via maybe_toggle_writable
}
At higher loads, add global budgets as well (e.g., stop accepting new connections or shed low-priority work when total queued bytes exceed a threshold). Tie send/receive deadlines to timers so stalled operations surface as actionable errors instead of hidden backlog.
A portable poller abstraction (epoll/kqueue)
Portability doesn't mean lowest common denominator. You can expose a small, strong API and map efficiently to each platform.
#include <stdint.h>
enum pe { PE_NONE=0, PE_READ=1<<0, PE_WRITE=1<<1, PE_ONESHOT=1<<2, PE_EDGE=1<<3 };
struct pe_event { int fd; uint32_t events; void *udata; };
struct poller {
int backend_fd; // epoll fd or kqueue fd
};
int pe_init(struct poller *p);
void pe_close(struct poller *p);
int pe_add(struct poller *p, int fd, uint32_t events, void *udata);
int pe_mod(struct poller *p, int fd, uint32_t events, void *udata);
int pe_del(struct poller *p, int fd);
int pe_wait(struct poller *p, struct pe_event *out, int cap, int timeout_ms);
Implementation sketch:
#if defined(__linux__)
#include <sys/epoll.h>
#include <unistd.h>
static uint32_t to_epoll(uint32_t ev) {
uint32_t r = 0;
if (ev & PE_READ) r |= EPOLLIN;
if (ev & PE_WRITE) r |= EPOLLOUT;
if (ev & PE_EDGE) r |= EPOLLET;
if (ev & PE_ONESHOT) r |= EPOLLONESHOT;
return r;
}
int pe_init(struct poller *p) { p->backend_fd = epoll_create1(EPOLL_CLOEXEC); return p->backend_fd>=0?0:-1; }
void pe_close(struct poller *p) { if (p->backend_fd>=0) close(p->backend_fd); }
int pe_add(struct poller *p, int fd, uint32_t ev, void *ud) {
struct epoll_event ee = { .events = to_epoll(ev) };
ee.data.ptr = ud; // store connection pointer or token
return epoll_ctl(p->backend_fd, EPOLL_CTL_ADD, fd, &ee);
}
int pe_mod(struct poller *p, int fd, uint32_t ev, void *ud) {
struct epoll_event ee = { .events = to_epoll(ev) };
ee.data.ptr = ud;
return epoll_ctl(p->backend_fd, EPOLL_CTL_MOD, fd, &ee);
}
int pe_del(struct poller *p, int fd) { return epoll_ctl(p->backend_fd, EPOLL_CTL_DEL, fd, NULL); }
int pe_wait(struct poller *p, struct pe_event *out, int cap, int timeout_ms) {
struct epoll_event eev[1024]; if (cap > 1024) cap = 1024;
int n = epoll_wait(p->backend_fd, eev, cap, timeout_ms);
for (int i = 0; i < n; ++i) {
out[i].fd = -1; // fd not provided by epoll when using data.ptr
out[i].udata = eev[i].data.ptr;
uint32_t e = 0;
if (eev[i].events & (EPOLLIN)) e |= PE_READ;
if (eev[i].events & (EPOLLOUT)) e |= PE_WRITE;
out[i].events = e;
}
return n;
}
#else // kqueue family
#include <sys/event.h>
#include <unistd.h>
int pe_init(struct poller *p) { p->backend_fd = kqueue(); return p->backend_fd>=0?0:-1; }
void pe_close(struct poller *p) { if (p->backend_fd>=0) close(p->backend_fd); }
static int kq_apply(int kq, int fd, uint32_t ev, void *ud) {
struct kevent ch[2]; int n = 0;
uint16_t flags = EV_ADD | EV_ENABLE;
if (ev & PE_ONESHOT) flags |= EV_ONESHOT; // deliver once
if (ev & PE_EDGE) flags |= EV_CLEAR; // edge-like
if (ev & PE_READ) { EV_SET(&ch[n++], (uintptr_t)fd, EVFILT_READ, flags, 0, 0, ud); }
if (ev & PE_WRITE) { EV_SET(&ch[n++], (uintptr_t)fd, EVFILT_WRITE, flags, 0, 0, ud); }
return kevent(kq, ch, n, NULL, 0, NULL);
}
int pe_add(struct poller *p, int fd, uint32_t ev, void *ud) { return kq_apply(p->backend_fd, fd, ev, ud); }
int pe_mod(struct poller *p, int fd, uint32_t ev, void *ud) {
// Re-specify filters atomically; kqueue treats EV_ADD as upsert
return kq_apply(p->backend_fd, fd, ev, ud);
}
int pe_del(struct poller *p, int fd) {
struct kevent ch[2]; int n = 0;
EV_SET(&ch[n++], (uintptr_t)fd, EVFILT_READ, EV_DELETE, 0, 0, NULL);
EV_SET(&ch[n++], (uintptr_t)fd, EVFILT_WRITE, EV_DELETE, 0, 0, NULL);
return kevent(p->backend_fd, ch, n, NULL, 0, NULL);
}
int pe_wait(struct poller *p, struct pe_event *out, int cap, int timeout_ms) {
struct timespec ts = { .tv_sec = timeout_ms/1000, .tv_nsec = (timeout_ms%1000)*1000000L };
struct kevent kev[1024]; if (cap > 1024) cap = 1024;
int n = kevent(p->backend_fd, NULL, 0, kev, cap, timeout_ms>=0?&ts:NULL);
for (int i = 0; i < n; ++i) {
out[i].fd = (int)kev[i].ident;
out[i].udata = kev[i].udata;
uint32_t e = 0;
if (kev[i].filter == EVFILT_READ) e |= PE_READ;
if (kev[i].filter == EVFILT_WRITE) e |= PE_WRITE;
out[i].events = e;
}
return n;
}
#endif
This preserves powerful options (edge, oneshot) while keeping the call sites clean and testable.
One-shot handoff with a worker pool
ET plus one-shot delivery lets you process a connection on exactly one worker at a time without locks. The flow:
- Register the fd with
PE_EDGE | PE_ONESHOT
for both read and write. - Any worker woken by
pe_wait
takes the event, processes with drain loops. - If more work remains (e.g., unread bytes or unflushed output), re-arm with
pe_mod(fd, ...)
from that worker.
// After handling an event for connection `c`
uint32_t want = PE_EDGE | PE_ONESHOT | PE_READ;
if (c->out_bytes > 0) want |= PE_WRITE;
pe_mod(&loop->poller, c->fd, want, c);
This guarantees mutual exclusion at the event level without coarse-grained locks. On BSD, EV_ONESHOT
auto-disables; re-add as needed.
Multi-reactor sharding (thread-per-core)
For very high fan-out, run one event loop per core and shard connections:
- A single acceptor thread accepts and dispatches fds to loops via lock-free queues and a wakeup fd (eventfd/EVFILT_USER per loop).
- Each loop owns its fds, timers, and buffers—no cross-loop contention.
struct loop {
struct poller p; int wake_fd; // eventfd or EVFILT_USER ident
// inbound queue of accepted fds
};
void loop_enqueue_fd(struct loop *L, int fd) {
// push to L->inbound queue
// signal L->wake_fd to break wait
}
static void loop_run(struct loop *L) {
for (;;) {
run_due_timers();
int n = pe_wait(&L->p, evs, EV_CAP, ms_until_next_timer());
for (int i = 0; i < n; ++i) {
if (evs[i].udata == WAKE_UDATA) { /* drain inbound queue; register new fds */ continue; }
// handle connection events here
}
}
}
Sharding gives you linear scalability (until the NIC/PCIe/memory bus becomes the bottleneck) and simplifies ownership semantics.
Lifecycle and safety checklist (connection-level)
- Create sockets with
SOCK_NONBLOCK | SOCK_CLOEXEC
(or set viafcntl
). - Set
TCP_NODELAY
judiciously for request/response protocols; leave enabled for streaming. - Suppress
SIGPIPE
(ignore or useMSG_NOSIGNAL
/SO_NOSIGPIPE
). - Always handle
EINTR
/EAGAIN
in I/O paths. - Limit per-connection memory; pre-size buffers reasonably; recycle.
- Log with context: fd, peer, bytes attempted/achieved, errno, queue sizes, deadlines.
End-to-end deadlines (no zombie I/O)
Time-bound operations make overload survivable and bugs diagnosable. Give each request and connection a deadline and check it before expensive work.
#include <time.h>
#include <stdbool.h>
static uint64_t now_ms(void) {
struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts);
return (uint64_t)ts.tv_sec * 1000ull + (uint64_t)ts.tv_nsec/1000000ull;
}
struct request { uint64_t deadline_ms; /* ... */ };
static bool expired(uint64_t deadline_ms) { return now_ms() >= deadline_ms; }
// In handlers
if (expired(req->deadline_ms)) {
// respond with timeout / drop work; free buffers
}
Tie deadlines into your timer heap: when scheduling a request, push its deadline; when it fires, cancel outstanding I/O and clean up.
Protocol-friendly, partial-read state machines
Handlers must tolerate partial reads/writes. Model protocols as small state machines with explicit progress and bounds.
enum pstate { HDR, BODY, DONE, ERR };
struct parser {
enum pstate st;
size_t need; // bytes still needed in current phase
size_t have; // bytes collected in current buffer
char hdr[4096];
char *body; size_t body_cap;
};
static void on_read_chunk(struct parser *p, const char *data, size_t len) {
size_t i = 0;
while (i < len && p->st != DONE && p->st != ERR) {
if (p->st == HDR) {
// accumulate until "\r\n\r\n" (guard bounds)
if (p->have < sizeof p->hdr) p->hdr[p->have++] = data[i++]; else { p->st = ERR; break; }
// when terminator found, parse content-length → set p->need and allocate body
} else if (p->st == BODY) {
size_t can = p->need < (len - i) ? p->need : (len - i);
// copy 'can' bytes into body at offset (body_cap - need)
p->need -= can; i += can;
if (p->need == 0) p->st = DONE;
}
}
}
Write side mirrors this with an iovec queue; stop on EAGAIN
and re-arm POLLOUT
.
Error and hangup semantics you must honor
- Linux
epoll
:EPOLLERR
/EPOLLHUP
are level-triggered and may be delivered with or withoutEPOLLIN/EPOLLOUT
. Treat them as readable/writable hints: attempt to drain (read
returns 0 or error) then close. - BSD/macOS
kqueue
:EV_EOF
on sockets indicates peer closed;fflags
may carry error info. - Handle in both read and write paths; don’t leave dead sockets in your set.
// epoll mask handling
uint32_t ev = ee.events;
if (ev & (EPOLLERR | EPOLLHUP)) {
// Try to drain reads and flush writes; then close
}
Tuning that actually moves the needle
SO_REUSEPORT
: scale accept across processes; measure fairness per core/NIC queue.- Backlog: raise listen backlog (
somaxconn
) to absorb bursts. - Keepalive: enable TCP keepalive with sensible intervals; reap dead peers.
- Nagle/coalescing:
TCP_NODELAY
for request/response; considerTCP_CORK
/TCP_NOPUSH
for large sequential writes. - Buffers: right-size
SO_SNDBUF
/SO_RCVBUF
; smaller in tests to forceEAGAIN
paths.
static void set_small_buffers(int fd) {
int sz = 4096; (void)setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &sz, sizeof sz);
(void)setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &sz, sizeof sz);
}
Overload protection and shedding
- Admission control: stop accepting when global queued bytes or active requests exceed thresholds; resume with hysteresis.
- Per-tenant caps: bound concurrent requests per client/IP.
- Deadlines + budget: drop or degrade when an operation exceeds its budget.
- Backoff signals: add jittered delays before re-enabling
POLLOUT
under extreme contention.
Observability: don’t fly blind
Track at minimum:
- Accept rate, active connections, close reasons (EOF, timeout, error code)
- Read/write bytes and ops;
EAGAIN
counts; queue sizes; dropped messages - Latency histograms for critical ops (P50/P95/P99)
- Timer heap length and next-due skew (timer storms)
struct metrics { uint64_t accepts, closes, eagain_r, eagain_w; /* ... */ } M;
static inline void inc(uint64_t *c) { __atomic_add_fetch(c, 1, __ATOMIC_RELAXED); }
Test like production (but meaner)
- Use
socketpair()
and tiny buffers to force partials andEAGAIN
. - Inject signals randomly; ensure
EINTR
paths are solid. - Introduce latency/packet loss with
tc netem
(Linux) or PF rules. - Observe with
strace
/dtruss
,tcpdump
/Wireshark,perf
/Instruments. - Fuzz parsers with truncated/overlong inputs; cap every length.
Final production checklist
- Nonblocking everywhere; handle partial I/O and
EINTR
. - Drain until
EAGAIN
; re-arm interest precisely. - Bound per-connection and global memory; apply backpressure.
- Deadlines driven by a monotonic timer manager.
- Handle
ERR/HUP/EOF
paths; close decisively. - Measure, log with context, and alert on backlogs/timeouts.
Wrap-up
Scaling C servers isn’t about a magic flag—it’s about disciplined readiness I/O, fair event handling, explicit time, and ruthless simplicity. With epoll
/kqueue
, drain-until-EAGAIN
, precise re-arming, and concrete backpressure, you trade mystery stalls for predictable throughput and debuggable behavior. Ship the small, boring loops; they’re the ones that keep running at 2 a.m.