Epoll, Kqueue, and Event Loops: Scaling C Network Servers

Published: November 5, 2013 (11y ago)19 min read

Updated: March 5, 2025 (5mo ago)

You built a socket server, it works on your laptop, then it faceplants in production the moment a thousand clients show up. Threads block. Buffers fill. Latency spikes. The fix isn’t a bigger server—it’s the right I/O model and a disciplined event loop.

This post is a practical field guide to Linux epoll and BSD kqueue: what they actually signal, how to choose level vs edge triggering, and how to write handlers that scale without starving neighbors or spinning CPUs. We’ll keep the vibe production-first: small, boring loops that survive signals, partial reads/writes, and backpressure.

Readiness I/O in one minute

Readiness APIs (epoll, kqueue, and even old poll) tell you “this file descriptor won’t block for operation X right now.” They do not transfer any bytes—that’s still your job via read()/write()/recv()/send().

Two big consequences:

  • You must still handle short reads/writes and EAGAIN.
  • Readiness is a hint, not a contract for “all the data.” After you act, conditions may change.

At scale, you make all sockets nonblocking, register interest in events, then react to notifications with small, deterministic handlers. Do the minimum per wakeup, re-arm interest correctly, and move on.

stateDiagram-v2 [*] --> Initialize Initialize --> RegisterEvents: Setup epoll/kqueue RegisterEvents --> WaitForEvents: epoll_wait/kevent WaitForEvents --> ProcessEvents: Events ready ProcessEvents --> HandleRead: EPOLLIN/EVFILT_READ ProcessEvents --> HandleWrite: EPOLLOUT/EVFILT_WRITE ProcessEvents --> HandleError: EPOLLERR/EV_ERROR HandleRead --> DrainSocket: Read until EAGAIN HandleWrite --> FlushBuffer: Write until EAGAIN HandleError --> CloseConnection: Error handling DrainSocket --> RearmEvents: Re-register interest FlushBuffer --> RearmEvents: Re-register interest CloseConnection --> WaitForEvents: Continue loop RearmEvents --> WaitForEvents: Next iteration WaitForEvents --> Shutdown: Exit signal Shutdown --> [*] note right of WaitForEvents Single thread blocks here until any fd has events end note note right of ProcessEvents Handle all ready events in one batch end note

Level-triggered vs edge-triggered: what’s the difference?

  • Level-triggered (LT): “While the condition is true, I’ll keep notifying you.” If a socket is readable and you don’t drain it, you’ll be notified again and again.
  • Edge-triggered (ET): “I’ll notify you on state CHANGE.” If a socket becomes readable, you get an event once. If you don’t drain it fully, you may not get another notification until more data arrives or state changes again.

ET can reduce wakeups under load but it is unforgiving: handlers must drain until EAGAIN. LT is friendlier but can cause repeated notifications if you only nibble a few bytes per wakeup.

sequenceDiagram participant App as Application participant LT as Level-Triggered participant ET as Edge-Triggered participant Socket as Socket Buffer Note over Socket: Data arrives Socket->>LT: Notify: READABLE Socket->>ET: Notify: READABLE (edge) App->>LT: Read 100 bytes LT->>Socket: 100 bytes read Note over Socket: 400 bytes remain LT-->>App: Still READABLE (level) Note over ET: No notification (no edge) App->>LT: Read remaining 400 bytes LT->>Socket: All data read Note over Socket: Buffer empty Note over Socket: More data arrives Socket->>LT: Notify: READABLE Socket->>ET: Notify: READABLE (new edge) Note right of LT: Keeps notifying<br/>while data exists Note right of ET: Only notifies<br/>on state change

The golden rule for ET

Whether reading or writing: loop until the syscall returns -1 with errno == EAGAIN (or EWOULDBLOCK), then stop. Forget that, and you’ll ship a stuck connection that never wakes up again.

Minimal, correct drain loops

#include <errno.h>
#include <unistd.h>
 
static void on_readable_et(int fd) {
  char buf[4096];
  for (;;) {
    ssize_t r = read(fd, buf, sizeof buf);
    if (r > 0) {
      // process buf[0..r)
      continue;
    }
    if (r == 0) {
      // peer closed
      break;
    }
    if (r == -1 && errno == EINTR) {
      continue; // try again
    }
    if (r == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
      // drained for now
      break;
    }
    break; // hard error
  }
}

For writes, you attempt to flush pending buffers until EAGAIN, then enable writable notifications to resume later:

#include <errno.h>
#include <sys/uio.h>
#include <unistd.h>
 
// Advance an iovec array by `bytes` consumed, mutating base/len and iovcnt.
static void advance_iovecs(struct iovec **piov, int *piovcnt, size_t bytes) {
  struct iovec *iov = *piov; int cnt = *piovcnt; size_t left = bytes;
  while (cnt > 0 && left > 0) {
    if (left >= iov->iov_len) { left -= iov->iov_len; ++iov; --cnt; }
    else { iov->iov_base = (char *)iov->iov_base + left; iov->iov_len -= left; left = 0; }
  }
  *piov = iov; *piovcnt = cnt;
}
 
static int flush_writeq_et(int fd, struct iovec *iov, int iovcnt) {
  while (iovcnt > 0) {
    ssize_t w = writev(fd, iov, iovcnt);
    if (w > 0) { advance_iovecs(&iov, &iovcnt, (size_t)w); continue; }
    if (w == -1 && errno == EINTR) { continue; }
    if (w == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
      // Need to re-enable POLLOUT/EVFILT_WRITE and resume later
      return 0; // partial
    }
    return -1; // hard error (EPIPE, etc.)
  }
  return 1; // fully flushed
}

Correctly arming and re-arming interest

Subtleties differ by API:

  • epoll (Linux)

    • LT by default; add EPOLLET for ET.
    • Use EPOLLONESHOT if you want “notify once, you must re-arm explicitly.” Great for handoff across worker threads.
    • Always register interest for the operations you intend to perform next (e.g., re-enable EPOLLOUT only if you still have bytes to flush).
  • kqueue (BSD/macOS)

    • EVFILT_READ/EVFILT_WRITE are level-triggered; the filter semantics carry sizes (“how many bytes are ready”).
    • You almost always follow the same drain-until-EAGAIN pattern. Even though the filter is level-triggered, you must consume enough that the kernel won’t keep notifying you incessantly.
    • EV_CLEAR approximates edge-trigger behavior by clearing the state after delivery. Treat it with the same discipline as ET.

Fairness: don’t monopolize the loop

Even in ET mode, never spin on a single hot socket while others starve. Bound the work per wakeup—e.g., cap processed bytes or messages—then yield. Re-arming ensures you’ll be called again promptly.

#define MAX_BYTES_PER_WAKE  (64 * 1024)
 
static void on_readable_fair(int fd) {
  char buf[4096]; size_t budget = MAX_BYTES_PER_WAKE; 
  for (;;) {
    if (budget == 0) break; // yield to peers
    ssize_t want = sizeof buf; if ((size_t)want > budget) want = (ssize_t)budget;
    ssize_t r = read(fd, buf, (size_t)want);
    if (r > 0) { budget -= (size_t)r; /* process */ continue; }
    if (r == 0) break;
    if (r == -1 && errno == EINTR) continue;
    if (r == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) break;
    break;
  }
}

The combination of “drain-until-EAGAIN” and “bounded budget per wakeup” keeps throughput high and tail latency sane.


A quick correctness checklist you can adopt today

  • Make every network socket nonblocking from creation.
  • Choose ET only if you implement drain-until-EAGAIN rigorously.
  • Re-enable POLLOUT/EVFILT_WRITE only when you have pending bytes.
  • Bound per-connection output queues (bytes and messages) to avoid memory blowups.
  • Budget work per wakeup; never monopolize the loop.
  • Treat EINTR as routine; return to the loop, don’t crash.

We’ll build on these foundations with timers, one-shot handoff, cancellation, and backpressure strategies next—keeping the core invariant intact: do a little, do it right, and keep going.

Timers and deadlines that don’t lie

Time drives everything in a server: idle timeouts, request deadlines, retries, periodic housekeeping. The rules:

  • Use a monotonic clock for scheduling (not wall time). Wall time can jump.
  • Drive your poller’s wait timeout from the next due timer.
  • Fire due timers before handling new readiness so you honor deadlines.

Timer primitives

  • Linux: timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK|TFD_CLOEXEC) integrates cleanly with epoll.
  • BSD/macOS: EVFILT_TIMER on kqueue creates one-shot or periodic timers that deliver kevents.
// Linux timerfd: one-shot timer; re-arm explicitly after each expiration
#include <sys/timerfd.h>
#include <stdint.h>
#include <unistd.h>
 
int make_timerfd(void) {
  int tfd = timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK | TFD_CLOEXEC);
  return tfd; // add to epoll with EPOLLIN
}
 
void arm_timerfd_once(int tfd, long ms_from_now) {
  struct itimerspec its = {0};
  its.it_value.tv_sec = ms_from_now / 1000;
  its.it_value.tv_nsec = (ms_from_now % 1000) * 1000000L;
  (void)timerfd_settime(tfd, 0, &its, NULL);
}
 
void on_timerfd_ready(int tfd) {
  uint64_t expirations = 0;
  (void)read(tfd, &expirations, sizeof expirations); // drain
  // run scheduled tasks; optionally re-arm
}
// BSD/macOS: kqueue EVFILT_TIMER (periodic example)
#include <sys/event.h>
 
void add_periodic_timer(int kq, int ident, int interval_ms) {
  struct kevent ev;
  EV_SET(&ev, (uintptr_t)ident, EVFILT_TIMER, EV_ADD | EV_ENABLE, 0, interval_ms, NULL);
  kevent(kq, &ev, 1, NULL, 0, NULL);
}

A minimal timer manager (min-heap)

For many servers, a binary min-heap keyed by due time is enough: O(log n) insert/cancel, O(log n) pop. On each loop tick:

  1. Pop and run all timers whose due <= now.
  2. Compute wait timeout as max(0, next_due - now) and pass to epoll_wait/kevent.
#include <stdint.h>
#include <time.h>
 
typedef void (*timer_cb)(void *arg);
 
struct timer { uint64_t due_ns; timer_cb cb; void *arg; };
 
// Assume you have a binary heap implementation over `struct timer` with:
//   heap_top_due_ns(), heap_push(), heap_pop(), heap_empty()
 
static uint64_t now_ns(void) {
  struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts);
  return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec;
}
 
static int ms_until_next_timer(void) {
  if (heap_empty()) return 1000; // default idle wait
  uint64_t n = now_ns();
  uint64_t due = heap_top_due_ns();
  if (due <= n) return 0;
  uint64_t delta_ns = due - n;
  uint64_t ms = delta_ns / 1000000ull;
  if (ms > 0x3fffffff) ms = 0x3fffffff;
  return (int)ms;
}
 
static void run_due_timers(void) {
  uint64_t n = now_ns();
  while (!heap_empty() && heap_top_due_ns() <= n) {
    struct timer t = heap_pop();
    t.cb(t.arg);
  }
}

Integrate by calling run_due_timers() before each wait, and use ms_until_next_timer() as your poll timeout.

Cancellation and user-triggered wakeups

Long waits should be cancelable (shutdowns, config reload, task preemption). Two patterns:

  • Linux: eventfd added to epoll; write increments the counter to wake the loop.
  • BSD/macOS: EVFILT_USER allows userland-triggered kevents.
// Linux eventfd cancellation
#include <sys/eventfd.h>
 
int make_cancel_fd(void) {
  int efd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
  return efd; // add to epoll with EPOLLIN
}
 
void signal_cancel(int efd) { uint64_t one = 1; (void)write(efd, &one, sizeof one); }
 
void on_cancel_ready(int efd) { uint64_t n; (void)read(efd, &n, sizeof n); /* drain */ }
// BSD/macOS EVFILT_USER cancellation
void add_user_event(int kq, uintptr_t ident) {
  struct kevent ev; EV_SET(&ev, ident, EVFILT_USER, EV_ADD | EV_ENABLE, 0, 0, NULL);
  kevent(kq, &ev, 1, NULL, 0, NULL);
}
 
void trigger_user_event(int kq, uintptr_t ident) {
  struct kevent ev; EV_SET(&ev, ident, EVFILT_USER, 0, NOTE_TRIGGER, 0, NULL);
  kevent(kq, &ev, 1, NULL, 0, NULL);
}

Both approaches integrate like any other fd/filter, letting you break out of epoll_wait/kevent immediately to act on the signal.

Accept without the thundering herd

Multiple workers blocked on the same listen socket can all wake when a new connection arrives. Only one wins accept(), the rest stampede and go back to sleep: wasted wakeups and cache churn.

Mitigations:

  • Linux EPOLLEXCLUSIVE on the listen fd to wake a single waiter.
  • Use SO_REUSEPORT to spread connections across processes, each with its own listen socket.
  • One-shot accept: handle one accept per wakeup and re-arm (EPOLLONESHOT, or EV_DISPATCH-like behavior), or funnel accepts through a dedicated thread.
// Linux: register listen fd with EPOLLEXCLUSIVE to reduce herd
#include <sys/epoll.h>
 
void add_listen_epoll_exclusive(int epfd, int lfd) {
  struct epoll_event ev = {0};
  ev.events = EPOLLIN | EPOLLEXCLUSIVE; // level-triggered is fine for listen
  ev.data.fd = lfd;
  epoll_ctl(epfd, EPOLL_CTL_ADD, lfd, &ev);
}
 
void on_listen_ready(int epfd, int lfd) {
  for (;;) {
    int cfd = accept4(lfd, NULL, NULL, SOCK_NONBLOCK | SOCK_CLOEXEC);
    if (cfd >= 0) {
      // register cfd for I/O
      struct epoll_event ev = { .events = EPOLLIN | EPOLLET, .data.fd = cfd };
      epoll_ctl(epfd, EPOLL_CTL_ADD, cfd, &ev);
      continue;
    }
    if (cfd == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) break; // drained backlog
    if (cfd == -1 && errno == EINTR) continue; // retry
    break; // other error
  }
}

On BSD/macOS, prefer a single acceptor (or per-process with SO_REUSEPORT) that dispatches accepted sockets to workers. If you must share, use EV_CLEAR/dispatch patterns and accept only a bounded number per wakeup.

Backpressure that actually works

Backpressure is a policy, not an afterthought. Goals:

  • Don’t let one slow peer soak all memory.
  • Keep the loop responsive under bursty load.
  • Fail fast when you cannot make progress within deadlines.

Practical rules:

  • Maintain per-connection output queues with byte/message caps; reject or shed when exceeded.
  • Enable writable notifications only when the queue is non-empty; disable immediately when flushed.
  • Push a bounded amount per wakeup to preserve fairness.
struct conn {
  int fd; size_t out_bytes;
  // your queue container here
  int want_writable; // 0/1
};
 
static void maybe_toggle_writable(int epfd, struct conn *c) {
  struct epoll_event ev = { .data.fd = c->fd };
  ev.events = EPOLLIN | EPOLLET | (c->out_bytes > 0 ? EPOLLOUT : 0);
  epoll_ctl(epfd, EPOLL_CTL_MOD, c->fd, &ev);
}
 
static void on_writable(struct conn *c) {
  // Try to flush from the head of the queue using writev; update out_bytes
  // Stop on EAGAIN; when out_bytes drops to 0, disable EPOLLOUT via maybe_toggle_writable
}

At higher loads, add global budgets as well (e.g., stop accepting new connections or shed low-priority work when total queued bytes exceed a threshold). Tie send/receive deadlines to timers so stalled operations surface as actionable errors instead of hidden backlog.

A portable poller abstraction (epoll/kqueue)

Portability doesn't mean lowest common denominator. You can expose a small, strong API and map efficiently to each platform.

graph TB subgraph "Application Layer" A[Application Code] A --> B[Portable Poller API] end subgraph "Abstraction Layer" B --> C{Platform Detection} end subgraph "Linux Implementation" C -->|Linux| D[epoll_create1] D --> E[epoll_ctl ADD/MOD/DEL] E --> F[epoll_wait] F --> G[EPOLLIN/EPOLLOUT/EPOLLET] end subgraph "BSD Implementation" C -->|BSD/macOS| H[kqueue] H --> I[kevent registration] I --> J[kevent wait] J --> K[EVFILT_READ/WRITE] end subgraph "Fallback Implementation" C -->|Other| L[poll/select] L --> M[pollfd array] M --> N[poll wait] N --> O[POLLIN/POLLOUT] end style B fill:#e1f5fe style C fill:#fff3e0 style G fill:#e8f5e8 style K fill:#e8f5e8 style O fill:#fff8e1
#include <stdint.h>
 
enum pe { PE_NONE=0, PE_READ=1<<0, PE_WRITE=1<<1, PE_ONESHOT=1<<2, PE_EDGE=1<<3 };
 
struct pe_event { int fd; uint32_t events; void *udata; };
 
struct poller {
  int backend_fd; // epoll fd or kqueue fd
};
 
int  pe_init(struct poller *p);
void pe_close(struct poller *p);
int  pe_add(struct poller *p, int fd, uint32_t events, void *udata);
int  pe_mod(struct poller *p, int fd, uint32_t events, void *udata);
int  pe_del(struct poller *p, int fd);
int  pe_wait(struct poller *p, struct pe_event *out, int cap, int timeout_ms);

Implementation sketch:

#if defined(__linux__)
#include <sys/epoll.h>
#include <unistd.h>
 
static uint32_t to_epoll(uint32_t ev) {
  uint32_t r = 0;
  if (ev & PE_READ)  r |= EPOLLIN;
  if (ev & PE_WRITE) r |= EPOLLOUT;
  if (ev & PE_EDGE)  r |= EPOLLET;
  if (ev & PE_ONESHOT) r |= EPOLLONESHOT;
  return r;
}
 
int pe_init(struct poller *p) { p->backend_fd = epoll_create1(EPOLL_CLOEXEC); return p->backend_fd>=0?0:-1; }
void pe_close(struct poller *p) { if (p->backend_fd>=0) close(p->backend_fd); }
 
int pe_add(struct poller *p, int fd, uint32_t ev, void *ud) {
  struct epoll_event ee = { .events = to_epoll(ev) };
  ee.data.ptr = ud; // store connection pointer or token
  return epoll_ctl(p->backend_fd, EPOLL_CTL_ADD, fd, &ee);
}
int pe_mod(struct poller *p, int fd, uint32_t ev, void *ud) {
  struct epoll_event ee = { .events = to_epoll(ev) };
  ee.data.ptr = ud;
  return epoll_ctl(p->backend_fd, EPOLL_CTL_MOD, fd, &ee);
}
int pe_del(struct poller *p, int fd) { return epoll_ctl(p->backend_fd, EPOLL_CTL_DEL, fd, NULL); }
 
int pe_wait(struct poller *p, struct pe_event *out, int cap, int timeout_ms) {
  struct epoll_event eev[1024]; if (cap > 1024) cap = 1024;
  int n = epoll_wait(p->backend_fd, eev, cap, timeout_ms);
  for (int i = 0; i < n; ++i) {
    out[i].fd = -1; // fd not provided by epoll when using data.ptr
    out[i].udata = eev[i].data.ptr;
    uint32_t e = 0;
    if (eev[i].events & (EPOLLIN)) e |= PE_READ;
    if (eev[i].events & (EPOLLOUT)) e |= PE_WRITE;
    out[i].events = e;
  }
  return n;
}
 
#else // kqueue family
#include <sys/event.h>
#include <unistd.h>
 
int pe_init(struct poller *p) { p->backend_fd = kqueue(); return p->backend_fd>=0?0:-1; }
void pe_close(struct poller *p) { if (p->backend_fd>=0) close(p->backend_fd); }
 
static int kq_apply(int kq, int fd, uint32_t ev, void *ud) {
  struct kevent ch[2]; int n = 0;
  uint16_t flags = EV_ADD | EV_ENABLE;
  if (ev & PE_ONESHOT) flags |= EV_ONESHOT;   // deliver once
  if (ev & PE_EDGE)    flags |= EV_CLEAR;     // edge-like
  if (ev & PE_READ)  { EV_SET(&ch[n++], (uintptr_t)fd, EVFILT_READ,  flags, 0, 0, ud); }
  if (ev & PE_WRITE) { EV_SET(&ch[n++], (uintptr_t)fd, EVFILT_WRITE, flags, 0, 0, ud); }
  return kevent(kq, ch, n, NULL, 0, NULL);
}
 
int pe_add(struct poller *p, int fd, uint32_t ev, void *ud) { return kq_apply(p->backend_fd, fd, ev, ud); }
int pe_mod(struct poller *p, int fd, uint32_t ev, void *ud) {
  // Re-specify filters atomically; kqueue treats EV_ADD as upsert
  return kq_apply(p->backend_fd, fd, ev, ud);
}
int pe_del(struct poller *p, int fd) {
  struct kevent ch[2]; int n = 0;
  EV_SET(&ch[n++], (uintptr_t)fd, EVFILT_READ,  EV_DELETE, 0, 0, NULL);
  EV_SET(&ch[n++], (uintptr_t)fd, EVFILT_WRITE, EV_DELETE, 0, 0, NULL);
  return kevent(p->backend_fd, ch, n, NULL, 0, NULL);
}
 
int pe_wait(struct poller *p, struct pe_event *out, int cap, int timeout_ms) {
  struct timespec ts = { .tv_sec = timeout_ms/1000, .tv_nsec = (timeout_ms%1000)*1000000L };
  struct kevent kev[1024]; if (cap > 1024) cap = 1024;
  int n = kevent(p->backend_fd, NULL, 0, kev, cap, timeout_ms>=0?&ts:NULL);
  for (int i = 0; i < n; ++i) {
    out[i].fd = (int)kev[i].ident;
    out[i].udata = kev[i].udata;
    uint32_t e = 0;
    if (kev[i].filter == EVFILT_READ)  e |= PE_READ;
    if (kev[i].filter == EVFILT_WRITE) e |= PE_WRITE;
    out[i].events = e;
  }
  return n;
}
#endif

This preserves powerful options (edge, oneshot) while keeping the call sites clean and testable.

One-shot handoff with a worker pool

ET plus one-shot delivery lets you process a connection on exactly one worker at a time without locks. The flow:

  1. Register the fd with PE_EDGE | PE_ONESHOT for both read and write.
  2. Any worker woken by pe_wait takes the event, processes with drain loops.
  3. If more work remains (e.g., unread bytes or unflushed output), re-arm with pe_mod(fd, ...) from that worker.
// After handling an event for connection `c`
uint32_t want = PE_EDGE | PE_ONESHOT | PE_READ;
if (c->out_bytes > 0) want |= PE_WRITE;
pe_mod(&loop->poller, c->fd, want, c);

This guarantees mutual exclusion at the event level without coarse-grained locks. On BSD, EV_ONESHOT auto-disables; re-add as needed.

Multi-reactor sharding (thread-per-core)

For very high fan-out, run one event loop per core and shard connections:

  • A single acceptor thread accepts and dispatches fds to loops via lock-free queues and a wakeup fd (eventfd/EVFILT_USER per loop).
  • Each loop owns its fds, timers, and buffers—no cross-loop contention.
struct loop {
  struct poller p; int wake_fd; // eventfd or EVFILT_USER ident
  // inbound queue of accepted fds
};
 
void loop_enqueue_fd(struct loop *L, int fd) {
  // push to L->inbound queue
  // signal L->wake_fd to break wait
}
 
static void loop_run(struct loop *L) {
  for (;;) {
    run_due_timers();
    int n = pe_wait(&L->p, evs, EV_CAP, ms_until_next_timer());
    for (int i = 0; i < n; ++i) {
      if (evs[i].udata == WAKE_UDATA) { /* drain inbound queue; register new fds */ continue; }
      // handle connection events here
    }
  }
}

Sharding gives you linear scalability (until the NIC/PCIe/memory bus becomes the bottleneck) and simplifies ownership semantics.

Lifecycle and safety checklist (connection-level)

  • Create sockets with SOCK_NONBLOCK | SOCK_CLOEXEC (or set via fcntl).
  • Set TCP_NODELAY judiciously for request/response protocols; leave enabled for streaming.
  • Suppress SIGPIPE (ignore or use MSG_NOSIGNAL/SO_NOSIGPIPE).
  • Always handle EINTR/EAGAIN in I/O paths.
  • Limit per-connection memory; pre-size buffers reasonably; recycle.
  • Log with context: fd, peer, bytes attempted/achieved, errno, queue sizes, deadlines.

End-to-end deadlines (no zombie I/O)

Time-bound operations make overload survivable and bugs diagnosable. Give each request and connection a deadline and check it before expensive work.

#include <time.h>
#include <stdbool.h>
 
static uint64_t now_ms(void) {
  struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts);
  return (uint64_t)ts.tv_sec * 1000ull + (uint64_t)ts.tv_nsec/1000000ull;
}
 
struct request { uint64_t deadline_ms; /* ... */ };
 
static bool expired(uint64_t deadline_ms) { return now_ms() >= deadline_ms; }
 
// In handlers
if (expired(req->deadline_ms)) {
  // respond with timeout / drop work; free buffers
}

Tie deadlines into your timer heap: when scheduling a request, push its deadline; when it fires, cancel outstanding I/O and clean up.

Protocol-friendly, partial-read state machines

Handlers must tolerate partial reads/writes. Model protocols as small state machines with explicit progress and bounds.

enum pstate { HDR, BODY, DONE, ERR };
 
struct parser {
  enum pstate st;
  size_t need;     // bytes still needed in current phase
  size_t have;     // bytes collected in current buffer
  char   hdr[4096];
  char  *body; size_t body_cap;
};
 
static void on_read_chunk(struct parser *p, const char *data, size_t len) {
  size_t i = 0;
  while (i < len && p->st != DONE && p->st != ERR) {
    if (p->st == HDR) {
      // accumulate until "\r\n\r\n" (guard bounds)
      if (p->have < sizeof p->hdr) p->hdr[p->have++] = data[i++]; else { p->st = ERR; break; }
      // when terminator found, parse content-length → set p->need and allocate body
    } else if (p->st == BODY) {
      size_t can = p->need < (len - i) ? p->need : (len - i);
      // copy 'can' bytes into body at offset (body_cap - need)
      p->need -= can; i += can;
      if (p->need == 0) p->st = DONE;
    }
  }
}

Write side mirrors this with an iovec queue; stop on EAGAIN and re-arm POLLOUT.

Error and hangup semantics you must honor

  • Linux epoll: EPOLLERR/EPOLLHUP are level-triggered and may be delivered with or without EPOLLIN/EPOLLOUT. Treat them as readable/writable hints: attempt to drain (read returns 0 or error) then close.
  • BSD/macOS kqueue: EV_EOF on sockets indicates peer closed; fflags may carry error info.
  • Handle in both read and write paths; don’t leave dead sockets in your set.
// epoll mask handling
uint32_t ev = ee.events;
if (ev & (EPOLLERR | EPOLLHUP)) {
  // Try to drain reads and flush writes; then close
}

Tuning that actually moves the needle

  • SO_REUSEPORT: scale accept across processes; measure fairness per core/NIC queue.
  • Backlog: raise listen backlog (somaxconn) to absorb bursts.
  • Keepalive: enable TCP keepalive with sensible intervals; reap dead peers.
  • Nagle/coalescing: TCP_NODELAY for request/response; consider TCP_CORK/TCP_NOPUSH for large sequential writes.
  • Buffers: right-size SO_SNDBUF/SO_RCVBUF; smaller in tests to force EAGAIN paths.
static void set_small_buffers(int fd) {
  int sz = 4096; (void)setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &sz, sizeof sz);
  (void)setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &sz, sizeof sz);
}

Overload protection and shedding

  • Admission control: stop accepting when global queued bytes or active requests exceed thresholds; resume with hysteresis.
  • Per-tenant caps: bound concurrent requests per client/IP.
  • Deadlines + budget: drop or degrade when an operation exceeds its budget.
  • Backoff signals: add jittered delays before re-enabling POLLOUT under extreme contention.

Observability: don’t fly blind

Track at minimum:

  • Accept rate, active connections, close reasons (EOF, timeout, error code)
  • Read/write bytes and ops; EAGAIN counts; queue sizes; dropped messages
  • Latency histograms for critical ops (P50/P95/P99)
  • Timer heap length and next-due skew (timer storms)
struct metrics { uint64_t accepts, closes, eagain_r, eagain_w; /* ... */ } M;
static inline void inc(uint64_t *c) { __atomic_add_fetch(c, 1, __ATOMIC_RELAXED); }

Test like production (but meaner)

  • Use socketpair() and tiny buffers to force partials and EAGAIN.
  • Inject signals randomly; ensure EINTR paths are solid.
  • Introduce latency/packet loss with tc netem (Linux) or PF rules.
  • Observe with strace/dtruss, tcpdump/Wireshark, perf/Instruments.
  • Fuzz parsers with truncated/overlong inputs; cap every length.

Final production checklist

  • Nonblocking everywhere; handle partial I/O and EINTR.
  • Drain until EAGAIN; re-arm interest precisely.
  • Bound per-connection and global memory; apply backpressure.
  • Deadlines driven by a monotonic timer manager.
  • Handle ERR/HUP/EOF paths; close decisively.
  • Measure, log with context, and alert on backlogs/timeouts.

Wrap-up

Scaling C servers isn’t about a magic flag—it’s about disciplined readiness I/O, fair event handling, explicit time, and ruthless simplicity. With epoll/kqueue, drain-until-EAGAIN, precise re-arming, and concrete backpressure, you trade mystery stalls for predictable throughput and debuggable behavior. Ship the small, boring loops; they’re the ones that keep running at 2 a.m.