Zero-Copy in C: sendfile, splice, vmsplice, and mmap

Published: March 29, 2013 (12y ago)24 min read

Updated: September 7, 2022 (2y ago)

You can make CPUs fast or you can make them move bytes. Doing both at the same time is hard. Most production servers spend a shocking amount of time shuffling memory—copying from kernel to user, from one buffer to another, from one cache level to the next. Zero-copy techniques are about getting out of the way: keep data on the fast path (page cache, DMA, socket buffers) and avoid needless detours through user space.

This post is a grounded tour of Linux primitives that enable “zero-copy-ish” data movement: sendfile(2), splice(2), vmsplice(2), tee(2), and mmap(2). We’ll cover what they promise, where the copies still happen, and how to write robust loops that hit line rate without turning into a pile of edge cases.

Important mindset: zero-copy rarely means "no copies exist anywhere." It usually means "no copies in user space" and "no redundant copies between kernel subsystems." DMA still moves bytes to/from devices; the page cache still backs file data; socket buffers still exist. The win is cutting out the userland bounce buffers and system call overhead in the hot path.

graph TB subgraph "Traditional Copy (Multiple Hops)" T1[File on Disk] --> T2[Read to Kernel Buffer] T2 --> T3[Copy to User Buffer] T3 --> T4[Process/Modify] T4 --> T5[Send to Kernel Buffer] T5 --> T6[Copy to Socket Buffer] T6 --> T7[DMA to Network] end subgraph "Zero-Copy (Kernel Direct)" Z1[File on Disk] --> Z2[Page Cache] Z2 --> Z3[Socket Buffer Reference] Z3 --> Z4[DMA to Network] end subgraph "Techniques" S1["sendfile(): File → Socket<br/>splice(): Pipe-based transfers<br/>mmap(): Shared memory mapping<br/>vmsplice(): Userspace → Pipe"] end subgraph "Benefits" B1["• Fewer memory copies<br/>• Reduced CPU usage<br/>• Lower memory bandwidth<br/>• Better cache efficiency<br/>• Reduced context switches"] end style T3 fill:#ffebee style T5 fill:#ffebee style Z3 fill:#e8f5e8 style B1 fill:#e8f5e8 Note1["Traditional: 2+ copies<br/>through user space"] Note2["Zero-copy: Direct<br/>kernel-to-kernel transfer"]

What zero-copy actually means (and doesn’t)

  • Kernel can forward pages or pipe buffers directly between subsystems without materializing a user-space buffer.
  • User space sets up the transfer with a small number of syscalls; the kernel does the heavy lifting.
  • Sometimes the kernel shares page references (or pipe buffers) instead of copying data; sometimes it still copies but avoids extra crossings.
  • Disk and NIC DMA are still real. Expect copies between device and RAM unless you’re on specialized NICs with userspace stacks (DPDK, io_uring/ZC send on some NICs, etc.). We’ll stay focused on portable kernel interfaces.

The common data paths at a glance

User-copy path (classic):

  1. read(file, user_buf) ⇒ device→RAM (DMA) → page cache → copy into user_buf
  2. write(sock, user_buf) ⇒ copy from user_buf → socket send buffer → NIC (DMA)

Zero-copy-ish path (file→socket):

  • sendfile(sock, file, ...) ⇒ page cache pages forwarded into socket stack; user-level buffer is never touched.

Zero-copy-ish path (fd↔fd via pipe):

  • splice(fd_in, pipe) + splice(pipe, fd_out) ⇒ pipe-buffer forwarding between endpoints.

User-mapped path:

  • mmap(file) ⇒ map file pages into your address space; page faults pull pages; you may write()/send() from those addresses (still a user→socket copy unless paired with other tricks).

Page cache vs direct paths (why it matters)

Most practical zero-copy file→socket pipelines ride the page cache. When you sendfile() from a regular file to a TCP socket:

  • If the needed pages are not in memory, the kernel will schedule reads (readahead) to fill them.
  • The socket then references those pages as the payload source. Depending on kernel and offloads, the bytes are segmented/checksummed and sent to the NIC without userland ever seeing them.

Direct I/O (O_DIRECT) changes the rules by bypassing the page cache for the file side, but it complicates alignment and buffering. Classic sendfile() expects page-cache-backed files; pairing O_DIRECT with sendfile typically yields EINVAL or falls back. Keep sendfile() on cache-backed files unless you’re solving a very specific problem with direct I/O (we’ll return to this later when we discuss measurement and alignment constraints).

sendfile(2): the simplest zero-copy for file→socket

sendfile() moves data from a file descriptor to a socket descriptor inside the kernel. No user buffer, minimal syscalls, high throughput.

High-level behavior on Linux:

  • Source must be a file descriptor that supports mapping into the page cache (regular files).
  • Destination is usually a socket (AF_INET, AF_UNIX, etc.).
  • The kernel pulls file pages (with readahead) and attaches them to the socket’s send path.
  • The call may transfer fewer bytes than requested (short send) and must be looped.

Minimal robust loop (blocking socket):

#define _GNU_SOURCE
#include <errno.h>
#include <limits.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <sys/sendfile.h>
#include <sys/types.h>
#include <unistd.h>
 
// Send exactly 'count' bytes from 'in_fd' (file) to 'out_fd' (socket), starting at *p_off.
// Returns true on success, false on error. Updates *p_off.
bool sendfile_exact(int out_fd, int in_fd, off_t *p_off, size_t count) {
  size_t remaining = count;
  while (remaining > 0) {
    // On 32-bit, SSIZE_MAX may cap each chunk. On 64-bit, chunk as large as you like.
    size_t chunk = remaining;
    if (chunk > SSIZE_MAX) chunk = SSIZE_MAX; // be pedantic
    ssize_t n = sendfile(out_fd, in_fd, p_off, chunk);
    if (n > 0) {
      remaining -= (size_t)n;
      continue;
    }
    if (n == 0) {
      // EOF on file before sending 'count' bytes
      return false;
    }
    if (n == -1 && errno == EINTR) {
      continue; // interrupted, retry
    }
    return false; // EAGAIN on nonblocking, or hard error
  }
  return true;
}

Key points you must honor:

  • Short sends are normal. Loop until done; treat EINTR as a retry.
  • On nonblocking sockets, sendfile() may return -1 with errno == EAGAIN after transferring some bytes. You must track progress via off_t *offset and resume after the socket is writable again.
  • offset semantics: if offset is non-NULL, the file position of in_fd is not modified; the kernel uses and updates the pointed value. If offset is NULL, the kernel updates the file position of in_fd instead.

Nonblocking-friendly variant (returning progress):

#include <sys/sendfile.h>
 
// Attempts to send up to 'count' bytes. Returns bytes sent (>=0), or -1 on error.
// On EAGAIN/EWOULDBLOCK, returns the bytes sent so far (>=0) and sets errno to EAGAIN.
ssize_t sendfile_try(int out_fd, int in_fd, off_t *p_off, size_t count) {
  size_t remaining = count;
  size_t sent = 0;
  while (remaining > 0) {
    size_t chunk = remaining;
    if (chunk > SSIZE_MAX) chunk = SSIZE_MAX;
    ssize_t n = sendfile(out_fd, in_fd, p_off, chunk);
    if (n > 0) { sent += (size_t)n; remaining -= (size_t)n; continue; }
    if (n == 0) { return (ssize_t)sent; } // EOF
    if (errno == EINTR) { continue; }
    if (errno == EAGAIN || errno == EWOULDBLOCK) { return (ssize_t)sent; }
    return -1; // hard error
  }
  return (ssize_t)sent;
}

Where sendfile() shines

  • Static content servers: large files, media streaming, software distribution.
  • Any workload where the application does not need to inspect/transform payloads.
  • CPU offload from user space: fewer cache misses, fewer copies, fewer syscalls.

Common pitfalls and gotchas

  • Headers/trailers: Classic sendfile() is body-only. If you need to send headers + file + trailers without extra copies, consider writev() for small headers followed by sendfile(), or use sendfile variants/sendmsg with MSG_ZEROCOPY on capable stacks (beyond the scope here) or kernel TLS (KTLS). Keep headers small so the user-space copy cost is negligible.
  • Nonblocking semantics: expect EAGAIN. Integrate with your event loop; resume where you left off using offset.
  • File holes and sparse files: the kernel may synthesize runs of zeros; behavior is generally fine but measure.
  • Direct I/O: pairing O_DIRECT file descriptors with sendfile() is not generally supported; you’ll see EINVAL. Stick to page-cache-backed files here.
  • Cross-platform: BSD/macOS sendfile() signatures and semantics differ (e.g., headers/trailers support). The portability story is “same idea, different knobs.” The patterns here focus on Linux.

mmap(2): zero-copy reads into your address space (with caveats)

mmap() maps file pages directly into your process. Reads become page faults that pull data into RAM and mark those pages present. This can reduce copies for in-process consumers (you read directly out of the page cache) and improve spatial locality.

However, mmap is not a magic bullet for sending data over sockets: calling send() on an mmap’d address still copies from user space into the socket buffer. You didn’t eliminate the user→socket copy. mmap shines when your application must parse/scan/inspect the data without mutating it heavily. For true file→socket zero-copy, prefer sendfile.

Minimal mapping pattern:

#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
 
struct map { const unsigned char *p; size_t len; };
 
static bool map_file_readonly(const char *path, struct map *m) {
  int fd = open(path, O_RDONLY);
  if (fd < 0) return false;
  struct stat st; if (fstat(fd, &st) != 0) { close(fd); return false; }
  if (st.st_size == 0) { m->p = NULL; m->len = 0; close(fd); return true; }
  void *addr = mmap(NULL, (size_t)st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
  close(fd);
  if (addr == MAP_FAILED) return false;
  m->p = (const unsigned char *)addr; m->len = (size_t)st.st_size; return true;
}
 
static void unmap_file(struct map *m) {
  if (m->p && m->len) munmap((void *)m->p, m->len);
  m->p = NULL; m->len = 0;
}

Use this when you need to parse or search large files efficiently. For pure file→socket transfer, sendfile is simpler and faster.

Measuring the win (preview)

You’ll validate zero-copy by observing:

  • Fewer syscalls per byte transferred (e.g., a single sendfile loop instead of read + write pairs).
  • Lower CPU usage per GB served.
  • Higher throughput for the same CPU budget, especially for large sequential files.

In the next sections we’ll dive into splice(2), vmsplice(2), and tee(2) for fd↔fd pipelines, along with alignment/offload nuances and a measurement checklist that counts copies and syscalls accurately.

The splice family: pipe buffers as the hub

splice(2), vmsplice(2), and tee(2) let the kernel shuttle bytes among file descriptors using pipe buffers as the interchange. One side of splice must be a pipe. The typical pattern for non-pipe endpoints (files, sockets) is:

  1. Create a pipe (pipe2(O_NONBLOCK|O_CLOEXEC)).
  2. splice(in_fd → pipe_w) to pull data into the pipe.
  3. splice(pipe_r → out_fd) to push data out.

This can avoid user-space copies entirely when both endpoints are kernel-managed (file, socket, device) and the pipe buffers can reference pages directly.

splice(2) fundamentals

  • One end must be a pipe; the other can be a file, socket, or another pipe.
  • Returns the number of bytes moved; short transfers are normal. Loop.
  • Common flags:
    • SPLICE_F_MOVE: request page-moving instead of copying when possible.
    • SPLICE_F_MORE: hint that more data will follow (helps coalescing on sockets).
    • SPLICE_F_NONBLOCK: act nonblocking; returns -1/EAGAIN quickly.

Minimal utility to move bytes between two non-pipe FDs via a pipe:

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdint.h>
#include <sys/types.h>
#include <unistd.h>
 
// Moves up to 'limit' bytes from in_fd → out_fd using an internal pipe.
// Returns bytes moved (>=0) or -1 on error. Nonblocking-friendly: may return < limit.
ssize_t splice_pump(int in_fd, int out_fd, size_t limit) {
  int p[2];
  if (pipe2(p, O_NONBLOCK | O_CLOEXEC) != 0) return -1;
  ssize_t total = 0;
  const size_t CHUNK = 1 << 20; // 1 MiB per inner transfer
  int pr = p[0], pw = p[1];
  for (;;) {
    size_t want = (size_t)((limit >= 0 && (size_t)limit - (size_t)total < CHUNK) ? (size_t)limit - (size_t)total : CHUNK);
    if (limit > 0 && (size_t)total >= limit) break;
    // Pull from in_fd into pipe
    ssize_t n = splice(in_fd, NULL, pw, NULL, want,
                       SPLICE_F_MOVE | SPLICE_F_MORE | SPLICE_F_NONBLOCK);
    if (n == 0) break;            // EOF on input
    if (n < 0) { if (errno == EINTR) continue; if (errno == EAGAIN) break; close(pr); close(pw); return total; }
    // Push from pipe to out_fd
    ssize_t left = n;
    while (left > 0) {
      ssize_t m = splice(pr, NULL, out_fd, NULL, (size_t)left,
                         SPLICE_F_MOVE | SPLICE_F_MORE | SPLICE_F_NONBLOCK);
      if (m > 0) { left -= m; total += m; continue; }
      if (m < 0 && errno == EINTR) continue;
      if (m < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
        // Couldn’t push all bytes now; caller should wait for writable and resume
        close(pr); close(pw); return total; 
      }
      // Hard error
      close(pr); close(pw); return -1;
    }
  }
  close(pr); close(pw); return total;
}

Notes:

  • This helper is nonblocking-friendly: it stops on EAGAIN so you can integrate with your event loop.
  • For large transfers, reuse a long-lived pipe per connection to avoid repeated pipe2() overhead.

File→socket with splice: when not to use sendfile

sendfile() is the simplest file→socket path. Use splice when you need to interleave kernel-forwarded segments (e.g., mix a user header via vmsplice, then file bytes via splice) into a single socket stream without userland copies.

// Example: pump file→socket using splice (nonblocking-friendly)
ssize_t file_to_sock_splice(int sockfd, int filefd, off_t *p_off, size_t count) {
  // If p_off != NULL, use pread-like behavior: advance the off_t yourself
  if (p_off) {
    // Move file position temporarily via pread-style loop
    off_t off = *p_off; size_t moved = 0;
    while (moved < count) {
      // Use splice with an explicit file offset via pread? splice does not accept off_t* on Linux.
      // Fallback: lseek and restore if you must preserve position (serialize carefully).
      if (lseek(filefd, off, SEEK_SET) == (off_t)-1) return -1;
      ssize_t n = splice_pump(filefd, sockfd, count - moved);
      if (n <= 0) return (ssize_t)moved > 0 ? (ssize_t)moved : n;
      moved += (size_t)n; off += (off_t)n;
    }
    *p_off = off; return (ssize_t)moved;
  }
  return splice_pump(filefd, sockfd, count);
}

If you don’t need headers/trailers or special routing, prefer sendfile. Otherwise, splice gives you a general fd↔fd conveyor belt.


vmsplice(2): inject user buffers into a pipe

vmsplice() maps user memory into a pipe as pipe buffers, potentially avoiding an extra copy. The kernel may “pin” pages and reference them until consumed downstream. Whether this is truly copy-free is implementation-dependent, but it reliably reduces user↔kernel crossings and avoids intermediate userland staging buffers.

Use cases:

  • Prepend small headers before a large file payload without building a big temporary buffer in user space.
  • Stitch multiple user buffers into a pipe, then splice to the socket in large bursts.

Minimal helper:

#define _GNU_SOURCE
#include <errno.h>
#include <stdbool.h>
#include <string.h>
#include <sys/uio.h>
#include <unistd.h>
 
// Push iovecs into a pipe with vmsplice. Returns bytes queued (>=0) or -1.
ssize_t pipe_push_iov_vmsplice(int pipe_w, const struct iovec *iov, int iovcnt) {
  // SPLICE_F_GIFT: hint pages can be gifted to the pipe; kernel may avoid copying
  ssize_t n = vmsplice(pipe_w, iov, (unsigned)iovcnt, SPLICE_F_GIFT);
  if (n < 0 && errno == EINTR) return 0; // retry policy left to caller
  return n;
}

Headers + file body without user copies

Pattern: queue headers via vmsplice into the pipe, then splice the file body into the same pipe, then splice pipe→socket. The socket sees a single contiguous stream.

#include <sys/uio.h>
 
bool send_header_and_file(int sockfd, const void *h1, size_t h1len,
                          const void *h2, size_t h2len,
                          int filefd, size_t file_len) {
  int p[2]; if (pipe2(p, O_NONBLOCK | O_CLOEXEC) != 0) return false;
  int pr = p[0], pw = p[1];
 
  struct iovec hdr[2] = { { .iov_base = (void*)h1, .iov_len = h1len },
                          { .iov_base = (void*)h2, .iov_len = h2len } };
  ssize_t h = pipe_push_iov_vmsplice(pw, hdr, 2);
  if (h < 0) { close(pr); close(pw); return false; }
 
  // Pull file bytes into pipe, possibly in chunks
  size_t moved = 0; const size_t CHUNK = 1 << 20;
  while (moved < file_len) {
    size_t want = file_len - moved; if (want > CHUNK) want = CHUNK;
    ssize_t n = splice(filefd, NULL, pw, NULL, want,
                       SPLICE_F_MORE | SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
    if (n > 0) { moved += (size_t)n; continue; }
    if (n == 0) break; // EOF
    if (errno == EINTR) continue;
    if (errno == EAGAIN) break; // caller should wait for POLLOUT on sock and retry later
    close(pr); close(pw); return false;
  }
 
  // Drain pipe → socket
  for (;;) {
    ssize_t m = splice(pr, NULL, sockfd, NULL, 1 << 20,
                       SPLICE_F_MOVE | SPLICE_F_MORE | SPLICE_F_NONBLOCK);
    if (m > 0) continue;
    if (m == 0) break; // pipe drained
    if (errno == EINTR) continue;
    if (errno == EAGAIN) break; // socket full; resume later
    close(pr); close(pw); return false;
  }
  close(pr); close(pw); return true;
}

Guidance:

  • Keep headers relatively small. Even if the kernel copies them, the cost is negligible compared to the file body.
  • Treat EAGAIN as a scheduling event; re-arm POLLOUT and resume draining the pipe into the socket.
  • Reuse a per-connection pipe to amortize pipe2() cost and avoid descriptor churn.

tee(2): duplicate a pipe stream without copying

tee() clones the data in one pipe into another pipe by increasing references to the same pipe buffers. This lets you fan out a stream to multiple consumers (e.g., write once, send to two sockets) without copying payload bytes.

Sketch:

#include <fcntl.h>
#include <sys/types.h>
#include <unistd.h>
 
// Duplicate data from pipe A→B, then drain each pipe to its socket.
bool fanout_two_sockets(int in_fd, int sock1, int sock2) {
  int pa[2], pb[2];
  if (pipe2(pa, O_NONBLOCK|O_CLOEXEC) || pipe2(pb, O_NONBLOCK|O_CLOEXEC)) return false;
  // Pull source fd → pa
  for (;;) {
    ssize_t n = splice(in_fd, NULL, pa[1], NULL, 1<<20, SPLICE_F_MORE|SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
    if (n <= 0) break;
    // Clone pa → pb
    ssize_t k = tee(pa[0], pb[1], (size_t)n, SPLICE_F_NONBLOCK);
    (void)k; // may be < n; loop to clone all available
    // Drain each pipe to its socket
    for (;;) {
      ssize_t a = splice(pa[0], NULL, sock1, NULL, 1<<20, SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
      if (!(a > 0 || (a < 0 && errno == EINTR))) break;
    }
    for (;;) {
      ssize_t b = splice(pb[0], NULL, sock2, NULL, 1<<20, SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
      if (!(b > 0 || (b < 0 && errno == EINTR))) break;
    }
  }
  close(pa[0]); close(pa[1]); close(pb[0]); close(pb[1]);
  return true;
}

Notes:

  • tee() requires both ends be pipes. It doesn’t advance the read head of the input pipe; you still need to drain it.
  • Backpressure applies independently per consumer. Bound queue lengths and resume when sockets are writable.

Coalescing and offloads: getting line-rate without tiny writes

Zero-copy primitives give you the big win by skipping user buffers, but you still need to respect how the TCP stack and NIC like to send bytes: in large, contiguous segments with checksums offloaded and minimal per-packet overhead.

Tools you can use:

  • SPLICE_F_MORE on splice: hints more data will follow, encouraging coalescing.
  • MSG_MORE on send/sendmsg: similar hint for non-splice paths.
  • TCP_CORK socket option: hold back small segments until you uncork or fill a full MSS. Useful when sending headers followed by a large body.
  • NIC offloads (checksum offload, TSO/GSO): the kernel and NIC cooperate to segment and checksum efficiently. You get these benefits automatically when you keep the path inside the kernel.

Corking pattern for mixed header + sendfile body:

#include <netinet/tcp.h>
 
static void set_cork(int sockfd, int on) {
  int v = on ? 1 : 0; (void)setsockopt(sockfd, IPPROTO_TCP, TCP_CORK, &v, sizeof v);
}
 
bool send_with_cork(int sockfd, const void *hdr, size_t hdr_len, int filefd, off_t *off, size_t len) {
  set_cork(sockfd, 1);
  ssize_t h = send(sockfd, hdr, hdr_len, MSG_MORE);
  if (h < 0) { set_cork(sockfd, 0); return false; }
  size_t moved = 0;
  while (moved < len) {
    ssize_t n = sendfile(sockfd, filefd, off, len - moved);
    if (n > 0) { moved += (size_t)n; continue; }
    if (n == 0) break;
    if (errno == EINTR) continue;
    if (errno == EAGAIN) { /* wait for writable then retry */ set_cork(sockfd, 0); return false; }
    set_cork(sockfd, 0); return false;
  }
  set_cork(sockfd, 0);
  return true;
}

Guidance:

  • Use corking only around small headers leading into large bodies. Leaving a socket corked too long increases tail latency.
  • Prefer SPLICE_F_MORE/MSG_MORE over corking for short-lived bursts; corking is a stronger control.

Pipe buffers: sizes, tuning, and memory pressure

Pipes are the hub for splice/vmsplice. Each pipe has a finite buffer capacity expressed as a number of pages. Under load, the defaults may limit throughput.

Practical knobs:

  • fcntl(pipe_fd, F_SETPIPE_SZ, bytes): request a larger pipe capacity. The kernel rounds to a multiple of page size and clamps to system limits.
  • System limits: /proc/sys/fs/pipe-max-size and per-user pages limits may cap your request.

Example setup:

#include <fcntl.h>
#include <stdio.h>
 
static void maybe_grow_pipe(int pfd) {
  int want = 1 << 20; // 1 MiB
  int got = fcntl(pfd, F_SETPIPE_SZ, want);
  (void)got; // inspect in logs if you care
}

Guidance:

  • A single large pipe per connection is usually sufficient. Measure before growing; larger pipes consume more kernel memory under backpressure.
  • Respect backpressure. If splice to the socket returns EAGAIN, stop and re-arm writable notifications rather than spinning.

Page cache, readahead, and advice

For file→socket transfers, the page cache and readahead determine how smoothly pages arrive.

Useful calls:

  • posix_fadvise(fd, off, len, POSIX_FADV_SEQUENTIAL): hint sequential access.
  • posix_fadvise(fd, off, len, POSIX_FADV_WILLNEED): nudge readahead to prefetch.
  • posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED): drop from cache after use for cold files to reduce cache pollution.
  • madvise(addr, len, MADV_SEQUENTIAL|MADV_WILLNEED|MADV_DONTNEED): similar hints for mmap paths.

Warmup sketch:

#include <fcntl.h>
 
static void warmup_sequential(int fd, off_t off, off_t len) {
  (void)posix_fadvise(fd, off, len, POSIX_FADV_SEQUENTIAL);
  (void)posix_fadvise(fd, off, len, POSIX_FADV_WILLNEED);
}

Guidance:

  • Do not overuse WILLNEED; you can evict hot data for other tenants. Use it for cold, one-shot transfers.
  • For very hot content, the cache will stay warm without special hints.

Transformations: when zero-copy meets TLS and compression

Transformations break pure zero-copy because bytes must be changed. Options to keep performance high:

  • Kernel TLS (KTLS): move encryption into the kernel so sendfile and friends can still operate on plaintext pages while the kernel encrypts on send. Availability depends on kernel and cipher suites.
  • Pre-compress or pre-encrypt static assets at rest and serve them as-is when clients accept them.
  • Accept a copy for the transformed portion (headers, small dynamic fragments) and keep the large body on the kernel path.

If KTLS isn’t available, a common compromise is: write small headers/trailers from user space, then sendfile the body. The overall CPU cost stays low.

Robust nonblocking state machine (headers + file)

Integrate zero-copy into a readiness loop with a small state machine that handles headers via iovecs and the body via sendfile.

#include <sys/uio.h>
 
enum xfer_state { X_HDR, X_BODY, X_DONE, X_ERR };
 
struct xfer {
  enum xfer_state st;
  struct iovec hdr[4];
  int hdr_cnt;
  int sockfd;
  int filefd;
  off_t off;
  size_t body_left;
};
 
static void on_writable(struct xfer *x) {
  if (x->st == X_HDR) {
    while (x->hdr_cnt > 0) {
      ssize_t n = writev(x->sockfd, x->hdr, x->hdr_cnt);
      if (n > 0) {
        size_t used = (size_t)n;
        // Advance iovecs
        int i = 0; while (i < x->hdr_cnt && used >= x->hdr[i].iov_len) { used -= x->hdr[i].iov_len; ++i; }
        if (i > 0) { for (int k = 0; k + i < x->hdr_cnt; ++k) x->hdr[k] = x->hdr[k+i]; x->hdr_cnt -= i; }
        if (used > 0 && x->hdr_cnt > 0) { x->hdr[0].iov_base = (char*)x->hdr[0].iov_base + used; x->hdr[0].iov_len -= used; }
        continue;
      }
      if (n < 0 && errno == EINTR) continue;
      if (n < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) return;
      x->st = X_ERR; return;
    }
    x->st = X_BODY;
  }
  if (x->st == X_BODY) {
    while (x->body_left > 0) {
      ssize_t n = sendfile(x->sockfd, x->filefd, &x->off, x->body_left);
      if (n > 0) { x->body_left -= (size_t)n; continue; }
      if (n == 0) { x->st = X_DONE; return; }
      if (n < 0 && errno == EINTR) continue;
      if (n < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) return;
      x->st = X_ERR; return;
    }
    x->st = X_DONE;
  }
}

Notes:

  • This integrates cleanly with an event loop: enable POLLOUT/EPOLLOUT when state is not X_DONE/X_ERR, call on_writable on each wake.
  • The header path uses writev because headers are typically small and benefit from a single syscall.
  • The body path uses sendfile for zero-copy. If you must mix in user buffers, switch to the pipe hub pattern from earlier.

Compatibility and graceful fallback

Not every fd pair supports splice/sendfile. Detect and fallback to read/write with robust loops when needed.

Error patterns to expect:

  • EINVAL: unsupported fd types or options (e.g., sendfile with O_DIRECT file).
  • ESPIPE: non-seekable file when an offset is required.
  • EOPNOTSUPP/ENOSYS: older kernels or specific filesystems.

Fallback sketch:

static ssize_t fallback_copy(int out_fd, int in_fd, size_t count) {
  char buf[1<<16]; size_t total = 0;
  while (total < count) {
    size_t want = count - total; if (want > sizeof buf) want = sizeof buf;
    ssize_t r = read(in_fd, buf, want);
    if (r > 0) {
      size_t off = 0; while (off < (size_t)r) {
        ssize_t w = write(out_fd, buf + off, (size_t)r - off);
        if (w > 0) off += (size_t)w; else if (w < 0 && errno == EINTR) continue; else return (ssize_t)total;
      }
      total += (size_t)r; continue;
    }
    if (r == 0) break;
    if (r < 0 && errno == EINTR) continue;
    break;
  }
  return (ssize_t)total;
}

Measuring copies and syscalls (trust, but verify)

Quick methods to validate improvements:

  • Syscall counting: strace -f -e trace=sendfile,splice,vmsplice,write,read -c your_server.
  • CPU accounting: perf stat -p <pid> during steady-state transfer.
  • Throughput: wrk/ab/iperf or a custom client; compare GB/s and CPU%.
  • Socket stats: enable TCP info logs or sample getsockopt(TCP_INFO) to observe retransmits and pacing.

Checklist for fair tests:

  • Warm the page cache or explicitly test cold-cache performance; report which you measured.
  • Bind to a single core to compare CPU cycles per byte; then scale cores.
  • Fix NIC and link speed; disable competing traffic.
  • Test small files and large files separately; zero-copy shines more as payload grows.

mmap writes, page faults, and consistency

mmap is also useful on the write side, but understand what’s actually happening:

  • MAP_PRIVATE (copy-on-write): your writes modify private pages; they do not hit the file. Useful for read-mostly parsing or transformations without persisting.
  • MAP_SHARED: your writes dirty the mapped pages; the kernel writes them back to the file later. Call msync(addr, len, MS_SYNC|MS_INVALIDATE) if you need durability/visibility guarantees at specific points.
  • Page faults: the first access to a page triggers a fault that pulls the page in; on write with MAP_PRIVATE, the kernel may allocate a private copy (CoW). Those are copies—just deferred and often benefitting from larger granularities.

Implications:

  • mmap helps when you need random access or parsing without staging copies. It does not remove the user→socket copy when you later send() those bytes.
  • For write-heavy workloads, classic buffered I/O with writev and tuned batching may outperform mmap due to fewer minor faults and clearer backpressure points. Measure.

Minimal durability sync:

#include <sys/mman.h>
 
static bool flush_mapped(void *addr, size_t len) {
  return msync(addr, len, MS_SYNC) == 0;
}

Checksums: NIC offloads vs userland

With kernel-managed send paths (sendfile, splice → socket), you benefit from NIC offloads automatically where supported:

  • Checksum offload (CSUM): NIC computes TCP/UDP checksums.
  • TSO/GSO: kernel presents large skbs; NIC segments them to MSS-size frames.
  • LRO/GRO on receive: coalescing reduces per-packet overhead (for completeness).

If your protocol embeds its own application-level checksum over the payload, you still compute that in user space (or precompute/store alongside the file). Zero-copy doesn’t stop you from computing a header checksum over metadata while letting the body take the kernel path.

Verification tip (operational): inspect ethtool -k and ss -ti/nstat counters to confirm offloads are in effect; no code changes needed for the zero-copy APIs discussed here.

Edge cases and correctness pitfalls

  • Short transfers are the norm: every API here can return fewer bytes than requested. Loop until done.
  • EAGAIN is not an error: it’s a scheduling signal. Re-arm POLLOUT/EPOLLOUT and resume.
  • Peer close during transfer: splice/sendfile can return -1 with EPIPE or zero; handle EPOLLERR/EPOLLHUP by attempting to drain and then close.
  • Offsets and overflow: when mixing headers + file lengths, validate off_t arithmetic; guard against integer wrap on 32-bit systems.
  • Descriptor lifetimes: keep the file and pipe fds alive until all splices complete; avoid closing the source fd while the kernel still references pages.
  • Direct I/O: avoid mixing O_DIRECT files with sendfile/splice. If you need direct I/O, use read/write with aligned buffers or io_uring and accept the copy.

Observability: prove it in prod

Add lightweight metrics to catch regressions:

  • Bytes and ops by path: counters for sendfile bytes, splice bytes, fallback bytes.
  • EAGAIN rates and average resume latency per connection.
  • Per-connection pipe sizes and occupancy (sampled), to detect chronic backpressure.
  • Error tallies by errno for sendfile/splice/vmsplice.

Sketch:

struct zc_metrics {
  unsigned long long sf_bytes, sp_bytes, vmsp_bytes, fb_bytes;
  unsigned long long sf_calls, sp_calls, vmsp_calls, fb_calls;
  unsigned long long eagain_w, eagain_r;
};
 
static inline void add64(unsigned long long *c, unsigned long long v) {
  __atomic_add_fetch(c, v, __ATOMIC_RELAXED);
}

Production checklist (printable)

  • Prefer sendfile for file→socket when you don’t need to transform the body.
  • Use splice+pipe when you must stitch sources (headers via vmsplice + file body) without user copies.
  • Treat EAGAIN as backpressure; integrate with an event loop; never spin.
  • Use SPLICE_F_MORE/MSG_MORE or transient TCP_CORK to coalesce small pieces into full segments.
  • Reuse a pipe per connection and tune pipe size only after measuring.
  • Warm cold files with posix_fadvise (SEQUENTIAL/WILLNEED) sparingly; consider DONTNEED after send.
  • Keep transformation work (TLS/compression) either in-kernel (KTLS) or limited to small fragments.
  • Log bytes moved and error codes per path; alert on fallback growth.

Closing thoughts

Zero-copy on Linux isn’t a single switch—it’s a set of disciplined patterns. The winning formula is simple:

  1. Keep big payloads on kernel-managed paths (sendfile, splice).
  2. Limit user-space to control, metadata, and small headers (writev, vmsplice).
  3. Respect backpressure, coalesce wisely, and measure everything.

Do that, and you’ll trade buffer shuffles for throughput, swap midnight copy-tuning for clear metrics, and make your CPUs do more interesting work than moving bytes from A to B.