You can make CPUs fast or you can make them move bytes. Doing both at the same time is hard. Most production servers spend a shocking amount of time shuffling memory—copying from kernel to user, from one buffer to another, from one cache level to the next. Zero-copy techniques are about getting out of the way: keep data on the fast path (page cache, DMA, socket buffers) and avoid needless detours through user space.
This post is a grounded tour of Linux primitives that enable “zero-copy-ish” data movement: sendfile(2)
, splice(2)
, vmsplice(2)
, tee(2)
, and mmap(2)
. We’ll cover what they promise, where the copies still happen, and how to write robust loops that hit line rate without turning into a pile of edge cases.
Important mindset: zero-copy rarely means "no copies exist anywhere." It usually means "no copies in user space" and "no redundant copies between kernel subsystems." DMA still moves bytes to/from devices; the page cache still backs file data; socket buffers still exist. The win is cutting out the userland bounce buffers and system call overhead in the hot path.
What zero-copy actually means (and doesn’t)
- Kernel can forward pages or pipe buffers directly between subsystems without materializing a user-space buffer.
- User space sets up the transfer with a small number of syscalls; the kernel does the heavy lifting.
- Sometimes the kernel shares page references (or pipe buffers) instead of copying data; sometimes it still copies but avoids extra crossings.
- Disk and NIC DMA are still real. Expect copies between device and RAM unless you’re on specialized NICs with userspace stacks (DPDK, io_uring/ZC send on some NICs, etc.). We’ll stay focused on portable kernel interfaces.
The common data paths at a glance
User-copy path (classic):
read(file, user_buf)
⇒ device→RAM (DMA) → page cache → copy intouser_buf
write(sock, user_buf)
⇒ copy fromuser_buf
→ socket send buffer → NIC (DMA)
Zero-copy-ish path (file→socket):
sendfile(sock, file, ...)
⇒ page cache pages forwarded into socket stack; user-level buffer is never touched.
Zero-copy-ish path (fd↔fd via pipe):
splice(fd_in, pipe)
+splice(pipe, fd_out)
⇒ pipe-buffer forwarding between endpoints.
User-mapped path:
mmap(file)
⇒ map file pages into your address space; page faults pull pages; you maywrite()
/send()
from those addresses (still a user→socket copy unless paired with other tricks).
Page cache vs direct paths (why it matters)
Most practical zero-copy file→socket pipelines ride the page cache. When you sendfile()
from a regular file to a TCP socket:
- If the needed pages are not in memory, the kernel will schedule reads (readahead) to fill them.
- The socket then references those pages as the payload source. Depending on kernel and offloads, the bytes are segmented/checksummed and sent to the NIC without userland ever seeing them.
Direct I/O (O_DIRECT
) changes the rules by bypassing the page cache for the file side, but it complicates alignment and buffering. Classic sendfile()
expects page-cache-backed files; pairing O_DIRECT
with sendfile
typically yields EINVAL
or falls back. Keep sendfile()
on cache-backed files unless you’re solving a very specific problem with direct I/O (we’ll return to this later when we discuss measurement and alignment constraints).
sendfile(2)
: the simplest zero-copy for file→socket
sendfile()
moves data from a file descriptor to a socket descriptor inside the kernel. No user buffer, minimal syscalls, high throughput.
High-level behavior on Linux:
- Source must be a file descriptor that supports mapping into the page cache (regular files).
- Destination is usually a socket (
AF_INET
,AF_UNIX
, etc.). - The kernel pulls file pages (with readahead) and attaches them to the socket’s send path.
- The call may transfer fewer bytes than requested (short send) and must be looped.
Minimal robust loop (blocking socket):
#define _GNU_SOURCE
#include <errno.h>
#include <limits.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <sys/sendfile.h>
#include <sys/types.h>
#include <unistd.h>
// Send exactly 'count' bytes from 'in_fd' (file) to 'out_fd' (socket), starting at *p_off.
// Returns true on success, false on error. Updates *p_off.
bool sendfile_exact(int out_fd, int in_fd, off_t *p_off, size_t count) {
size_t remaining = count;
while (remaining > 0) {
// On 32-bit, SSIZE_MAX may cap each chunk. On 64-bit, chunk as large as you like.
size_t chunk = remaining;
if (chunk > SSIZE_MAX) chunk = SSIZE_MAX; // be pedantic
ssize_t n = sendfile(out_fd, in_fd, p_off, chunk);
if (n > 0) {
remaining -= (size_t)n;
continue;
}
if (n == 0) {
// EOF on file before sending 'count' bytes
return false;
}
if (n == -1 && errno == EINTR) {
continue; // interrupted, retry
}
return false; // EAGAIN on nonblocking, or hard error
}
return true;
}
Key points you must honor:
- Short sends are normal. Loop until done; treat
EINTR
as a retry. - On nonblocking sockets,
sendfile()
may return-1
witherrno == EAGAIN
after transferring some bytes. You must track progress viaoff_t *offset
and resume after the socket is writable again. offset
semantics: ifoffset
is non-NULL, the file position ofin_fd
is not modified; the kernel uses and updates the pointed value. Ifoffset
is NULL, the kernel updates the file position ofin_fd
instead.
Nonblocking-friendly variant (returning progress):
#include <sys/sendfile.h>
// Attempts to send up to 'count' bytes. Returns bytes sent (>=0), or -1 on error.
// On EAGAIN/EWOULDBLOCK, returns the bytes sent so far (>=0) and sets errno to EAGAIN.
ssize_t sendfile_try(int out_fd, int in_fd, off_t *p_off, size_t count) {
size_t remaining = count;
size_t sent = 0;
while (remaining > 0) {
size_t chunk = remaining;
if (chunk > SSIZE_MAX) chunk = SSIZE_MAX;
ssize_t n = sendfile(out_fd, in_fd, p_off, chunk);
if (n > 0) { sent += (size_t)n; remaining -= (size_t)n; continue; }
if (n == 0) { return (ssize_t)sent; } // EOF
if (errno == EINTR) { continue; }
if (errno == EAGAIN || errno == EWOULDBLOCK) { return (ssize_t)sent; }
return -1; // hard error
}
return (ssize_t)sent;
}
Where sendfile()
shines
- Static content servers: large files, media streaming, software distribution.
- Any workload where the application does not need to inspect/transform payloads.
- CPU offload from user space: fewer cache misses, fewer copies, fewer syscalls.
Common pitfalls and gotchas
- Headers/trailers: Classic
sendfile()
is body-only. If you need to send headers + file + trailers without extra copies, considerwritev()
for small headers followed bysendfile()
, or usesendfile
variants/sendmsg
withMSG_ZEROCOPY
on capable stacks (beyond the scope here) or kernel TLS (KTLS). Keep headers small so the user-space copy cost is negligible. - Nonblocking semantics: expect
EAGAIN
. Integrate with your event loop; resume where you left off usingoffset
. - File holes and sparse files: the kernel may synthesize runs of zeros; behavior is generally fine but measure.
- Direct I/O: pairing
O_DIRECT
file descriptors withsendfile()
is not generally supported; you’ll seeEINVAL
. Stick to page-cache-backed files here. - Cross-platform: BSD/macOS
sendfile()
signatures and semantics differ (e.g., headers/trailers support). The portability story is “same idea, different knobs.” The patterns here focus on Linux.
mmap(2): zero-copy reads into your address space (with caveats)
mmap()
maps file pages directly into your process. Reads become page faults that pull data into RAM and mark those pages present. This can reduce copies for in-process consumers (you read directly out of the page cache) and improve spatial locality.
However, mmap
is not a magic bullet for sending data over sockets: calling send()
on an mmap
’d address still copies from user space into the socket buffer. You didn’t eliminate the user→socket copy. mmap
shines when your application must parse/scan/inspect the data without mutating it heavily. For true file→socket zero-copy, prefer sendfile
.
Minimal mapping pattern:
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
struct map { const unsigned char *p; size_t len; };
static bool map_file_readonly(const char *path, struct map *m) {
int fd = open(path, O_RDONLY);
if (fd < 0) return false;
struct stat st; if (fstat(fd, &st) != 0) { close(fd); return false; }
if (st.st_size == 0) { m->p = NULL; m->len = 0; close(fd); return true; }
void *addr = mmap(NULL, (size_t)st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
if (addr == MAP_FAILED) return false;
m->p = (const unsigned char *)addr; m->len = (size_t)st.st_size; return true;
}
static void unmap_file(struct map *m) {
if (m->p && m->len) munmap((void *)m->p, m->len);
m->p = NULL; m->len = 0;
}
Use this when you need to parse or search large files efficiently. For pure file→socket transfer, sendfile
is simpler and faster.
Measuring the win (preview)
You’ll validate zero-copy by observing:
- Fewer syscalls per byte transferred (e.g., a single
sendfile
loop instead ofread
+write
pairs). - Lower CPU usage per GB served.
- Higher throughput for the same CPU budget, especially for large sequential files.
In the next sections we’ll dive into splice(2)
, vmsplice(2)
, and tee(2)
for fd↔fd pipelines, along with alignment/offload nuances and a measurement checklist that counts copies and syscalls accurately.
The splice family: pipe buffers as the hub
splice(2)
, vmsplice(2)
, and tee(2)
let the kernel shuttle bytes among file descriptors using pipe buffers as the interchange. One side of splice
must be a pipe. The typical pattern for non-pipe endpoints (files, sockets) is:
- Create a pipe (
pipe2(O_NONBLOCK|O_CLOEXEC)
). splice(in_fd → pipe_w)
to pull data into the pipe.splice(pipe_r → out_fd)
to push data out.
This can avoid user-space copies entirely when both endpoints are kernel-managed (file, socket, device) and the pipe buffers can reference pages directly.
splice(2)
fundamentals
- One end must be a pipe; the other can be a file, socket, or another pipe.
- Returns the number of bytes moved; short transfers are normal. Loop.
- Common flags:
SPLICE_F_MOVE
: request page-moving instead of copying when possible.SPLICE_F_MORE
: hint that more data will follow (helps coalescing on sockets).SPLICE_F_NONBLOCK
: act nonblocking; returns-1/EAGAIN
quickly.
Minimal utility to move bytes between two non-pipe FDs via a pipe:
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdint.h>
#include <sys/types.h>
#include <unistd.h>
// Moves up to 'limit' bytes from in_fd → out_fd using an internal pipe.
// Returns bytes moved (>=0) or -1 on error. Nonblocking-friendly: may return < limit.
ssize_t splice_pump(int in_fd, int out_fd, size_t limit) {
int p[2];
if (pipe2(p, O_NONBLOCK | O_CLOEXEC) != 0) return -1;
ssize_t total = 0;
const size_t CHUNK = 1 << 20; // 1 MiB per inner transfer
int pr = p[0], pw = p[1];
for (;;) {
size_t want = (size_t)((limit >= 0 && (size_t)limit - (size_t)total < CHUNK) ? (size_t)limit - (size_t)total : CHUNK);
if (limit > 0 && (size_t)total >= limit) break;
// Pull from in_fd into pipe
ssize_t n = splice(in_fd, NULL, pw, NULL, want,
SPLICE_F_MOVE | SPLICE_F_MORE | SPLICE_F_NONBLOCK);
if (n == 0) break; // EOF on input
if (n < 0) { if (errno == EINTR) continue; if (errno == EAGAIN) break; close(pr); close(pw); return total; }
// Push from pipe to out_fd
ssize_t left = n;
while (left > 0) {
ssize_t m = splice(pr, NULL, out_fd, NULL, (size_t)left,
SPLICE_F_MOVE | SPLICE_F_MORE | SPLICE_F_NONBLOCK);
if (m > 0) { left -= m; total += m; continue; }
if (m < 0 && errno == EINTR) continue;
if (m < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
// Couldn’t push all bytes now; caller should wait for writable and resume
close(pr); close(pw); return total;
}
// Hard error
close(pr); close(pw); return -1;
}
}
close(pr); close(pw); return total;
}
Notes:
- This helper is nonblocking-friendly: it stops on
EAGAIN
so you can integrate with your event loop. - For large transfers, reuse a long-lived pipe per connection to avoid repeated
pipe2()
overhead.
File→socket with splice
: when not to use sendfile
sendfile()
is the simplest file→socket path. Use splice
when you need to interleave kernel-forwarded segments (e.g., mix a user header via vmsplice
, then file bytes via splice
) into a single socket stream without userland copies.
// Example: pump file→socket using splice (nonblocking-friendly)
ssize_t file_to_sock_splice(int sockfd, int filefd, off_t *p_off, size_t count) {
// If p_off != NULL, use pread-like behavior: advance the off_t yourself
if (p_off) {
// Move file position temporarily via pread-style loop
off_t off = *p_off; size_t moved = 0;
while (moved < count) {
// Use splice with an explicit file offset via pread? splice does not accept off_t* on Linux.
// Fallback: lseek and restore if you must preserve position (serialize carefully).
if (lseek(filefd, off, SEEK_SET) == (off_t)-1) return -1;
ssize_t n = splice_pump(filefd, sockfd, count - moved);
if (n <= 0) return (ssize_t)moved > 0 ? (ssize_t)moved : n;
moved += (size_t)n; off += (off_t)n;
}
*p_off = off; return (ssize_t)moved;
}
return splice_pump(filefd, sockfd, count);
}
If you don’t need headers/trailers or special routing, prefer sendfile
. Otherwise, splice
gives you a general fd↔fd conveyor belt.
vmsplice(2)
: inject user buffers into a pipe
vmsplice()
maps user memory into a pipe as pipe buffers, potentially avoiding an extra copy. The kernel may “pin” pages and reference them until consumed downstream. Whether this is truly copy-free is implementation-dependent, but it reliably reduces user↔kernel crossings and avoids intermediate userland staging buffers.
Use cases:
- Prepend small headers before a large file payload without building a big temporary buffer in user space.
- Stitch multiple user buffers into a pipe, then
splice
to the socket in large bursts.
Minimal helper:
#define _GNU_SOURCE
#include <errno.h>
#include <stdbool.h>
#include <string.h>
#include <sys/uio.h>
#include <unistd.h>
// Push iovecs into a pipe with vmsplice. Returns bytes queued (>=0) or -1.
ssize_t pipe_push_iov_vmsplice(int pipe_w, const struct iovec *iov, int iovcnt) {
// SPLICE_F_GIFT: hint pages can be gifted to the pipe; kernel may avoid copying
ssize_t n = vmsplice(pipe_w, iov, (unsigned)iovcnt, SPLICE_F_GIFT);
if (n < 0 && errno == EINTR) return 0; // retry policy left to caller
return n;
}
Headers + file body without user copies
Pattern: queue headers via vmsplice
into the pipe, then splice
the file body into the same pipe, then splice
pipe→socket. The socket sees a single contiguous stream.
#include <sys/uio.h>
bool send_header_and_file(int sockfd, const void *h1, size_t h1len,
const void *h2, size_t h2len,
int filefd, size_t file_len) {
int p[2]; if (pipe2(p, O_NONBLOCK | O_CLOEXEC) != 0) return false;
int pr = p[0], pw = p[1];
struct iovec hdr[2] = { { .iov_base = (void*)h1, .iov_len = h1len },
{ .iov_base = (void*)h2, .iov_len = h2len } };
ssize_t h = pipe_push_iov_vmsplice(pw, hdr, 2);
if (h < 0) { close(pr); close(pw); return false; }
// Pull file bytes into pipe, possibly in chunks
size_t moved = 0; const size_t CHUNK = 1 << 20;
while (moved < file_len) {
size_t want = file_len - moved; if (want > CHUNK) want = CHUNK;
ssize_t n = splice(filefd, NULL, pw, NULL, want,
SPLICE_F_MORE | SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
if (n > 0) { moved += (size_t)n; continue; }
if (n == 0) break; // EOF
if (errno == EINTR) continue;
if (errno == EAGAIN) break; // caller should wait for POLLOUT on sock and retry later
close(pr); close(pw); return false;
}
// Drain pipe → socket
for (;;) {
ssize_t m = splice(pr, NULL, sockfd, NULL, 1 << 20,
SPLICE_F_MOVE | SPLICE_F_MORE | SPLICE_F_NONBLOCK);
if (m > 0) continue;
if (m == 0) break; // pipe drained
if (errno == EINTR) continue;
if (errno == EAGAIN) break; // socket full; resume later
close(pr); close(pw); return false;
}
close(pr); close(pw); return true;
}
Guidance:
- Keep headers relatively small. Even if the kernel copies them, the cost is negligible compared to the file body.
- Treat
EAGAIN
as a scheduling event; re-armPOLLOUT
and resume draining the pipe into the socket. - Reuse a per-connection pipe to amortize
pipe2()
cost and avoid descriptor churn.
tee(2)
: duplicate a pipe stream without copying
tee()
clones the data in one pipe into another pipe by increasing references to the same pipe buffers. This lets you fan out a stream to multiple consumers (e.g., write once, send to two sockets) without copying payload bytes.
Sketch:
#include <fcntl.h>
#include <sys/types.h>
#include <unistd.h>
// Duplicate data from pipe A→B, then drain each pipe to its socket.
bool fanout_two_sockets(int in_fd, int sock1, int sock2) {
int pa[2], pb[2];
if (pipe2(pa, O_NONBLOCK|O_CLOEXEC) || pipe2(pb, O_NONBLOCK|O_CLOEXEC)) return false;
// Pull source fd → pa
for (;;) {
ssize_t n = splice(in_fd, NULL, pa[1], NULL, 1<<20, SPLICE_F_MORE|SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
if (n <= 0) break;
// Clone pa → pb
ssize_t k = tee(pa[0], pb[1], (size_t)n, SPLICE_F_NONBLOCK);
(void)k; // may be < n; loop to clone all available
// Drain each pipe to its socket
for (;;) {
ssize_t a = splice(pa[0], NULL, sock1, NULL, 1<<20, SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
if (!(a > 0 || (a < 0 && errno == EINTR))) break;
}
for (;;) {
ssize_t b = splice(pb[0], NULL, sock2, NULL, 1<<20, SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
if (!(b > 0 || (b < 0 && errno == EINTR))) break;
}
}
close(pa[0]); close(pa[1]); close(pb[0]); close(pb[1]);
return true;
}
Notes:
tee()
requires both ends be pipes. It doesn’t advance the read head of the input pipe; you still need to drain it.- Backpressure applies independently per consumer. Bound queue lengths and resume when sockets are writable.
Coalescing and offloads: getting line-rate without tiny writes
Zero-copy primitives give you the big win by skipping user buffers, but you still need to respect how the TCP stack and NIC like to send bytes: in large, contiguous segments with checksums offloaded and minimal per-packet overhead.
Tools you can use:
SPLICE_F_MORE
onsplice
: hints more data will follow, encouraging coalescing.MSG_MORE
onsend
/sendmsg
: similar hint for non-splice paths.TCP_CORK
socket option: hold back small segments until you uncork or fill a full MSS. Useful when sending headers followed by a large body.- NIC offloads (checksum offload, TSO/GSO): the kernel and NIC cooperate to segment and checksum efficiently. You get these benefits automatically when you keep the path inside the kernel.
Corking pattern for mixed header + sendfile
body:
#include <netinet/tcp.h>
static void set_cork(int sockfd, int on) {
int v = on ? 1 : 0; (void)setsockopt(sockfd, IPPROTO_TCP, TCP_CORK, &v, sizeof v);
}
bool send_with_cork(int sockfd, const void *hdr, size_t hdr_len, int filefd, off_t *off, size_t len) {
set_cork(sockfd, 1);
ssize_t h = send(sockfd, hdr, hdr_len, MSG_MORE);
if (h < 0) { set_cork(sockfd, 0); return false; }
size_t moved = 0;
while (moved < len) {
ssize_t n = sendfile(sockfd, filefd, off, len - moved);
if (n > 0) { moved += (size_t)n; continue; }
if (n == 0) break;
if (errno == EINTR) continue;
if (errno == EAGAIN) { /* wait for writable then retry */ set_cork(sockfd, 0); return false; }
set_cork(sockfd, 0); return false;
}
set_cork(sockfd, 0);
return true;
}
Guidance:
- Use corking only around small headers leading into large bodies. Leaving a socket corked too long increases tail latency.
- Prefer
SPLICE_F_MORE
/MSG_MORE
over corking for short-lived bursts; corking is a stronger control.
Pipe buffers: sizes, tuning, and memory pressure
Pipes are the hub for splice
/vmsplice
. Each pipe has a finite buffer capacity expressed as a number of pages. Under load, the defaults may limit throughput.
Practical knobs:
fcntl(pipe_fd, F_SETPIPE_SZ, bytes)
: request a larger pipe capacity. The kernel rounds to a multiple of page size and clamps to system limits.- System limits:
/proc/sys/fs/pipe-max-size
and per-user pages limits may cap your request.
Example setup:
#include <fcntl.h>
#include <stdio.h>
static void maybe_grow_pipe(int pfd) {
int want = 1 << 20; // 1 MiB
int got = fcntl(pfd, F_SETPIPE_SZ, want);
(void)got; // inspect in logs if you care
}
Guidance:
- A single large pipe per connection is usually sufficient. Measure before growing; larger pipes consume more kernel memory under backpressure.
- Respect backpressure. If
splice
to the socket returnsEAGAIN
, stop and re-arm writable notifications rather than spinning.
Page cache, readahead, and advice
For file→socket transfers, the page cache and readahead determine how smoothly pages arrive.
Useful calls:
posix_fadvise(fd, off, len, POSIX_FADV_SEQUENTIAL)
: hint sequential access.posix_fadvise(fd, off, len, POSIX_FADV_WILLNEED)
: nudge readahead to prefetch.posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED)
: drop from cache after use for cold files to reduce cache pollution.madvise(addr, len, MADV_SEQUENTIAL|MADV_WILLNEED|MADV_DONTNEED)
: similar hints formmap
paths.
Warmup sketch:
#include <fcntl.h>
static void warmup_sequential(int fd, off_t off, off_t len) {
(void)posix_fadvise(fd, off, len, POSIX_FADV_SEQUENTIAL);
(void)posix_fadvise(fd, off, len, POSIX_FADV_WILLNEED);
}
Guidance:
- Do not overuse WILLNEED; you can evict hot data for other tenants. Use it for cold, one-shot transfers.
- For very hot content, the cache will stay warm without special hints.
Transformations: when zero-copy meets TLS and compression
Transformations break pure zero-copy because bytes must be changed. Options to keep performance high:
- Kernel TLS (KTLS): move encryption into the kernel so
sendfile
and friends can still operate on plaintext pages while the kernel encrypts on send. Availability depends on kernel and cipher suites. - Pre-compress or pre-encrypt static assets at rest and serve them as-is when clients accept them.
- Accept a copy for the transformed portion (headers, small dynamic fragments) and keep the large body on the kernel path.
If KTLS isn’t available, a common compromise is: write small headers/trailers from user space, then sendfile
the body. The overall CPU cost stays low.
Robust nonblocking state machine (headers + file)
Integrate zero-copy into a readiness loop with a small state machine that handles headers via iovecs and the body via sendfile
.
#include <sys/uio.h>
enum xfer_state { X_HDR, X_BODY, X_DONE, X_ERR };
struct xfer {
enum xfer_state st;
struct iovec hdr[4];
int hdr_cnt;
int sockfd;
int filefd;
off_t off;
size_t body_left;
};
static void on_writable(struct xfer *x) {
if (x->st == X_HDR) {
while (x->hdr_cnt > 0) {
ssize_t n = writev(x->sockfd, x->hdr, x->hdr_cnt);
if (n > 0) {
size_t used = (size_t)n;
// Advance iovecs
int i = 0; while (i < x->hdr_cnt && used >= x->hdr[i].iov_len) { used -= x->hdr[i].iov_len; ++i; }
if (i > 0) { for (int k = 0; k + i < x->hdr_cnt; ++k) x->hdr[k] = x->hdr[k+i]; x->hdr_cnt -= i; }
if (used > 0 && x->hdr_cnt > 0) { x->hdr[0].iov_base = (char*)x->hdr[0].iov_base + used; x->hdr[0].iov_len -= used; }
continue;
}
if (n < 0 && errno == EINTR) continue;
if (n < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) return;
x->st = X_ERR; return;
}
x->st = X_BODY;
}
if (x->st == X_BODY) {
while (x->body_left > 0) {
ssize_t n = sendfile(x->sockfd, x->filefd, &x->off, x->body_left);
if (n > 0) { x->body_left -= (size_t)n; continue; }
if (n == 0) { x->st = X_DONE; return; }
if (n < 0 && errno == EINTR) continue;
if (n < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) return;
x->st = X_ERR; return;
}
x->st = X_DONE;
}
}
Notes:
- This integrates cleanly with an event loop: enable
POLLOUT
/EPOLLOUT
when state is notX_DONE
/X_ERR
, callon_writable
on each wake. - The header path uses
writev
because headers are typically small and benefit from a single syscall. - The body path uses
sendfile
for zero-copy. If you must mix in user buffers, switch to the pipe hub pattern from earlier.
Compatibility and graceful fallback
Not every fd pair supports splice
/sendfile
. Detect and fallback to read
/write
with robust loops when needed.
Error patterns to expect:
EINVAL
: unsupported fd types or options (e.g.,sendfile
withO_DIRECT
file).ESPIPE
: non-seekable file when an offset is required.EOPNOTSUPP
/ENOSYS
: older kernels or specific filesystems.
Fallback sketch:
static ssize_t fallback_copy(int out_fd, int in_fd, size_t count) {
char buf[1<<16]; size_t total = 0;
while (total < count) {
size_t want = count - total; if (want > sizeof buf) want = sizeof buf;
ssize_t r = read(in_fd, buf, want);
if (r > 0) {
size_t off = 0; while (off < (size_t)r) {
ssize_t w = write(out_fd, buf + off, (size_t)r - off);
if (w > 0) off += (size_t)w; else if (w < 0 && errno == EINTR) continue; else return (ssize_t)total;
}
total += (size_t)r; continue;
}
if (r == 0) break;
if (r < 0 && errno == EINTR) continue;
break;
}
return (ssize_t)total;
}
Measuring copies and syscalls (trust, but verify)
Quick methods to validate improvements:
- Syscall counting:
strace -f -e trace=sendfile,splice,vmsplice,write,read -c your_server
. - CPU accounting:
perf stat -p <pid>
during steady-state transfer. - Throughput: wrk/ab/iperf or a custom client; compare GB/s and CPU%.
- Socket stats: enable TCP info logs or sample
getsockopt(TCP_INFO)
to observe retransmits and pacing.
Checklist for fair tests:
- Warm the page cache or explicitly test cold-cache performance; report which you measured.
- Bind to a single core to compare CPU cycles per byte; then scale cores.
- Fix NIC and link speed; disable competing traffic.
- Test small files and large files separately; zero-copy shines more as payload grows.
mmap writes, page faults, and consistency
mmap
is also useful on the write side, but understand what’s actually happening:
MAP_PRIVATE
(copy-on-write): your writes modify private pages; they do not hit the file. Useful for read-mostly parsing or transformations without persisting.MAP_SHARED
: your writes dirty the mapped pages; the kernel writes them back to the file later. Callmsync(addr, len, MS_SYNC|MS_INVALIDATE)
if you need durability/visibility guarantees at specific points.- Page faults: the first access to a page triggers a fault that pulls the page in; on write with
MAP_PRIVATE
, the kernel may allocate a private copy (CoW). Those are copies—just deferred and often benefitting from larger granularities.
Implications:
mmap
helps when you need random access or parsing without staging copies. It does not remove the user→socket copy when you latersend()
those bytes.- For write-heavy workloads, classic buffered I/O with
writev
and tuned batching may outperformmmap
due to fewer minor faults and clearer backpressure points. Measure.
Minimal durability sync:
#include <sys/mman.h>
static bool flush_mapped(void *addr, size_t len) {
return msync(addr, len, MS_SYNC) == 0;
}
Checksums: NIC offloads vs userland
With kernel-managed send paths (sendfile
, splice
→ socket), you benefit from NIC offloads automatically where supported:
- Checksum offload (CSUM): NIC computes TCP/UDP checksums.
- TSO/GSO: kernel presents large skbs; NIC segments them to MSS-size frames.
- LRO/GRO on receive: coalescing reduces per-packet overhead (for completeness).
If your protocol embeds its own application-level checksum over the payload, you still compute that in user space (or precompute/store alongside the file). Zero-copy doesn’t stop you from computing a header checksum over metadata while letting the body take the kernel path.
Verification tip (operational): inspect ethtool -k
and ss -ti
/nstat
counters to confirm offloads are in effect; no code changes needed for the zero-copy APIs discussed here.
Edge cases and correctness pitfalls
- Short transfers are the norm: every API here can return fewer bytes than requested. Loop until done.
EAGAIN
is not an error: it’s a scheduling signal. Re-armPOLLOUT
/EPOLLOUT
and resume.- Peer close during transfer:
splice
/sendfile
can return-1
withEPIPE
or zero; handleEPOLLERR
/EPOLLHUP
by attempting to drain and then close. - Offsets and overflow: when mixing headers + file lengths, validate
off_t
arithmetic; guard against integer wrap on 32-bit systems. - Descriptor lifetimes: keep the file and pipe fds alive until all splices complete; avoid closing the source fd while the kernel still references pages.
- Direct I/O: avoid mixing
O_DIRECT
files withsendfile
/splice
. If you need direct I/O, use read/write with aligned buffers or io_uring and accept the copy.
Observability: prove it in prod
Add lightweight metrics to catch regressions:
- Bytes and ops by path: counters for
sendfile
bytes,splice
bytes, fallback bytes. - EAGAIN rates and average resume latency per connection.
- Per-connection pipe sizes and occupancy (sampled), to detect chronic backpressure.
- Error tallies by errno for
sendfile
/splice
/vmsplice
.
Sketch:
struct zc_metrics {
unsigned long long sf_bytes, sp_bytes, vmsp_bytes, fb_bytes;
unsigned long long sf_calls, sp_calls, vmsp_calls, fb_calls;
unsigned long long eagain_w, eagain_r;
};
static inline void add64(unsigned long long *c, unsigned long long v) {
__atomic_add_fetch(c, v, __ATOMIC_RELAXED);
}
Production checklist (printable)
- Prefer
sendfile
for file→socket when you don’t need to transform the body. - Use
splice
+pipe when you must stitch sources (headers viavmsplice
+ file body) without user copies. - Treat
EAGAIN
as backpressure; integrate with an event loop; never spin. - Use
SPLICE_F_MORE
/MSG_MORE
or transientTCP_CORK
to coalesce small pieces into full segments. - Reuse a pipe per connection and tune pipe size only after measuring.
- Warm cold files with
posix_fadvise
(SEQUENTIAL/WILLNEED) sparingly; consider DONTNEED after send. - Keep transformation work (TLS/compression) either in-kernel (KTLS) or limited to small fragments.
- Log bytes moved and error codes per path; alert on fallback growth.
Closing thoughts
Zero-copy on Linux isn’t a single switch—it’s a set of disciplined patterns. The winning formula is simple:
- Keep big payloads on kernel-managed paths (
sendfile
,splice
). - Limit user-space to control, metadata, and small headers (
writev
,vmsplice
). - Respect backpressure, coalesce wisely, and measure everything.
Do that, and you’ll trade buffer shuffles for throughput, swap midnight copy-tuning for clear metrics, and make your CPUs do more interesting work than moving bytes from A to B.