You opened a file, called read()
a few times, and bytes dutifully appeared. Magic? Not quite. The kernel’s page cache is doing heavy lifting: turning slow storage into something that often feels like RAM, smoothing read bursts with readahead, and hiding writes behind lazy writeback.
This post is a practical tour of how the page cache interacts with your C code, how mmap
changes the picture, and the rare cases where bypassing the cache is the right choice. We’ll keep it production-first: fewer myths, more behavior you can rely on.
The page cache in one minute
The page cache is a big, unified cache of file data in memory. It sits beneath the VFS and above block devices. Conceptually, each on-disk block maps to one or more physical pages; the kernel keeps those pages in memory as clean (match disk) or dirty (newer than disk).
What this buys you:
- Hot files become RAM-speed after the first miss.
- Reads can be served without disk I/O when pages are cached.
- Writes can complete quickly by marking pages dirty and deferring the slow flush to background writeback.
Core ideas to anchor on:
- The cache is indexed by (inode, offset → page). Same file, same bytes, shared by all processes.
- Regular
read()
/write()
andmmap()
both operate on the same underlying page cache. They're two faces of the same mechanism. - Dirty data isn't durable until it's written back.
fsync()
/fdatasync()
are the contracts to force durability.
What actually happens on read()
When you call read(fd, buf, n)
on a regular file:
- The kernel looks up the file’s cached pages covering your file offset range.
- If a page is present, it copies directly from the page cache into your buffer and advances the file offset.
- If a page is missing (cache miss), the kernel issues I/O to fill that page, potentially alongside neighbors via readahead, then copies to your buffer.
If you read sequentially, the kernel’s readahead heuristics will detect the pattern and fetch future pages in the background so your next read hits memory. If you jump around randomly, readahead backs off; most reads will need I/O.
Key properties:
- A successful
read()
returningr > 0
means r bytes were copied from the cache (after the kernel fetched them if needed). It does not promise the rest of your future reads are cached. - Short reads are legal: you asked for
n
, kernel gave0 < r <= n
based on what was available without blocking further (or hitting EOF). Loop until done.
A minimal, robust loop (from first principles):
#include <errno.h>
#include <stddef.h>
#include <stdint.h>
#include <unistd.h>
ssize_t read_exact(int fd, void *buf, size_t len) {
uint8_t *p = (uint8_t *)buf;
size_t have = 0;
while (have < len) {
ssize_t r = read(fd, p + have, len - have);
if (r > 0) { have += (size_t)r; continue; }
if (r == 0) return (ssize_t)have; // EOF
if (errno == EINTR) continue; // try again
return -1; // error (EIO, EINVAL, ...)
}
return (ssize_t)have;
}
Readahead: the quiet accelerator
Readahead is speculation done right. When the kernel notices you marching forward in a file, it starts fetching the next pages asynchronously. Your synchronous read()
then hits memory instead of waiting on the disk.
Practical implications you can engineer around:
- Favor forward, contiguous reads when possible (parse headers first, then bodies in order). Random seeks crush readahead benefits.
- If you must sample sparsely, collect offsets and sort them to batch nearby reads together—turn random into clustered.
- File systems and kernels tune readahead windows dynamically. Your best lever is access pattern, not magic flags.
What actually happens on write()
write(fd, buf, n)
on a regular file:
- Copies
n
bytes from your buffer into page cache pages covering the file range. - Marks those pages dirty and updates metadata (size, mtime) in memory.
- Returns once the copy is done—often long before the disk has seen the data.
Later, background writeback threads flush dirty pages to storage based on thresholds and policies (dirty ratios, flusher duty cycles). Two calls give you durability control:
fsync(fd)
: flushes file data and metadata to stable storage.fdatasync(fd)
: flushes data and minimal metadata needed to access it; may skip timestamps.
The rule of thumb: no fsync
/fdatasync
, no durability guarantees. If you’re designing a WAL or database, you must arrange for explicit flushes at the right points.
mmap: the other path to the same cache
mmap
maps a file (or anonymous memory) into your address space. You then read/write it with ordinary loads and stores. Under the hood:
- The first time your code touches a not-yet-present page, the CPU raises a page fault.
- The kernel resolves it by bringing the corresponding file page into the page cache (if it isn’t already) and mapping it into your process’s page tables.
- Subsequent accesses are ordinary memory ops; the kernel is out of the way until the next fault or eviction.
Writes through a shared file mapping (MAP_SHARED
) dirty the same page cache pages that write()
would dirty. Later, writeback flushes them, or you can call msync()
to push ranges proactively. Private mappings (MAP_PRIVATE
) use copy-on-write: your writes affect anonymous private pages, not the file, unless you explicitly write back via pwrite
or similar.
Minimal scanning example with mmap
:
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdint.h>
uint64_t sum_bytes(const char *path) {
int fd = open(path, O_RDONLY);
struct stat st; fstat(fd, &st);
size_t len = (size_t)st.st_size;
const unsigned char *p = (const unsigned char *)mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
uint64_t sum = 0;
for (size_t i = 0; i < len; ++i) sum += p[i]; // page faults pull pages via page cache
munmap((void *)p, len);
close(fd);
return sum;
}
Why prefer mmap
here?
- The kernel can drive readahead perfectly: your sequential touches are literal virtual-memory accesses.
- You avoid a user-kernel copy per
read()
; the data is already in the page cache mapped into your address space.
Why not always mmap
?
- Error handling becomes page-fault handling. I/O errors surface asynchronously on access instead of as
read()
return codes. - You must manage pointer lifetimes and
munmap()
carefully (no stray pointers past unmap). - For write-heavy workloads that need clear durability points,
write()
+fsync()
often makes the protocol clearer.
Semantics and coherence: mixing APIs safely
Because read()
/write()
and mmap(MAP_SHARED)
touch the same page cache, they’re coherent with each other through that cache. A few rules to avoid surprises:
- If one process writes via
write()
and another reads via a shared mapping, the reader may see the change once the writer’s data is in the page cache (often immediately). There’s still no cross-process synchronization—use your own protocols for ordering and visibility. - To force visibility from your own process’s writes through a shared mapping to the file system, use
msync(addr, len, MS_SYNC)
; to makewrite()
d data visible to your own mapping, no extra calls are needed beyond memory ordering in your program, but beware CPU caches on some architectures when using device DAX or non-coherent mappings. - Private mappings (
MAP_PRIVATE
) are intentionally not coherent with file writes after the mapping is established—copy-on-write decouples them.
We’ll go deeper into flush semantics, msync
modes, and pitfalls around partial-page writes next, but the key intuition stands: the page cache is the common ground.
mmap vs read/write: choosing the right tool
Both APIs can be fast. The choice comes down to ergonomics, error semantics, and access patterns.
-
Prefer
mmap
when:- You scan large files sequentially or with predictable locality.
- You benefit from zero-copy access (parsers, searchers) and can structure code to touch bytes once.
- You want the kernel to do perfect readahead from your memory touches.
-
Prefer
read
/pread
when:- You need explicit, synchronous error returns per I/O call.
- You operate on streaming interfaces or sockets (not mappable), or on files with simple buffered I/O patterns.
- You want tight control over durability and write ordering with
fsync
checkpoints.
Small rule of thumb: mmap
is phenomenal for read-mostly, high-locality workloads. read/pread
shines when you need explicit control, portability of error semantics, or you’re writing a WAL/DB and must pin flush points.
Measuring what matters (preview)
Before you refactor a hot path, measure real workloads:
- Page-fault rate and readahead efficacy (look at
perf
page-fault samples,iostat
, and fs metrics). - Copy cost for
read()
into user buffers vsmmap
’s zero-copy. - Tail latencies under cache miss: SSDs make misses cheap-ish but not free.
We’ll dive into madvise
/posix_fadvise
, O_DIRECT
trade-offs, alignment constraints, and fsync
semantics next. For now, keep the core mental model: there’s one shared page cache; your API choice decides how you traverse it and how you surface errors and durability.
Forcing the issue: msync, fsync, and what they really promise
When you write through a shared mapping (MAP_SHARED
), your stores dirty page cache pages. To push them to storage you have two knobs:
msync(addr, len, MS_SYNC)
: write back dirty pages in the given range and wait.MS_ASYNC
schedules writeback but returns early.MS_INVALIDATE
asks the kernel to drop cache state and reload from disk on next access (for multi-writer scenarios).fsync(fd)
/fdatasync(fd)
: flush file data (and metadata, forfsync
) to stable storage.
Rules of thumb that avoid 3 a.m. surprises:
msync(MS_SYNC)
ensures modified pages are written to the file. For full durability including metadata (e.g., size growth), pair withfdatasync(fd)
/fsync(fd)
on the file descriptor.- If you only ever shrink or grow the file via
ftruncate
, callfsync
afterftruncate
to persist size changes, thenmsync
/fdatasync
for data. - Avoid
MS_INVALIDATE
unless you truly want to throw away private changes and force refetch; it can be surprising and is not a replacement for a coherence protocol.
Minimal write-through-mapping with explicit flush:
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <string.h>
int write_header(const char *path, const void *hdr, size_t len) {
int fd = open(path, O_RDWR);
if (fd < 0) return -1;
// Ensure file is large enough
if (ftruncate(fd, (off_t)len) != 0) { close(fd); return -1; }
void *p = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (p == MAP_FAILED) { close(fd); return -1; }
memcpy(p, hdr, len); // dirty page cache pages
if (msync(p, len, MS_SYNC) != 0) { /* handle EIO */ }
// Persist size + metadata changes
(void)fdatasync(fd);
munmap(p, len);
close(fd);
return 0;
}
Tell the kernel your intent: madvise and posix_fadvise
You can often get big wins by giving the kernel hints about how you will access the file or mapping.
Memory mapping hints (madvise
):
MADV_SEQUENTIAL
: you’ll walk forward once; kernel can be more aggressive with readahead and drop behind.MADV_RANDOM
: avoid aggressive readahead.MADV_WILLNEED
: prefetch pages in the range soon.MADV_DONTNEED
: you’re done with this range; drop clean pages and discard private COW pages.
#include <sys/mman.h>
void tune_mapping(void *addr, size_t len) {
(void)madvise(addr, len, MADV_SEQUENTIAL);
// Before a tight scan of a subset
(void)madvise(addr, len, MADV_WILLNEED);
// After processing a window, free cache pressure
(void)madvise(addr, len, MADV_DONTNEED);
}
File descriptor hints (posix_fadvise
):
POSIX_FADV_SEQUENTIAL
vsPOSIX_FADV_RANDOM
mirror the intent ofmadvise
for non-mapped I/O.POSIX_FADV_WILLNEED
: initiate readahead.POSIX_FADV_DONTNEED
: evict cached pages for the range (clean ones); good for one-pass readers.POSIX_FADV_NOREUSE
: data will be accessed once; implementations vary.
#include <fcntl.h>
void tune_fd(int fd, off_t off, off_t len) {
(void)posix_fadvise(fd, off, len, POSIX_FADV_SEQUENTIAL);
(void)posix_fadvise(fd, off, len, POSIX_FADV_WILLNEED);
}
These are hints, not contracts. Measure under your kernel/filesystem; some combos respond more visibly than others.
mmap pitfalls you should design around
mmap
is powerful—and sharp. Common failure modes:
- Truncation under your feet: if another process (or you) shrinks the file while you hold a mapping, touching addresses past the new EOF can raise
SIGBUS
. Harden by coordinating size changes or by only mapping stable regions. - Growth isn’t automatic: if the file grows, your mapping doesn’t. You must
munmap
andmmap
a larger region to see new bytes. - Partial-page writes: the last page of a file may include unrelated bytes you didn’t intend to change. When writing via mappings, update the exact subrange or pad your format to page boundaries.
- Error propagation: I/O errors surface on access (fault time) rather than on a syscall return like
read()
. You’ll often seeSIGBUS
orEIO
frommsync
. Audit error handling paths.
Minimal guard against SIGBUS on trailing bytes:
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
void *map_safe_tail(int fd, size_t *out_len) {
struct stat st; if (fstat(fd, &st) != 0) return MAP_FAILED;
size_t len = (size_t)st.st_size;
if (len == 0) { *out_len = 0; return MAP_FAILED; }
void *p = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
if (p == MAP_FAILED) return MAP_FAILED;
*out_len = len; return p;
}
Coordinate file size changes across writers/readers; don’t rely on luck.
Dirty throttling and background writeback
Dirty pages can’t grow without bound. Linux tracks dirty ratios/bytes and will throttle writers when thresholds are exceeded:
- Buffered
write()
may block once the dirty pool is too large, even if the page cache copy is fast. msync(MS_SYNC)
obviously blocks until pages are written.- Mapped writes can stall when flusher threads are saturated or the device is slow.
Symptoms you’ll observe:
- Latency spikes on writes after sustained throughput.
- CPU idle with I/O queues full;
iostat
shows high util, low KB/s if throttled by device.
Operational guidance:
- Batch and align writes; prefer sequential access to help flusher throughput.
- Consider write-combining at the application layer to reduce the number of dirty pages.
- Budget end-to-end: add deadlines and surface backpressure instead of letting dirty growth surprise you.
Bypassing the cache: when (and how) to use direct I/O
Sometimes the page cache gets in the way—think large one-off scans that evict hot working sets, or databases that maintain their own cache and want to avoid double-caching. Enter direct I/O.
On Linux, O_DIRECT
asks the kernel to transfer data between your buffers and the block device without filling the page cache. Caveats:
- Alignment matters: file offsets, buffer addresses, and lengths generally must be multiples of the logical block size (often 512 or 4096 bytes). Violations yield
EINVAL
. - Semantics vary by filesystem; some still touch the cache for metadata or small tail regions.
- You lose readahead and caching benefits; your code must batch and prefetch explicitly.
Minimal pattern with alignment helpers:
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
int read_direct_4k(const char *path, void **out_buf, size_t *out_len) {
int fd = open(path, O_RDONLY | O_DIRECT);
if (fd < 0) return -1;
size_t len = 4096; // multiple of block size
void *buf = NULL;
if (posix_memalign(&buf, 4096, len) != 0) { close(fd); return -1; }
ssize_t r = pread(fd, buf, len, 0);
if (r < 0 && errno == EINVAL) {
// Misaligned? Verify filesystem/block size and adjust
}
*out_buf = buf; *out_len = (size_t)(r > 0 ? r : 0);
close(fd);
return 0;
}
Platform note:
- macOS doesn’t support
O_DIRECT
; usefcntl(fd, F_NOCACHE, 1)
to hint “don’t populate cache.” Semantics differ, and some caching may still occur.
Use cases that actually benefit:
- DB engines with their own buffer cache and eviction policy.
- Very large streaming reads that you don’t want to pollute the cache with (e.g., cold archival scans). For those,
POSIX_FADV_DONTNEED
on a normal read path can also be effective and simpler.
Buffered + direct I/O together: proceed with caution
Mixing buffered I/O (read
/write
) and direct I/O (O_DIRECT
) on the same file region can produce surprising results:
- Reads through the page cache may return stale data if you wrote via direct I/O and the kernel didn’t invalidate cached pages for that range.
- Writes through the page cache might later overwrite blocks you wrote via direct I/O, depending on flush timing.
Conservative guidance that avoids footguns:
- Do not mix buffered and direct I/O concurrently on the same file and byte ranges. If you must, separate by ranges or time (and use
POSIX_FADV_DONTNEED
/msync(MS_INVALIDATE)
carefully to purge cache), and test on your target filesystem. - Prefer a single mode per file for a given process: either buffered or direct.
Streaming patterns that keep caches healthy
One-pass scans are great candidates for “drop-behind” so you don’t evict hot working sets.
Windowed reader with drop-behind:
#include <fcntl.h>
#include <stdint.h>
#include <unistd.h>
enum { WIN = 8 * 1024 * 1024, CHUNK = 256 * 1024 };
int scan_drop_behind(int fd) {
uint8_t buf[CHUNK];
off_t off = 0; off_t dropped = 0; ssize_t r;
for (;;) {
r = pread(fd, buf, sizeof buf, off);
if (r > 0) {
// process buf[0..r)
off += r;
if (off - dropped >= WIN) {
(void)posix_fadvise(fd, dropped, off - dropped, POSIX_FADV_DONTNEED);
dropped = off;
}
continue;
}
if (r == 0) break; // EOF
if (r < 0 && errno == EINTR) continue;
return -1; // error
}
// Final drop-behind
(void)posix_fadvise(fd, dropped, off - dropped, POSIX_FADV_DONTNEED);
return 0;
}
Notes:
pread
avoids races on the shared file offset if multiple threads scan different regions.- Use a reasonable window (8–64 MiB) to avoid over-aggressive eviction while still protecting hot caches.
Equivalent drop-behind with mmap
uses madvise(MADV_DONTNEED)
after finishing each window.
Concurrency and correctness with offsets and locks
For multi-threaded I/O on the same file:
- Use
pread
/pwrite
to avoid contending on the per-FD file offset. - If multiple writers may touch overlapping regions, coordinate with record-level locks (application-level) or advisory file locks (
fcntl(F_SETLK)
), understanding they serialize by region but don’t enforce ordering semantics for caches.
Minimal advisory write lock:
#include <fcntl.h>
int lock_region(int fd, off_t off, off_t len) {
struct flock fl = { .l_type = F_WRLCK, .l_whence = SEEK_SET, .l_start = off, .l_len = len };
return fcntl(fd, F_SETLKW, &fl); // blocks until acquired
}
int unlock_region(int fd, off_t off, off_t len) {
struct flock fl = { .l_type = F_UNLCK, .l_whence = SEEK_SET, .l_start = off, .l_len = len };
return fcntl(fd, F_SETLK, &fl);
}
Use file locks judiciously; many high-throughput designs prefer append-only logs with record headers (length + checksum) and idempotent recovery over fine-grained locking.
Appends and file tailing: choose APIs that fit the semantics
Tailing a growing file doesn’t fit mmap
ergonomics well because the mapping length is fixed. Prefer buffered I/O:
- Open the file O_RDONLY and poll for readability (kqueue/epoll) or use filesystem-specific notifications (
inotify
/EVFILT_VNODE
). read()
untilEAGAIN
/EOF, then wait again.
Skeleton with poll
:
#include <poll.h>
#include <unistd.h>
void tail(int fd) {
char buf[4096]; struct pollfd p = { .fd = fd, .events = POLLIN };
for (;;) {
int pr = poll(&p, 1, -1);
if (pr <= 0) continue;
for (;;) {
ssize_t r = read(fd, buf, sizeof buf);
if (r > 0) { /* process */ continue; }
if (r == 0) break; // EOF for now (may grow later)
if (errno == EINTR) continue;
if (errno == EAGAIN || errno == EWOULDBLOCK) break;
return; // error
}
}
}
If you must use mmap
for structured parsing on a growing file, periodically fstat
to detect growth, munmap
/mmap
a larger range, and be prepared for races with writers.
Hints for huge pages, NUMA, and advanced advice (measure!)
- Transparent Huge Pages (THP) can reduce TLB pressure for anonymous memory. File-backed THP support exists on some kernels/filesystems but is nuanced. For file scans, measure before relying on THP effects.
MADV_HUGEPAGE
/MADV_NOHUGEPAGE
can hint THP behavior for eligible mappings. Effects vary by kernel.- NUMA locality matters for
mmap
: the first-touch policy typically places pages on the node of the CPU that faults them. Consider pinning threads (e.g.,pthread_setaffinity_np
) during the initial scan to keep data local, or usembind
/numa_alloc_onnode
-style APIs where appropriate.
A simple, fair benchmark methodology
Comparing mmap
vs read()
requires care:
- Use the same kernel, filesystem, device, and file.
- Warm vs cold cache changes everything. Test both: once with page cache warmed (run once, then again), and once cold (drop caches if you can, outside production):
echo 3 > /proc/sys/vm/drop_caches
(root only, Linux). - Control CPU frequency scaling and background load; pin threads.
Sketch harness to time a scan with each API:
#include <stdint.h>
#include <time.h>
static uint64_t now_ns(void) {
struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts);
return (uint64_t)ts.tv_sec*1000000000ull + (uint64_t)ts.tv_nsec;
}
double time_read_scan(int fd) { /* implement with read loop */ }
double time_mmap_scan(const char *path) { /* implement with mmap loop */ }
int main(int argc, char **argv) {
// open file; optionally prewarm by reading once
// run each variant multiple times; report min/median
}
Report both throughput (MiB/s) and tail latency for operations that matter (e.g., time to first byte, time to last byte). The right choice is workload-specific—measure with realistic data and access patterns.
Prealloc, sparsity, and truncation: shaping the file
Preallocating space avoids late failures and fragmentation, and helps writeback schedule large sequential I/O.
posix_fallocate(fd, off, len)
: ensures space is allocated for the given range; subsequent writes won’t fail withENOSPC
for lack of blocks in that range. On some filesystems, this creates unwritten extents that become initialized on first write.- Sparse files: writing near end-of-file without touching middle creates holes (reading holes yields zeros). To intentionally punch holes and reclaim space, use
fallocate
withFALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE
(Linux).
#include <fcntl.h>
#include <errno.h>
int ensure_space(int fd, off_t off, off_t len) {
int r = posix_fallocate(fd, off, len);
return r == 0 ? 0 : -1; // r is an errno on failure
}
#ifdef __linux__
int punch_hole(int fd, off_t off, off_t len) {
return fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, off, len);
}
#endif
Prealloc pairs well with append-heavy logs and large file builders; it reduces extent churn and surprises during peak traffic.
Durability knobs and platform nuances
Durability is where details matter most.
fdatasync
vsfsync
:fdatasync
flushes file data and minimal metadata required to access it;fsync
also flushes metadata like timestamps. Many databases preferfdatasync
on hot paths.- Open flags:
O_DSYNC
(data-sync) andO_SYNC
(data + metadata) make each write behave as if followed by a sync—simplifies logic, often reduces throughput; measure. - Filesystem behavior: ext4’s default
data=ordered
journals metadata and writes data blocks before committing metadata; XFS and btrfs have different policies. Always test on the actual FS you deploy. syncfs(fd)
: flushes all dirty data on the filesystem containingfd
. Useful for controlled batch flushes; beware global impact.- macOS:
F_FULLFSYNC
viafcntl(fd, F_FULLFSYNC, 1)
forces the drive to flush caches to non-volatile media (stronger thanfsync
). It’s slower; use only where required.
Minimal cross-platform flush helper:
#include <unistd.h>
#include <fcntl.h>
int flush_durable(int fd) {
#if defined(__APPLE__)
// Try full flush; fallback to fsync on older volumes
if (fcntl(fd, F_FULLFSYNC, 1) == 0) return 0;
#endif
return fsync(fd);
}
Decision guide you can apply today
- Reads are mostly sequential and large; you want simplicity: use buffered
read
/pread
. AddPOSIX_FADV_SEQUENTIAL
and optional drop-behind. - Reads are large and one-off, and you must protect hot caches: consider buffered reads with
POSIX_FADV_DONTNEED
after each window, orO_DIRECT
if alignment constraints are acceptable and your FS supports it well. - Reads are many and random across a hot set: use
mmap
or buffered I/O—measure both. Favormadvise(MADV_RANDOM)
formmap
. - Writes are append-only with periodic durability points: buffered
write
+fdatasync
at checkpoints; preallocate space up front. - Writes modify in-place and must be readable immediately by other processes: shared
mmap
+msync(MS_SYNC)
on the touched ranges; consider buffered I/O if error handling via return codes is preferable. - Multiple threads read/write different regions: use
pread
/pwrite
to avoid file-offset races; avoid mixing buffered with direct I/O on the same ranges.
Troubleshooting and observability
Tools that reveal what’s happening:
- Syscall tracing:
strace -T -e trace=read,write,fsync,fdatasync,msync,mmap,munmap
(Linux) ordtruss
(macOS). Look for short reads/writes, unexpectedEINTR
, and sync timing. - Page faults:
perf stat -e page-faults <cmd>
andperf record -e page-faults
; high minor faults are normal withmmap
scans; major faults indicate disk I/O. - I/O pressure:
iostat -xz 1
,pidstat -d 1
,iotop
to watch queues and throughput. - Block layer:
blktrace
/btt
(Linux) for deep dives; advanced users can employbpftrace
to sample VFS/page cache events.
Quick bpftrace
sampler (Linux):
bpftrace -e 'tracepoint:syscalls:sys_enter_read { @r[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /@r[tid]/ { printf("read %d -> %d in %d us\n", pid, retval, (nsecs-@r[tid])/1000); delete(@r[tid]); }'
Interpretation pointers:
- If
read()
is fast but first access viammap
is slow, you’re seeing page faults fetch data—expected. Warm the cache or switch API. - If throughput collapses during writes, suspect dirty throttling or device limits; batch writes and monitor dirty ratios.
- If durability calls dominate tail latency, consider grouping with
syncfs
or moving fsyncs off the hot path (WAL + group-commit style).
Production checklist (printable)
- Choose one I/O mode per file/region; avoid mixing buffered and direct.
- Use
pread
/pwrite
for concurrency; avoid shared-offset races. - Exploit readahead: keep scans sequential or cluster seeks.
- Drop-behind on one-pass scans (
POSIX_FADV_DONTNEED
orMADV_DONTNEED
). - Preallocate with
posix_fallocate
; manage sparsity intentionally. - Budget durability with
fdatasync
/fsync
/msync
; know your filesystem’s guarantees. - Add deadlines and backpressure; expect dirty throttling.
- Measure hot paths with warmed and cold caches; report throughput and tails.
Closing thoughts
There isn’t a single “fastest” I/O API—there’s a shared page cache and a set of contracts. Your job is to pick the API that best expresses your error handling, durability points, and access pattern, then give the kernel clear hints so it can help you. When you keep the mental model simple—one cache, two ways to traverse it, and rare cases for bypass—you’ll ship I/O that is both fast and predictably correct.