Page Cache, mmap, and When to Bypass It

Published: December 3, 2014 (10y ago)22 min read

Updated: January 11, 2025 (7mo ago)

You opened a file, called read() a few times, and bytes dutifully appeared. Magic? Not quite. The kernel’s page cache is doing heavy lifting: turning slow storage into something that often feels like RAM, smoothing read bursts with readahead, and hiding writes behind lazy writeback.

This post is a practical tour of how the page cache interacts with your C code, how mmap changes the picture, and the rare cases where bypassing the cache is the right choice. We’ll keep it production-first: fewer myths, more behavior you can rely on.

The page cache in one minute

The page cache is a big, unified cache of file data in memory. It sits beneath the VFS and above block devices. Conceptually, each on-disk block maps to one or more physical pages; the kernel keeps those pages in memory as clean (match disk) or dirty (newer than disk).

What this buys you:

  • Hot files become RAM-speed after the first miss.
  • Reads can be served without disk I/O when pages are cached.
  • Writes can complete quickly by marking pages dirty and deferring the slow flush to background writeback.

Core ideas to anchor on:

  • The cache is indexed by (inode, offset → page). Same file, same bytes, shared by all processes.
  • Regular read()/write() and mmap() both operate on the same underlying page cache. They're two faces of the same mechanism.
  • Dirty data isn't durable until it's written back. fsync()/fdatasync() are the contracts to force durability.
graph TB subgraph "Application Layer" A1[Process A: read/write] A2[Process B: mmap] A3[Process C: read/write] end subgraph "Virtual File System (VFS)" V1[File Descriptors] V2[Inode Operations] end subgraph "Page Cache" P1["File 1<br/>Pages 0-7<br/>(Clean)"] P2["File 1<br/>Pages 8-15<br/>(Dirty)"] P3["File 2<br/>Pages 0-3<br/>(Clean)"] P4[LRU Eviction<br/>Algorithm] end subgraph "Storage Layer" S1[Block Device] S2[Filesystem] S3[Actual Disk] end A1 --> V1 A2 --> V2 A3 --> V1 V1 --> P1 V1 --> P2 V2 --> P1 V2 --> P3 P1 --> S1 P2 -.->|Writeback| S1 P3 --> S1 P4 --> S2 S1 --> S3 style P1 fill:#e8f5e8 style P2 fill:#fff3e0 style P3 fill:#e8f5e8 style A2 fill:#e1f5fe Note1["All processes share<br/>same cached pages<br/>for same file regions"] Note2["Dirty pages written<br/>back asynchronously<br/>or on fsync()"]

What actually happens on read()

When you call read(fd, buf, n) on a regular file:

  1. The kernel looks up the file’s cached pages covering your file offset range.
  2. If a page is present, it copies directly from the page cache into your buffer and advances the file offset.
  3. If a page is missing (cache miss), the kernel issues I/O to fill that page, potentially alongside neighbors via readahead, then copies to your buffer.

If you read sequentially, the kernel’s readahead heuristics will detect the pattern and fetch future pages in the background so your next read hits memory. If you jump around randomly, readahead backs off; most reads will need I/O.

Key properties:

  • A successful read() returning r > 0 means r bytes were copied from the cache (after the kernel fetched them if needed). It does not promise the rest of your future reads are cached.
  • Short reads are legal: you asked for n, kernel gave 0 < r <= n based on what was available without blocking further (or hitting EOF). Loop until done.

A minimal, robust loop (from first principles):

#include <errno.h>
#include <stddef.h>
#include <stdint.h>
#include <unistd.h>
 
ssize_t read_exact(int fd, void *buf, size_t len) {
  uint8_t *p = (uint8_t *)buf;
  size_t have = 0;
  while (have < len) {
    ssize_t r = read(fd, p + have, len - have);
    if (r > 0) { have += (size_t)r; continue; }
    if (r == 0) return (ssize_t)have;      // EOF
    if (errno == EINTR) continue;          // try again
    return -1;                             // error (EIO, EINVAL, ...)
  }
  return (ssize_t)have;
}

Readahead: the quiet accelerator

Readahead is speculation done right. When the kernel notices you marching forward in a file, it starts fetching the next pages asynchronously. Your synchronous read() then hits memory instead of waiting on the disk.

Practical implications you can engineer around:

  • Favor forward, contiguous reads when possible (parse headers first, then bodies in order). Random seeks crush readahead benefits.
  • If you must sample sparsely, collect offsets and sort them to batch nearby reads together—turn random into clustered.
  • File systems and kernels tune readahead windows dynamically. Your best lever is access pattern, not magic flags.

What actually happens on write()

write(fd, buf, n) on a regular file:

  1. Copies n bytes from your buffer into page cache pages covering the file range.
  2. Marks those pages dirty and updates metadata (size, mtime) in memory.
  3. Returns once the copy is done—often long before the disk has seen the data.

Later, background writeback threads flush dirty pages to storage based on thresholds and policies (dirty ratios, flusher duty cycles). Two calls give you durability control:

  • fsync(fd): flushes file data and metadata to stable storage.
  • fdatasync(fd): flushes data and minimal metadata needed to access it; may skip timestamps.

The rule of thumb: no fsync/fdatasync, no durability guarantees. If you’re designing a WAL or database, you must arrange for explicit flushes at the right points.

mmap: the other path to the same cache

mmap maps a file (or anonymous memory) into your address space. You then read/write it with ordinary loads and stores. Under the hood:

  • The first time your code touches a not-yet-present page, the CPU raises a page fault.
  • The kernel resolves it by bringing the corresponding file page into the page cache (if it isn’t already) and mapping it into your process’s page tables.
  • Subsequent accesses are ordinary memory ops; the kernel is out of the way until the next fault or eviction.

Writes through a shared file mapping (MAP_SHARED) dirty the same page cache pages that write() would dirty. Later, writeback flushes them, or you can call msync() to push ranges proactively. Private mappings (MAP_PRIVATE) use copy-on-write: your writes affect anonymous private pages, not the file, unless you explicitly write back via pwrite or similar.

Minimal scanning example with mmap:

#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdint.h>
 
uint64_t sum_bytes(const char *path) {
  int fd = open(path, O_RDONLY);
  struct stat st; fstat(fd, &st);
  size_t len = (size_t)st.st_size;
  const unsigned char *p = (const unsigned char *)mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
  uint64_t sum = 0;
  for (size_t i = 0; i < len; ++i) sum += p[i]; // page faults pull pages via page cache
  munmap((void *)p, len);
  close(fd);
  return sum;
}

Why prefer mmap here?

  • The kernel can drive readahead perfectly: your sequential touches are literal virtual-memory accesses.
  • You avoid a user-kernel copy per read(); the data is already in the page cache mapped into your address space.

Why not always mmap?

  • Error handling becomes page-fault handling. I/O errors surface asynchronously on access instead of as read() return codes.
  • You must manage pointer lifetimes and munmap() carefully (no stray pointers past unmap).
  • For write-heavy workloads that need clear durability points, write() + fsync() often makes the protocol clearer.

Semantics and coherence: mixing APIs safely

Because read()/write() and mmap(MAP_SHARED) touch the same page cache, they’re coherent with each other through that cache. A few rules to avoid surprises:

  • If one process writes via write() and another reads via a shared mapping, the reader may see the change once the writer’s data is in the page cache (often immediately). There’s still no cross-process synchronization—use your own protocols for ordering and visibility.
  • To force visibility from your own process’s writes through a shared mapping to the file system, use msync(addr, len, MS_SYNC); to make write()d data visible to your own mapping, no extra calls are needed beyond memory ordering in your program, but beware CPU caches on some architectures when using device DAX or non-coherent mappings.
  • Private mappings (MAP_PRIVATE) are intentionally not coherent with file writes after the mapping is established—copy-on-write decouples them.

We’ll go deeper into flush semantics, msync modes, and pitfalls around partial-page writes next, but the key intuition stands: the page cache is the common ground.


mmap vs read/write: choosing the right tool

Both APIs can be fast. The choice comes down to ergonomics, error semantics, and access patterns.

graph TB subgraph "read/write Path" R1[Application Buffer] --> R2[read() syscall] R2 --> R3[VFS Layer] R3 --> R4[Page Cache Lookup] R4 -->|Hit| R5[Copy to User Buffer] R4 -->|Miss| R6[Load from Disk] R6 --> R7[Copy to User Buffer] R5 --> R8[Return to App] R7 --> R8 end subgraph "mmap Path" M1[Virtual Address] --> M2[Page Fault] M2 --> M3[Page Cache Lookup] M3 -->|Hit| M4[Map Page to Process] M3 -->|Miss| M5[Load from Disk] M5 --> M6[Map Page to Process] M4 --> M7[Direct Memory Access] M6 --> M7 end subgraph "Trade-offs" T1["read/write:<br/>✓ Explicit control<br/>✓ Clear error handling<br/>✓ Bounded memory<br/>✗ Copy overhead"] T2["mmap:<br/>✓ Zero-copy access<br/>✓ Shared pages<br/>✓ Lazy loading<br/>✗ SIGBUS on I/O errors<br/>✗ Virtual memory overhead"] end style R5 fill:#ffebee style R7 fill:#ffebee style M7 fill:#e8f5e8 Note1["Copy required:<br/>kernel → userspace"] Note2["No copy:<br/>direct page access"]
  • Prefer mmap when:

    • You scan large files sequentially or with predictable locality.
    • You benefit from zero-copy access (parsers, searchers) and can structure code to touch bytes once.
    • You want the kernel to do perfect readahead from your memory touches.
  • Prefer read/pread when:

    • You need explicit, synchronous error returns per I/O call.
    • You operate on streaming interfaces or sockets (not mappable), or on files with simple buffered I/O patterns.
    • You want tight control over durability and write ordering with fsync checkpoints.

Small rule of thumb: mmap is phenomenal for read-mostly, high-locality workloads. read/pread shines when you need explicit control, portability of error semantics, or you’re writing a WAL/DB and must pin flush points.


Measuring what matters (preview)

Before you refactor a hot path, measure real workloads:

  • Page-fault rate and readahead efficacy (look at perf page-fault samples, iostat, and fs metrics).
  • Copy cost for read() into user buffers vs mmap’s zero-copy.
  • Tail latencies under cache miss: SSDs make misses cheap-ish but not free.

We’ll dive into madvise/posix_fadvise, O_DIRECT trade-offs, alignment constraints, and fsync semantics next. For now, keep the core mental model: there’s one shared page cache; your API choice decides how you traverse it and how you surface errors and durability.

Forcing the issue: msync, fsync, and what they really promise

When you write through a shared mapping (MAP_SHARED), your stores dirty page cache pages. To push them to storage you have two knobs:

  • msync(addr, len, MS_SYNC): write back dirty pages in the given range and wait. MS_ASYNC schedules writeback but returns early. MS_INVALIDATE asks the kernel to drop cache state and reload from disk on next access (for multi-writer scenarios).
  • fsync(fd)/fdatasync(fd): flush file data (and metadata, for fsync) to stable storage.

Rules of thumb that avoid 3 a.m. surprises:

  • msync(MS_SYNC) ensures modified pages are written to the file. For full durability including metadata (e.g., size growth), pair with fdatasync(fd)/fsync(fd) on the file descriptor.
  • If you only ever shrink or grow the file via ftruncate, call fsync after ftruncate to persist size changes, then msync/fdatasync for data.
  • Avoid MS_INVALIDATE unless you truly want to throw away private changes and force refetch; it can be surprising and is not a replacement for a coherence protocol.

Minimal write-through-mapping with explicit flush:

#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <string.h>
 
int write_header(const char *path, const void *hdr, size_t len) {
  int fd = open(path, O_RDWR);
  if (fd < 0) return -1;
  // Ensure file is large enough
  if (ftruncate(fd, (off_t)len) != 0) { close(fd); return -1; }
  void *p = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
  if (p == MAP_FAILED) { close(fd); return -1; }
  memcpy(p, hdr, len);                       // dirty page cache pages
  if (msync(p, len, MS_SYNC) != 0) { /* handle EIO */ }
  // Persist size + metadata changes
  (void)fdatasync(fd);
  munmap(p, len);
  close(fd);
  return 0;
}

Tell the kernel your intent: madvise and posix_fadvise

You can often get big wins by giving the kernel hints about how you will access the file or mapping.

Memory mapping hints (madvise):

  • MADV_SEQUENTIAL: you’ll walk forward once; kernel can be more aggressive with readahead and drop behind.
  • MADV_RANDOM: avoid aggressive readahead.
  • MADV_WILLNEED: prefetch pages in the range soon.
  • MADV_DONTNEED: you’re done with this range; drop clean pages and discard private COW pages.
#include <sys/mman.h>
 
void tune_mapping(void *addr, size_t len) {
  (void)madvise(addr, len, MADV_SEQUENTIAL);
  // Before a tight scan of a subset
  (void)madvise(addr, len, MADV_WILLNEED);
  // After processing a window, free cache pressure
  (void)madvise(addr, len, MADV_DONTNEED);
}

File descriptor hints (posix_fadvise):

  • POSIX_FADV_SEQUENTIAL vs POSIX_FADV_RANDOM mirror the intent of madvise for non-mapped I/O.
  • POSIX_FADV_WILLNEED: initiate readahead.
  • POSIX_FADV_DONTNEED: evict cached pages for the range (clean ones); good for one-pass readers.
  • POSIX_FADV_NOREUSE: data will be accessed once; implementations vary.
#include <fcntl.h>
 
void tune_fd(int fd, off_t off, off_t len) {
  (void)posix_fadvise(fd, off, len, POSIX_FADV_SEQUENTIAL);
  (void)posix_fadvise(fd, off, len, POSIX_FADV_WILLNEED);
}

These are hints, not contracts. Measure under your kernel/filesystem; some combos respond more visibly than others.

mmap pitfalls you should design around

mmap is powerful—and sharp. Common failure modes:

  • Truncation under your feet: if another process (or you) shrinks the file while you hold a mapping, touching addresses past the new EOF can raise SIGBUS. Harden by coordinating size changes or by only mapping stable regions.
  • Growth isn’t automatic: if the file grows, your mapping doesn’t. You must munmap and mmap a larger region to see new bytes.
  • Partial-page writes: the last page of a file may include unrelated bytes you didn’t intend to change. When writing via mappings, update the exact subrange or pad your format to page boundaries.
  • Error propagation: I/O errors surface on access (fault time) rather than on a syscall return like read(). You’ll often see SIGBUS or EIO from msync. Audit error handling paths.

Minimal guard against SIGBUS on trailing bytes:

#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
 
void *map_safe_tail(int fd, size_t *out_len) {
  struct stat st; if (fstat(fd, &st) != 0) return MAP_FAILED;
  size_t len = (size_t)st.st_size;
  if (len == 0) { *out_len = 0; return MAP_FAILED; }
  void *p = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
  if (p == MAP_FAILED) return MAP_FAILED;
  *out_len = len; return p;
}

Coordinate file size changes across writers/readers; don’t rely on luck.

Dirty throttling and background writeback

Dirty pages can’t grow without bound. Linux tracks dirty ratios/bytes and will throttle writers when thresholds are exceeded:

  • Buffered write() may block once the dirty pool is too large, even if the page cache copy is fast.
  • msync(MS_SYNC) obviously blocks until pages are written.
  • Mapped writes can stall when flusher threads are saturated or the device is slow.

Symptoms you’ll observe:

  • Latency spikes on writes after sustained throughput.
  • CPU idle with I/O queues full; iostat shows high util, low KB/s if throttled by device.

Operational guidance:

  • Batch and align writes; prefer sequential access to help flusher throughput.
  • Consider write-combining at the application layer to reduce the number of dirty pages.
  • Budget end-to-end: add deadlines and surface backpressure instead of letting dirty growth surprise you.

Bypassing the cache: when (and how) to use direct I/O

Sometimes the page cache gets in the way—think large one-off scans that evict hot working sets, or databases that maintain their own cache and want to avoid double-caching. Enter direct I/O.

On Linux, O_DIRECT asks the kernel to transfer data between your buffers and the block device without filling the page cache. Caveats:

  • Alignment matters: file offsets, buffer addresses, and lengths generally must be multiples of the logical block size (often 512 or 4096 bytes). Violations yield EINVAL.
  • Semantics vary by filesystem; some still touch the cache for metadata or small tail regions.
  • You lose readahead and caching benefits; your code must batch and prefetch explicitly.

Minimal pattern with alignment helpers:

#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
 
int read_direct_4k(const char *path, void **out_buf, size_t *out_len) {
  int fd = open(path, O_RDONLY | O_DIRECT);
  if (fd < 0) return -1;
  size_t len = 4096; // multiple of block size
  void *buf = NULL;
  if (posix_memalign(&buf, 4096, len) != 0) { close(fd); return -1; }
  ssize_t r = pread(fd, buf, len, 0);
  if (r < 0 && errno == EINVAL) {
    // Misaligned? Verify filesystem/block size and adjust
  }
  *out_buf = buf; *out_len = (size_t)(r > 0 ? r : 0);
  close(fd);
  return 0;
}

Platform note:

  • macOS doesn’t support O_DIRECT; use fcntl(fd, F_NOCACHE, 1) to hint “don’t populate cache.” Semantics differ, and some caching may still occur.

Use cases that actually benefit:

  • DB engines with their own buffer cache and eviction policy.
  • Very large streaming reads that you don’t want to pollute the cache with (e.g., cold archival scans). For those, POSIX_FADV_DONTNEED on a normal read path can also be effective and simpler.

Buffered + direct I/O together: proceed with caution

Mixing buffered I/O (read/write) and direct I/O (O_DIRECT) on the same file region can produce surprising results:

  • Reads through the page cache may return stale data if you wrote via direct I/O and the kernel didn’t invalidate cached pages for that range.
  • Writes through the page cache might later overwrite blocks you wrote via direct I/O, depending on flush timing.

Conservative guidance that avoids footguns:

  • Do not mix buffered and direct I/O concurrently on the same file and byte ranges. If you must, separate by ranges or time (and use POSIX_FADV_DONTNEED/msync(MS_INVALIDATE) carefully to purge cache), and test on your target filesystem.
  • Prefer a single mode per file for a given process: either buffered or direct.

Streaming patterns that keep caches healthy

One-pass scans are great candidates for “drop-behind” so you don’t evict hot working sets.

Windowed reader with drop-behind:

#include <fcntl.h>
#include <stdint.h>
#include <unistd.h>
 
enum { WIN = 8 * 1024 * 1024, CHUNK = 256 * 1024 };
 
int scan_drop_behind(int fd) {
  uint8_t buf[CHUNK];
  off_t off = 0; off_t dropped = 0; ssize_t r;
  for (;;) {
    r = pread(fd, buf, sizeof buf, off);
    if (r > 0) {
      // process buf[0..r)
      off += r;
      if (off - dropped >= WIN) {
        (void)posix_fadvise(fd, dropped, off - dropped, POSIX_FADV_DONTNEED);
        dropped = off;
      }
      continue;
    }
    if (r == 0) break; // EOF
    if (r < 0 && errno == EINTR) continue;
    return -1; // error
  }
  // Final drop-behind
  (void)posix_fadvise(fd, dropped, off - dropped, POSIX_FADV_DONTNEED);
  return 0;
}

Notes:

  • pread avoids races on the shared file offset if multiple threads scan different regions.
  • Use a reasonable window (8–64 MiB) to avoid over-aggressive eviction while still protecting hot caches.

Equivalent drop-behind with mmap uses madvise(MADV_DONTNEED) after finishing each window.

Concurrency and correctness with offsets and locks

For multi-threaded I/O on the same file:

  • Use pread/pwrite to avoid contending on the per-FD file offset.
  • If multiple writers may touch overlapping regions, coordinate with record-level locks (application-level) or advisory file locks (fcntl(F_SETLK)), understanding they serialize by region but don’t enforce ordering semantics for caches.

Minimal advisory write lock:

#include <fcntl.h>
 
int lock_region(int fd, off_t off, off_t len) {
  struct flock fl = { .l_type = F_WRLCK, .l_whence = SEEK_SET, .l_start = off, .l_len = len };
  return fcntl(fd, F_SETLKW, &fl); // blocks until acquired
}
 
int unlock_region(int fd, off_t off, off_t len) {
  struct flock fl = { .l_type = F_UNLCK, .l_whence = SEEK_SET, .l_start = off, .l_len = len };
  return fcntl(fd, F_SETLK, &fl);
}

Use file locks judiciously; many high-throughput designs prefer append-only logs with record headers (length + checksum) and idempotent recovery over fine-grained locking.

Appends and file tailing: choose APIs that fit the semantics

Tailing a growing file doesn’t fit mmap ergonomics well because the mapping length is fixed. Prefer buffered I/O:

  • Open the file O_RDONLY and poll for readability (kqueue/epoll) or use filesystem-specific notifications (inotify/EVFILT_VNODE).
  • read() until EAGAIN/EOF, then wait again.

Skeleton with poll:

#include <poll.h>
#include <unistd.h>
 
void tail(int fd) {
  char buf[4096]; struct pollfd p = { .fd = fd, .events = POLLIN };
  for (;;) {
    int pr = poll(&p, 1, -1);
    if (pr <= 0) continue;
    for (;;) {
      ssize_t r = read(fd, buf, sizeof buf);
      if (r > 0) { /* process */ continue; }
      if (r == 0) break; // EOF for now (may grow later)
      if (errno == EINTR) continue;
      if (errno == EAGAIN || errno == EWOULDBLOCK) break;
      return; // error
    }
  }
}

If you must use mmap for structured parsing on a growing file, periodically fstat to detect growth, munmap/mmap a larger range, and be prepared for races with writers.

Hints for huge pages, NUMA, and advanced advice (measure!)

  • Transparent Huge Pages (THP) can reduce TLB pressure for anonymous memory. File-backed THP support exists on some kernels/filesystems but is nuanced. For file scans, measure before relying on THP effects.
  • MADV_HUGEPAGE/MADV_NOHUGEPAGE can hint THP behavior for eligible mappings. Effects vary by kernel.
  • NUMA locality matters for mmap: the first-touch policy typically places pages on the node of the CPU that faults them. Consider pinning threads (e.g., pthread_setaffinity_np) during the initial scan to keep data local, or use mbind/numa_alloc_onnode-style APIs where appropriate.

A simple, fair benchmark methodology

Comparing mmap vs read() requires care:

  • Use the same kernel, filesystem, device, and file.
  • Warm vs cold cache changes everything. Test both: once with page cache warmed (run once, then again), and once cold (drop caches if you can, outside production): echo 3 > /proc/sys/vm/drop_caches (root only, Linux).
  • Control CPU frequency scaling and background load; pin threads.

Sketch harness to time a scan with each API:

#include <stdint.h>
#include <time.h>
 
static uint64_t now_ns(void) {
  struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts);
  return (uint64_t)ts.tv_sec*1000000000ull + (uint64_t)ts.tv_nsec;
}
 
double time_read_scan(int fd) { /* implement with read loop */ }
double time_mmap_scan(const char *path) { /* implement with mmap loop */ }
 
int main(int argc, char **argv) {
  // open file; optionally prewarm by reading once
  // run each variant multiple times; report min/median
}

Report both throughput (MiB/s) and tail latency for operations that matter (e.g., time to first byte, time to last byte). The right choice is workload-specific—measure with realistic data and access patterns.

Prealloc, sparsity, and truncation: shaping the file

Preallocating space avoids late failures and fragmentation, and helps writeback schedule large sequential I/O.

  • posix_fallocate(fd, off, len): ensures space is allocated for the given range; subsequent writes won’t fail with ENOSPC for lack of blocks in that range. On some filesystems, this creates unwritten extents that become initialized on first write.
  • Sparse files: writing near end-of-file without touching middle creates holes (reading holes yields zeros). To intentionally punch holes and reclaim space, use fallocate with FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE (Linux).
#include <fcntl.h>
#include <errno.h>
 
int ensure_space(int fd, off_t off, off_t len) {
  int r = posix_fallocate(fd, off, len);
  return r == 0 ? 0 : -1; // r is an errno on failure
}
 
#ifdef __linux__
int punch_hole(int fd, off_t off, off_t len) {
  return fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, off, len);
}
#endif

Prealloc pairs well with append-heavy logs and large file builders; it reduces extent churn and surprises during peak traffic.

Durability knobs and platform nuances

Durability is where details matter most.

  • fdatasync vs fsync: fdatasync flushes file data and minimal metadata required to access it; fsync also flushes metadata like timestamps. Many databases prefer fdatasync on hot paths.
  • Open flags: O_DSYNC (data-sync) and O_SYNC (data + metadata) make each write behave as if followed by a sync—simplifies logic, often reduces throughput; measure.
  • Filesystem behavior: ext4’s default data=ordered journals metadata and writes data blocks before committing metadata; XFS and btrfs have different policies. Always test on the actual FS you deploy.
  • syncfs(fd): flushes all dirty data on the filesystem containing fd. Useful for controlled batch flushes; beware global impact.
  • macOS: F_FULLFSYNC via fcntl(fd, F_FULLFSYNC, 1) forces the drive to flush caches to non-volatile media (stronger than fsync). It’s slower; use only where required.

Minimal cross-platform flush helper:

#include <unistd.h>
#include <fcntl.h>
 
int flush_durable(int fd) {
#if defined(__APPLE__)
  // Try full flush; fallback to fsync on older volumes
  if (fcntl(fd, F_FULLFSYNC, 1) == 0) return 0;
#endif
  return fsync(fd);
}

Decision guide you can apply today

  • Reads are mostly sequential and large; you want simplicity: use buffered read/pread. Add POSIX_FADV_SEQUENTIAL and optional drop-behind.
  • Reads are large and one-off, and you must protect hot caches: consider buffered reads with POSIX_FADV_DONTNEED after each window, or O_DIRECT if alignment constraints are acceptable and your FS supports it well.
  • Reads are many and random across a hot set: use mmap or buffered I/O—measure both. Favor madvise(MADV_RANDOM) for mmap.
  • Writes are append-only with periodic durability points: buffered write + fdatasync at checkpoints; preallocate space up front.
  • Writes modify in-place and must be readable immediately by other processes: shared mmap + msync(MS_SYNC) on the touched ranges; consider buffered I/O if error handling via return codes is preferable.
  • Multiple threads read/write different regions: use pread/pwrite to avoid file-offset races; avoid mixing buffered with direct I/O on the same ranges.

Troubleshooting and observability

Tools that reveal what’s happening:

  • Syscall tracing: strace -T -e trace=read,write,fsync,fdatasync,msync,mmap,munmap (Linux) or dtruss (macOS). Look for short reads/writes, unexpected EINTR, and sync timing.
  • Page faults: perf stat -e page-faults <cmd> and perf record -e page-faults; high minor faults are normal with mmap scans; major faults indicate disk I/O.
  • I/O pressure: iostat -xz 1, pidstat -d 1, iotop to watch queues and throughput.
  • Block layer: blktrace/btt (Linux) for deep dives; advanced users can employ bpftrace to sample VFS/page cache events.

Quick bpftrace sampler (Linux):

bpftrace -e 'tracepoint:syscalls:sys_enter_read { @r[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /@r[tid]/ { printf("read %d -> %d in %d us\n", pid, retval, (nsecs-@r[tid])/1000); delete(@r[tid]); }'

Interpretation pointers:

  • If read() is fast but first access via mmap is slow, you’re seeing page faults fetch data—expected. Warm the cache or switch API.
  • If throughput collapses during writes, suspect dirty throttling or device limits; batch writes and monitor dirty ratios.
  • If durability calls dominate tail latency, consider grouping with syncfs or moving fsyncs off the hot path (WAL + group-commit style).

Production checklist (printable)

  • Choose one I/O mode per file/region; avoid mixing buffered and direct.
  • Use pread/pwrite for concurrency; avoid shared-offset races.
  • Exploit readahead: keep scans sequential or cluster seeks.
  • Drop-behind on one-pass scans (POSIX_FADV_DONTNEED or MADV_DONTNEED).
  • Preallocate with posix_fallocate; manage sparsity intentionally.
  • Budget durability with fdatasync/fsync/msync; know your filesystem’s guarantees.
  • Add deadlines and backpressure; expect dirty throttling.
  • Measure hot paths with warmed and cold caches; report throughput and tails.

Closing thoughts

There isn’t a single “fastest” I/O API—there’s a shared page cache and a set of contracts. Your job is to pick the API that best expresses your error handling, durability points, and access pattern, then give the kernel clear hints so it can help you. When you keep the mental model simple—one cache, two ways to traverse it, and rare cases for bypass—you’ll ship I/O that is both fast and predictably correct.