Log-Structured Storage in C: Segments, GC, and Bloom Filters

You need writes that stay fast even when your dataset grows, reads that don’t thrash disks, and a design simple enough that you can reason about it at 2 a.m. That’s the promise of log-structured storage: append now, organize later. We’ll build up a practical mental model, then shape it into concrete C code you can compile—covering append-only segments, crash consistency, a minimal SSTable format, compaction policy, and Bloom filters that say “don’t bother hitting disk for this key.”

We’ll proceed in small, production-flavored steps: durable write path first, an immutable on-disk table you can scan and binary-search, and the scaffolding for compaction and filters. The endgame is an LSM-ish store that can take punches: sustained ingest, predictable recovery, and bounded memory.

Why log-structured storage exists (and still wins)

Random in-place updates are a losing battle for modern storage. Drives (HDD and SSD) reward sequential writes and punish scattered updates. The log-structured answer is elegant:

Append new facts to a log/segment—sequential, cache-friendly, fsync-able
Periodically merge immutable segments to purge old versions and reclaim space
Maintain small auxiliary indexes (per-segment) so reads remain cheap

Writes become predictable, and garbage collection (compaction) moves the mess off the hot path.

The bird’s-eye architecture

Memtable (in-memory, ordered map)
WAL (append-only redo log for durability)
Immutable SSTables on disk (sorted segments)
Compaction (tiered/leveled) to merge and discard overwritten keys
Bloom filters to avoid unnecessary disk probes

High level flow:

Client PUT(k,v) → write WAL → insert into memtable → ack
Memtable fills → flush to SSTable file (sorted) with index + filter → drop oldest WAL entries
Background: compact multiple SSTables into fewer, larger ones; maintain ordering and drop shadowed keys

graph TD subgraph "Write Path" A[Client PUT/DELETE] --> B[Write to WAL] B --> C[fsync WAL] C --> D[Insert to Memtable] D --> E[ACK to Client] end subgraph "Flush Path" F[Memtable Full] --> G[Sort Entries] G --> H[Create SSTable] H --> I[Write Data Blocks] I --> J[Write Sparse Index] J --> K[Write Bloom Filter] K --> L[Write Footer] L --> M[fsync & Install] end subgraph "Read Path" N[Client GET] --> O[Check Memtable] O -->|Hit| P[Return Value] O -->|Miss| Q[Check Bloom Filters] Q -->|Maybe| R[Search SSTables L0→L1→L2...] Q -->|No| S[Return NOT_FOUND] R -->|Found| P R -->|Not Found| S end subgraph "Compaction" T[Background Thread] --> U[Pick Overlapping Files] U --> V[K-way Merge] V --> W[Drop Overwritten Keys] W --> X[Output New SSTable] X --> Y[Atomic Install] Y --> Z[Delete Old Files] end D --> F M --> R style A fill:#e1f5fe style E fill:#e8f5e8 style P fill:#e8f5e8 style S fill:#ffebee

Minimal data model and invariants

We’ll use a byte-oriented KV interface:

Keys: opaque byte strings (lexicographically ordered)
Values: opaque byte strings
MVCC-lite: a later PUT for the same key supersedes earlier versions at read time
Deletes: represented by a tombstone record; compaction can elide dead keys

Correctness invariants we’ll keep front-and-center:

Durability: a committed PUT survives crashes (WAL discipline)
Atomicity at record granularity: partial writes are detectable and ignored during recovery
Order: keys within an SSTable are strictly increasing
Integrity: blocks and files carry checksums

The write path you can trust (WAL + memtable)

On every PUT/DELETE:

Serialize an intent record and append to the WAL
fsync() or group-commit within a bounded window
Apply to the memtable (an ordered structure—skiplist, treap, or array of runs)

If you crash after step 2 but before the memtable update, recovery replays the WAL into a fresh memtable and you’re back where you intended to be.

Record shape (on the wire and in WAL)

We’ll start with a compact, self-describing record:

#include <stdint.h>
 
enum rec_kind : uint8_t { REC_PUT = 1, REC_DEL = 2 };
 
// Lengths are varints; all CRCs are little-endian u32 of the preceding bytes in the record
//  [kind:1][key_len:varint][val_len:varint (0 for DEL)][key...][value...][crc32:4]

Varints keep small keys/values tight; CRC lets us reject torn/partial appends after a crash.

A tiny varint and CRC toolkit

#include <stddef.h>
#include <stdint.h>
 
static size_t u64_varint_len(uint64_t x) {
  size_t n = 1; while (x >= 0x80) { x >>= 7; ++n; } return n; }
 
static uint8_t *u64_varint_encode(uint8_t *p, uint64_t x) {
  while (x >= 0x80) { *p++ = (uint8_t)(x | 0x80); x >>= 7; } *p++ = (uint8_t)x; return p; }
 
// Placeholder CRC32 (poly 0xEDB88320). In production, prefer a vetted impl or XXH3 for speed
static uint32_t crc32_update(uint32_t crc, const void *buf, size_t len) {
  static uint32_t T[256]; static int init;
  if (!init) { for (uint32_t i=0;i<256;i++){ uint32_t c=i; for(int k=0;k<8;k++) c=c&1?0xEDB88320u^(c>>1):c>>1; T[i]=c; } init=1; }
  crc = ~crc; const uint8_t *p = (const uint8_t*)buf;
  for (size_t i=0;i<len;i++) crc = T[(crc ^ p[i]) & 0xffu] ^ (crc >> 8);
  return ~crc;
}

WAL append: atomic enough to sleep at night

#include <errno.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
 
struct wal {
  int fd;
};
 
static bool write_all(int fd, const void *buf, size_t len) {
  const uint8_t *p = (const uint8_t*)buf; size_t r = len;
  while (r) { ssize_t w = write(fd, p, r); if (w > 0) { p += (size_t)w; r -= (size_t)w; continue; }
    if (w < 0 && errno == EINTR) continue; return false; }
  return true;
}
 
bool wal_append_put(struct wal *w, const void *k, size_t klen, const void *v, size_t vlen) {
  uint8_t hdr[1 + 10 + 10]; // kind + max varints for key/val
  uint8_t *p = hdr; *p++ = (uint8_t)REC_PUT; p = u64_varint_encode(p, (uint64_t)klen); p = u64_varint_encode(p, (uint64_t)vlen);
  uint32_t crc = 0; crc = crc32_update(crc, hdr, (size_t)(p - hdr));
  crc = crc32_update(crc, k, klen); crc = crc32_update(crc, v, vlen);
  if (!write_all(w->fd, hdr, (size_t)(p - hdr))) return false;
  if (!write_all(w->fd, k, klen)) return false;
  if (!write_all(w->fd, v, vlen)) return false;
  if (!write_all(w->fd, &crc, sizeof crc)) return false;
  return true;
}
 
bool wal_append_del(struct wal *w, const void *k, size_t klen) {
  uint8_t hdr[1 + 10 + 10]; uint8_t *p = hdr; *p++ = (uint8_t)REC_DEL; p = u64_varint_encode(p, (uint64_t)klen); p = u64_varint_encode(p, 0);
  uint32_t crc = 0; crc = crc32_update(crc, hdr, (size_t)(p - hdr)); crc = crc32_update(crc, k, klen);
  if (!write_all(w->fd, hdr, (size_t)(p - hdr))) return false;
  if (!write_all(w->fd, k, klen)) return false;
  if (!write_all(w->fd, &crc, sizeof crc)) return false;
  return true;
}
 
bool wal_sync(struct wal *w) { return fsync(w->fd) == 0; }

This is intentionally small and boring: varint headers, CRC tail, and a strict fsync() to form a durability boundary. Recovery scans until a CRC mismatch and stops—the last partial record is ignored.

In-memory memtable: ordered, bounded, and flushable

The memtable needs three properties:

Ordered by key (so we can flush as a sorted run)
Efficient inserts (we’ll keep it simple to start)
Bounded memory with a clear flush condition (bytes or count)

Production designs use skiplists or balanced trees. We’ll start with a sortable vector of entries: append on insert, mark size, and when a configured byte budget is hit, sort and flush to disk.

struct slice { const uint8_t *ptr; size_t len; };
 
struct entry { uint8_t kind; struct slice key; struct slice val; };
 
static int cmp_slice(const void *a, const void *b) {
  const struct entry *ea = (const struct entry*)a, *eb = (const struct entry*)b;
  size_t n = ea->key.len < eb->key.len ? ea->key.len : eb->key.len;
  int c = memcmp(ea->key.ptr, eb->key.ptr, n); if (c) return c;
  return (ea->key.len < eb->key.len) ? -1 : (ea->key.len > eb->key.len);
}

During flush we sort by key, drop intra-batch duplicates keeping only the last for each key, and stream the result to an SSTable writer.

Immutable SSTable: one file, many parts

An SSTable is a single immutable file composed of:

Header: magic, version, options (block size, checksum kind)
Data blocks: sorted records grouped into blocks with restart points for prefix compression (we’ll stub the restarts for now)
Per-block index: smallest key → file offset
Bloom filter region: optional, built over all keys in the file
Footer: fixed-size trailer with pointers to the index and filter regions + a file-level checksum

We’ll keep the first pass deliberately straightforward: one “block” that is the whole file for small flushes, a simple sparse index every N records, and a stub Bloom filter we’ll fill in later.

struct sst_writer {
  int fd;
  uint64_t num_records;
  uint64_t index_every; // write an index entry every N records
  // You’d also track offsets to write footer later
};
 
bool sst_begin(struct sst_writer *w, int fd) {
  w->fd = fd; w->num_records = 0; w->index_every = 64; // default sparse index
  static const char magic[8] = { 'S','S','T','1','\0','\0','\0','\0' };
  return write_all(fd, magic, sizeof magic);
}
 
static bool write_kv_record(int fd, const uint8_t *k, size_t klen, const uint8_t *v, size_t vlen) {
  uint8_t hdr[10+10]; uint8_t *p = hdr; p = u64_varint_encode(p, klen); p = u64_varint_encode(p, vlen);
  uint32_t crc = 0; crc = crc32_update(crc, hdr, (size_t)(p - hdr));
  crc = crc32_update(crc, k, klen); crc = crc32_update(crc, v, vlen);
  if (!write_all(fd, hdr, (size_t)(p - hdr))) return false;
  if (!write_all(fd, k, klen)) return false; if (!write_all(fd, v, vlen)) return false;
  return write_all(fd, &crc, sizeof crc);
}
 
bool sst_add(struct sst_writer *w, const uint8_t *k, size_t klen, const uint8_t *v, size_t vlen) {
  // For simplicity, we write raw sorted key/value records back-to-back with a CRC per record
  if (!write_kv_record(w->fd, k, klen, v, vlen)) return false;
  w->num_records++;
  // In a real build, you’d emit an index entry every index_every records: [key prefix][file offset]
  return true;
}
 
bool sst_end(struct sst_writer *w) {
  // Footer (placeholder): in later sections we’ll append index/filter offsets and a file checksum
  static const char end[4] = { 'E','N','D','\n' };
  return write_all(w->fd, end, sizeof end) && fsync(w->fd) == 0;
}

This “minimal SSTable” is enough to support a clean flush path and a trivial reader: open, sanity-check the header, then sequentially scan (or binary search with a separate index file). We’ll refine it with a sparse index and filter once we have flushes working.

Read path sketch (single SSTable + memtable)

Reads consult the freshest layer first:

Memtable: exact match or miss
Recent SSTable(s): search newest to oldest; stop on first hit (or tombstone)

Even this simple two-tier approach avoids most disk I/O under write-heavy workloads because recent keys live in the memtable or the newest flushed table. Filters cut further once we add them.

Durability discipline and crash recovery

Recovery is your “I meant that” button:

On start, rebuild the memtable by scanning WAL files in order
Stop on first malformed/CRC-failing record; truncate tail
If a flush completed (SSTable present with valid footer), you can discard WAL segments older than that flush’s durable point

This ensures that your read path sees a consistent latest state (memtable + immutable SSTables), and that a crash never leaves half-applied updates.

On compaction (preview)

Compaction is the garbage collector of log-structured storage. We’ll detail strategies later, but the basic job is:

Merge several sorted runs → produce a new, larger sorted run
Drop overwritten values and tombstoned keys
Maintain per-level size/overlap guarantees

Two popular strategies you should know:

Tiered: accumulate multiple same-size runs, then merge them wholesale into the next tier
Leveled: maintain small L0, then L1..Ln with bounded total bytes per level; compactions target overlapping key ranges to keep read amplification low

graph TB subgraph "Tiered Compaction" T1[L0: 4 SSTables 10MB each] --> T2[Compact all 4→1] T2 --> T3[L1: 1 SSTable 40MB] T3 --> T4[Wait for 4 more 40MB tables] T4 --> T5[Compact all 4→1] T5 --> T6[L2: 1 SSTable 160MB] style T1 fill:#ffebee style T3 fill:#e8f5e8 style T6 fill:#e1f5fe end subgraph "Leveled Compaction" L1[L0: 4 SSTables overlapping ranges] --> L2[Pick 1 + overlapping L1s] L2 --> L3[Compact range only] L3 --> L4[L1: Sorted runs non-overlapping] L4 --> L5[Size trigger: L1 > 256MB] L5 --> L6[Pick overlap with L2] L6 --> L7[L2: Larger sorted runs] style L1 fill:#ffebee style L4 fill:#fff3e0 style L7 fill:#e1f5fe end subgraph "Trade-offs" TT["Tiered: + Lower write amp + Simpler logic - Higher read amp - Larger space amp"] TL["Leveled: + Lower read amp + Predictable space - Higher write amp - Complex range selection"] end

We'll pick leveled as the default and explain the exact heuristics with code later—after we've proven the write/flush/read loop.

Bloom filters (preview)

Filters reduce disk probes for negative lookups: before touching an SSTable, consult its Bloom filter. If the filter says “definitely not,” you skip the file.

We’ll implement:

A bitset stored in the SSTable
k independent hash functions derived from 64-bit mixing
Parameterization by bits-per-key to balance false positives vs space

Rule of thumb: optimal number of hashes is (k \approx (m/n)\ln 2) where (m) is bits, (n) keys.

Milestone checklist (where we are now)

WAL: append + CRC + fsync
Memtable: ordered buffer with flush trigger
SSTable writer: minimal sorted run with per-record checksums
Recovery plan: scan WAL, rebuild memtable, ignore torn tails

Next, we’ll wire the flush path end-to-end (memtable → SSTable), add a sparse index for binary search, and layer in a working Bloom filter to cut misses. Then compaction policy and range merges to keep read amplification under control.

Wiring the flush path (memtable → SSTable)

A flush turns the in-memory batch into one immutable, sorted run on disk with a tiny index and an optional Bloom filter. Steps:

Sort entries by key
Deduplicate within the batch, keeping only the newest entry per key
Stream records to the SSTable writer (PUT → key+value, DEL → tombstone with empty value)
Every N records, capture an index entry: the current file offset and the corresponding key
Add keys to the Bloom filter as you go
After data, write the index region, then the filter region, then a footer with offsets

We’ll define a simple footer so a reader can find the index/filter quickly:

// Footer layout at end of file:
// [index_off:varint][index_len:varint][filter_off:varint][filter_len:varint][magic:8]
static const char FOOTER_MAGIC[8] = { 'S','S','T','F','O','O','T','1' };

Bloom filter: compact and fast

We’ll use double hashing to derive k indices from two 64-bit hashes. A simple FNV-1a 64 for h1 and a splitmix64-based variant for h2 keeps implementation small and portable.

#include <stdint.h>
#include <string.h>
 
static uint64_t fnv1a64(const void *buf, size_t len) {
  const uint8_t *p = (const uint8_t*)buf; uint64_t h = 1469598103934665603ull;
  for (size_t i=0;i<len;i++) { h ^= p[i]; h *= 1099511628211ull; }
  return h;
}
 
static uint64_t splitmix64(uint64_t x) {
  x += 0x9e3779b97f4a7c15ull; x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ull;
  x = (x ^ (x >> 27)) * 0x94d049bb133111ebull; return x ^ (x >> 31);
}
 
struct bloom { uint32_t m_bits; uint32_t k; uint8_t *bits; };
 
static void bloom_set(struct bloom *b, uint64_t i) {
  i %= b->m_bits; size_t byte = (size_t)(i >> 3); uint8_t mask = (uint8_t)(1u << (i & 7));
  b->bits[byte] |= mask;
}
 
static int bloom_get(const struct bloom *b, uint64_t i) {
  i %= b->m_bits; size_t byte = (size_t)(i >> 3); uint8_t mask = (uint8_t)(1u << (i & 7));
  return (b->bits[byte] & mask) != 0;
}
 
static void bloom_add(struct bloom *b, const void *key, size_t klen) {
  uint64_t h1 = fnv1a64(key, klen); uint64_t h2 = splitmix64(h1);
  for (uint32_t i=0;i<b->k;i++) bloom_set(b, h1 + i * h2);
}
 
static int bloom_maybe_contains(const struct bloom *b, const void *key, size_t klen) {
  uint64_t h1 = fnv1a64(key, klen); uint64_t h2 = splitmix64(h1);
  for (uint32_t i=0;i<b->k;i++) if (!bloom_get(b, h1 + i * h2)) return 0; return 1;
}

We size the filter by bits-per-key, m = ceil(n * bits_per_key), and choose k = max(1, round((m/n) * ln 2)).

Sparse index format

We’ll append an index region that’s just a sequence of entries:

[key_len:varint][key_bytes...][data_offset:varint]

It’s simple, compact, and easy to binary search in memory after loading.

Putting it together: flush implementation

#include <fcntl.h>
#include <stdlib.h>
 
struct dynbuf { uint8_t *p; size_t len, cap; };
static void db_init(struct dynbuf *b){ b->p=NULL; b->len=0; b->cap=0; }
static int  db_reserve(struct dynbuf *b, size_t add){ size_t need=b->len+add; if(need<=b->cap) return 1; size_t nc=b->cap?b->cap*2:1024; while(nc<need) nc*=2; uint8_t *np=(uint8_t*)realloc(b->p,nc); if(!np) return 0; b->p=np; b->cap=nc; return 1; }
static int  db_put(struct dynbuf *b, const void *p, size_t n){ if(!db_reserve(b,n)) return 0; memcpy(b->p+b->len,p,n); b->len+=n; return 1; }
 
struct index_entry { uint64_t off; uint32_t klen; const uint8_t *k; };
 
struct flush_cfg { double bits_per_key; uint32_t index_every; };
 
static size_t dedupe_in_place(struct entry *a, size_t n) {
  if (n == 0) return 0; size_t w = 0; // assumes 'a' is sorted by key
  for (size_t i = 0; i < n; ) {
    size_t j = i + 1; while (j < n && cmp_slice(&a[i], &a[j]) == 0) j++;
    a[w++] = a[j-1]; i = j; // keep the last occurrence per key
  }
  return w;
}
 
int flush_memtable_to_sstable(const char *path, struct entry *ents, size_t n, struct flush_cfg cfg) {
  // 1) sort by key
  qsort(ents, n, sizeof *ents, cmp_slice);
  // 2) dedupe (keep newest per key)
  n = dedupe_in_place(ents, n);
 
  int fd = open(path, O_CREAT|O_TRUNC|O_WRONLY, 0644);
  if (fd < 0) return -1;
  struct sst_writer w; if (!sst_begin(&w, fd)) { close(fd); return -1; }
 
  // Prepare filter
  uint64_t m = (uint64_t) ((double)n * (cfg.bits_per_key <= 0 ? 10.0 : cfg.bits_per_key));
  if (m < 8) m = 8; // at least one byte
  struct bloom bf = { .m_bits = (uint32_t)m, .k = 0, .bits = NULL };
  bf.k = (uint32_t)((double)bf.m_bits / (double)(n? n:1) * 0.6931471805599453); // ln 2
  if (bf.k == 0) bf.k = 1; if (bf.k > 15) bf.k = 15;
  size_t bitbytes = ((size_t)bf.m_bits + 7u) >> 3; bf.bits = (uint8_t*)calloc(1, bitbytes);
  if (!bf.bits) { close(fd); return -1; }
 
  struct dynbuf index; db_init(&index);
 
  // 3–5) stream data, build index+filter
  uint64_t recno = 0; off_t data_off;
  for (size_t i = 0; i < n; i++) {
    // Index every cfg.index_every records
    if ((recno % (cfg.index_every ? cfg.index_every : 64)) == 0) {
      data_off = lseek(fd, 0, SEEK_CUR);
      // encode [key_len][key][off]
      uint8_t tmp[20]; uint8_t *p = tmp; p = u64_varint_encode(p, ents[i].key.len);
      if (!db_put(&index, tmp, (size_t)(p - tmp))) goto fail;
      if (!db_put(&index, ents[i].key.ptr, ents[i].key.len)) goto fail;
      p = tmp; p = u64_varint_encode(p, (uint64_t)data_off);
      if (!db_put(&index, tmp, (size_t)(p - tmp))) goto fail;
    }
    // Add to filter (include tombstones to avoid false negatives for shadowing)
    bloom_add(&bf, ents[i].key.ptr, ents[i].key.len);
 
    // Write record (DEL → empty value as tombstone)
    const uint8_t *vp = ents[i].kind == REC_DEL ? (const uint8_t*)"" : ents[i].val.ptr;
    size_t vlen = ents[i].kind == REC_DEL ? 0 : ents[i].val.len;
    if (!sst_add(&w, ents[i].key.ptr, ents[i].key.len, vp, vlen)) goto fail;
    recno++;
  }
 
  // 6) write index region
  off_t index_off = lseek(fd, 0, SEEK_CUR);
  if (!write_all(fd, index.p, index.len)) goto fail;
  uint64_t index_len = (uint64_t)index.len;
 
  // write filter region (format: [m_bits:varint][k:varint][bytes...])
  uint8_t hdr[20]; uint8_t *ph = hdr; ph = u64_varint_encode(ph, (uint64_t)bf.m_bits); ph = u64_varint_encode(ph, (uint64_t)bf.k);
  off_t filter_off = lseek(fd, 0, SEEK_CUR);
  if (!write_all(fd, hdr, (size_t)(ph - hdr))) goto fail;
  if (!write_all(fd, bf.bits, bitbytes)) goto fail;
  uint64_t filter_len = (uint64_t)((ph - hdr) + bitbytes);
 
  // footer
  uint8_t ftr[64]; uint8_t *pf = ftr;
  pf = u64_varint_encode(pf, (uint64_t)index_off);
  pf = u64_varint_encode(pf, index_len);
  pf = u64_varint_encode(pf, (uint64_t)filter_off);
  pf = u64_varint_encode(pf, filter_len);
  // write footer chunks and magic
  if (!write_all(fd, ftr, (size_t)(pf - ftr))) goto fail;
  if (!write_all(fd, FOOTER_MAGIC, sizeof FOOTER_MAGIC)) goto fail;
 
  if (!sst_end(&w)) goto fail;
  free(bf.bits); free(index.p); close(fd); return 0;
 
fail:
  {
    int err = errno; free(bf.bits); free(index.p); close(fd); errno = err; return -1;
  }
}

Notes:

We include tombstoned keys in the filter to avoid false negatives that would allow an older value to resurface from lower levels.
Sparse index granularity (index_every) is a trade-off: more entries → faster seeks, larger index.
Footer magic at the very end simplifies discovery: read last 8 bytes, then backtrack varints.

SSTable reader: filter → index → data

Reads avoid I/O in three stages:

Consult the Bloom filter. If negative, skip the file.
Binary search the in-memory index to find the largest offset ≤ target key.
Seek to that offset, then scan forward until the key is found or the next index boundary is crossed.

Footer and region loading

struct sst_index_entry { const uint8_t *key; uint32_t klen; uint64_t off; };
 
struct sst_reader {
  int fd; uint8_t *index; size_t index_len; struct sst_index_entry *ents; size_t ents_n;
  struct bloom bf; // owns bf.bits
};
 
static int read_tail(int fd, off_t file_size, void *buf, size_t len) {
  if (lseek(fd, file_size - (off_t)len, SEEK_SET) < 0) return -1;
  return (int)read(fd, buf, len) == (int)len ? 0 : -1;
}
 
static int sst_load_footer(int fd, off_t *index_off, uint64_t *index_len, off_t *filter_off, uint64_t *filter_len) {
  // Read magic
  off_t end = lseek(fd, 0, SEEK_END); if (end < 16) return -1;
  char magic[8]; if (read_tail(fd, end, magic, sizeof magic) < 0) return -1;
  if (memcmp(magic, FOOTER_MAGIC, 8) != 0) return -1;
  // Read backwards the preceding varints: we’ll seek near the end and parse forward with a small buffer.
  size_t back = 64; if (end < (off_t)(back + 8)) back = (size_t)end - 8;
  if (lseek(fd, end - 8 - (off_t)back, SEEK_SET) < 0) return -1;
  uint8_t tmp[128]; ssize_t n = read(fd, tmp, back); if (n <= 0) return -1;
  // Find the last 4 varints ending exactly at position `n`
  // For simplicity, parse forward and remember the last four decodes
  uint64_t vals[4] = {0}; int vi = 0; size_t i = 0; while (i < (size_t)n) {
    uint64_t v = 0; int s = 0; size_t start = i; for (;;){ if (i >= (size_t)n) return -1; uint8_t b = tmp[i++]; v |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s += 7; }
    // shift queue
    vals[0] = vals[1]; vals[1] = vals[2]; vals[2] = vals[3]; vals[3] = v; vi++;
  }
  if (vi < 4) return -1;
  *index_off = (off_t)vals[vi>=4?vi-4:0]; *index_len = vals[vi>=3?vi-3:1];
  *filter_off = (off_t)vals[vi>=2?vi-2:2]; *filter_len = vals[vi>=1?vi-1:3];
  return 0;
}
 
static int sst_open(struct sst_reader *r, const char *path) {
  memset(r, 0, sizeof *r); r->fd = open(path, O_RDONLY); if (r->fd < 0) return -1;
  off_t io=0, fo=0; uint64_t il=0, fl=0; if (sst_load_footer(r->fd, &io, &il, &fo, &fl) != 0) goto fail;
  // Load index region into memory
  r->index_len = (size_t)il; r->index = (uint8_t*)malloc(r->index_len); if (!r->index) goto fail;
  if (lseek(r->fd, io, SEEK_SET) < 0) goto fail; if ((size_t)read(r->fd, r->index, r->index_len) != r->index_len) goto fail;
  // Parse index into entries (pointers into r->index)
  size_t cap = 16; r->ents = (struct sst_index_entry*)malloc(cap * sizeof *r->ents); r->ents_n = 0;
  size_t p = 0; while (p < r->index_len) {
    // key_len
    uint64_t klen = 0; int s = 0; do { uint8_t b = r->index[p++]; klen |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s += 7; } while (p < r->index_len);
    const uint8_t *k = r->index + p; p += (size_t)klen;
    // off
    uint64_t off = 0; s = 0; do { uint8_t b = r->index[p++]; off |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s += 7; } while (p < r->index_len);
    if (r->ents_n == cap) { cap *= 2; r->ents = (struct sst_index_entry*)realloc(r->ents, cap * sizeof *r->ents); }
    r->ents[r->ents_n++] = (struct sst_index_entry){ .key = k, .klen = (uint32_t)klen, .off = off };
  }
  // Load filter
  if (lseek(r->fd, fo, SEEK_SET) < 0) goto fail; uint8_t hdr[20];
  // parse m_bits
  uint64_t m_bits=0; int sft=0; ssize_t rd=read(r->fd, hdr, sizeof hdr); if (rd <= 0) goto fail; size_t hp=0;
  do { uint8_t b = hdr[hp++]; m_bits |= (uint64_t)(b & 0x7f) << sft; if (!(b & 0x80)) break; sft += 7; } while (hp < (size_t)rd);
  uint64_t kh=0; sft=0; do { if (hp >= (size_t)rd) { if (lseek(r->fd, fo+hp, SEEK_SET)<0) goto fail; rd=read(r->fd, hdr, sizeof hdr); if (rd<=0) goto fail; hp=0; } uint8_t b = hdr[hp++]; kh |= (uint64_t)(b & 0x7f) << sft; if (!(b & 0x80)) break; sft += 7; } while (1);
  r->bf.m_bits = (uint32_t)m_bits; r->bf.k = (uint32_t)kh; size_t bbytes = ((size_t)r->bf.m_bits + 7u) >> 3; r->bf.bits = (uint8_t*)malloc(bbytes);
  if (!r->bf.bits) goto fail; if ((size_t)read(r->fd, r->bf.bits, bbytes) != bbytes) goto fail;
  return 0;
fail:
  { int err = errno; if (r->fd>=0) close(r->fd); free(r->index); free(r->ents); free(r->bf.bits); errno = err; return -1; }
}
 
static void sst_close(struct sst_reader *r) {
  if (r->fd>=0) close(r->fd); free(r->index); free(r->ents); free(r->bf.bits); memset(r,0,sizeof*r);
}

Lookup using filter + sparse index

static int key_cmp_mem(const uint8_t *a, uint32_t alen, const uint8_t *b, uint32_t blen) {
  size_t n = alen < blen ? alen : blen; int c = memcmp(a, b, n); if (c) return c; return (alen < blen) ? -1 : (alen > blen);
}
 
static int64_t index_floor(const struct sst_reader *r, const uint8_t *key, uint32_t klen) {
  // binary search for largest idx s.t. ents[idx].key <= key
  int64_t lo = 0, hi = (int64_t)r->ents_n - 1, ans = -1;
  while (lo <= hi) {
    int64_t mid = (lo + hi) >> 1; int c = key_cmp_mem(r->ents[mid].key, r->ents[mid].klen, key, klen);
    if (c <= 0) { ans = mid; lo = mid + 1; } else { hi = mid - 1; }
  }
  return ans;
}
 
// Returns: 1=found PUT (value returned), 0=found DEL (tombstone), -1=not found, <0 on error
int sst_get(struct sst_reader *r, const uint8_t *key, uint32_t klen, uint8_t **out, uint32_t *outlen) {
  // 1) Bloom filter
  if (!bloom_maybe_contains(&r->bf, key, klen)) return -1; // definitely not present
 
  // 2) index floor
  int64_t idx = index_floor(r, key, klen); off_t off = 0; if (idx >= 0) off = (off_t)r->ents[idx].off;
  if (lseek(r->fd, off, SEEK_SET) < 0) return -2;
 
  // 3) scan forward (bounded). Stop when we pass the key or hit EOF/next block boundary (not tracked here)
  for (;;) {
    // read [klen][vlen]
    uint8_t hdr[20]; ssize_t n = read(r->fd, hdr, sizeof hdr); if (n <= 0) return -1; size_t p = 0;
    // decode klen
    uint64_t kl=0; int s=0; do { uint8_t b = hdr[p++]; kl |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s += 7; } while (p < (size_t)n);
    // decode vlen
    uint64_t vl=0; s=0; do { if (p >= (size_t)n) { return -2; } uint8_t b = hdr[p++]; vl |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s += 7; } while (p < (size_t)n);
    // read key, value, crc
    uint8_t *kbuf = (uint8_t*)malloc((size_t)kl); if (!kbuf) return -3; if ((size_t)read(r->fd, kbuf, (size_t)kl) != (size_t)kl) { free(kbuf); return -2; }
    uint8_t *vbuf = (uint8_t*)malloc((size_t)vl); if (!vbuf && vl) { free(kbuf); return -3; }
    if (vl && (size_t)read(r->fd, vbuf, (size_t)vl) != (size_t)vl) { free(kbuf); free(vbuf); return -2; }
    uint32_t crc=0; if ((size_t)read(r->fd, &crc, sizeof crc) != sizeof crc) { free(kbuf); free(vbuf); return -2; }
    // verify crc
    uint8_t htmp[20]; uint8_t *hp = htmp; hp = u64_varint_encode(hp, kl); hp = u64_varint_encode(hp, vl);
    uint32_t calc = 0; calc = crc32_update(calc, htmp, (size_t)(hp - htmp)); calc = crc32_update(calc, kbuf, (size_t)kl); calc = crc32_update(calc, vbuf, (size_t)vl);
    if (calc != crc) { free(kbuf); free(vbuf); return -2; }
    int c = key_cmp_mem(kbuf, (uint32_t)kl, key, klen);
    if (c == 0) {
      // tombstone if vl==0
      if (vl == 0) { free(kbuf); free(vbuf); return 0; }
      *out = vbuf; *outlen = (uint32_t)vl; free(kbuf); return 1;
    } else if (c > 0) {
      // passed the key
      free(kbuf); free(vbuf); return -1;
    }
    free(kbuf); free(vbuf);
  }
}

This reader is intentionally minimal. In a real build you’ll group records into blocks, add restart points for prefix compression, and bound scans to a block to keep worst-case latency tight.

Practical budgets and knobs

Bits per key: start with 10 (≈1% FPR). If your negative lookups dominate and I/O is precious, spend up to 12–14.
Index stride: 64–128 records per entry is a good default. If your values are small and tables are large, tighten the stride.
WAL fsync: use group commit with a short timer (e.g., 1–5 ms) under load to amortize durability without lying.
Flush thresholds: cap by bytes in memtable (e.g., 64–256 MiB) and by number of entries to avoid pathological huge flushes.

Compaction that actually reclaims space (k-way merge)

Compaction merges several sorted runs into one, dropping overwritten values and honoring tombstones. The core algorithm is a k-way merge with a min-heap on keys and a tie-breaker on “newness.”

Key rules:

Among duplicate keys across inputs, keep only the newest record
If the newest record is a tombstone (value length 0):
- Preserve it when compacting into a non-bottom level (so it can hide older values below)
- Drop it at the bottom-most level (space reclaim)

We represent “newness” with a per-file generation counter (monotonic increasing at flush/creation time). Newer generation wins ties.

Refinement: finalize with a footer, not an ad-hoc tail marker

Earlier we wrote an END\n marker. We’ll refine the writer to rely solely on the footer magic at EOF. Updated end routine:

bool sst_end(struct sst_writer *w) {
  // Footer + magic must already be written by the caller (flush/compaction path)
  return fsync(w->fd) == 0;
}

SSTable iterator (sequential reader over data region)

struct sst_iter {
  int fd; off_t off; off_t data_end;
  uint8_t *k; uint32_t klen; uint8_t *v; uint32_t vlen;
  int eof;
};
 
static int sst_iter_open(struct sst_iter *it, const char *path) {
  memset(it, 0, sizeof *it); it->fd = open(path, O_RDONLY); if (it->fd < 0) return -1;
  off_t io=0, fo=0; uint64_t il=0, fl=0; if (sst_load_footer(it->fd, &io, &il, &fo, &fl) != 0) { close(it->fd); return -1; }
  it->off = 8; // after 8-byte header magic written by sst_begin
  it->data_end = io; it->eof = it->off >= it->data_end; return 0;
}
 
static void sst_iter_close(struct sst_iter *it) {
  if (it->fd>=0) close(it->fd); free(it->k); free(it->v); memset(it,0,sizeof*it);
}
 
static int sst_iter_next(struct sst_iter *it) {
  if (it->eof) return 0; free(it->k); free(it->v); it->k = it->v = NULL; it->klen = it->vlen = 0;
  if (lseek(it->fd, it->off, SEEK_SET) < 0) return -1;
  uint8_t hdr[20]; ssize_t n = read(it->fd, hdr, sizeof hdr); if (n <= 0) { it->eof = 1; return 0; }
  size_t p=0; uint64_t kl=0; int s=0; do { uint8_t b = hdr[p++]; kl |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s+=7; } while (p < (size_t)n);
  uint64_t vl=0; s=0; do { uint8_t b = hdr[p++]; vl |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s+=7; } while (p < (size_t)n);
  it->k = (uint8_t*)malloc((size_t)kl); it->v = (uint8_t*)(vl? malloc((size_t)vl) : NULL);
  if (!it->k || (vl && !it->v)) return -1;
  if ((size_t)read(it->fd, it->k, (size_t)kl) != (size_t)kl) return -1;
  if (vl && (size_t)read(it->fd, it->v, (size_t)vl) != (size_t)vl) return -1;
  uint32_t crc=0; if ((size_t)read(it->fd, &crc, sizeof crc) != sizeof crc) return -1;
  uint8_t htmp[20]; uint8_t *hp = htmp; hp = u64_varint_encode(hp, kl); hp = u64_varint_encode(hp, vl);
  uint32_t calc=0; calc = crc32_update(calc, htmp, (size_t)(hp - htmp)); calc = crc32_update(calc, it->k, (size_t)kl); calc = crc32_update(calc, it->v, (size_t)vl);
  if (calc != crc) return -1;
  it->klen = (uint32_t)kl; it->vlen = (uint32_t)vl;
  it->off = lseek(it->fd, 0, SEEK_CUR);
  if (it->off >= it->data_end) it->eof = 1; return 1;
}

K-way merge with heap and tombstones

struct heap_item { struct sst_iter *it; uint64_t gen; };
 
static int heap_less(const struct heap_item *a, const struct heap_item *b) {
  int c = key_cmp_mem(a->it->k, a->it->klen, b->it->k, b->it->klen);
  if (c != 0) return c < 0; // smaller key first
  // Same key: newer (higher gen) first so we can drop older ones quickly
  return a->gen > b->gen;
}
 
struct heap { struct heap_item *v; size_t n, cap; };
static void hswap(struct heap_item *a, struct heap_item *b){ struct heap_item t=*a; *a=*b; *b=t; }
static void hpush(struct heap *h, struct heap_item x){ if(h->n==h->cap){ h->cap=h->cap?h->cap*2:8; h->v=realloc(h->v,h->cap*sizeof*h->v); } size_t i=h->n++; h->v[i]=x; while(i&&heap_less(&h->v[i],&h->v[(i-1)/2])){ hswap(&h->v[i],&h->v[(i-1)/2]); i=(i-1)/2; }}
static int  hpop(struct heap *h, struct heap_item *out){ if(!h->n) return 0; *out=h->v[0]; size_t i=0; h->v[0]=h->v[--h->n]; while(1){ size_t l=2*i+1,r=l+1,sm=i; if(l<h->n && heap_less(&h->v[l],&h->v[sm])) sm=l; if(r<h->n && heap_less(&h->v[r],&h->v[sm])) sm=r; if(sm==i) break; hswap(&h->v[i],&h->v[sm]); i=sm; } return 1; }
 
struct sst_input { const char *path; uint64_t gen; };
 
int compact_kway(struct sst_input *in, size_t nin,
                 const char *out_path, int bottom_level,
                 struct flush_cfg cfg) {
  // Open iters and prime heap
  struct sst_iter *its = calloc(nin, sizeof *its); struct heap hp = {0};
  if (!its) return -1;
  for (size_t i=0;i<nin;i++) {
    if (sst_iter_open(&its[i], in[i].path) != 0) { /* handle */ }
    if (sst_iter_next(&its[i]) > 0) hpush(&hp, (struct heap_item){ .it=&its[i], .gen=in[i].gen });
  }
 
  // Output writer
  int fd = open(out_path, O_CREAT|O_TRUNC|O_WRONLY, 0644); if (fd<0) return -1;
  struct sst_writer w; if (!sst_begin(&w, fd)) { close(fd); return -1; }
 
  // Prepare index and filter as in flush
  struct dynbuf index; db_init(&index);
  // To size Bloom, we approximate n by sum of inputs (overestimates fine)
  size_t approx_n = 0; for (size_t i=0;i<nin;i++) approx_n += 1024; // placeholder unless tracked per file
  uint64_t m = (uint64_t)((double)(approx_n?approx_n:1) * (cfg.bits_per_key<=0?10.0:cfg.bits_per_key)); if (m < 8) m = 8;
  struct bloom bf = { .m_bits=(uint32_t)m, .k=0, .bits=NULL };
  bf.k = (uint32_t)((double)bf.m_bits / (double)(approx_n?approx_n:1) * 0.6931471805599453); if (bf.k==0) bf.k=1; if (bf.k>15) bf.k=15;
  size_t bitbytes = ((size_t)bf.m_bits + 7u) >> 3; bf.bits = (uint8_t*)calloc(1, bitbytes);
 
  uint64_t recno = 0; int rc = -1;
  while (hp.n) {
    // Pull smallest key; gather all with same key
    struct heap_item top; hpop(&hp, &top);
    struct sst_iter *lead = top.it; uint64_t lead_gen = top.gen;
    // Consume and discard older dupes of the same key in heap
    struct heap_item stash[32]; size_t sn=0;
    while (hp.n) {
      struct heap_item nxt = hp.v[0];
      if (key_cmp_mem(lead->k, lead->klen, nxt.it->k, nxt.it->klen) != 0) break;
      // same key: pop and stash (older than lead due to heap_less)
      hpop(&hp, &nxt); if (sn<32) stash[sn++] = nxt;
    }
    // Decide what to output based on lead
    int is_tomb = (lead->vlen == 0);
    if (!is_tomb || !bottom_level) {
      // Index stride
      if ((recno % (cfg.index_every ? cfg.index_every : 64)) == 0) {
        off_t data_off = lseek(fd, 0, SEEK_CUR);
        uint8_t tmp[20]; uint8_t *p = tmp; p = u64_varint_encode(p, lead->klen);
        if (!db_put(&index, tmp, (size_t)(p - tmp))) goto out;
        if (!db_put(&index, lead->k, lead->klen)) goto out;
        p = tmp; p = u64_varint_encode(p, (uint64_t)data_off);
        if (!db_put(&index, tmp, (size_t)(p - tmp))) goto out;
      }
      bloom_add(&bf, lead->k, lead->klen);
      if (!sst_add(&w, lead->k, lead->klen, lead->v, lead->vlen)) goto out;
      recno++;
    }
    // Advance lead iterator
    if (sst_iter_next(lead) > 0) hpush(&hp, (struct heap_item){ .it=lead, .gen=lead_gen });
    // Advance any stashed iters and push
    for (size_t i=0;i<sn;i++) if (sst_iter_next(stash[i].it) > 0) hpush(&hp, stash[i]);
  }
 
  // Write index and filter, then footer+magic
  off_t index_off = lseek(fd, 0, SEEK_CUR); if (!write_all(fd, index.p, index.len)) goto out;
  uint64_t index_len = (uint64_t)index.len;
  uint8_t hdr[20]; uint8_t *ph = hdr; ph = u64_varint_encode(ph, (uint64_t)bf.m_bits); ph = u64_varint_encode(ph, (uint64_t)bf.k);
  off_t filter_off = lseek(fd, 0, SEEK_CUR);
  if (!write_all(fd, hdr, (size_t)(ph - hdr))) goto out; if (!write_all(fd, bf.bits, bitbytes)) goto out;
  uint64_t filter_len = (uint64_t)((ph - hdr) + bitbytes);
  uint8_t ftr[64]; uint8_t *pf = ftr;
  pf = u64_varint_encode(pf, (uint64_t)index_off);
  pf = u64_varint_encode(pf, index_len);
  pf = u64_varint_encode(pf, (uint64_t)filter_off);
  pf = u64_varint_encode(pf, filter_len);
  if (!write_all(fd, ftr, (size_t)(pf - ftr))) goto out; if (!write_all(fd, FOOTER_MAGIC, sizeof FOOTER_MAGIC)) goto out;
  if (!sst_end(&w)) goto out;
  rc = 0;
 
out:
  free(hp.v); for (size_t i=0;i<nin;i++) sst_iter_close(&its[i]); free(its);
  free(bf.bits); free(index.p); close(fd); return rc;
}

This merge is stable, memory-bounded, and linear in total input size plus a log k heap factor. It preserves tombstones above the bottom level, ensuring deletes remain effective across deeper runs.

Levels, manifest, and atomic installs

Compaction outputs must become visible atomically. A simple approach:

Name files L<level>-<gen>.sst (e.g., L1-000123.sst)
Maintain a MANIFEST text file listing active tables: one per line
- L <level> <gen> <filename>
On install: write a new manifest to a temp file (MANIFEST.tmp), fsync(), then rename() over the old MANIFEST and fsync() its directory

Readers open tables named in the current MANIFEST. On startup, load the manifest, open SSTables newest-to-oldest within a level, and compose the search order from L0 downward.

static int manifest_install(const char *dir, const char *new_manifest_tmp) {
  // 1) fsync tmp manifest
  int fd = open(new_manifest_tmp, O_RDONLY); if (fd<0) return -1; if (fsync(fd) != 0) { close(fd); return -1; } close(fd);
  // 2) rename into place
  char dst[1024]; snprintf(dst, sizeof dst, "%s/MANIFEST", dir);
  if (rename(new_manifest_tmp, dst) != 0) return -1;
  // 3) fsync directory for metadata durability
  int dfd = open(dir, O_RDONLY); if (dfd<0) return -1; int rc = fsync(dfd); close(dfd); return rc;
}

Atomic rename plus directory fsync() is the portable pattern for metadata durability on POSIX systems.

Background worker, triggers, and backpressure

You don’t compact on the foreground thread. A background worker:

Watches level sizes and overlap; when a level exceeds its byte budget, picks a victim range and runs compact_kway
Installs the output atomically, then deletes obsolete inputs
Applies backpressure by slowing new flushes when L0 is too deep (e.g., > 8 files)

Trigger heuristics (leveled):

Set per-level max bytes: L1 = 256 MiB, growing by ~10x per level
When a level exceeds its budget, choose an input file and the overlapping files in the next level; compact that range only
Keep L0 shallow by compacting aggressively into L1

Minimal loop sketch:

void compaction_daemon(void) {
  for (;;) {
    // evaluate level sizes & overlaps → pick a job or sleep
    struct job j; if (!choose_job(&j)) { sleep_ms(50); continue; }
    // run compaction to temp file
    if (compact_kway(j.inputs, j.nin, j.tmp_out, j.bottom_level, j.cfg) == 0) {
      // atomically install: new manifest with j.tmp_out added and inputs removed
      install_compaction(&j);
      // delete old files after install
      gc_inputs(&j);
    }
  }
}

Backpressure rule of thumb: if L0 count exceeds a threshold, slow commits (e.g., delay group-commit timer or block new flushes briefly) to give the daemon time to catch up without unlimited memory growth.

Range queries, iterators, and multi-level reads

Point lookups are only half the story; you also need ordered scans. A production read path composes multiple iterators—memtable, L0 files (newest→oldest), then L1..Ln—and merges them with duplicate suppression and tombstone handling.

Core idea: a heap of iterators ordered by current key, with a “seen” tracker by key so only the newest value wins. The compaction heap we built is the same pattern.

struct level_reader { struct sst_reader rdr; /* plus bounds for range */ };
 
// Open N sources: 1 memtable iterator + M SSTable iterators
// Push first key from each into a heap; pop smallest key, suppress duplicates by preferring newer source order
// Tombstones: skip output but still suppress older versions from deeper levels

To bound file touches per lookup, keep L0 shallow and maintain non-overlapping ranges at L1+. For scans, stop early when keys pass the requested end.

Prefix seeks and restart points

Block-level restart points (every K records, store an absolute key) make prefix compression practical while keeping seeks O(log B) per block. For simplicity we avoided block encoding earlier; adding restarts is a compatible optimization that reduces I/O and memory footprint for big tables.

Snapshots and read-your-own-writes

If you need consistent snapshots across concurrent reads and writes, tag records with a monotonically increasing sequence number and only expose records with seq <= snapshot_seq to a given reader. Two practical designs:

Global sequence counter persisted in WAL; each PUT/DEL increments it
Record format v2 adds [seq:varint] after kind

Backwards compatibility: version tables with a header byte or magic; readers choose the decoder by version.

// v2 record header (WAL and SSTable data blocks):
// [kind:1][seq:varint][key_len:varint][val_len:varint][key][value][crc32]

Snapshots keep reads stable while compactions run; compaction must discard versions with seq > snapshot_seq only after all readers that might still see them finish. A simple epoch-based snapshot registry lets the compactor compute a safe minimum snapshot.

TTL and tombstone garbage collection

Sometimes deletes are time-based. You can encode an optional TTL in the value metadata or as a separate attribute alongside the tombstone. During bottom-level compaction:

Drop keys with expired TTLs
Elide tombstones older than a retention window so space gets reclaimed deterministically

Be explicit about wall-clock vs monotonic time in tests to avoid flakiness.

Verification, checksums, and corruption handling

We already checksum each record; you can also:

Add per-block checksums (one CRC per data block) to catch burst corruptions
Append a file-level checksum in the footer over the data+index+filter regions
On read, verify checksums only when a block is touched (lazy), and sample verification in background to detect bit-rot

Recovery strategy:

If an SSTable is corrupt: quarantine the file, rebuild by compacting from higher levels if possible, and alert
If WAL tail is corrupt: truncate to last valid record (CRC match) and continue

Benchmarking and tuning methodology

Microbenchmarks: measure PUT throughput vs flush size, group-commit interval, and fsync cadence
Read probes: random gets with adjustable negative/positive ratio to calibrate Bloom bits/key
Scan benchmarks: sustained range scans with varying key sizes and value sizes
Stress: overload L0 to validate backpressure and confirm compaction stabilizes

What to record:

Write/read throughput and tail latencies (P50/P95/P99)
Bloom FPR (observed), index hit rate, average bytes read per get
Compaction bytes written/read, time per job, and queue length
WAL sync interval, group size, and stall time due to backpressure

Observability you’ll actually use

Expose cheap counters and gauges; keep atomic increments relaxed and sample histograms in the app layer:

struct metrics {
  uint64_t puts, dels, gets, scans;
  uint64_t sstable_gets, sstable_skipped_by_bloom;
  uint64_t wal_bytes, data_bytes, compaction_in_bytes, compaction_out_bytes;
  uint64_t bloom_queries, bloom_negatives;
};
 
static inline void inc_relaxed(uint64_t *c) { __atomic_add_fetch(c, 1, __ATOMIC_RELAXED); }

Add periodic summaries and lightweight tracing around compaction jobs so production incidents come with context, not guesswork.

File format evolution without pain

Bake versioning into the file header and footer, and keep old readers working:

Header magic: "SST1\0\0\0\0" + version byte
Footer magic: "SSTFOOT1" + optional checksum
Feature flags: bitfield indicating presence of restarts, block checksums, or seq numbers

Readers choose the slow-but-correct path for unknown flags. Writers only emit features after all readers can understand them.

Production checklist (printable)

WAL group commit with fsync budget; truncate to last valid record on crash
Per-record CRCs; optional per-block and file checksums
Bloom filters at all levels; bits/key tuned from observed FPR
Sparse index sized for median seek budget
L0 depth cap and backpressure on flush when exceeded
Leveled compaction with per-level byte budgets and range selection
Atomic manifest installs (rename + directory fsync)
Observability: counters, compaction job logs, tail latencies
Fault injection tests: power-cut, partial writes, torn WAL tails

Closing thoughts

Log-structured storage trades in-place updates for append-and-organize. With a disciplined WAL, sorted immutable runs, Bloom filters to say “no” early, and compaction to keep overlap in check, you get predictable write throughput and competitive read latencies—even as your dataset grows. The patterns here are small and boring by design; wire them together carefully, measure, and you’ll have a store that keeps working at 2 a.m. when it matters.