You need writes that stay fast even when your dataset grows, reads that don’t thrash disks, and a design simple enough that you can reason about it at 2 a.m. That’s the promise of log-structured storage: append now, organize later. We’ll build up a practical mental model, then shape it into concrete C code you can compile—covering append-only segments, crash consistency, a minimal SSTable format, compaction policy, and Bloom filters that say “don’t bother hitting disk for this key.”
We’ll proceed in small, production-flavored steps: durable write path first, an immutable on-disk table you can scan and binary-search, and the scaffolding for compaction and filters. The endgame is an LSM-ish store that can take punches: sustained ingest, predictable recovery, and bounded memory.
Why log-structured storage exists (and still wins)
Random in-place updates are a losing battle for modern storage. Drives (HDD and SSD) reward sequential writes and punish scattered updates. The log-structured answer is elegant:
- Append new facts to a log/segment—sequential, cache-friendly, fsync-able
- Periodically merge immutable segments to purge old versions and reclaim space
- Maintain small auxiliary indexes (per-segment) so reads remain cheap
Writes become predictable, and garbage collection (compaction) moves the mess off the hot path.
The bird’s-eye architecture
- Memtable (in-memory, ordered map)
- WAL (append-only redo log for durability)
- Immutable SSTables on disk (sorted segments)
- Compaction (tiered/leveled) to merge and discard overwritten keys
- Bloom filters to avoid unnecessary disk probes
High level flow:
- Client PUT(k,v) → write WAL → insert into memtable → ack
- Memtable fills → flush to SSTable file (sorted) with index + filter → drop oldest WAL entries
- Background: compact multiple SSTables into fewer, larger ones; maintain ordering and drop shadowed keys
Minimal data model and invariants
We’ll use a byte-oriented KV interface:
- Keys: opaque byte strings (lexicographically ordered)
- Values: opaque byte strings
- MVCC-lite: a later PUT for the same key supersedes earlier versions at read time
- Deletes: represented by a tombstone record; compaction can elide dead keys
Correctness invariants we’ll keep front-and-center:
- Durability: a committed PUT survives crashes (WAL discipline)
- Atomicity at record granularity: partial writes are detectable and ignored during recovery
- Order: keys within an SSTable are strictly increasing
- Integrity: blocks and files carry checksums
The write path you can trust (WAL + memtable)
On every PUT/DELETE:
- Serialize an intent record and append to the WAL
fsync()
or group-commit within a bounded window- Apply to the memtable (an ordered structure—skiplist, treap, or array of runs)
If you crash after step 2 but before the memtable update, recovery replays the WAL into a fresh memtable and you’re back where you intended to be.
Record shape (on the wire and in WAL)
We’ll start with a compact, self-describing record:
#include <stdint.h>
enum rec_kind : uint8_t { REC_PUT = 1, REC_DEL = 2 };
// Lengths are varints; all CRCs are little-endian u32 of the preceding bytes in the record
// [kind:1][key_len:varint][val_len:varint (0 for DEL)][key...][value...][crc32:4]
Varints keep small keys/values tight; CRC lets us reject torn/partial appends after a crash.
A tiny varint and CRC toolkit
#include <stddef.h>
#include <stdint.h>
static size_t u64_varint_len(uint64_t x) {
size_t n = 1; while (x >= 0x80) { x >>= 7; ++n; } return n; }
static uint8_t *u64_varint_encode(uint8_t *p, uint64_t x) {
while (x >= 0x80) { *p++ = (uint8_t)(x | 0x80); x >>= 7; } *p++ = (uint8_t)x; return p; }
// Placeholder CRC32 (poly 0xEDB88320). In production, prefer a vetted impl or XXH3 for speed
static uint32_t crc32_update(uint32_t crc, const void *buf, size_t len) {
static uint32_t T[256]; static int init;
if (!init) { for (uint32_t i=0;i<256;i++){ uint32_t c=i; for(int k=0;k<8;k++) c=c&1?0xEDB88320u^(c>>1):c>>1; T[i]=c; } init=1; }
crc = ~crc; const uint8_t *p = (const uint8_t*)buf;
for (size_t i=0;i<len;i++) crc = T[(crc ^ p[i]) & 0xffu] ^ (crc >> 8);
return ~crc;
}
WAL append: atomic enough to sleep at night
#include <errno.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
struct wal {
int fd;
};
static bool write_all(int fd, const void *buf, size_t len) {
const uint8_t *p = (const uint8_t*)buf; size_t r = len;
while (r) { ssize_t w = write(fd, p, r); if (w > 0) { p += (size_t)w; r -= (size_t)w; continue; }
if (w < 0 && errno == EINTR) continue; return false; }
return true;
}
bool wal_append_put(struct wal *w, const void *k, size_t klen, const void *v, size_t vlen) {
uint8_t hdr[1 + 10 + 10]; // kind + max varints for key/val
uint8_t *p = hdr; *p++ = (uint8_t)REC_PUT; p = u64_varint_encode(p, (uint64_t)klen); p = u64_varint_encode(p, (uint64_t)vlen);
uint32_t crc = 0; crc = crc32_update(crc, hdr, (size_t)(p - hdr));
crc = crc32_update(crc, k, klen); crc = crc32_update(crc, v, vlen);
if (!write_all(w->fd, hdr, (size_t)(p - hdr))) return false;
if (!write_all(w->fd, k, klen)) return false;
if (!write_all(w->fd, v, vlen)) return false;
if (!write_all(w->fd, &crc, sizeof crc)) return false;
return true;
}
bool wal_append_del(struct wal *w, const void *k, size_t klen) {
uint8_t hdr[1 + 10 + 10]; uint8_t *p = hdr; *p++ = (uint8_t)REC_DEL; p = u64_varint_encode(p, (uint64_t)klen); p = u64_varint_encode(p, 0);
uint32_t crc = 0; crc = crc32_update(crc, hdr, (size_t)(p - hdr)); crc = crc32_update(crc, k, klen);
if (!write_all(w->fd, hdr, (size_t)(p - hdr))) return false;
if (!write_all(w->fd, k, klen)) return false;
if (!write_all(w->fd, &crc, sizeof crc)) return false;
return true;
}
bool wal_sync(struct wal *w) { return fsync(w->fd) == 0; }
This is intentionally small and boring: varint headers, CRC tail, and a strict fsync()
to form a durability boundary. Recovery scans until a CRC mismatch and stops—the last partial record is ignored.
In-memory memtable: ordered, bounded, and flushable
The memtable needs three properties:
- Ordered by key (so we can flush as a sorted run)
- Efficient inserts (we’ll keep it simple to start)
- Bounded memory with a clear flush condition (bytes or count)
Production designs use skiplists or balanced trees. We’ll start with a sortable vector of entries: append on insert, mark size, and when a configured byte budget is hit, sort and flush to disk.
struct slice { const uint8_t *ptr; size_t len; };
struct entry { uint8_t kind; struct slice key; struct slice val; };
static int cmp_slice(const void *a, const void *b) {
const struct entry *ea = (const struct entry*)a, *eb = (const struct entry*)b;
size_t n = ea->key.len < eb->key.len ? ea->key.len : eb->key.len;
int c = memcmp(ea->key.ptr, eb->key.ptr, n); if (c) return c;
return (ea->key.len < eb->key.len) ? -1 : (ea->key.len > eb->key.len);
}
During flush we sort by key, drop intra-batch duplicates keeping only the last for each key, and stream the result to an SSTable writer.
Immutable SSTable: one file, many parts
An SSTable is a single immutable file composed of:
- Header: magic, version, options (block size, checksum kind)
- Data blocks: sorted records grouped into blocks with restart points for prefix compression (we’ll stub the restarts for now)
- Per-block index: smallest key → file offset
- Bloom filter region: optional, built over all keys in the file
- Footer: fixed-size trailer with pointers to the index and filter regions + a file-level checksum
We’ll keep the first pass deliberately straightforward: one “block” that is the whole file for small flushes, a simple sparse index every N records, and a stub Bloom filter we’ll fill in later.
struct sst_writer {
int fd;
uint64_t num_records;
uint64_t index_every; // write an index entry every N records
// You’d also track offsets to write footer later
};
bool sst_begin(struct sst_writer *w, int fd) {
w->fd = fd; w->num_records = 0; w->index_every = 64; // default sparse index
static const char magic[8] = { 'S','S','T','1','\0','\0','\0','\0' };
return write_all(fd, magic, sizeof magic);
}
static bool write_kv_record(int fd, const uint8_t *k, size_t klen, const uint8_t *v, size_t vlen) {
uint8_t hdr[10+10]; uint8_t *p = hdr; p = u64_varint_encode(p, klen); p = u64_varint_encode(p, vlen);
uint32_t crc = 0; crc = crc32_update(crc, hdr, (size_t)(p - hdr));
crc = crc32_update(crc, k, klen); crc = crc32_update(crc, v, vlen);
if (!write_all(fd, hdr, (size_t)(p - hdr))) return false;
if (!write_all(fd, k, klen)) return false; if (!write_all(fd, v, vlen)) return false;
return write_all(fd, &crc, sizeof crc);
}
bool sst_add(struct sst_writer *w, const uint8_t *k, size_t klen, const uint8_t *v, size_t vlen) {
// For simplicity, we write raw sorted key/value records back-to-back with a CRC per record
if (!write_kv_record(w->fd, k, klen, v, vlen)) return false;
w->num_records++;
// In a real build, you’d emit an index entry every index_every records: [key prefix][file offset]
return true;
}
bool sst_end(struct sst_writer *w) {
// Footer (placeholder): in later sections we’ll append index/filter offsets and a file checksum
static const char end[4] = { 'E','N','D','\n' };
return write_all(w->fd, end, sizeof end) && fsync(w->fd) == 0;
}
This “minimal SSTable” is enough to support a clean flush path and a trivial reader: open, sanity-check the header, then sequentially scan (or binary search with a separate index file). We’ll refine it with a sparse index and filter once we have flushes working.
Read path sketch (single SSTable + memtable)
Reads consult the freshest layer first:
- Memtable: exact match or miss
- Recent SSTable(s): search newest to oldest; stop on first hit (or tombstone)
Even this simple two-tier approach avoids most disk I/O under write-heavy workloads because recent keys live in the memtable or the newest flushed table. Filters cut further once we add them.
Durability discipline and crash recovery
Recovery is your “I meant that” button:
- On start, rebuild the memtable by scanning WAL files in order
- Stop on first malformed/CRC-failing record; truncate tail
- If a flush completed (SSTable present with valid footer), you can discard WAL segments older than that flush’s durable point
This ensures that your read path sees a consistent latest state (memtable + immutable SSTables), and that a crash never leaves half-applied updates.
On compaction (preview)
Compaction is the garbage collector of log-structured storage. We’ll detail strategies later, but the basic job is:
- Merge several sorted runs → produce a new, larger sorted run
- Drop overwritten values and tombstoned keys
- Maintain per-level size/overlap guarantees
Two popular strategies you should know:
- Tiered: accumulate multiple same-size runs, then merge them wholesale into the next tier
- Leveled: maintain small L0, then L1..Ln with bounded total bytes per level; compactions target overlapping key ranges to keep read amplification low
We'll pick leveled as the default and explain the exact heuristics with code later—after we've proven the write/flush/read loop.
Bloom filters (preview)
Filters reduce disk probes for negative lookups: before touching an SSTable, consult its Bloom filter. If the filter says “definitely not,” you skip the file.
We’ll implement:
- A bitset stored in the SSTable
- k independent hash functions derived from 64-bit mixing
- Parameterization by bits-per-key to balance false positives vs space
Rule of thumb: optimal number of hashes is (k \approx (m/n)\ln 2) where (m) is bits, (n) keys.
Milestone checklist (where we are now)
- WAL: append + CRC + fsync
- Memtable: ordered buffer with flush trigger
- SSTable writer: minimal sorted run with per-record checksums
- Recovery plan: scan WAL, rebuild memtable, ignore torn tails
Next, we’ll wire the flush path end-to-end (memtable → SSTable), add a sparse index for binary search, and layer in a working Bloom filter to cut misses. Then compaction policy and range merges to keep read amplification under control.
Wiring the flush path (memtable → SSTable)
A flush turns the in-memory batch into one immutable, sorted run on disk with a tiny index and an optional Bloom filter. Steps:
- Sort entries by key
- Deduplicate within the batch, keeping only the newest entry per key
- Stream records to the SSTable writer (PUT → key+value, DEL → tombstone with empty value)
- Every N records, capture an index entry: the current file offset and the corresponding key
- Add keys to the Bloom filter as you go
- After data, write the index region, then the filter region, then a footer with offsets
We’ll define a simple footer so a reader can find the index/filter quickly:
// Footer layout at end of file:
// [index_off:varint][index_len:varint][filter_off:varint][filter_len:varint][magic:8]
static const char FOOTER_MAGIC[8] = { 'S','S','T','F','O','O','T','1' };
Bloom filter: compact and fast
We’ll use double hashing to derive k indices from two 64-bit hashes. A simple FNV-1a 64 for h1 and a splitmix64-based variant for h2 keeps implementation small and portable.
#include <stdint.h>
#include <string.h>
static uint64_t fnv1a64(const void *buf, size_t len) {
const uint8_t *p = (const uint8_t*)buf; uint64_t h = 1469598103934665603ull;
for (size_t i=0;i<len;i++) { h ^= p[i]; h *= 1099511628211ull; }
return h;
}
static uint64_t splitmix64(uint64_t x) {
x += 0x9e3779b97f4a7c15ull; x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ull;
x = (x ^ (x >> 27)) * 0x94d049bb133111ebull; return x ^ (x >> 31);
}
struct bloom { uint32_t m_bits; uint32_t k; uint8_t *bits; };
static void bloom_set(struct bloom *b, uint64_t i) {
i %= b->m_bits; size_t byte = (size_t)(i >> 3); uint8_t mask = (uint8_t)(1u << (i & 7));
b->bits[byte] |= mask;
}
static int bloom_get(const struct bloom *b, uint64_t i) {
i %= b->m_bits; size_t byte = (size_t)(i >> 3); uint8_t mask = (uint8_t)(1u << (i & 7));
return (b->bits[byte] & mask) != 0;
}
static void bloom_add(struct bloom *b, const void *key, size_t klen) {
uint64_t h1 = fnv1a64(key, klen); uint64_t h2 = splitmix64(h1);
for (uint32_t i=0;i<b->k;i++) bloom_set(b, h1 + i * h2);
}
static int bloom_maybe_contains(const struct bloom *b, const void *key, size_t klen) {
uint64_t h1 = fnv1a64(key, klen); uint64_t h2 = splitmix64(h1);
for (uint32_t i=0;i<b->k;i++) if (!bloom_get(b, h1 + i * h2)) return 0; return 1;
}
We size the filter by bits-per-key, m = ceil(n * bits_per_key)
, and choose k = max(1, round((m/n) * ln 2))
.
Sparse index format
We’ll append an index region that’s just a sequence of entries:
[key_len:varint][key_bytes...][data_offset:varint]
It’s simple, compact, and easy to binary search in memory after loading.
Putting it together: flush implementation
#include <fcntl.h>
#include <stdlib.h>
struct dynbuf { uint8_t *p; size_t len, cap; };
static void db_init(struct dynbuf *b){ b->p=NULL; b->len=0; b->cap=0; }
static int db_reserve(struct dynbuf *b, size_t add){ size_t need=b->len+add; if(need<=b->cap) return 1; size_t nc=b->cap?b->cap*2:1024; while(nc<need) nc*=2; uint8_t *np=(uint8_t*)realloc(b->p,nc); if(!np) return 0; b->p=np; b->cap=nc; return 1; }
static int db_put(struct dynbuf *b, const void *p, size_t n){ if(!db_reserve(b,n)) return 0; memcpy(b->p+b->len,p,n); b->len+=n; return 1; }
struct index_entry { uint64_t off; uint32_t klen; const uint8_t *k; };
struct flush_cfg { double bits_per_key; uint32_t index_every; };
static size_t dedupe_in_place(struct entry *a, size_t n) {
if (n == 0) return 0; size_t w = 0; // assumes 'a' is sorted by key
for (size_t i = 0; i < n; ) {
size_t j = i + 1; while (j < n && cmp_slice(&a[i], &a[j]) == 0) j++;
a[w++] = a[j-1]; i = j; // keep the last occurrence per key
}
return w;
}
int flush_memtable_to_sstable(const char *path, struct entry *ents, size_t n, struct flush_cfg cfg) {
// 1) sort by key
qsort(ents, n, sizeof *ents, cmp_slice);
// 2) dedupe (keep newest per key)
n = dedupe_in_place(ents, n);
int fd = open(path, O_CREAT|O_TRUNC|O_WRONLY, 0644);
if (fd < 0) return -1;
struct sst_writer w; if (!sst_begin(&w, fd)) { close(fd); return -1; }
// Prepare filter
uint64_t m = (uint64_t) ((double)n * (cfg.bits_per_key <= 0 ? 10.0 : cfg.bits_per_key));
if (m < 8) m = 8; // at least one byte
struct bloom bf = { .m_bits = (uint32_t)m, .k = 0, .bits = NULL };
bf.k = (uint32_t)((double)bf.m_bits / (double)(n? n:1) * 0.6931471805599453); // ln 2
if (bf.k == 0) bf.k = 1; if (bf.k > 15) bf.k = 15;
size_t bitbytes = ((size_t)bf.m_bits + 7u) >> 3; bf.bits = (uint8_t*)calloc(1, bitbytes);
if (!bf.bits) { close(fd); return -1; }
struct dynbuf index; db_init(&index);
// 3–5) stream data, build index+filter
uint64_t recno = 0; off_t data_off;
for (size_t i = 0; i < n; i++) {
// Index every cfg.index_every records
if ((recno % (cfg.index_every ? cfg.index_every : 64)) == 0) {
data_off = lseek(fd, 0, SEEK_CUR);
// encode [key_len][key][off]
uint8_t tmp[20]; uint8_t *p = tmp; p = u64_varint_encode(p, ents[i].key.len);
if (!db_put(&index, tmp, (size_t)(p - tmp))) goto fail;
if (!db_put(&index, ents[i].key.ptr, ents[i].key.len)) goto fail;
p = tmp; p = u64_varint_encode(p, (uint64_t)data_off);
if (!db_put(&index, tmp, (size_t)(p - tmp))) goto fail;
}
// Add to filter (include tombstones to avoid false negatives for shadowing)
bloom_add(&bf, ents[i].key.ptr, ents[i].key.len);
// Write record (DEL → empty value as tombstone)
const uint8_t *vp = ents[i].kind == REC_DEL ? (const uint8_t*)"" : ents[i].val.ptr;
size_t vlen = ents[i].kind == REC_DEL ? 0 : ents[i].val.len;
if (!sst_add(&w, ents[i].key.ptr, ents[i].key.len, vp, vlen)) goto fail;
recno++;
}
// 6) write index region
off_t index_off = lseek(fd, 0, SEEK_CUR);
if (!write_all(fd, index.p, index.len)) goto fail;
uint64_t index_len = (uint64_t)index.len;
// write filter region (format: [m_bits:varint][k:varint][bytes...])
uint8_t hdr[20]; uint8_t *ph = hdr; ph = u64_varint_encode(ph, (uint64_t)bf.m_bits); ph = u64_varint_encode(ph, (uint64_t)bf.k);
off_t filter_off = lseek(fd, 0, SEEK_CUR);
if (!write_all(fd, hdr, (size_t)(ph - hdr))) goto fail;
if (!write_all(fd, bf.bits, bitbytes)) goto fail;
uint64_t filter_len = (uint64_t)((ph - hdr) + bitbytes);
// footer
uint8_t ftr[64]; uint8_t *pf = ftr;
pf = u64_varint_encode(pf, (uint64_t)index_off);
pf = u64_varint_encode(pf, index_len);
pf = u64_varint_encode(pf, (uint64_t)filter_off);
pf = u64_varint_encode(pf, filter_len);
// write footer chunks and magic
if (!write_all(fd, ftr, (size_t)(pf - ftr))) goto fail;
if (!write_all(fd, FOOTER_MAGIC, sizeof FOOTER_MAGIC)) goto fail;
if (!sst_end(&w)) goto fail;
free(bf.bits); free(index.p); close(fd); return 0;
fail:
{
int err = errno; free(bf.bits); free(index.p); close(fd); errno = err; return -1;
}
}
Notes:
- We include tombstoned keys in the filter to avoid false negatives that would allow an older value to resurface from lower levels.
- Sparse index granularity (
index_every
) is a trade-off: more entries → faster seeks, larger index. - Footer magic at the very end simplifies discovery: read last 8 bytes, then backtrack varints.
SSTable reader: filter → index → data
Reads avoid I/O in three stages:
- Consult the Bloom filter. If negative, skip the file.
- Binary search the in-memory index to find the largest offset ≤ target key.
- Seek to that offset, then scan forward until the key is found or the next index boundary is crossed.
Footer and region loading
struct sst_index_entry { const uint8_t *key; uint32_t klen; uint64_t off; };
struct sst_reader {
int fd; uint8_t *index; size_t index_len; struct sst_index_entry *ents; size_t ents_n;
struct bloom bf; // owns bf.bits
};
static int read_tail(int fd, off_t file_size, void *buf, size_t len) {
if (lseek(fd, file_size - (off_t)len, SEEK_SET) < 0) return -1;
return (int)read(fd, buf, len) == (int)len ? 0 : -1;
}
static int sst_load_footer(int fd, off_t *index_off, uint64_t *index_len, off_t *filter_off, uint64_t *filter_len) {
// Read magic
off_t end = lseek(fd, 0, SEEK_END); if (end < 16) return -1;
char magic[8]; if (read_tail(fd, end, magic, sizeof magic) < 0) return -1;
if (memcmp(magic, FOOTER_MAGIC, 8) != 0) return -1;
// Read backwards the preceding varints: we’ll seek near the end and parse forward with a small buffer.
size_t back = 64; if (end < (off_t)(back + 8)) back = (size_t)end - 8;
if (lseek(fd, end - 8 - (off_t)back, SEEK_SET) < 0) return -1;
uint8_t tmp[128]; ssize_t n = read(fd, tmp, back); if (n <= 0) return -1;
// Find the last 4 varints ending exactly at position `n`
// For simplicity, parse forward and remember the last four decodes
uint64_t vals[4] = {0}; int vi = 0; size_t i = 0; while (i < (size_t)n) {
uint64_t v = 0; int s = 0; size_t start = i; for (;;){ if (i >= (size_t)n) return -1; uint8_t b = tmp[i++]; v |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s += 7; }
// shift queue
vals[0] = vals[1]; vals[1] = vals[2]; vals[2] = vals[3]; vals[3] = v; vi++;
}
if (vi < 4) return -1;
*index_off = (off_t)vals[vi>=4?vi-4:0]; *index_len = vals[vi>=3?vi-3:1];
*filter_off = (off_t)vals[vi>=2?vi-2:2]; *filter_len = vals[vi>=1?vi-1:3];
return 0;
}
static int sst_open(struct sst_reader *r, const char *path) {
memset(r, 0, sizeof *r); r->fd = open(path, O_RDONLY); if (r->fd < 0) return -1;
off_t io=0, fo=0; uint64_t il=0, fl=0; if (sst_load_footer(r->fd, &io, &il, &fo, &fl) != 0) goto fail;
// Load index region into memory
r->index_len = (size_t)il; r->index = (uint8_t*)malloc(r->index_len); if (!r->index) goto fail;
if (lseek(r->fd, io, SEEK_SET) < 0) goto fail; if ((size_t)read(r->fd, r->index, r->index_len) != r->index_len) goto fail;
// Parse index into entries (pointers into r->index)
size_t cap = 16; r->ents = (struct sst_index_entry*)malloc(cap * sizeof *r->ents); r->ents_n = 0;
size_t p = 0; while (p < r->index_len) {
// key_len
uint64_t klen = 0; int s = 0; do { uint8_t b = r->index[p++]; klen |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s += 7; } while (p < r->index_len);
const uint8_t *k = r->index + p; p += (size_t)klen;
// off
uint64_t off = 0; s = 0; do { uint8_t b = r->index[p++]; off |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s += 7; } while (p < r->index_len);
if (r->ents_n == cap) { cap *= 2; r->ents = (struct sst_index_entry*)realloc(r->ents, cap * sizeof *r->ents); }
r->ents[r->ents_n++] = (struct sst_index_entry){ .key = k, .klen = (uint32_t)klen, .off = off };
}
// Load filter
if (lseek(r->fd, fo, SEEK_SET) < 0) goto fail; uint8_t hdr[20];
// parse m_bits
uint64_t m_bits=0; int sft=0; ssize_t rd=read(r->fd, hdr, sizeof hdr); if (rd <= 0) goto fail; size_t hp=0;
do { uint8_t b = hdr[hp++]; m_bits |= (uint64_t)(b & 0x7f) << sft; if (!(b & 0x80)) break; sft += 7; } while (hp < (size_t)rd);
uint64_t kh=0; sft=0; do { if (hp >= (size_t)rd) { if (lseek(r->fd, fo+hp, SEEK_SET)<0) goto fail; rd=read(r->fd, hdr, sizeof hdr); if (rd<=0) goto fail; hp=0; } uint8_t b = hdr[hp++]; kh |= (uint64_t)(b & 0x7f) << sft; if (!(b & 0x80)) break; sft += 7; } while (1);
r->bf.m_bits = (uint32_t)m_bits; r->bf.k = (uint32_t)kh; size_t bbytes = ((size_t)r->bf.m_bits + 7u) >> 3; r->bf.bits = (uint8_t*)malloc(bbytes);
if (!r->bf.bits) goto fail; if ((size_t)read(r->fd, r->bf.bits, bbytes) != bbytes) goto fail;
return 0;
fail:
{ int err = errno; if (r->fd>=0) close(r->fd); free(r->index); free(r->ents); free(r->bf.bits); errno = err; return -1; }
}
static void sst_close(struct sst_reader *r) {
if (r->fd>=0) close(r->fd); free(r->index); free(r->ents); free(r->bf.bits); memset(r,0,sizeof*r);
}
Lookup using filter + sparse index
static int key_cmp_mem(const uint8_t *a, uint32_t alen, const uint8_t *b, uint32_t blen) {
size_t n = alen < blen ? alen : blen; int c = memcmp(a, b, n); if (c) return c; return (alen < blen) ? -1 : (alen > blen);
}
static int64_t index_floor(const struct sst_reader *r, const uint8_t *key, uint32_t klen) {
// binary search for largest idx s.t. ents[idx].key <= key
int64_t lo = 0, hi = (int64_t)r->ents_n - 1, ans = -1;
while (lo <= hi) {
int64_t mid = (lo + hi) >> 1; int c = key_cmp_mem(r->ents[mid].key, r->ents[mid].klen, key, klen);
if (c <= 0) { ans = mid; lo = mid + 1; } else { hi = mid - 1; }
}
return ans;
}
// Returns: 1=found PUT (value returned), 0=found DEL (tombstone), -1=not found, <0 on error
int sst_get(struct sst_reader *r, const uint8_t *key, uint32_t klen, uint8_t **out, uint32_t *outlen) {
// 1) Bloom filter
if (!bloom_maybe_contains(&r->bf, key, klen)) return -1; // definitely not present
// 2) index floor
int64_t idx = index_floor(r, key, klen); off_t off = 0; if (idx >= 0) off = (off_t)r->ents[idx].off;
if (lseek(r->fd, off, SEEK_SET) < 0) return -2;
// 3) scan forward (bounded). Stop when we pass the key or hit EOF/next block boundary (not tracked here)
for (;;) {
// read [klen][vlen]
uint8_t hdr[20]; ssize_t n = read(r->fd, hdr, sizeof hdr); if (n <= 0) return -1; size_t p = 0;
// decode klen
uint64_t kl=0; int s=0; do { uint8_t b = hdr[p++]; kl |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s += 7; } while (p < (size_t)n);
// decode vlen
uint64_t vl=0; s=0; do { if (p >= (size_t)n) { return -2; } uint8_t b = hdr[p++]; vl |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s += 7; } while (p < (size_t)n);
// read key, value, crc
uint8_t *kbuf = (uint8_t*)malloc((size_t)kl); if (!kbuf) return -3; if ((size_t)read(r->fd, kbuf, (size_t)kl) != (size_t)kl) { free(kbuf); return -2; }
uint8_t *vbuf = (uint8_t*)malloc((size_t)vl); if (!vbuf && vl) { free(kbuf); return -3; }
if (vl && (size_t)read(r->fd, vbuf, (size_t)vl) != (size_t)vl) { free(kbuf); free(vbuf); return -2; }
uint32_t crc=0; if ((size_t)read(r->fd, &crc, sizeof crc) != sizeof crc) { free(kbuf); free(vbuf); return -2; }
// verify crc
uint8_t htmp[20]; uint8_t *hp = htmp; hp = u64_varint_encode(hp, kl); hp = u64_varint_encode(hp, vl);
uint32_t calc = 0; calc = crc32_update(calc, htmp, (size_t)(hp - htmp)); calc = crc32_update(calc, kbuf, (size_t)kl); calc = crc32_update(calc, vbuf, (size_t)vl);
if (calc != crc) { free(kbuf); free(vbuf); return -2; }
int c = key_cmp_mem(kbuf, (uint32_t)kl, key, klen);
if (c == 0) {
// tombstone if vl==0
if (vl == 0) { free(kbuf); free(vbuf); return 0; }
*out = vbuf; *outlen = (uint32_t)vl; free(kbuf); return 1;
} else if (c > 0) {
// passed the key
free(kbuf); free(vbuf); return -1;
}
free(kbuf); free(vbuf);
}
}
This reader is intentionally minimal. In a real build you’ll group records into blocks, add restart points for prefix compression, and bound scans to a block to keep worst-case latency tight.
Practical budgets and knobs
- Bits per key: start with 10 (≈1% FPR). If your negative lookups dominate and I/O is precious, spend up to 12–14.
- Index stride: 64–128 records per entry is a good default. If your values are small and tables are large, tighten the stride.
- WAL fsync: use group commit with a short timer (e.g., 1–5 ms) under load to amortize durability without lying.
- Flush thresholds: cap by bytes in memtable (e.g., 64–256 MiB) and by number of entries to avoid pathological huge flushes.
Compaction that actually reclaims space (k-way merge)
Compaction merges several sorted runs into one, dropping overwritten values and honoring tombstones. The core algorithm is a k-way merge with a min-heap on keys and a tie-breaker on “newness.”
Key rules:
- Among duplicate keys across inputs, keep only the newest record
- If the newest record is a tombstone (value length 0):
- Preserve it when compacting into a non-bottom level (so it can hide older values below)
- Drop it at the bottom-most level (space reclaim)
We represent “newness” with a per-file generation counter (monotonic increasing at flush/creation time). Newer generation wins ties.
Refinement: finalize with a footer, not an ad-hoc tail marker
Earlier we wrote an END\n
marker. We’ll refine the writer to rely solely on the footer magic at EOF. Updated end routine:
bool sst_end(struct sst_writer *w) {
// Footer + magic must already be written by the caller (flush/compaction path)
return fsync(w->fd) == 0;
}
SSTable iterator (sequential reader over data region)
struct sst_iter {
int fd; off_t off; off_t data_end;
uint8_t *k; uint32_t klen; uint8_t *v; uint32_t vlen;
int eof;
};
static int sst_iter_open(struct sst_iter *it, const char *path) {
memset(it, 0, sizeof *it); it->fd = open(path, O_RDONLY); if (it->fd < 0) return -1;
off_t io=0, fo=0; uint64_t il=0, fl=0; if (sst_load_footer(it->fd, &io, &il, &fo, &fl) != 0) { close(it->fd); return -1; }
it->off = 8; // after 8-byte header magic written by sst_begin
it->data_end = io; it->eof = it->off >= it->data_end; return 0;
}
static void sst_iter_close(struct sst_iter *it) {
if (it->fd>=0) close(it->fd); free(it->k); free(it->v); memset(it,0,sizeof*it);
}
static int sst_iter_next(struct sst_iter *it) {
if (it->eof) return 0; free(it->k); free(it->v); it->k = it->v = NULL; it->klen = it->vlen = 0;
if (lseek(it->fd, it->off, SEEK_SET) < 0) return -1;
uint8_t hdr[20]; ssize_t n = read(it->fd, hdr, sizeof hdr); if (n <= 0) { it->eof = 1; return 0; }
size_t p=0; uint64_t kl=0; int s=0; do { uint8_t b = hdr[p++]; kl |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s+=7; } while (p < (size_t)n);
uint64_t vl=0; s=0; do { uint8_t b = hdr[p++]; vl |= (uint64_t)(b & 0x7f) << s; if (!(b & 0x80)) break; s+=7; } while (p < (size_t)n);
it->k = (uint8_t*)malloc((size_t)kl); it->v = (uint8_t*)(vl? malloc((size_t)vl) : NULL);
if (!it->k || (vl && !it->v)) return -1;
if ((size_t)read(it->fd, it->k, (size_t)kl) != (size_t)kl) return -1;
if (vl && (size_t)read(it->fd, it->v, (size_t)vl) != (size_t)vl) return -1;
uint32_t crc=0; if ((size_t)read(it->fd, &crc, sizeof crc) != sizeof crc) return -1;
uint8_t htmp[20]; uint8_t *hp = htmp; hp = u64_varint_encode(hp, kl); hp = u64_varint_encode(hp, vl);
uint32_t calc=0; calc = crc32_update(calc, htmp, (size_t)(hp - htmp)); calc = crc32_update(calc, it->k, (size_t)kl); calc = crc32_update(calc, it->v, (size_t)vl);
if (calc != crc) return -1;
it->klen = (uint32_t)kl; it->vlen = (uint32_t)vl;
it->off = lseek(it->fd, 0, SEEK_CUR);
if (it->off >= it->data_end) it->eof = 1; return 1;
}
K-way merge with heap and tombstones
struct heap_item { struct sst_iter *it; uint64_t gen; };
static int heap_less(const struct heap_item *a, const struct heap_item *b) {
int c = key_cmp_mem(a->it->k, a->it->klen, b->it->k, b->it->klen);
if (c != 0) return c < 0; // smaller key first
// Same key: newer (higher gen) first so we can drop older ones quickly
return a->gen > b->gen;
}
struct heap { struct heap_item *v; size_t n, cap; };
static void hswap(struct heap_item *a, struct heap_item *b){ struct heap_item t=*a; *a=*b; *b=t; }
static void hpush(struct heap *h, struct heap_item x){ if(h->n==h->cap){ h->cap=h->cap?h->cap*2:8; h->v=realloc(h->v,h->cap*sizeof*h->v); } size_t i=h->n++; h->v[i]=x; while(i&&heap_less(&h->v[i],&h->v[(i-1)/2])){ hswap(&h->v[i],&h->v[(i-1)/2]); i=(i-1)/2; }}
static int hpop(struct heap *h, struct heap_item *out){ if(!h->n) return 0; *out=h->v[0]; size_t i=0; h->v[0]=h->v[--h->n]; while(1){ size_t l=2*i+1,r=l+1,sm=i; if(l<h->n && heap_less(&h->v[l],&h->v[sm])) sm=l; if(r<h->n && heap_less(&h->v[r],&h->v[sm])) sm=r; if(sm==i) break; hswap(&h->v[i],&h->v[sm]); i=sm; } return 1; }
struct sst_input { const char *path; uint64_t gen; };
int compact_kway(struct sst_input *in, size_t nin,
const char *out_path, int bottom_level,
struct flush_cfg cfg) {
// Open iters and prime heap
struct sst_iter *its = calloc(nin, sizeof *its); struct heap hp = {0};
if (!its) return -1;
for (size_t i=0;i<nin;i++) {
if (sst_iter_open(&its[i], in[i].path) != 0) { /* handle */ }
if (sst_iter_next(&its[i]) > 0) hpush(&hp, (struct heap_item){ .it=&its[i], .gen=in[i].gen });
}
// Output writer
int fd = open(out_path, O_CREAT|O_TRUNC|O_WRONLY, 0644); if (fd<0) return -1;
struct sst_writer w; if (!sst_begin(&w, fd)) { close(fd); return -1; }
// Prepare index and filter as in flush
struct dynbuf index; db_init(&index);
// To size Bloom, we approximate n by sum of inputs (overestimates fine)
size_t approx_n = 0; for (size_t i=0;i<nin;i++) approx_n += 1024; // placeholder unless tracked per file
uint64_t m = (uint64_t)((double)(approx_n?approx_n:1) * (cfg.bits_per_key<=0?10.0:cfg.bits_per_key)); if (m < 8) m = 8;
struct bloom bf = { .m_bits=(uint32_t)m, .k=0, .bits=NULL };
bf.k = (uint32_t)((double)bf.m_bits / (double)(approx_n?approx_n:1) * 0.6931471805599453); if (bf.k==0) bf.k=1; if (bf.k>15) bf.k=15;
size_t bitbytes = ((size_t)bf.m_bits + 7u) >> 3; bf.bits = (uint8_t*)calloc(1, bitbytes);
uint64_t recno = 0; int rc = -1;
while (hp.n) {
// Pull smallest key; gather all with same key
struct heap_item top; hpop(&hp, &top);
struct sst_iter *lead = top.it; uint64_t lead_gen = top.gen;
// Consume and discard older dupes of the same key in heap
struct heap_item stash[32]; size_t sn=0;
while (hp.n) {
struct heap_item nxt = hp.v[0];
if (key_cmp_mem(lead->k, lead->klen, nxt.it->k, nxt.it->klen) != 0) break;
// same key: pop and stash (older than lead due to heap_less)
hpop(&hp, &nxt); if (sn<32) stash[sn++] = nxt;
}
// Decide what to output based on lead
int is_tomb = (lead->vlen == 0);
if (!is_tomb || !bottom_level) {
// Index stride
if ((recno % (cfg.index_every ? cfg.index_every : 64)) == 0) {
off_t data_off = lseek(fd, 0, SEEK_CUR);
uint8_t tmp[20]; uint8_t *p = tmp; p = u64_varint_encode(p, lead->klen);
if (!db_put(&index, tmp, (size_t)(p - tmp))) goto out;
if (!db_put(&index, lead->k, lead->klen)) goto out;
p = tmp; p = u64_varint_encode(p, (uint64_t)data_off);
if (!db_put(&index, tmp, (size_t)(p - tmp))) goto out;
}
bloom_add(&bf, lead->k, lead->klen);
if (!sst_add(&w, lead->k, lead->klen, lead->v, lead->vlen)) goto out;
recno++;
}
// Advance lead iterator
if (sst_iter_next(lead) > 0) hpush(&hp, (struct heap_item){ .it=lead, .gen=lead_gen });
// Advance any stashed iters and push
for (size_t i=0;i<sn;i++) if (sst_iter_next(stash[i].it) > 0) hpush(&hp, stash[i]);
}
// Write index and filter, then footer+magic
off_t index_off = lseek(fd, 0, SEEK_CUR); if (!write_all(fd, index.p, index.len)) goto out;
uint64_t index_len = (uint64_t)index.len;
uint8_t hdr[20]; uint8_t *ph = hdr; ph = u64_varint_encode(ph, (uint64_t)bf.m_bits); ph = u64_varint_encode(ph, (uint64_t)bf.k);
off_t filter_off = lseek(fd, 0, SEEK_CUR);
if (!write_all(fd, hdr, (size_t)(ph - hdr))) goto out; if (!write_all(fd, bf.bits, bitbytes)) goto out;
uint64_t filter_len = (uint64_t)((ph - hdr) + bitbytes);
uint8_t ftr[64]; uint8_t *pf = ftr;
pf = u64_varint_encode(pf, (uint64_t)index_off);
pf = u64_varint_encode(pf, index_len);
pf = u64_varint_encode(pf, (uint64_t)filter_off);
pf = u64_varint_encode(pf, filter_len);
if (!write_all(fd, ftr, (size_t)(pf - ftr))) goto out; if (!write_all(fd, FOOTER_MAGIC, sizeof FOOTER_MAGIC)) goto out;
if (!sst_end(&w)) goto out;
rc = 0;
out:
free(hp.v); for (size_t i=0;i<nin;i++) sst_iter_close(&its[i]); free(its);
free(bf.bits); free(index.p); close(fd); return rc;
}
This merge is stable, memory-bounded, and linear in total input size plus a log k
heap factor. It preserves tombstones above the bottom level, ensuring deletes remain effective across deeper runs.
Levels, manifest, and atomic installs
Compaction outputs must become visible atomically. A simple approach:
- Name files
L<level>-<gen>.sst
(e.g.,L1-000123.sst
) - Maintain a
MANIFEST
text file listing active tables: one per lineL <level> <gen> <filename>
- On install: write a new manifest to a temp file (
MANIFEST.tmp
),fsync()
, thenrename()
over the oldMANIFEST
andfsync()
its directory
Readers open tables named in the current MANIFEST
. On startup, load the manifest, open SSTables newest-to-oldest within a level, and compose the search order from L0 downward.
static int manifest_install(const char *dir, const char *new_manifest_tmp) {
// 1) fsync tmp manifest
int fd = open(new_manifest_tmp, O_RDONLY); if (fd<0) return -1; if (fsync(fd) != 0) { close(fd); return -1; } close(fd);
// 2) rename into place
char dst[1024]; snprintf(dst, sizeof dst, "%s/MANIFEST", dir);
if (rename(new_manifest_tmp, dst) != 0) return -1;
// 3) fsync directory for metadata durability
int dfd = open(dir, O_RDONLY); if (dfd<0) return -1; int rc = fsync(dfd); close(dfd); return rc;
}
Atomic rename plus directory fsync()
is the portable pattern for metadata durability on POSIX systems.
Background worker, triggers, and backpressure
You don’t compact on the foreground thread. A background worker:
- Watches level sizes and overlap; when a level exceeds its byte budget, picks a victim range and runs
compact_kway
- Installs the output atomically, then deletes obsolete inputs
- Applies backpressure by slowing new flushes when L0 is too deep (e.g., > 8 files)
Trigger heuristics (leveled):
- Set per-level max bytes:
L1 = 256 MiB
, growing by ~10x per level - When a level exceeds its budget, choose an input file and the overlapping files in the next level; compact that range only
- Keep L0 shallow by compacting aggressively into L1
Minimal loop sketch:
void compaction_daemon(void) {
for (;;) {
// evaluate level sizes & overlaps → pick a job or sleep
struct job j; if (!choose_job(&j)) { sleep_ms(50); continue; }
// run compaction to temp file
if (compact_kway(j.inputs, j.nin, j.tmp_out, j.bottom_level, j.cfg) == 0) {
// atomically install: new manifest with j.tmp_out added and inputs removed
install_compaction(&j);
// delete old files after install
gc_inputs(&j);
}
}
}
Backpressure rule of thumb: if L0 count exceeds a threshold, slow commits (e.g., delay group-commit timer or block new flushes briefly) to give the daemon time to catch up without unlimited memory growth.
Range queries, iterators, and multi-level reads
Point lookups are only half the story; you also need ordered scans. A production read path composes multiple iterators—memtable, L0 files (newest→oldest), then L1..Ln—and merges them with duplicate suppression and tombstone handling.
Core idea: a heap of iterators ordered by current key, with a “seen” tracker by key so only the newest value wins. The compaction heap we built is the same pattern.
struct level_reader { struct sst_reader rdr; /* plus bounds for range */ };
// Open N sources: 1 memtable iterator + M SSTable iterators
// Push first key from each into a heap; pop smallest key, suppress duplicates by preferring newer source order
// Tombstones: skip output but still suppress older versions from deeper levels
To bound file touches per lookup, keep L0 shallow and maintain non-overlapping ranges at L1+. For scans, stop early when keys pass the requested end.
Prefix seeks and restart points
Block-level restart points (every K records, store an absolute key) make prefix compression practical while keeping seeks O(log B) per block. For simplicity we avoided block encoding earlier; adding restarts is a compatible optimization that reduces I/O and memory footprint for big tables.
Snapshots and read-your-own-writes
If you need consistent snapshots across concurrent reads and writes, tag records with a monotonically increasing sequence number and only expose records with seq <= snapshot_seq
to a given reader. Two practical designs:
- Global sequence counter persisted in WAL; each PUT/DEL increments it
- Record format v2 adds
[seq:varint]
afterkind
Backwards compatibility: version tables with a header byte or magic; readers choose the decoder by version.
// v2 record header (WAL and SSTable data blocks):
// [kind:1][seq:varint][key_len:varint][val_len:varint][key][value][crc32]
Snapshots keep reads stable while compactions run; compaction must discard versions with seq > snapshot_seq
only after all readers that might still see them finish. A simple epoch-based snapshot registry lets the compactor compute a safe minimum snapshot.
TTL and tombstone garbage collection
Sometimes deletes are time-based. You can encode an optional TTL in the value metadata or as a separate attribute alongside the tombstone. During bottom-level compaction:
- Drop keys with expired TTLs
- Elide tombstones older than a retention window so space gets reclaimed deterministically
Be explicit about wall-clock vs monotonic time in tests to avoid flakiness.
Verification, checksums, and corruption handling
We already checksum each record; you can also:
- Add per-block checksums (one CRC per data block) to catch burst corruptions
- Append a file-level checksum in the footer over the data+index+filter regions
- On read, verify checksums only when a block is touched (lazy), and sample verification in background to detect bit-rot
Recovery strategy:
- If an SSTable is corrupt: quarantine the file, rebuild by compacting from higher levels if possible, and alert
- If WAL tail is corrupt: truncate to last valid record (CRC match) and continue
Benchmarking and tuning methodology
- Microbenchmarks: measure PUT throughput vs flush size, group-commit interval, and fsync cadence
- Read probes: random gets with adjustable negative/positive ratio to calibrate Bloom bits/key
- Scan benchmarks: sustained range scans with varying key sizes and value sizes
- Stress: overload L0 to validate backpressure and confirm compaction stabilizes
What to record:
- Write/read throughput and tail latencies (P50/P95/P99)
- Bloom FPR (observed), index hit rate, average bytes read per get
- Compaction bytes written/read, time per job, and queue length
- WAL sync interval, group size, and stall time due to backpressure
Observability you’ll actually use
Expose cheap counters and gauges; keep atomic increments relaxed and sample histograms in the app layer:
struct metrics {
uint64_t puts, dels, gets, scans;
uint64_t sstable_gets, sstable_skipped_by_bloom;
uint64_t wal_bytes, data_bytes, compaction_in_bytes, compaction_out_bytes;
uint64_t bloom_queries, bloom_negatives;
};
static inline void inc_relaxed(uint64_t *c) { __atomic_add_fetch(c, 1, __ATOMIC_RELAXED); }
Add periodic summaries and lightweight tracing around compaction jobs so production incidents come with context, not guesswork.
File format evolution without pain
Bake versioning into the file header and footer, and keep old readers working:
- Header magic:
"SST1\0\0\0\0"
+ version byte - Footer magic:
"SSTFOOT1"
+ optional checksum - Feature flags: bitfield indicating presence of restarts, block checksums, or seq numbers
Readers choose the slow-but-correct path for unknown flags. Writers only emit features after all readers can understand them.
Production checklist (printable)
- WAL group commit with fsync budget; truncate to last valid record on crash
- Per-record CRCs; optional per-block and file checksums
- Bloom filters at all levels; bits/key tuned from observed FPR
- Sparse index sized for median seek budget
- L0 depth cap and backpressure on flush when exceeded
- Leveled compaction with per-level byte budgets and range selection
- Atomic manifest installs (rename + directory fsync)
- Observability: counters, compaction job logs, tail latencies
- Fault injection tests: power-cut, partial writes, torn WAL tails
Closing thoughts
Log-structured storage trades in-place updates for append-and-organize. With a disciplined WAL, sorted immutable runs, Bloom filters to say “no” early, and compaction to keep overlap in check, you get predictable write throughput and competitive read latencies—even as your dataset grows. The patterns here are small and boring by design; wire them together carefully, measure, and you’ll have a store that keeps working at 2 a.m. when it matters.