Key-value separation & why compaction shouldn't touch values

The problem

Compaction is the silent tax you pay on every read.

Classical LSM trees — the storage engine inside RocksDB, Cassandra, LevelDB, and a dozen others — rewrite your values during compaction. That rewrite is what makes their P99 graph look like a sawblade.

Here's the shape of the problem. An LSM tree absorbs writes into a fast in-memory level (the memtable), then flushes that level to disk as an immutable SSTable. To keep read amplification bounded, the engine periodically merges SSTables — that's compaction. And compaction has to rewrite every byte that lives in the SSTables being merged: keys and values.

If your values are 128 bytes and your keys are 16 bytes, you've just paid an 8× rewrite tax on data that didn't change. Measured end-to-end across the entire lifecycle of a value, the industry term for this is write amplification, and it sits between 6× and 12× for tuned RocksDB and Cassandra deployments. It's the reason teams burn through NVMe every few months, and the reason your P99 spikes at the worst possible moment.

VeltrixDB makes one architectural decision that eliminates this entirely: we never put values inside the LSM. The index — which is what gets compacted — only ever sees 24-byte pointers.

Engine	Storage model	Write amplification	P99 stability
RocksDB	LSM, values in SSTables	6–10×	sawblade
Cassandra	LSM, values in SSTables	8–12×	sawblade
LevelDB	LSM, values in SSTables	10–15×	sawblade
VeltrixDB	KV separation · WiscKey-style	1.0×	flat

The benchmark we publish on the performance page measures write amplification of 1.0× sustained over a 60-minute window. That's not theoretical — that's the actual ratio of bytes_written_to_nvme over bytes_written_by_clients, scraped from /metrics.

The idea

WiscKey, production-hardened.

In 2016, a paper out of Wisconsin proposed the separation we use today. We took the idea, kept its core insight, and rebuilt the engine end-to-end to make it survive in multi-cloud production.

The WiscKey paper (FAST '16) made one elegant observation: if you separate keys from values, only the keys have to participate in compaction. Values can live in an append-only log on disk — they're written once, never rewritten, and only touched again when garbage collection decides to reclaim space.

The original WiscKey paper proved it worked in a research setting. VeltrixDB takes the idea further — we re-architected it around three production realities the paper sidestepped:

Sharded clusters · the paper assumed a single node; we shard the key space across 1024 logical shards, each with its own VLog
Modern kernels · io_uring didn't exist when WiscKey was published; we exploit it heavily to drive down per-op overhead
Caching · the paper had a basic LRU; we replaced it with LIRS, which dramatically improves hit rate under skewed access

What lives in the index

Each VeltrixDB shard maintains an adaptive radix tree (ART) index — entirely in RAM. The leaves of the ART don't store values; they store 24-byte VLog pointers. The pointer encodes which VLog segment file the value lives in, the byte offset within that file, and the value's length.

Index entry layout · 24 bytes

struct

segment_id : u32 · offset : u64 · length : u32 · flags : u32 · checksum : u32

An ART of 24-byte pointers for 1 billion keys consumes ~24 GB of RAM per shard set — which is why the honest fit check is honest about the RAM budget. The trade-off is that compaction never has to do anything other than merge these tiny, fixed-size pointer records. That's why our compaction is cheap and our P99 stays flat.

Write path

Value lands on NVMe before ack.

A write isn't acknowledged until the value is durable on disk. We do that with group-commit WAL fsync and O_DIRECT VLog appends — together they cost us about 130 microseconds.

When a write enters VeltrixDB it follows five stages. Each one has been tuned to within an inch of its life:

Write path · App → durable ack

01 · ~12 µs

App→WAL ring

02 · ~80 µs

WAL fsync·group-commit batch

03 · ~30 µs

VLog append·O_DIRECT

04 · ~8 µs

Index pointer update·ART insert

05 · ~5 µs

Cache warm→ack

Group-commit WAL

The WAL ring is a lock-free queue per shard. Writers append their record and park on a futex. A dedicated commit thread wakes every 50µs (or earlier, if the batch fills), issues a single fsync, and wakes every parked writer at once.

This means the fsync cost — the floor of disk durability, typically 60-100µs on a healthy NVMe — is amortized across every concurrent writer in the batch. At our test load of 2M ops/sec per node, the average batch contains ~95 writes. So 80µs of fsync becomes ~0.85 µs per write.

VLog append

Once the WAL acknowledges, the value gets streamed into the VLog — an append-only segment file on NVMe, written with O_DIRECT | O_SYNC. Bypassing the page cache here is deliberate: the page cache adds non-determinism (eviction noise, dirty-page writeback storms) that we don't want anywhere near the latency-critical path.

Each VLog segment is 1 GB by default. When a segment fills, we close it for writes and the GC subsystem can later reclaim it. Segments are sealed with a checksum trailer and never modified after seal.

Index update

Finally, the 24-byte pointer is inserted into the ART. The ART uses per-node fine-grained latches (RCU-style) so concurrent inserts on different key prefixes don't block each other. Read traffic continues to see the previous version of the affected node until the insert is published.

Read path

io_uring SQPOLL — zero syscalls on the hot path.

When a read misses the cache and has to go to NVMe, we want exactly one thing to happen: the kernel hands the value back. No syscall trap. No context switch. No page cache interference.

The traditional read path on Linux costs you at minimum a syscall (read(2) or pread(2)), a kernel-mode trap, a context switch, and — under contention — a wait on the page cache. None of that is free. At the latency budgets we operate in, each layer is measurable.

io_uring SQPOLL mode

io_uring with SQPOLL mode pins a kernel thread to a CPU core whose job is to poll a shared ring buffer for I/O requests. Our user-space code writes a submission entry into the ring; the kernel polls it within microseconds, dispatches the I/O to the NVMe queue, and writes the completion entry back into the ring. Our process never traps into kernel mode.

The benefit isn't just the saved syscall — it's that the entire latency floor drops. On our hardware, the difference between pread() and io_uring SQPOLL for a 128-byte NVMe read is around 4-7µs of wall-clock time, and far more under contention.

O_DIRECT — out of the page cache

We open the VLog with O_DIRECT. This tells the kernel to bypass the page cache entirely — the read goes from NVMe straight into our user-space buffer with no intermediate copy. Two consequences:

Deterministic latency · no page cache evictions, no surprise jitter from a noisy tenant doing a large sequential scan
Less RAM pressure · we choose what to cache; the kernel doesn't fight us by caching VLog blocks we'll never re-read

1024 parallel queues

Each shard owns its own io_uring instance and its own NVMe submission queue. On a 1024-shard cluster, that's 1024 dedicated, kernel-polled queues. From the kernel's vantage point we look less like a database and more like a parallel storage controller.

Syscalls / read

SQPOLL kernel thread polls

Parallel queues

1024

one per shard

P99 NVMe read

4.9 ms

under sustained writes

The cache

LIRS — because LRU isn't smart enough.

94% of reads in steady state are served from RAM. The reason isn't more cache — it's a better eviction policy than LRU.

Most databases use LRU (Least Recently Used) for their block cache. LRU has one famous failure mode: a large cold scan walks through every block once, evicting the hot working set on its way through. Your "hot" cache is now full of cold data you'll never read again.

VeltrixDB uses LIRS (Low Inter-reference Recency Set). LIRS keeps two stacks — one for LIR blocks (proven hot) and one for HIR blocks (probationary). A block has to demonstrate reuse before it gets promoted out of probation, which means a single-pass scan can't pollute the resident set.

Why this matters in production

Consider a session store: 90% of requests hit the last hour of sessions; 10% are background jobs reading older history. With LRU, the background jobs would constantly evict hot sessions and your cache hit rate would collapse to ~60%. With LIRS, the background blocks live in HIR, get evicted on their second non-use, and the hot set stays resident. Measured hit rate in our 1B-key benchmark: 94.2% with no warm-up babysitting.

One tuning knob. Small hot keys (≤ 256 B) get priority-2 eviction resistance — they stay in the LIR stack even during sustained cold scans. This is the feature that lets a fraud-scoring workload survive a midnight ETL job sharing the same cluster.

Garbage collection

GC runs on its own queue, never blocks reads.

Tombstones, deletes, and overwrites leave dead values in the VLog. We reclaim that space — but we do it on a completely separate I/O queue so user requests never wait behind cleanup.

VLog garbage collection runs in three modes, configurable per shard:

Lazy · pick the segment with the highest dead-byte ratio, copy live records forward, free the segment. Default mode.
Throttled · same as lazy but with a configurable IOPS ceiling — useful for cost-sensitive deployments on slower NVMe
Aggressive · target a specific free-space watermark; ramp up parallelism until the ceiling is met

The crucial design decision: GC submits its I/O on a separate io_uring instance from the read path. Even if GC is moving gigabytes per second, the NVMe queue depth seen by user reads is unaffected. On Linux, this is enforced via cgroup v2 io.weight — GC's cgroup is given a 10× lower weight than the user-traffic cgroup, so on contention, reads always win.

This is the single biggest difference between our P99 graph and an LSM-based engine's. RocksDB compaction shares the same I/O scheduler as the read path — when it kicks off, reads queue behind it. VeltrixDB GC literally cannot interfere with reads. Watch a 60-minute window on the cluster dashboard and look for the missing sawtooth.

Putting it together

Four decisions, one outcome.

No single trick gets you to 4.9ms P99 on a 1-billion-key working set with values on disk. It's the compound effect of four design decisions that respect the boundary between user code and the kernel.

Each decision is independently boring. Together they are why the latency graph is flat:

Key-value separation · compaction touches 24-byte pointers, never the value blob
io_uring SQPOLL + O_DIRECT · the kernel never gets in the way of a hot-path read
LIRS cache · cold scans can't evict the hot set
Isolated GC I/O queue · cleanup is invisible to user latency

That's the entire engineering bet. Whether it's worth it for your workload depends on whether you live in the failure mode the design eliminates — sawtooth P99 from compaction interference. If you do, the 1B-key benchmark is the cheapest 18 minutes of reading you'll do this quarter.

Citation: WiscKey (Lu et al., FAST '16) · Linux io_uring docs · LIRS (Jiang & Zhang, SIGMETRICS '02)

See the benchmark Read the whitepaper Live cluster metrics

Continue reading

Benchmark report

The 1B-Key Benchmark — methodology & raw data

Hardware specs, workload generator config, percentile distributions, and replayable harness.

PDF · 32 pagesRead →

Migration guide

Moving from DynamoDB to VeltrixDB in two sprints

Shadow reads, dual-write strategy, cutover checklist, and rollback plan — with copy-pasteable Terraform.

Guide · 12 minRead →

The full technical whitepaper, on your desk.

42 pages — block-by-block diagrams, the durability proof, the sharding model, and the failure modes we test for in chaos engineering.

Read the whitepaper ↗ Book a 30-min demo