42 pages of how the storage engine actually works — block-by-block diagrams, the durability proof, the sharding model, and the failure modes we test in chaos engineering. Written for the engineer who'll be on call for this in production.
A first-principles walk through the storage layer, the I/O model, the sharding scheme, and the proofs behind a 99.999% durability SLA on commodity NVMe.
We're closed source — for reasons that are honest and uninteresting. But we shipped a 42-page architectural document so that no CTO ever has to take "trust us" for an answer. If your security-review team wants the engineer who wrote a section on a call, we'll send them. This is the document that makes that conversation possible.
The full case for WiscKey-style separation — index layout, VLog segment format, pointer encoding, and the production-hardening changes we made over the original paper.
How we structure the submission and completion queues per shard, why we use O_DIRECT, and the benchmarks comparing io_uring vs pread at our P99 targets.
Group-commit WAL, replication invariants, the actual probability calculus behind the published durability number, and the conditions under which it does not hold.
Why 1024, how we rebalance without read interruption, the operator's reshard procedure, and the cap on individual shard ownership during partition events.
The LIRS algorithm in detail, the priority-2 eviction-resistance tweak for small hot keys, and the measured hit-rate uplift on Zipfian workloads vs LRU.
The full failure-injection matrix — single-node loss, full-zone outage, NVMe bit-flip, kernel hang, network partition. Every test case, the expected behaviour, and the RTO/RPO numbers we hit.
Two pulled-out figures from the document — the write path, and the failure-injection matrix. The full PDF has 18 diagrams of this density.
The five-stage write pipeline. Application code lands in a per-shard ring buffer; a dedicated commit thread wakes every 50µs (or on batch-full), issues a single fsync, and unparks every blocked writer at once. The 80µs fsync cost is divided across the entire batch — at 2M ops/sec, that's roughly ~0.85µs per write for full durability.
The full chapter covers the lock-free queue design, the futex-park behaviour, the WAL crash-recovery protocol, and a proof that the WAL never loses an acknowledged write under any single-node failure.
The chaos-engineering matrix lives in chapter 7. We run each of these injections nightly against a production-equivalent cluster. The chapter documents the exact fault-injection spec, the expected user-observable behaviour, and the recovery time we measured over a 90-day window.
The point of publishing this isn't to say nothing ever fails — it's to say we know what does, and we test for it. When you have an incident, the failure mode you'll hit is almost certainly already in the matrix.
Every section, every figure number, every page. If a chapter is what you're after, we'll send you the chapter — you don't have to take the whole PDF.
A taste of what's in the chaos chapter — every one of these is exercised against a production-equivalent cluster every night, and the results are published to the same Prometheus we expose on customer clusters.
We kill -9 a healthy node at random under sustained 2M ops/sec load. Expected: reads continue from replica within 8 seconds, no acknowledged writes lost. Measured RTO: 8.4s P99.
iptables-drop all traffic from a zone for 30 minutes. Expected: traffic shifts to remaining zones, write quorum maintained on 2/3 zones. Measured RTO: 22s · zero data loss.
Inject a deliberate bit-flip into a sealed VLog segment. Expected: checksum trips on next read, value re-fetched from replica, bad segment quarantined. Detection < 1s.
Stall a CPU running the io_uring SQPOLL thread for 30s. Expected: liveness probe trips, the node is auto-ejected from the read pool, traffic re-routes to healthy nodes. Ejection in 12s.
Cut QUIC replication between two regions for 5 minutes. Expected: both regions continue serving local reads & writes, conflict resolution applies on heal. RPO < 30s, no acknowledged writes lost.
Force every shard into aggressive GC simultaneously while serving peak load. Expected: GC respects its cgroup weight, P99 reads stay under 8ms. Measured P99 during storm: 6.4ms.
One email gets you the 42-page PDF — including the chapter you skipped. If you want a specific chapter, mention it in the body and we'll send just that.