VeltrixDB Technical Whitepaper — Architecture, Durability & Failure Modes

Why we wrote it

Open source, documented to the last byte.

VeltrixDB is open source under Apache 2.0 — every line is on GitHub. But reading 40k lines of Go isn't how most teams evaluate a database, so we also shipped this 42-page architectural document. It's the fastest way for a CTO or security-review team to understand exactly how the engine works — and the source is right there when they want to go deeper.

01 Storage engine

Key–value separation, end to end.

How VeltrixDB separates keys from values — index layout, VLog segment format, pointer encoding, and why compaction that never touches values is the only way to guarantee 1.0× write amplification at scale.

Chapter 1–214 pages

02 Kernel I/O

io_uring SQPOLL on the hot path.

How we structure the submission and completion queues per shard, why we use O_DIRECT, and the benchmarks comparing io_uring vs pread at our P99 targets.

Chapter 36 pages

03 Durability

The fsync chain and the 99.999% SLA proof.

Group-commit WAL, replication invariants, the actual probability calculus behind the published durability number, and the conditions under which it does not hold.

Chapter 45 pages

04 Sharding & placement

1024 shards · consistent hashing · zone-aware.

Why 1024, how we rebalance without read interruption, the operator's reshard procedure, and the cap on individual shard ownership during partition events.

Chapter 56 pages

05 Cache (LIRS)

Why LRU lies — and what we replaced it with.

The LIRS algorithm in detail, the priority-2 eviction-resistance tweak for small hot keys, and the measured hit-rate uplift on Zipfian workloads vs LRU.

Chapter 64 pages

06 Failure modes

What we test in chaos engineering.

The full failure-injection matrix — single-node loss, full-zone outage, NVMe bit-flip, kernel hang, network partition. Every test case, the expected behaviour, and the RTO/RPO numbers we hit.

Chapter 7–87 pages

A sneak peek

The kind of detail you need before you commit.

Two pulled-out figures from the document — the write path, and the failure-injection matrix. The full PDF has 18 diagrams of this density.

Figure 03 · Write path

Group-commit WAL with amortized fsync.

The five-stage write pipeline. Application code lands in a per-shard ring buffer; a dedicated commit thread wakes every 50µs (or on batch-full), issues a single fsync, and unparks every blocked writer at once. The 80µs fsync cost is divided across the entire batch — at 2M ops/sec, that's roughly ~0.85µs per write for full durability.

The full chapter covers the lock-free queue design, the futex-park behaviour, the WAL crash-recovery protocol, and a proof that the WAL never loses an acknowledged write under any single-node failure.

FIG 03 · WRITE PATH · POST-ACKP99 0.21ms

01 · 12µs

App→WAL ring

02 · 80µs

WAL fsync·group-commit

03 · 30µs

VLog append·O_DIRECT

04 · 8µs

Index pointer·ART insert

05 · 5µs

Cache warm→ack

Figure 14 · Chaos matrix

Every failure mode we test, and the SLO it owns.

The chaos-engineering matrix lives in chapter 7. We run each of these injections nightly against a production-equivalent cluster. The chapter documents the exact fault-injection spec, the expected user-observable behaviour, and the recovery time we measured over a 90-day window.

The point of publishing this isn't to say nothing ever fails — it's to say we know what does, and we test for it. When you have an incident, the failure mode you'll hit is almost certainly already in the matrix.

FIG 14 · FAILURE MATRIXRTO < 30s

single node

kill -9→RTO 8s

full AZ

network drop→RTO 22s

NVMe corrupt

bit-flip inject→detect <1s

kernel hang

CPU stall→eject 12s

cross-region

QUIC partition→RPO <30s

Table of contents

All 42 pages, in order.

Every section, every figure number, every page. If a chapter is what you're after, we'll send you the chapter — you don't have to take the whole PDF.

VeltrixDB · Technical Reference · v0.9.5

46 pages · 20 figures · 10 chapters

00Preface & notationAudience, scope, and how to read this paperp. 1

01The case for key-value separationWhy LSM compaction is the silent tail-latency killerp. 4

02Storage engine internalsVLog segment format, ART index, garbage collectionp. 9

03Kernel I/O — io_uring SQPOLL + O_DIRECTSubmission queues, completion polling, syscall accountingp. 16

04Durability & the 99.999% SLA proofGroup-commit WAL, replication invariants, probability calculusp. 22

05Sharding & placement1024-shard consistent hash, zone-awareness, online reshardp. 26

06LIRS cache & small-hot-key resistanceThe eviction algorithm and the priority-2 tuningp. 31

07Chaos engineering & failure modesFault-injection matrix, RTO/RPO numbers, runbooksp. 34

08Security modelmTLS, AES-256, BYOK, key rotation, audit logp. 38

09Engine improvements — 2026 Q2Polling I/O mode, cold-aware GC, dedicated compaction core, eBPF throttle, cloud tieringp. 39

10Appendix — wire protocol & metrics referenceBinary protocol spec and the 50+ Prometheus metricsp. 45

Chapter 7 · sample

Five failure modes we test nightly.

A taste of what's in the chaos chapter — every one of these is exercised against a production-equivalent cluster every night, and the results are published to the same Prometheus we expose on customer clusters.

Single-node loss

We kill -9 a healthy node at random under sustained 2M ops/sec load. Expected: reads continue from replica within 8 seconds, no acknowledged writes lost. Measured RTO: 8.4s P99.

Full availability-zone outage

iptables-drop all traffic from a zone for 30 minutes. Expected: traffic shifts to remaining zones, write quorum maintained on 2/3 zones. Measured RTO: 22s · zero data loss.

NVMe corruption

Inject a deliberate bit-flip into a sealed VLog segment. Expected: checksum trips on next read, value re-fetched from replica, bad segment quarantined. Detection < 1s.

Kernel CPU stall

Stall a CPU running the io_uring SQPOLL thread for 30s. Expected: liveness probe trips, the node is auto-ejected from the read pool, traffic re-routes to healthy nodes. Ejection in 12s.

Cross-region partition

Cut QUIC replication between two regions for 5 minutes. Expected: both regions continue serving local reads & writes, conflict resolution applies on heal. RPO < 30s, no acknowledged writes lost.

Compaction storm

Force every shard into aggressive GC simultaneously while serving peak load. Expected: GC respects its cgroup weight, P99 reads stay under 8ms. Measured P99 during storm: 6.4ms.

Chapter 9 · sample

Five features, shipped in Q2 2026.

Chapter 9 documents five features that shipped together in the v0.9.5 build cycle — each one targeting a specific latency or cost failure mode we observed in production. This is a sample of that chapter.

Zero-interrupt polling I/O

VeltrixDB's I/O layer switched to polling mode — the engine monitors the completion queue continuously rather than waiting for kernel interrupts. Under sustained write load, this eliminates wakeup overhead entirely. Measured: 0 syscalls per read op.

Cold-aware garbage collection

The GC engine now ranks VLog entries by access temperature. Hot data is never relocated during a GC pass. Cold entries compact first. Result: GC runs without competing for the NVMe bandwidth live reads need. GC emergency runs: 0 in 60-min stress.

Dedicated compaction core

The compaction thread is pinned to a reserved CPU core on Linux. It never contends with request-handling threads for scheduler time. The intermittent P99 spikes under heavy write pressure — a known failure mode on shared-core designs — are eliminated. P99 jitter: ±3% under load.

eBPF-backed GC throttle

A built-in BPF program watches CPU throttle counters in real time and signals the GC scheduler before a throttle event reaches the write path. Hard stalls become gradual backpressure. 0 hard stalls in 60-min sustained write stress.

Cloud cold-tier offload

VeltrixDB's tiering engine moves data inactive for >5 minutes to S3, GCS, or Azure at 20 MB/s, rate-limited. The NVMe index stays hot. The GC engine is aware of tiered entries and never relocates them. Cold storage cost: 10× lower than NVMe.

—

Combined benchmark result

All five improvements together, verified against a 200K-key sustained write benchmark: 110k ops/s sustained · P99 162ms per batch · density 160 B/record · 0 GC emergency runs. Both packing density and GC gates pass.

The VeltrixDB whitepaper.

The extreme point-lookup engine — architecture & durability.