wal-db is a write-ahead log primitive for Rust storage engines. It is the durability substrate underneath every database, transaction system, and distributed log in the portfolio — lsm-db, txn-db, raft-io, and Hive DB all build on it. The append path is lock-free, durability is explicit and platform-correct on Linux, macOS, and Windows, recovery is provable from a torn write, and concurrent commits coalesce into a single fsync.
A WAL is the workhorse no database can avoid: every state change is appended to a durable log before it is acknowledged, and the log is the source of truth used to rebuild state after a crash. Most Rust databases ship their WAL privately inside the engine; wal-db publishes it as a clean, composable primitive so multiple storage engines (LSM, B-tree, document store) can share a single, well-tested implementation.
The common case is four calls — open, append, sync, iter. The core is synchronous; async is left to the consumer, where it belongs.
MSRV is 1.85+ (Rust 2024 edition). Lock-free append. Group commit. Explicit fsync. Crash-safe recovery.
Status:1.0— stable. The public API is frozen until2.0and the on-disk format is frozen for the 1.x line. Full feature set — lock-free append, group commit, segment rotation, suffix and prefix compaction — hardened with a fuzz harness, loom model checks, adversarial recovery tests, injected I/O-failure tests, and property tests, and measured against a hand-rolled WAL (benchmarks). SeeCHANGELOG.mdfor detail.
- Append-only durable log of arbitrary byte records
- Lock-free multi-writer append — many threads append at once with no global lock
- Group commit — concurrent
synccalls coalesce into one fsync, amortising the durability cost - Segment rotation — optionally stripe the log across bounded segment files for bounded recovery and archival
- Explicit durability barriers —
appendis in-memory-fast;syncis the durability point - Platform-correct flush —
fdatasyncon Linux,FlushFileBufferson Windows,fcntl(F_FULLFSYNC)on macOS - Torn-write detection — a CRC32C checksum per record; recovery stops at the first damaged record
- Self-healing recovery — a torn tail from a crash mid-append is truncated on open, leaving a clean boundary
- Fuzz-hardened recovery — arbitrary bytes never panic or over-allocate; a continuous
cargo-fuzzharness proves it - Recovery policies — stop at the first damaged record, or skip past it for forensic partial recovery
- LSN seeking & truncation — replay from any LSN (
iter_from); drop everything after one (truncate_after) or, on a segmented log, reclaim everything before one (truncate_before) for compaction - Iterator-based replay — walk the log forward to rebuild state
- Typed records (optional) — serialise any value via
pack-iobehind a feature; the byte-record API is unchanged when off - Pluggable storage backend — file-backed by default; injectable for in-memory testing and custom stores
Two operations, two distinct guarantees. Confusing them is the single most common way to lose data with a WAL, so wal-db keeps them explicit:
appendreturns when the record is in the operating system's page cache. A crash afterappendbut beforesyncmay lose that record.syncreturns only when every record appended before it is on stable storage and will survive a power loss.
That flush is not the same call on every platform, and getting it wrong is silent:
| Platform | Durability call |
|---|---|
| Linux | fdatasync |
| Windows | FlushFileBuffers |
| macOS | fcntl(F_FULLFSYNC) — not plain fsync, which leaves data in the drive's write cache |
[dependencies]
wal-db = "1.0"use wal_db::Wal;
# fn apply(_lsn: wal_db::Lsn, _bytes: &[u8]) -> Result<(), wal_db::WalError> { Ok(()) }
// Open (or create) the log.
let wal = Wal::open("/var/lib/myapp/app.wal")?;
// Append returns once the record is in the OS page cache. It does not flush.
let lsn = wal.append(b"a state change")?;
// Sync is the durability barrier: it returns once the record is on stable storage.
wal.sync()?;
// On restart, replay the log from the start to rebuild state.
for entry in wal.iter()? {
let entry = entry?;
apply(entry.lsn(), entry.data())?;
}Every record carries a CRC32C checksum over its own bytes. On open, the log scans forward and stops at the first record that is incomplete or fails its checksum — a torn write left by a crash mid-append — and truncates that tail. The records before it are kept; the next append continues from a clean boundary with no gap in the sequence numbers. A corrupt length prefix can never trigger a wild allocation: lengths are validated against the configured maximum before a single payload byte is read.
use wal_db::Wal;
# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
# let path = dir.path().join("app.wal");
// After a crash, reopening the log truncates any torn tail automatically.
let wal = Wal::open(&path)?;
// Iteration yields a Result per record; a damaged record surfaces once, then ends.
for entry in wal.iter()? {
match entry {
Ok(record) => { /* apply record.data() at record.lsn() */ }
Err(e) => eprintln!("recovery stopped: {e}"),
}
}
# Ok(())
# }Tunables live on WalConfig, a builder passed to Wal::open_with:
use wal_db::{Wal, WalConfig};
# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
# let path = dir.path().join("app.wal");
let config = WalConfig::new().with_max_record_size(1024 * 1024); // cap records at 1 MiB
let wal = Wal::open_with(&path, config)?;
# let _ = wal;
# Ok(())
# }Wal is built for many writers. append is lock-free: each call reserves its byte range with a single atomic step — that range's start offset is the record's LSN — then writes its record without blocking the others. Share one Wal behind an Arc and append from every thread.
Durability is where threads cooperate. When several call sync at once they coalesce into a single fsync — group commit — so the cost of making data durable is amortised across everyone committing together rather than paid N times. append_and_sync does an append and a group-commit-aware sync in one call:
use std::sync::Arc;
use std::thread;
use wal_db::{MemStore, Wal};
# fn main() -> Result<(), wal_db::WalError> {
let wal = Arc::new(Wal::with_store(MemStore::new())?);
let workers: Vec<_> = (0..4)
.map(|t| {
let wal = Arc::clone(&wal);
thread::spawn(move || {
for i in 0..100 {
// Each thread appends and commits; the fsyncs coalesce.
wal.append_and_sync(format!("worker {t} record {i}").as_bytes()).unwrap();
}
})
})
.collect();
for w in workers {
w.join().unwrap();
}
assert_eq!(wal.iter()?.count(), 400);
# Ok(())
# }LSNs are byte offsets. The LSN returned by
appendis the record's position in the log — monotonic and unique, but not consecutive. The first record is0; the next sits at its end. This is what lets the append path reserve with a single atomic and never reorder. Seedocs/ON_DISK_FORMAT.md.
Wal::open uses the file-backed FileStore. Any type implementing the WalStore trait can stand in — an in-memory store for tests, or an alternative storage layer. The crate ships MemStore for the in-memory case:
use wal_db::{MemStore, Wal};
# fn main() -> Result<(), wal_db::WalError> {
let wal = Wal::with_store(MemStore::new())?;
let lsn = wal.append(b"no filesystem involved")?;
assert_eq!(lsn.get(), 0);
# Ok(())
# }By default a log is a single file. For bounded recovery time and archival, stripe it across fixed-size segment files in a directory instead — Wal::open_segmented. The log stays one continuous byte stream; records span segment boundaries freely (the same scheme PostgreSQL uses), so nothing about the API or the records changes:
use wal_db::Wal;
# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
// 16 MiB segments. Old, superseded segment files can be archived or pruned.
let wal = Wal::open_segmented(dir.path(), 16 * 1024 * 1024)?;
wal.append(b"striped across files")?;
wal.sync()?;
# Ok(())
# }By default a record is bytes. With the pack-io feature, a record can be any type that derives Serialize/Deserialize — append_typed writes it, Record::decode reads it back. The derives come from the re-exported wal_db::pack_io, so no extra dependency is needed.
[dependencies]
wal-db = { version = "1.0", features = ["pack-io"] }use wal_db::{MemStore, Wal};
use wal_db::pack_io::{Serialize, Deserialize};
#[derive(Serialize, Deserialize, PartialEq, Debug)]
struct Event { id: u64, name: String }
# fn main() -> Result<(), wal_db::WalError> {
let wal = Wal::with_store(MemStore::new())?;
wal.append_typed(&Event { id: 1, name: "start".into() })?;
let event: Event = wal.iter()?.next().unwrap()?.decode()?;
assert_eq!(event, Event { id: 1, name: "start".into() });
# Ok(())
# }Wal::open always truncates a torn tail so the append boundary is clean. For corruption inside an already-recovered log — bit rot, say — a WalConfig recovery policy controls how iteration reacts:
use wal_db::{RecoveryPolicy, Wal, WalConfig};
# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
# let path = dir.path().join("app.wal");
// Default: stop at the first damaged record. Or skip past it for partial recovery:
let config = WalConfig::new().with_recovery_policy(RecoveryPolicy::SkipBadRecords);
let wal = Wal::open_with(&path, config)?;
for entry in wal.iter()? {
match entry {
Ok(record) => { /* use it */ }
Err(e) => eprintln!("skipped a damaged record: {e}"), // iteration continues
}
}
# Ok(())
# }An LSN is a byte offset, so replaying from a checkpoint is O(1) — iter_from starts at the LSN instead of scanning from the beginning. truncate_after drops everything after a record (rolling back a tail, the way a Raft log does on a conflict), and on a segmented log truncate_before reclaims everything before a record (prefix compaction once it has been applied and flushed elsewhere). Both preserve the LSNs of surviving records:
use wal_db::Wal;
# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
# let path = dir.path().join("app.wal");
let wal = Wal::open(&path)?;
let _ = wal.append(b"applied")?;
let checkpoint = wal.append(b"also applied")?;
let _ = wal.append(b"not yet applied")?;
// Replay only what came at or after the checkpoint.
for entry in wal.iter_from(checkpoint)? { let _ = entry?; }
// Or compact: keep up to the checkpoint, drop the rest (made durable).
wal.truncate_after(checkpoint)?;
# Ok(())
# }The core is synchronous on purpose — a WAL's calls map to blocking syscalls (write, fsync), and a runtime is the consumer's choice, not the library's. From an async context, offload to a blocking pool:
let wal = wal.clone(); // Arc<Wal>
let lsn = tokio::task::spawn_blocking(move || wal.append_and_sync(b"record")).await??;Numbers from the criterion suite on the development machine, 256-byte records. They are honest measurements, not marketing — the commit figures are bounded by this machine's fsync latency and scale with faster storage and more writers. Full detail and method in docs/BENCHMARKS.md.
| Benchmark | Result | What it measures |
|---|---|---|
| LSN reservation | ~4 ns | the single atomic that allocates an LSN and reserves a byte range |
append/single |
~105 ns | the lock-free hot path: reserve, frame, write one record to memory, no syscall |
append/multi (8, file) |
~160 K/s | file-backed multi-writer append — syscall-bound (one pwrite each) |
commit/group (8 writers) |
~1.9× a hand-rolled inline WAL | concurrent append-and-sync; group commit coalesces the fsyncs |
recovery/replay (10k) |
~215 K records/s | reopen and replay a file-backed log |
A file-backed append is syscall-bound, not lock-bound — the pwrite the durability contract requires dominates the negligible commit-watermark lock — so the throughput lever is group commit, which beats the inline WAL an engine hand-rolls before it has batching. Run them yourself:
cargo bench --bench wal_bench # append, commit, recovery, reservation
cargo bench --bench compare # wal-db vs a hand-rolled inline WAL| Example | Run | Shows |
|---|---|---|
basic |
cargo run --example basic |
the four-call API: open, append, sync, replay |
recovery |
cargo run --example recovery |
a simulated torn write and self-healing recovery |
concurrent |
cargo run --example concurrent |
many writers, one log, group commit |
checkpoint |
cargo run --example checkpoint |
replay from a checkpoint (iter_from) and truncate back to one (truncate_after) |
typed |
cargo run --example typed --features pack-io |
typed records via pack-io |
cargo test --all-features # unit, integration, doc tests
cargo test --test torn_write # torn-write recovery property test
cargo test --test durability # durability across a real process restart
cargo test --test segmented # segment rotation and spanning records
RUSTFLAGS="--cfg loom" cargo test --test loom_wal # model-checked concurrency
cargo +nightly fuzz run recover # fuzz the recovery path
cargo bench --bench wal_bench # append and commit throughputThe loom run model-checks the lock-free append and the group-commit handshake: it explores every meaningful thread interleaving and asserts no overlapping records, no reorder, and at most one fsync per syncer. The fuzz run feeds arbitrary bytes to the recovery path and proves it never panics or over-allocates.
wal-db is the durability substrate. It is consumed by:
lsm-db— memtable durabilitytxn-db— transaction lograft-io— Raft log persistence- Hive DB — primary write-ahead log
It stays foreign-compatible: usable standalone in any project that needs a durable append-only log.
Tier 1 Support:
- Linux (x86_64, aarch64) —
fdatasync - macOS (x86_64, Apple Silicon) —
fcntl(F_FULLFSYNC)for true durability - Windows (x86_64) —
FlushFileBuffers
Durability semantics are equivalent across platforms; the CI matrix runs the full suite — including the cross-process durability test — on each.
Before opening a PR, cargo fmt --all, cargo clippy --all-targets --all-features -- -D warnings, and cargo test --all-features must be clean. Any change touching the durability path requires a torn-write recovery test and a benchmark.
Licensed under either of
- Apache License, Version 2.0 — see LICENSE-APACHE
- MIT License — see LICENSE-MIT
at your option.