fix(storage): harden blockchain crash-recovery against panics by orrinfrazier · Pull Request #5 · orrinfrazier/cuprate

orrinfrazier · 2026-05-24T21:56:24Z

Closes #15

Summary

The fjall rebuild path (rebuild_fjall_database, invoked by make_consistent() after an unclean shutdown during DB init) panicked on corrupt tape data instead of returning a clean error — the exact case a rebuild exists for. This makes the recovery path panic-free end-to-end and strengthens the consistency check so torn writes that leave equal counts are detected. Implements issues/S04-2-crash-recovery.md (A-storage, I-panic, P-high).

Why

The write path commits with Persistence::Buffer and there is no periodic fsync (durability only on graceful Drop), so post-crash rebuilds are expected to be common — the recovery path must not crash the node. Errors surface to the user as DATABASE_CORRUPT_MSG via init_with_pool → .context(...) in cuprated.

Changes

Propagate errors instead of .unwrap() in the rebuild tx iterator — thread a fallible Item = DbResult<Cow<Transaction<Pruned>>> through the shared add_block_to_dynamic_tables (the hot write path wraps items in Ok(...)).
Harden get_block (called per height during rebuild): checked_sub + try_from mapped to BlockchainError::Corrupt, and ok_or on blob_tape_len — no unwrap/expect/underflow/OOM on corrupt indices.
Bound the rebuild tx-blob allocation against the pruned_blobs tape length before allocating (prevents OOM from a corrupt pruned_size).
Replace the height assert! in add_block_to_dynamic_tables with an Err return.
Strengthen make_consistent: in addition to block_infos↔block_heights, also check tx_infos↔tx_ids length and a tail-record (top block hash → top height) sanity check; each triggers a rebuild. Adds BlockchainError::Corrupt for unrecoverable states. Stays O(1)/O(log n) at startup.

How this was built

STIR pipeline with --multi-ai (Codex for test/impl) and --consensus review. The original fix hardened only the three named .unwrap() lines; a 3-way cross-family consensus review (Claude + Gemini + Codex, opus adjudicator) caught that the goal wasn't met end-to-end — the rebuild loop also calls get_block (its own underflow/OOM) and add_block_to_dynamic_tables (an assert!). All four findings (2 consensus, 1 majority, 1 solo) were fixed and confirmed by a fresh re-review.

Testing

cargo test -p cuprate-blockchain — 5 lib tests + 10 doc-tests pass, 0 ignored:

rebuild_returns_err_on_corrupt_tape — reopen-based: writes a synthetic chain, drops the DB (flushes to disk), corrupts the on-disk pruned_blobs file, reopens, forces a rebuild, asserts make_consistent() returns Err (not a panic). (Reopen is required: Persistence::Buffer keeps data in the live in-process reader, so same-process truncation isn't observed.)
make_consistent_detects_txid_mismatch, make_consistent_noop_on_consistent_db, add_block_to_dynamic_tables_propagates_err.

cargo clippy -p cuprate-blockchain --all-targets --all-features -- -D warnings and cargo fmt --check clean.

Follow-up (non-blocking, pre-existing — not introduced here)

ops/block.rs block_info.mining_tx_index + 1 / *block_height + 1 are raw additions on tape data reachable during rebuild; they would overflow-panic only in debug/overflow-checked builds (wrap harmlessly in release). Worth a checked-arithmetic pass in a future hardening change.

Out of scope

The issue's open question (Buffer-only durability vs. a periodic SyncAll task) is a durability-design decision, not part of this panic fix.

The fjall rebuild path (rebuild_fjall_database, invoked by make_consistent after an unclean shutdown) panicked on corrupt tape data instead of returning a clean error — the exact case a rebuild exists for. Make the recovery path panic-free end-to-end and strengthen the consistency check. - Propagate errors via `?` instead of `.unwrap()` in the rebuild tx iterator; thread a fallible `Item = DbResult<Cow<Transaction<Pruned>>>` through the shared add_block_to_dynamic_tables (hot write path wraps items in Ok). - Harden get_block (called per height during rebuild): checked_sub + try_from mapped to BlockchainError::Corrupt, and ok_or on blob_tape_len (no unwrap/ expect/underflow/OOM on corrupt indices). - Bound the rebuild tx-blob allocation against the pruned_blobs tape length before allocating (prevents OOM from a corrupt pruned_size). - Replace the height assert! in add_block_to_dynamic_tables with an Err return. - Strengthen make_consistent: also check tx_infos<->tx_ids length and a tail-record (top block hash -> top height) sanity check; each triggers a rebuild. Add BlockchainError::Corrupt for unrecoverable states. - Tests: reopen-based corrupt-tape recovery test (drop -> corrupt on-disk pruned_blobs -> reopen -> rebuild returns Err, no panic), tx_id-mismatch detection, no-op-on-consistent-db, and unit-level error propagation. Errors surface as DATABASE_CORRUPT_MSG via init_with_pool in cuprated. Implements issues/S04-2-crash-recovery.md

github-actions Bot added A-storage labels May 24, 2026

orrinfrazier removed A-dependency labels May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(storage): harden blockchain crash-recovery against panics#5

fix(storage): harden blockchain crash-recovery against panics#5
orrinfrazier wants to merge 1 commit into
mainfrom
fix/crash-recovery-rebuild-panic

orrinfrazier commented May 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

orrinfrazier commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Changes

How this was built

Testing

Follow-up (non-blocking, pre-existing — not introduced here)

Out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

orrinfrazier commented May 24, 2026 •

edited

Loading