Skip to content

fix(storage): harden blockchain crash-recovery against panics#5

Open
orrinfrazier wants to merge 1 commit into
mainfrom
fix/crash-recovery-rebuild-panic
Open

fix(storage): harden blockchain crash-recovery against panics#5
orrinfrazier wants to merge 1 commit into
mainfrom
fix/crash-recovery-rebuild-panic

Conversation

@orrinfrazier
Copy link
Copy Markdown
Owner

@orrinfrazier orrinfrazier commented May 24, 2026

Closes #15

Summary

The fjall rebuild path (rebuild_fjall_database, invoked by make_consistent() after an unclean shutdown during DB init) panicked on corrupt tape data instead of returning a clean error — the exact case a rebuild exists for. This makes the recovery path panic-free end-to-end and strengthens the consistency check so torn writes that leave equal counts are detected. Implements issues/S04-2-crash-recovery.md (A-storage, I-panic, P-high).

Why

The write path commits with Persistence::Buffer and there is no periodic fsync (durability only on graceful Drop), so post-crash rebuilds are expected to be common — the recovery path must not crash the node. Errors surface to the user as DATABASE_CORRUPT_MSG via init_with_pool.context(...) in cuprated.

Changes

  • Propagate errors instead of .unwrap() in the rebuild tx iterator — thread a fallible Item = DbResult<Cow<Transaction<Pruned>>> through the shared add_block_to_dynamic_tables (the hot write path wraps items in Ok(...)).
  • Harden get_block (called per height during rebuild): checked_sub + try_from mapped to BlockchainError::Corrupt, and ok_or on blob_tape_len — no unwrap/expect/underflow/OOM on corrupt indices.
  • Bound the rebuild tx-blob allocation against the pruned_blobs tape length before allocating (prevents OOM from a corrupt pruned_size).
  • Replace the height assert! in add_block_to_dynamic_tables with an Err return.
  • Strengthen make_consistent: in addition to block_infosblock_heights, also check tx_infostx_ids length and a tail-record (top block hash → top height) sanity check; each triggers a rebuild. Adds BlockchainError::Corrupt for unrecoverable states. Stays O(1)/O(log n) at startup.

How this was built

STIR pipeline with --multi-ai (Codex for test/impl) and --consensus review. The original fix hardened only the three named .unwrap() lines; a 3-way cross-family consensus review (Claude + Gemini + Codex, opus adjudicator) caught that the goal wasn't met end-to-end — the rebuild loop also calls get_block (its own underflow/OOM) and add_block_to_dynamic_tables (an assert!). All four findings (2 consensus, 1 majority, 1 solo) were fixed and confirmed by a fresh re-review.

Testing

cargo test -p cuprate-blockchain — 5 lib tests + 10 doc-tests pass, 0 ignored:

  • rebuild_returns_err_on_corrupt_tape — reopen-based: writes a synthetic chain, drops the DB (flushes to disk), corrupts the on-disk pruned_blobs file, reopens, forces a rebuild, asserts make_consistent() returns Err (not a panic). (Reopen is required: Persistence::Buffer keeps data in the live in-process reader, so same-process truncation isn't observed.)
  • make_consistent_detects_txid_mismatch, make_consistent_noop_on_consistent_db, add_block_to_dynamic_tables_propagates_err.

cargo clippy -p cuprate-blockchain --all-targets --all-features -- -D warnings and cargo fmt --check clean.

Follow-up (non-blocking, pre-existing — not introduced here)

  • ops/block.rs block_info.mining_tx_index + 1 / *block_height + 1 are raw additions on tape data reachable during rebuild; they would overflow-panic only in debug/overflow-checked builds (wrap harmlessly in release). Worth a checked-arithmetic pass in a future hardening change.

Out of scope

The issue's open question (Buffer-only durability vs. a periodic SyncAll task) is a durability-design decision, not part of this panic fix.

The fjall rebuild path (rebuild_fjall_database, invoked by make_consistent
after an unclean shutdown) panicked on corrupt tape data instead of returning
a clean error — the exact case a rebuild exists for. Make the recovery path
panic-free end-to-end and strengthen the consistency check.

- Propagate errors via `?` instead of `.unwrap()` in the rebuild tx iterator;
  thread a fallible `Item = DbResult<Cow<Transaction<Pruned>>>` through the
  shared add_block_to_dynamic_tables (hot write path wraps items in Ok).
- Harden get_block (called per height during rebuild): checked_sub + try_from
  mapped to BlockchainError::Corrupt, and ok_or on blob_tape_len (no unwrap/
  expect/underflow/OOM on corrupt indices).
- Bound the rebuild tx-blob allocation against the pruned_blobs tape length
  before allocating (prevents OOM from a corrupt pruned_size).
- Replace the height assert! in add_block_to_dynamic_tables with an Err return.
- Strengthen make_consistent: also check tx_infos<->tx_ids length and a
  tail-record (top block hash -> top height) sanity check; each triggers a
  rebuild. Add BlockchainError::Corrupt for unrecoverable states.
- Tests: reopen-based corrupt-tape recovery test (drop -> corrupt on-disk
  pruned_blobs -> reopen -> rebuild returns Err, no panic), tx_id-mismatch
  detection, no-op-on-consistent-db, and unit-level error propagation.

Errors surface as DATABASE_CORRUPT_MSG via init_with_pool in cuprated.

Implements issues/S04-2-crash-recovery.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

storage: harden blockchain crash-recovery (no unwrap in rebuild_fjall_database)

1 participant