fix(storage): harden blockchain crash-recovery against panics#5
Open
orrinfrazier wants to merge 1 commit into
Open
fix(storage): harden blockchain crash-recovery against panics#5orrinfrazier wants to merge 1 commit into
orrinfrazier wants to merge 1 commit into
Conversation
The fjall rebuild path (rebuild_fjall_database, invoked by make_consistent after an unclean shutdown) panicked on corrupt tape data instead of returning a clean error — the exact case a rebuild exists for. Make the recovery path panic-free end-to-end and strengthen the consistency check. - Propagate errors via `?` instead of `.unwrap()` in the rebuild tx iterator; thread a fallible `Item = DbResult<Cow<Transaction<Pruned>>>` through the shared add_block_to_dynamic_tables (hot write path wraps items in Ok). - Harden get_block (called per height during rebuild): checked_sub + try_from mapped to BlockchainError::Corrupt, and ok_or on blob_tape_len (no unwrap/ expect/underflow/OOM on corrupt indices). - Bound the rebuild tx-blob allocation against the pruned_blobs tape length before allocating (prevents OOM from a corrupt pruned_size). - Replace the height assert! in add_block_to_dynamic_tables with an Err return. - Strengthen make_consistent: also check tx_infos<->tx_ids length and a tail-record (top block hash -> top height) sanity check; each triggers a rebuild. Add BlockchainError::Corrupt for unrecoverable states. - Tests: reopen-based corrupt-tape recovery test (drop -> corrupt on-disk pruned_blobs -> reopen -> rebuild returns Err, no panic), tx_id-mismatch detection, no-op-on-consistent-db, and unit-level error propagation. Errors surface as DATABASE_CORRUPT_MSG via init_with_pool in cuprated. Implements issues/S04-2-crash-recovery.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #15
Summary
The fjall rebuild path (
rebuild_fjall_database, invoked bymake_consistent()after an unclean shutdown during DB init) panicked on corrupt tape data instead of returning a clean error — the exact case a rebuild exists for. This makes the recovery path panic-free end-to-end and strengthens the consistency check so torn writes that leave equal counts are detected. Implementsissues/S04-2-crash-recovery.md(A-storage, I-panic, P-high).Why
The write path commits with
Persistence::Bufferand there is no periodic fsync (durability only on gracefulDrop), so post-crash rebuilds are expected to be common — the recovery path must not crash the node. Errors surface to the user asDATABASE_CORRUPT_MSGviainit_with_pool→.context(...)incuprated.Changes
.unwrap()in the rebuild tx iterator — thread a fallibleItem = DbResult<Cow<Transaction<Pruned>>>through the sharedadd_block_to_dynamic_tables(the hot write path wraps items inOk(...)).get_block(called per height during rebuild):checked_sub+try_frommapped toBlockchainError::Corrupt, andok_oronblob_tape_len— no unwrap/expect/underflow/OOM on corrupt indices.pruned_blobstape length before allocating (prevents OOM from a corruptpruned_size).assert!inadd_block_to_dynamic_tableswith anErrreturn.make_consistent: in addition toblock_infos↔block_heights, also checktx_infos↔tx_idslength and a tail-record (top block hash → top height) sanity check; each triggers a rebuild. AddsBlockchainError::Corruptfor unrecoverable states. Stays O(1)/O(log n) at startup.How this was built
STIR pipeline with
--multi-ai(Codex for test/impl) and--consensusreview. The original fix hardened only the three named.unwrap()lines; a 3-way cross-family consensus review (Claude + Gemini + Codex, opus adjudicator) caught that the goal wasn't met end-to-end — the rebuild loop also callsget_block(its own underflow/OOM) andadd_block_to_dynamic_tables(anassert!). All four findings (2 consensus, 1 majority, 1 solo) were fixed and confirmed by a fresh re-review.Testing
cargo test -p cuprate-blockchain— 5 lib tests + 10 doc-tests pass, 0 ignored:rebuild_returns_err_on_corrupt_tape— reopen-based: writes a synthetic chain, drops the DB (flushes to disk), corrupts the on-diskpruned_blobsfile, reopens, forces a rebuild, assertsmake_consistent()returnsErr(not a panic). (Reopen is required:Persistence::Bufferkeeps data in the live in-process reader, so same-process truncation isn't observed.)make_consistent_detects_txid_mismatch,make_consistent_noop_on_consistent_db,add_block_to_dynamic_tables_propagates_err.cargo clippy -p cuprate-blockchain --all-targets --all-features -- -D warningsandcargo fmt --checkclean.Follow-up (non-blocking, pre-existing — not introduced here)
ops/block.rsblock_info.mining_tx_index + 1/*block_height + 1are raw additions on tape data reachable during rebuild; they would overflow-panic only in debug/overflow-checked builds (wrap harmlessly in release). Worth a checked-arithmetic pass in a future hardening change.Out of scope
The issue's open question (Buffer-only durability vs. a periodic
SyncAlltask) is a durability-design decision, not part of this panic fix.