wallet: cooperative-yield sync (incremental persist + sentinel)#24
Open
saroupille wants to merge 3 commits intotrilitech:mainfrom
Open
wallet: cooperative-yield sync (incremental persist + sentinel)#24saroupille wants to merge 3 commits intotrilitech:mainfrom
saroupille wants to merge 3 commits intotrilitech:mainfrom
Conversation
`tzel-wallet sync` previously held the wallet lock for the full duration
of `[scanned, tree_size)`, persisting only at the end. On a long-running
rollup that's 5-45 min during which every other wallet op (shield,
transfer, unshield, balance, …) blocks. Unacceptable for the
daemon-driven UI flow where users want to act now.
This rewrite:
* Pins head + tree_size once, scans in batches of 50 commits
(CHECKPOINT_EVERY constant; ~0.7% persist overhead, ~2 s of
re-do work on kill -9 between checkpoints).
* Each batch ends with `save_wallet` — atomic temp-file +
fsync + rename + parent-dir fsync, so an unclean shutdown
leaves a coherent wallet.json one batch behind the cursor,
never mid-write.
* After each batch, stat()s `<wallet>.yield`. If present, exits
cleanly (exit 0). The next sync resumes from the persisted
cursor. The companion change in tzel-infra teaches the daemon
to touch this sentinel before submitting a slow CLI op
(shield/transfer/unshield/withdraw) and remove it after, so
a long initial sync no longer starves slow-lane ops.
* Final flush after the loop runs the nullifier + pending-pool
prune once against the cumulative cm set seen across all
batches — preserves the existing `apply_scan_feed` semantics
exactly.
Tests added in `network_profile_tests`:
* `cmd_rollup_sync_exits_at_first_checkpoint_when_sentinel_present`
— pre-create the sentinel, assert sync stops at exactly one
checkpoint boundary and never removes the caller's sentinel.
* `cmd_rollup_sync_resumes_from_checkpointed_cursor` — set the
sentinel via the test hook mid-scan, run twice, assert notes
before/after the yield boundary are recovered in the right run.
* `cmd_rollup_sync_intermediate_wallet_is_always_parseable` —
concurrent reader thread parses wallet.json while the scan
runs; every read must serde-decode (no half-written file).
Total: 102 lib tests pass (99 baseline + 3 new), 0 regressions.
Companion change: trilitech/tzel-infra `feat/sync-cooperative-yield`
which teaches the daemon's slow-lane dispatcher to touch + remove the
sentinel around each slow op.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version of this PR baked the cooperative-yield checkpoint granularity as a private const `CHECKPOINT_EVERY = 50`. That value was chosen by reasoning (HTTP-bound work / save_wallet overhead / loss-on-interrupt), not by measurement. Expose it as `tzel-wallet sync --checkpoint-every N`: - Default 50 (renamed const → DEFAULT_CHECKPOINT_EVERY) — same behaviour as before for callers that don't pass the flag. - Reject 0 with a clear error (would loop forever without progress). - Docstring spells out the empirical-vs-reasoned status: numbers come from `services/scan-bench/` POC + the design doc in tzel-infra, but pending live measurement against octez-smart-rollup-node. - Recommends the safe range (~10..500) with explicit costs at each edge so a user tuning down (faster preemption) or up (less overhead) understands the trade-off. Wired through `cmd_rollup_sync` and `cmd_rollup_sync_watch`. The dispatch site casts u64 → usize (saturating on 32-bit, which is fine because the inner loop bounds the batch via `min(.., tree_size)` regardless). Tests: 1 new unit test `cmd_rollup_sync_refuses_zero_checkpoint_every` locking the validation path. Existing 3 cooperative-yield tests updated to pass `DEFAULT_CHECKPOINT_EVERY` explicitly. Full suite: 103/103 pass (102 → 103 with the new test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two independent reviewers on PR trilitech#24 surfaced one real bug, two test gaps, and three docstring follow-ups. All applied: **Real bug (Reviewer A)** — `w.scanned + checkpoint_every` at lib.rs :7721 used unchecked addition. With an adversarial `--checkpoint-every` (close to `usize::MAX` on 64-bit), that would panic in debug / wrap in release. Switched to `w.scanned.saturating_add(checkpoint_every) .min(tree_size)`. The atomic-rename invariant on `save_wallet` means a wrap couldn't have corrupted on-disk state, but the sync would have silently misbehaved with a tiny degenerate batch. Belt-and-braces. **Test gaps**: - All four cooperative-yield tests share the process-global `SYNC_CHECKPOINT_HOOK` (a `OnceLock<Mutex<...>>`). cargo-test runs in-binary tests in parallel; without serialization, one test's hook could fire under another test's loop. Added `serial_test` dev-dep and `#[serial(sync_checkpoint_hook)]` on all four (now five). - All four tests hardcoded `DEFAULT_CHECKPOINT_EVERY` (50). A future regression that ignored the runtime parameter and used the const internally would still pass. Added `cmd_rollup_sync_respects_custom_ checkpoint_every` exercising K=7 with a sentinel trip at K*2; locks the contract that the flag actually does something. **Docstring follow-ups (Reviewer B)**: - The previous text claimed `services/scan-bench/` of tzel-infra informed the K=50 default. False: scan-bench measures concurrent- fetch tolerance (the orthogonal N dial in design doc §2.A); the K=50 numbers come from a `save_wallet` micro-bench that doesn't exist yet. Tightened to call out the dial separation + acknowledge the pending bench. - Added the operator-vs-laptop calibration sentence: raise to 100-200 on slow / synchronously-replicated storage; drop to 20-30 on fast operator boxes when sub-second preemption matters. - Added a one-liner about `--watch` × `--checkpoint-every` composition (each polling iteration honours the value independently). **Env var (Reviewer B F4)** — added `env = "TZEL_SYNC_CHECKPOINT_EVERY"` to the clap arg, mirroring the precedent the design doc set for `TZEL_SYNC_CONCURRENCY` (doc §4 line 187). Operator-deployed daemons can now override K without recompiling. Required `clap` "env" feature, added. Tests: 104/104 pass (was 103, +1 with the new K=7 test). No regressions to the 99 baseline tests. Deferred to follow-up PRs (out of scope here): - Daemon-side runner.rs pass-through of TZEL_SYNC_CHECKPOINT_EVERY (tzel-infra PR #55, before its merge — otherwise the daemon ships a hardcoded K=50 contract). - friendlyError pattern for the validation error once the env var exposes it to non-CLI users. - Re-tune K=250-500 once the design doc's 2.A (concurrent fetch) lands; at the post-2.A wallclock, K=50 means ~7 fsyncs/sec, which is non-trivial sustained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
tzel-wallet syncscans[scanned, tree_size)sequentially and persistswallet.json.scannedonly at the end. While running it holds an exclusivelock on
wallet.json.lock, so every other wallet op (shield, transfer,unshield, balance, sync, …) blocks for the full duration. On a long-running
rollup that can be 5–45 min, which is unacceptable in daemon-driven flows
where users want to act now.
Design
Cooperative yield via a sentinel file (
<wallet>.yield):cmd_rollup_syncnow scans in batches ofCHECKPOINT_EVERY = 50commits, each batch ending in asave_walletthat records
w.scanned = batch_endplus any newly-recovered notes.The atomic-rename invariant in
save_wallet(temp-file + fsync +rename + parent-dir fsync) means a
kill -9between fsync and renameleaves the previous wallet.json intact: at most one batch (~50 commits,
~2 s of scan work) is lost, never corrupted.
sync stat()s
<wallet>.yield. Present ⇒ exit 0 cleanly. The next syncresumes from the persisted cursor. The companion change in
trilitech/tzel-infra teaches the daemon to
touchthe sentinel beforeadmitting a slow CLI op and to remove it after, so a long initial sync
no longer starves slow-lane ops (shield/transfer/unshield/withdraw).
run once over the cumulative cm set accumulated across batches —
preserves the existing
apply_scan_feedsemantics exactly. (Theexisting helper is split into
apply_scan_feed_recover_batchandapply_scan_feed_finalize; legacyapply_scan_feedis kept forin-tree tests and back-compat.)
Choice of K = 50
50 is the empirical sweet spot. Don't drop below ~10 (overhead dominates)
or push past ~500 (the loss becomes user-visible during a long initial
sync). Documented inline.
Failure-mode analysis
batch lost. Confirmed by the existing atomic-rename invariant — no
change to
save_wallet, just called more often.wis dropped, prioron-disk state is the last committed checkpoint. Same as above.
checkpoint catches it. Worst case: one extra batch runs (~2 s).
NOT cause spurious exits. Correctness comes from the daemon's
best-effort sentinel removal post-op.
Tests
Three new tests in
network_profile_tests:cmd_rollup_sync_exits_at_first_checkpoint_when_sentinel_present— pre-create the sentinel; assert sync stops after exactly one
checkpoint and does NOT remove the caller's sentinel.
cmd_rollup_sync_resumes_from_checkpointed_cursor— drive thesentinel via the
set_sync_checkpoint_hooktest hook, run synctwice. Notes before the yield boundary are recovered in run 1;
notes after are recovered in run 2.
cmd_rollup_sync_intermediate_wallet_is_always_parseable—concurrent reader thread parses
wallet.jsonwhile the scan runs;every read must serde-parse (no half-written file).
Result:
cargo +nightly-2025-07-14 test -p tzel-wallet-app --lib→ 102 passed (99 baseline + 3 new), 0 failed.
Companion change
trilitech/tzel-infra branch
feat/sync-cooperative-yield(PR linkedonce opened). The daemon dispatcher there wraps slow-lane task
execution with a touch + RAII-removed sentinel, so a long-running
sync yields the lock for shield/transfer/unshield/withdraw and
resumes after.
🤖 Generated with Claude Code