wallet: cooperative-yield sync (incremental persist + sentinel) by saroupille · Pull Request #24 · trilitech/tzel

saroupille · 2026-05-09T06:44:34Z

Problem

tzel-wallet sync scans [scanned, tree_size) sequentially and persists
wallet.json.scanned only at the end. While running it holds an exclusive
lock on wallet.json.lock, so every other wallet op (shield, transfer,
unshield, balance, sync, …) blocks for the full duration. On a long-running
rollup that can be 5–45 min, which is unacceptable in daemon-driven flows
where users want to act now.

Design

Cooperative yield via a sentinel file (<wallet>.yield):

Incremental persist. cmd_rollup_sync now scans in batches of
CHECKPOINT_EVERY = 50 commits, each batch ending in a save_wallet
that records w.scanned = batch_end plus any newly-recovered notes.
The atomic-rename invariant in save_wallet (temp-file + fsync +
rename + parent-dir fsync) means a kill -9 between fsync and rename
leaves the previous wallet.json intact: at most one batch (~50 commits,
~2 s of scan work) is lost, never corrupted.
Sentinel check at every checkpoint. After persisting a batch, the
sync stat()s <wallet>.yield. Present ⇒ exit 0 cleanly. The next sync
resumes from the persisted cursor. The companion change in
trilitech/tzel-infra teaches the daemon to touch the sentinel before
admitting a slow CLI op and to remove it after, so a long initial sync
no longer starves slow-lane ops (shield/transfer/unshield/withdraw).
Final flush at end-of-loop. Nullifier prune + pending-pool prune
run once over the cumulative cm set accumulated across batches —
preserves the existing apply_scan_feed semantics exactly. (The
existing helper is split into apply_scan_feed_recover_batch and
apply_scan_feed_finalize; legacy apply_scan_feed is kept for
in-tree tests and back-compat.)

Choice of K = 50

K	persist overhead (1 save_wallet / K HTTP fetches)	loss-on-kill-9	per-batch sentinel cost
1	~33%	<1 commit	sub-ms (negligible)
10	~6%	<1 s	sub-ms
50	~0.7%	~2 s	sub-ms
200	~0.2%	~10 s	sub-ms
500	<0.1%	~25 s	sub-ms

50 is the empirical sweet spot. Don't drop below ~10 (overhead dominates)
or push past ~500 (the loss becomes user-visible during a long initial
sync). Documented inline.

Failure-mode analysis

kill -9 between fsync and rename: previous wallet.json wins; one
batch lost. Confirmed by the existing atomic-rename invariant — no
change to save_wallet, just called more often.
kill -9 mid-batch HTTP fetch: the in-RAM w is dropped, prior
on-disk state is the last committed checkpoint. Same as above.
Sentinel races a checkpoint boundary: stat() succeeds late ⇒ next
checkpoint catches it. Worst case: one extra batch runs (~2 s).
fs error on stat(): treated as "not present" so a flaky FS does
NOT cause spurious exits. Correctness comes from the daemon's
best-effort sentinel removal post-op.

Tests

Three new tests in network_profile_tests:

cmd_rollup_sync_exits_at_first_checkpoint_when_sentinel_present
— pre-create the sentinel; assert sync stops after exactly one
checkpoint and does NOT remove the caller's sentinel.
cmd_rollup_sync_resumes_from_checkpointed_cursor — drive the
sentinel via the set_sync_checkpoint_hook test hook, run sync
twice. Notes before the yield boundary are recovered in run 1;
notes after are recovered in run 2.
cmd_rollup_sync_intermediate_wallet_is_always_parseable —
concurrent reader thread parses wallet.json while the scan runs;
every read must serde-parse (no half-written file).

Result: cargo +nightly-2025-07-14 test -p tzel-wallet-app --lib
→ 102 passed (99 baseline + 3 new), 0 failed.

Companion change

trilitech/tzel-infra branch feat/sync-cooperative-yield (PR linked
once opened). The daemon dispatcher there wraps slow-lane task
execution with a touch + RAII-removed sentinel, so a long-running
sync yields the lock for shield/transfer/unshield/withdraw and
resumes after.

🤖 Generated with Claude Code

`tzel-wallet sync` previously held the wallet lock for the full duration of `[scanned, tree_size)`, persisting only at the end. On a long-running rollup that's 5-45 min during which every other wallet op (shield, transfer, unshield, balance, …) blocks. Unacceptable for the daemon-driven UI flow where users want to act now. This rewrite: * Pins head + tree_size once, scans in batches of 50 commits (CHECKPOINT_EVERY constant; ~0.7% persist overhead, ~2 s of re-do work on kill -9 between checkpoints). * Each batch ends with `save_wallet` — atomic temp-file + fsync + rename + parent-dir fsync, so an unclean shutdown leaves a coherent wallet.json one batch behind the cursor, never mid-write. * After each batch, stat()s `<wallet>.yield`. If present, exits cleanly (exit 0). The next sync resumes from the persisted cursor. The companion change in tzel-infra teaches the daemon to touch this sentinel before submitting a slow CLI op (shield/transfer/unshield/withdraw) and remove it after, so a long initial sync no longer starves slow-lane ops. * Final flush after the loop runs the nullifier + pending-pool prune once against the cumulative cm set seen across all batches — preserves the existing `apply_scan_feed` semantics exactly. Tests added in `network_profile_tests`: * `cmd_rollup_sync_exits_at_first_checkpoint_when_sentinel_present` — pre-create the sentinel, assert sync stops at exactly one checkpoint boundary and never removes the caller's sentinel. * `cmd_rollup_sync_resumes_from_checkpointed_cursor` — set the sentinel via the test hook mid-scan, run twice, assert notes before/after the yield boundary are recovered in the right run. * `cmd_rollup_sync_intermediate_wallet_is_always_parseable` — concurrent reader thread parses wallet.json while the scan runs; every read must serde-decode (no half-written file). Total: 102 lib tests pass (99 baseline + 3 new), 0 regressions. Companion change: trilitech/tzel-infra `feat/sync-cooperative-yield` which teaches the daemon's slow-lane dispatcher to touch + remove the sentinel around each slow op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous version of this PR baked the cooperative-yield checkpoint granularity as a private const `CHECKPOINT_EVERY = 50`. That value was chosen by reasoning (HTTP-bound work / save_wallet overhead / loss-on-interrupt), not by measurement. Expose it as `tzel-wallet sync --checkpoint-every N`: - Default 50 (renamed const → DEFAULT_CHECKPOINT_EVERY) — same behaviour as before for callers that don't pass the flag. - Reject 0 with a clear error (would loop forever without progress). - Docstring spells out the empirical-vs-reasoned status: numbers come from `services/scan-bench/` POC + the design doc in tzel-infra, but pending live measurement against octez-smart-rollup-node. - Recommends the safe range (~10..500) with explicit costs at each edge so a user tuning down (faster preemption) or up (less overhead) understands the trade-off. Wired through `cmd_rollup_sync` and `cmd_rollup_sync_watch`. The dispatch site casts u64 → usize (saturating on 32-bit, which is fine because the inner loop bounds the batch via `min(.., tree_size)` regardless). Tests: 1 new unit test `cmd_rollup_sync_refuses_zero_checkpoint_every` locking the validation path. Existing 3 cooperative-yield tests updated to pass `DEFAULT_CHECKPOINT_EVERY` explicitly. Full suite: 103/103 pass (102 → 103 with the new test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two independent reviewers on PR trilitech#24 surfaced one real bug, two test gaps, and three docstring follow-ups. All applied: **Real bug (Reviewer A)** — `w.scanned + checkpoint_every` at lib.rs :7721 used unchecked addition. With an adversarial `--checkpoint-every` (close to `usize::MAX` on 64-bit), that would panic in debug / wrap in release. Switched to `w.scanned.saturating_add(checkpoint_every) .min(tree_size)`. The atomic-rename invariant on `save_wallet` means a wrap couldn't have corrupted on-disk state, but the sync would have silently misbehaved with a tiny degenerate batch. Belt-and-braces. **Test gaps**: - All four cooperative-yield tests share the process-global `SYNC_CHECKPOINT_HOOK` (a `OnceLock<Mutex<...>>`). cargo-test runs in-binary tests in parallel; without serialization, one test's hook could fire under another test's loop. Added `serial_test` dev-dep and `#[serial(sync_checkpoint_hook)]` on all four (now five). - All four tests hardcoded `DEFAULT_CHECKPOINT_EVERY` (50). A future regression that ignored the runtime parameter and used the const internally would still pass. Added `cmd_rollup_sync_respects_custom_ checkpoint_every` exercising K=7 with a sentinel trip at K*2; locks the contract that the flag actually does something. **Docstring follow-ups (Reviewer B)**: - The previous text claimed `services/scan-bench/` of tzel-infra informed the K=50 default. False: scan-bench measures concurrent- fetch tolerance (the orthogonal N dial in design doc §2.A); the K=50 numbers come from a `save_wallet` micro-bench that doesn't exist yet. Tightened to call out the dial separation + acknowledge the pending bench. - Added the operator-vs-laptop calibration sentence: raise to 100-200 on slow / synchronously-replicated storage; drop to 20-30 on fast operator boxes when sub-second preemption matters. - Added a one-liner about `--watch` × `--checkpoint-every` composition (each polling iteration honours the value independently). **Env var (Reviewer B F4)** — added `env = "TZEL_SYNC_CHECKPOINT_EVERY"` to the clap arg, mirroring the precedent the design doc set for `TZEL_SYNC_CONCURRENCY` (doc §4 line 187). Operator-deployed daemons can now override K without recompiling. Required `clap` "env" feature, added. Tests: 104/104 pass (was 103, +1 with the new K=7 test). No regressions to the 99 baseline tests. Deferred to follow-up PRs (out of scope here): - Daemon-side runner.rs pass-through of TZEL_SYNC_CHECKPOINT_EVERY (tzel-infra PR #55, before its merge — otherwise the daemon ships a hardcoded K=50 contract). - friendlyError pattern for the validation error once the env var exposes it to non-CLI users. - Re-tune K=250-500 once the design doc's 2.A (concurrent fetch) lands; at the post-2.A wallclock, K=50 means ~7 fsyncs/sec, which is non-trivial sustained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tzel-infra wallet team and others added 3 commits May 9, 2026 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wallet: cooperative-yield sync (incremental persist + sentinel)#24

wallet: cooperative-yield sync (incremental persist + sentinel)#24
saroupille wants to merge 3 commits intotrilitech:mainfrom
saroupille:feat/sync-cooperative-yield

saroupille commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saroupille commented May 9, 2026

Problem

Design

Choice of K = 50

Failure-mode analysis

Tests

Companion change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant