Skip to content

wallet: cooperative-yield sync (incremental persist + sentinel)#24

Open
saroupille wants to merge 3 commits intotrilitech:mainfrom
saroupille:feat/sync-cooperative-yield
Open

wallet: cooperative-yield sync (incremental persist + sentinel)#24
saroupille wants to merge 3 commits intotrilitech:mainfrom
saroupille:feat/sync-cooperative-yield

Conversation

@saroupille
Copy link
Copy Markdown
Collaborator

Problem

tzel-wallet sync scans [scanned, tree_size) sequentially and persists
wallet.json.scanned only at the end. While running it holds an exclusive
lock on wallet.json.lock, so every other wallet op (shield, transfer,
unshield, balance, sync, …) blocks for the full duration. On a long-running
rollup that can be 5–45 min, which is unacceptable in daemon-driven flows
where users want to act now.

Design

Cooperative yield via a sentinel file (<wallet>.yield):

  1. Incremental persist. cmd_rollup_sync now scans in batches of
    CHECKPOINT_EVERY = 50 commits, each batch ending in a save_wallet
    that records w.scanned = batch_end plus any newly-recovered notes.
    The atomic-rename invariant in save_wallet (temp-file + fsync +
    rename + parent-dir fsync) means a kill -9 between fsync and rename
    leaves the previous wallet.json intact: at most one batch (~50 commits,
    ~2 s of scan work) is lost, never corrupted.
  2. Sentinel check at every checkpoint. After persisting a batch, the
    sync stat()s <wallet>.yield. Present ⇒ exit 0 cleanly. The next sync
    resumes from the persisted cursor. The companion change in
    trilitech/tzel-infra teaches the daemon to touch the sentinel before
    admitting a slow CLI op and to remove it after, so a long initial sync
    no longer starves slow-lane ops (shield/transfer/unshield/withdraw).
  3. Final flush at end-of-loop. Nullifier prune + pending-pool prune
    run once over the cumulative cm set accumulated across batches —
    preserves the existing apply_scan_feed semantics exactly. (The
    existing helper is split into apply_scan_feed_recover_batch and
    apply_scan_feed_finalize; legacy apply_scan_feed is kept for
    in-tree tests and back-compat.)

Choice of K = 50

K persist overhead (1 save_wallet / K HTTP fetches) loss-on-kill-9 per-batch sentinel cost
1 ~33% <1 commit sub-ms (negligible)
10 ~6% <1 s sub-ms
50 ~0.7% ~2 s sub-ms
200 ~0.2% ~10 s sub-ms
500 <0.1% ~25 s sub-ms

50 is the empirical sweet spot. Don't drop below ~10 (overhead dominates)
or push past ~500 (the loss becomes user-visible during a long initial
sync). Documented inline.

Failure-mode analysis

  • kill -9 between fsync and rename: previous wallet.json wins; one
    batch lost. Confirmed by the existing atomic-rename invariant — no
    change to save_wallet, just called more often.
  • kill -9 mid-batch HTTP fetch: the in-RAM w is dropped, prior
    on-disk state is the last committed checkpoint. Same as above.
  • Sentinel races a checkpoint boundary: stat() succeeds late ⇒ next
    checkpoint catches it. Worst case: one extra batch runs (~2 s).
  • fs error on stat(): treated as "not present" so a flaky FS does
    NOT cause spurious exits. Correctness comes from the daemon's
    best-effort sentinel removal post-op.

Tests

Three new tests in network_profile_tests:

  1. cmd_rollup_sync_exits_at_first_checkpoint_when_sentinel_present
    — pre-create the sentinel; assert sync stops after exactly one
    checkpoint and does NOT remove the caller's sentinel.
  2. cmd_rollup_sync_resumes_from_checkpointed_cursor — drive the
    sentinel via the set_sync_checkpoint_hook test hook, run sync
    twice. Notes before the yield boundary are recovered in run 1;
    notes after are recovered in run 2.
  3. cmd_rollup_sync_intermediate_wallet_is_always_parseable
    concurrent reader thread parses wallet.json while the scan runs;
    every read must serde-parse (no half-written file).

Result: cargo +nightly-2025-07-14 test -p tzel-wallet-app --lib
→ 102 passed (99 baseline + 3 new), 0 failed.

Companion change

trilitech/tzel-infra branch feat/sync-cooperative-yield (PR linked
once opened). The daemon dispatcher there wraps slow-lane task
execution with a touch + RAII-removed sentinel, so a long-running
sync yields the lock for shield/transfer/unshield/withdraw and
resumes after.

🤖 Generated with Claude Code

Tzel-infra wallet team and others added 3 commits May 9, 2026 08:43
`tzel-wallet sync` previously held the wallet lock for the full duration
of `[scanned, tree_size)`, persisting only at the end. On a long-running
rollup that's 5-45 min during which every other wallet op (shield,
transfer, unshield, balance, …) blocks. Unacceptable for the
daemon-driven UI flow where users want to act now.

This rewrite:

  * Pins head + tree_size once, scans in batches of 50 commits
    (CHECKPOINT_EVERY constant; ~0.7% persist overhead, ~2 s of
    re-do work on kill -9 between checkpoints).
  * Each batch ends with `save_wallet` — atomic temp-file +
    fsync + rename + parent-dir fsync, so an unclean shutdown
    leaves a coherent wallet.json one batch behind the cursor,
    never mid-write.
  * After each batch, stat()s `<wallet>.yield`. If present, exits
    cleanly (exit 0). The next sync resumes from the persisted
    cursor. The companion change in tzel-infra teaches the daemon
    to touch this sentinel before submitting a slow CLI op
    (shield/transfer/unshield/withdraw) and remove it after, so
    a long initial sync no longer starves slow-lane ops.
  * Final flush after the loop runs the nullifier + pending-pool
    prune once against the cumulative cm set seen across all
    batches — preserves the existing `apply_scan_feed` semantics
    exactly.

Tests added in `network_profile_tests`:

  * `cmd_rollup_sync_exits_at_first_checkpoint_when_sentinel_present`
    — pre-create the sentinel, assert sync stops at exactly one
    checkpoint boundary and never removes the caller's sentinel.
  * `cmd_rollup_sync_resumes_from_checkpointed_cursor` — set the
    sentinel via the test hook mid-scan, run twice, assert notes
    before/after the yield boundary are recovered in the right run.
  * `cmd_rollup_sync_intermediate_wallet_is_always_parseable` —
    concurrent reader thread parses wallet.json while the scan
    runs; every read must serde-decode (no half-written file).

Total: 102 lib tests pass (99 baseline + 3 new), 0 regressions.

Companion change: trilitech/tzel-infra `feat/sync-cooperative-yield`
which teaches the daemon's slow-lane dispatcher to touch + remove the
sentinel around each slow op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version of this PR baked the cooperative-yield checkpoint
granularity as a private const `CHECKPOINT_EVERY = 50`. That value was
chosen by reasoning (HTTP-bound work / save_wallet overhead /
loss-on-interrupt), not by measurement.

Expose it as `tzel-wallet sync --checkpoint-every N`:

- Default 50 (renamed const → DEFAULT_CHECKPOINT_EVERY) — same
  behaviour as before for callers that don't pass the flag.
- Reject 0 with a clear error (would loop forever without progress).
- Docstring spells out the empirical-vs-reasoned status: numbers come
  from `services/scan-bench/` POC + the design doc in tzel-infra, but
  pending live measurement against octez-smart-rollup-node.
- Recommends the safe range (~10..500) with explicit costs at each
  edge so a user tuning down (faster preemption) or up (less overhead)
  understands the trade-off.

Wired through `cmd_rollup_sync` and `cmd_rollup_sync_watch`. The dispatch
site casts u64 → usize (saturating on 32-bit, which is fine because
the inner loop bounds the batch via `min(.., tree_size)` regardless).

Tests: 1 new unit test `cmd_rollup_sync_refuses_zero_checkpoint_every`
locking the validation path. Existing 3 cooperative-yield tests
updated to pass `DEFAULT_CHECKPOINT_EVERY` explicitly. Full suite:
103/103 pass (102 → 103 with the new test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two independent reviewers on PR trilitech#24 surfaced one real bug, two test
gaps, and three docstring follow-ups. All applied:

**Real bug (Reviewer A)** — `w.scanned + checkpoint_every` at lib.rs
:7721 used unchecked addition. With an adversarial `--checkpoint-every`
(close to `usize::MAX` on 64-bit), that would panic in debug / wrap
in release. Switched to `w.scanned.saturating_add(checkpoint_every)
.min(tree_size)`. The atomic-rename invariant on `save_wallet` means
a wrap couldn't have corrupted on-disk state, but the sync would have
silently misbehaved with a tiny degenerate batch. Belt-and-braces.

**Test gaps**:
- All four cooperative-yield tests share the process-global
  `SYNC_CHECKPOINT_HOOK` (a `OnceLock<Mutex<...>>`). cargo-test runs
  in-binary tests in parallel; without serialization, one test's hook
  could fire under another test's loop. Added `serial_test` dev-dep
  and `#[serial(sync_checkpoint_hook)]` on all four (now five).
- All four tests hardcoded `DEFAULT_CHECKPOINT_EVERY` (50). A future
  regression that ignored the runtime parameter and used the const
  internally would still pass. Added `cmd_rollup_sync_respects_custom_
  checkpoint_every` exercising K=7 with a sentinel trip at K*2; locks
  the contract that the flag actually does something.

**Docstring follow-ups (Reviewer B)**:
- The previous text claimed `services/scan-bench/` of tzel-infra
  informed the K=50 default. False: scan-bench measures concurrent-
  fetch tolerance (the orthogonal N dial in design doc §2.A); the K=50
  numbers come from a `save_wallet` micro-bench that doesn't exist
  yet. Tightened to call out the dial separation + acknowledge the
  pending bench.
- Added the operator-vs-laptop calibration sentence: raise to 100-200
  on slow / synchronously-replicated storage; drop to 20-30 on fast
  operator boxes when sub-second preemption matters.
- Added a one-liner about `--watch` × `--checkpoint-every`
  composition (each polling iteration honours the value
  independently).

**Env var (Reviewer B F4)** — added `env = "TZEL_SYNC_CHECKPOINT_EVERY"`
to the clap arg, mirroring the precedent the design doc set for
`TZEL_SYNC_CONCURRENCY` (doc §4 line 187). Operator-deployed daemons
can now override K without recompiling. Required `clap` "env"
feature, added.

Tests: 104/104 pass (was 103, +1 with the new K=7 test). No
regressions to the 99 baseline tests.

Deferred to follow-up PRs (out of scope here):
- Daemon-side runner.rs pass-through of TZEL_SYNC_CHECKPOINT_EVERY
  (tzel-infra PR #55, before its merge — otherwise the daemon ships a
  hardcoded K=50 contract).
- friendlyError pattern for the validation error once the env var
  exposes it to non-CLI users.
- Re-tune K=250-500 once the design doc's 2.A (concurrent fetch)
  lands; at the post-2.A wallclock, K=50 means ~7 fsyncs/sec, which
  is non-trivial sustained.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant