Skip to content

feat(client): resumable single-node (non-merkle) uploads (per-wave cached proofs)#88

Open
grumbach wants to merge 5 commits into
WithAutonomi:mainfrom
grumbach:grumbach/resumable-single-payment
Open

feat(client): resumable single-node (non-merkle) uploads (per-wave cached proofs)#88
grumbach wants to merge 5 commits into
WithAutonomi:mainfrom
grumbach:grumbach/resumable-single-payment

Conversation

@grumbach
Copy link
Copy Markdown
Contributor

Summary

Mirrors PR #84 (resumable merkle) for the regular (single-node, non-merkle) payment path. Persists each wave's per-chunk PaymentProof bytes to disk after batch_pay confirms, before the wave's PUT phase. If the upload dies mid-file (network flake, slow close-K, client crash, Ctrl-C), the next attempt loads the cached proofs and skips quote + pay for any chunk whose address matches the current encryption.

Asked by Nic on the PR #84 follow-up: "is this already in for regular payments?" — answer was no, this PR fixes that.

Why

Single-node uploads break the file into payment waves. Each wave is one EVM transaction that produces N per-chunk payment proofs. Before this change, those proofs lived only in process memory. A partial-upload failure meant every wave already paid for was unrecoverable — the user had to re-quote and re-pay every chunk on the next attempt.

Live merkle uploads on prod last week burned ~2.78 ANT on a single failed 730 MB attempt because the merkle path had the same disease; PR #84 fixed it. This PR closes the parallel hole for regular uploads, which is what most files <64 chunks use.

Design

New ant-core/src/data/client/cached_single module:

  • try_append_wave(file_path, new_proofs, storage_cost, gas_cost) — called once per successfully paid wave, before the PUT phase. Adds the wave's (addr, proof_bytes) entries to the on-disk receipt and updates cumulative cost figures. The whole receipt is rewritten on each append (bounded by chunk count).
  • try_load_for_file(file_path) — called once at the top of the upload. Returns the cached receipt (if any).
  • try_delete_for_file(file_path) — called after full-file success.
  • cleanup_outdated() — opportunistic GC on every load.

Storage layout: <data_dir>/payments/single/<ts>_<file_hash>. The single/ subdirectory keeps this cache from colliding on filename with cached_merkle (<data_dir>/payments/).

Expiry: 24 h, matching QUOTE_MAX_AGE_SECS in ant-node. After that, storers reject the cached proof even if the file is otherwise resumable, so keeping the cache wouldn't help.

The wire-up:

  • batch_upload_chunks_with_events gains a resume_key: Option<&str> parameter. None (used by direct callers without a stable file identity) opts out of caching.
  • Inside the wave loop, prepared chunks are split into "already paid in a previous attempt" (PaidChunk built from cached proof + freshly-quoted peers) and "needs payment" (sent to batch_pay). Cached chunks bypass the EVM entirely.
  • After every successful batch_pay, the newly paid wave's proofs are appended to the on-disk receipt.
  • file.rs::upload_with_options deletes the cache after full file success on both the merkle-fallback path and the regular path.

Failure-mode tolerance

All public-facing API (try_* variants) swallows IO and serialization errors with a warn! log. A busted cache directory must never prevent a real upload from running. At worst, the user re-pays.

Tests

  • Unit: file_hash_key stability, expired/fresh filename detection, malformed filename safety, multi-wave roundtrip save → load → delete with cumulative cost summing.
  • cargo build --release --bin ant clean.

What this does NOT do

  • Cross-machine resume. The cache is per-user-per-machine. Out of scope.
  • Per-chunk progress tracking. On resume, the storer returns AlreadyExists cheaply for already-stored chunks, but the PUT request still goes over the wire. A future iteration could persist a stored: HashSet<XorName> alongside the receipt.
  • Resume after the 24 h QUOTE_MAX_AGE_SECS window. Cache files past that are GC'd automatically.

Note on caller surface

Client::batch_upload_chunks keeps its existing signature (no resume_key) — the public direct-batch API doesn't have a stable file identity to key on, so it stays opt-out by default. File-level callers (upload_with_options) pass the file path string as the key.

grumbach added 4 commits May 13, 2026 11:57
… from cached per-wave proofs)

Mirrors PR WithAutonomi#84's merkle-resume design for the regular payment path:
persist each wave's per-chunk PaymentProof bytes to disk after
batch_pay confirms, before the wave's PUT phase. If the upload dies
mid-file (network flake, slow close-K, client crash, Ctrl-C), the
next attempt loads the cached proofs and skips quote+pay for any
chunk whose address matches the current encryption.

Storage layout: <data_dir>/payments/single/<ts>_<file_hash> via the
new ant-core::data::client::cached_single module. Subdirectory keeps
single-node and merkle caches from colliding on filename. Receipts
expire after 24h to match QUOTE_MAX_AGE_SECS in ant-node.

Threaded a resume_key: Option<&str> through
batch_upload_chunks_with_events and upload_waves_single so callers
that don't have a stable file identity (the direct batch API at the
public surface) opt out by passing None.

Wave-level append rather than full rewrite per chunk so the cost
of caching scales with chunk count, not chunk count squared.

Failure-mode tolerance: every save/load/delete is wrapped in try_*
that logs but never bubbles up — a busted cache never blocks a real
upload.

Asked by Nic on PR WithAutonomi#84 follow-up.
Adversarial review of WithAutonomi#88 found a dozen money-loss paths in the
single-node resume cache. This is the consolidated fix.

Atomicity & concurrency
- write_receipt_atomic: <path>.tmp + BufWriter::into_inner check +
  flush + sync_all + rename + parent-dir fsync. Replaces the prior
  truncate-then-write that lost paid waves on crash or concurrent CLI.
- ReceiptLock: per-key fs2 exclusive lock on a .lock sidecar guards
  append/drop/delete. Two concurrent ant-file-upload invocations on
  the same path now serialize at the receipt boundary instead of
  last-writer-wins on the proof set.
- recover_orphaned_tmps: under the lock, recover the newest readable
  <ts>_<key>.tmp left by a crash between sync_all and rename. Unlink
  older .tmp siblings (their content is a subset by the load-extend-
  write invariant) and any corrupt .tmp.
- dedupe_canonical_receipts: pick newest canonical by ts-prefix and
  unlink older siblings. Prevents the non-deterministic resolution
  that first-match iteration would yield.

Stale-proof handling (no remote text trust)
- prune_locally_expired_proofs in batch.rs decodes each cached
  PaymentProof and drops entries whose quote.timestamp is past the
  storer-side QUOTE_MAX_AGE_SECS budget minus a 5-minute safety
  margin. Replaces the prior substring match on storer error text,
  which a Byzantine peer could spoof to force double-payment.
- drop_proofs_for_file now takes &[(addr, expected_bytes)] and does
  compare-and-swap on the on-disk bytes. A concurrent re-pay's fresh
  proof is never clobbered by a stale prune list computed earlier.

Schema, cost, and key stability
- SingleNodeReceipt gains a version: u8 field. read_receipt rejects
  versions above SCHEMA_VERSION = 1 as unreadable.
- storage_cost_atto now sums as ant_protocol::evm::Amount (U256)
  instead of u128, so very large uploads don't silently saturate.
- find_existing's unreadable arm unlinks the corrupt file instead of
  letting it occupy the directory for 24 h.
- file.rs canonicalizes the cache key (with display-string fallback)
  so ./foo and /abs/foo hit the same receipt.
- batch.rs seeds total_storage/total_gas from the loaded receipt so
  the returned tally reflects this-file-total, not just freshly-paid.
- delete_for_file also unlinks matching .tmp residue so a future
  upload of the same path can't resurrect a deleted receipt.

Tests
- 24 new tests in cached_single + 5 in batch covering atomicity, lock
  exclusion (2/8 reproducible regression without the lock), tmp
  recovery + dedupe, CAS-on-bytes drop, schema-version rejection,
  cost-overflow safety, canonical dedupe, unreadable auto-unlink,
  and concurrent drop+append consistency.
- proof_is_safely_fresh tests for the local pre-flight expiry check.

Verification: 280 lib tests pass, clippy -D warnings clean, fmt clean,
release build clean.
CI uses Rust 1.95 which promotes clippy::unnecessary_sort_by to deny;
local was 1.94 and didn't fire. Rewrite the two descending-sort calls
in recover_orphaned_tmps and dedupe_canonical_receipts to use
sort_by_key + std::cmp::Reverse (the lint's suggested form.
EOF
)
@Nic-dorman
Copy link
Copy Markdown
Contributor

Nic-dorman commented May 13, 2026

Code review

TL;DR

Strong, well-tested PR. The on-disk cache module is essentially production-grade: atomic writes, advisory locking, CAS-on-drop for TOCTOU, schema versioning, orphaned-tmp recovery, dedup of canonicals, U256 cumulative cost summing. Tests cover the actual failure modes (concurrent appends, torn .tmp, corrupt receipts, future schemas, CAS races). I'd merge it, with a couple of questions and a small list of follow-ups.

Correctness verification I ran

  • ant-protocol pinned at 93e63b8a exposes payment::deserialize_proof, ProofOfPayment.peer_quotes, and PaymentQuote { content, timestamp, price, rewards_address, pub_key, signature } exactly as the new tests use them — the test code in batch.rs:1056 will compile.
  • The 24h figure mirrors QUOTE_MAX_AGE_SECS in WithAutonomi/ant-node/src/payment/verifier.rs:48 (const QUOTE_MAX_AGE_SECS: u64 = 86_400;). Grounded.
  • Mirrors the design from feat(client): resumable merkle upload (auto-load cached payment receipt) #84 (resumable merkle), which is already merged.

Things I like

  • cached_single.rs:153-200prune_locally_expired_proofs is decision-pure. The previous merkle iteration trusted substring-matching on storer error text; that's spoofable by a Byzantine storer into a double-payment. Local decode + drop is the right pattern. The "future-dated quote treated as unsafe" branch is defensive and cheap (batch.rs:920-927).
  • cached_single.rs:148-189 — CAS-on-drop with the observed bytes. Without it, a stale-list computed at load time can clobber a fresh proof appended mid-prune. The dedicated test drop_proofs_skips_drop_if_bytes_have_changed (cached_single.rs:1186-1217) directly exercises the regression. This is the kind of thing that's invisible until it costs someone real money.
  • cached_single.rs:430-501 — Two-pass .tmp recovery + canonical dedup. Newest-readable wins, older .tmps unlinked, then dedupe_canonical_receipts ensures find_existing can't load a stale sibling. The hazard the comment calls out (filesystem-order iteration choosing the older canonical) is real and tested.
  • cached_single.rs:603-628write_receipt_atomic does the right sequence: tmp + flush + sync_all + rename + parent dir fsync. Holding the BufWriter in a named local and explicitly checking into_inner() for flush failure is correct — the prior pattern (inline BufWriter argument) does swallow flush errors. Worth keeping.
  • cached_single.rs:113-127 — U256 cumulative cost summing. Test wave_cost_above_u128_max_does_not_silently_drop_cumulative (cached_single.rs:1780) catches the saturation regression. Realistic for high-ANT pricing on multi-TB uploads.
  • batch.rs:541-570 — partition into needs_pay / cached_paid + persist immediately after batch_pay is the correct ordering: cache lands before PUT starts, so a crash mid-PUT is recoverable. Matches the docstring's promise.

Questions / things to confirm before merge

Q1. Storer acceptance of a stale proof against a re-quoted close group.
On resume, prepare_wave re-quotes each chunk → fresh quoted_peers for the PUT phase. The cached proof_bytes carries the OLD wave's peer_quotes. The new PUT then sends an old proof to a (possibly different) close group. If a storer not in proof.peer_quotes rejects the proof, partial resumes on networks with peer churn will still re-pay (the cache helped only for the overlap). I assume this was verified for the merkle PR (#84), but it's worth a one-line confirmation from someone who knows the verifier semantics. If acceptance is "any one storer's quote in the proof is enough," everything is fine. If it's "this storer's quote must be in the proof," resumes degrade gracefully but won't always avoid re-pay.

Q2. Sync I/O on the tokio runtime. try_append_wave / try_load_for_file / try_drop_proofs_for_file are called from inside batch_upload_chunks_with_events (an async fn). They do flock(2), sync_all, rename, and a parent-dir fsync — each of these can stall the runtime for tens to hundreds of ms on a slow filesystem. Two concurrent uploads to the same key serialize on the lock entirely on the runtime thread. A tokio::task::spawn_blocking wrapper around each try_* call would prevent any chance of starving other tasks scheduled on the same worker. Not a correctness bug — performance/responsiveness only. Worth a follow-up issue.

Smaller things

  • cached_single.rs:362-378cleanup_outdated does not expire .tmp siblings. Filter at cached_single.rs:393 returns false for .tmp/.lock. If a .tmp is orphaned for a key that's never loaded again, it sits forever. Minor disk leak (bounded by aborted uploads × small proof size). One-line fix: in is_expired_entry, treat .tmp files with an old timestamp prefix as reapable. Not a blocker.
  • file.rs:1297 and :1389cached_merkle is not cleaned on a successful wave-batch path (and vice versa). If the user's previous attempt was merkle (cached) and the retry succeeds via wave-batch (or vice versa), the other-mode cache lingers for 24h/7d until expiry. Not a correctness issue; tiny disk + tiny extra scan on the next upload. Acceptable for now.
  • cached_single.rs:626-628 — directory fsync is a no-op on Windows. File::open(parent_dir) fails without FILE_FLAG_BACKUP_SEMANTICS. Comment mentions macOS and Linux but not Windows. NTFS journals directory metadata so durability is still fine in practice — but the comment is slightly misleading. One-line clarification would be nice.
  • batch.rs:586-602 — pruned cached chunks' costs are not subtracted from cached_storage/cached_gas. This is intentional and correct (the user did spend that money on chain), but it means a resumed upload that has to re-pay every chunk because of pruning will report prior_total + new_total, double-charging the pruned addresses in the "total reported cost" view. That matches actual wallet spend, just worth noting in commit message or a comment.
  • cached_single.rs:259-303cached_paid vec is appended after paid_chunks. No ordering invariant breaks because store_paid_chunks_with_events operates on the set, but if any future code grows a "chunks in input order" assumption it will trip on this. Document or add a // order is fresh-paid then cached note inline.
  • batch.rs:541cached_proofs.get(&prep.address).cloned() clones the proof bytes even though we know the address is unique in this wave (unique_chunks dedupes earlier). A cached_proofs.remove(&prep.address) would save the clone. Sub-millisecond. Take it or leave it.

What I didn't verify

  • I did not check out the PR locally and run cargo check --all-features --all-targets or cargo clippy. The author's commit 45d287b is a clippy fix for the same module, which suggests the local build is clean.
  • I didn't run any of the new tests; many touch the global config::data_dir()/payments/single/ and can interact under cargo test. The authors clearly noticed this (the dedupe_canonical_receipts test uses recent timestamps "so a concurrent test's cleanup_outdated doesn't reap our hand-written receipts"), but the suite would be more hermetic if it used a per-test temp dir via TempDir. Optional cleanup.

Verdict

Approve with the question on storer acceptance answered. Two follow-ups worth filing:

  1. Wrap try_* cache calls in spawn_blocking (perf / responsiveness).
  2. Either share a temp-dir override for tests or accept the current global-dir style with eyes open.

…review

1. Receipt filename TTL was keyed on the FIRST wave's timestamp. A
   wave paid at T0+23h50m got dropped wholesale at T0+24h via
   cleanup_outdated even though the late wave's proof was only 10
   minutes old. Fix: rotate the canonical filename to <now>_<key>
   on every successful append_wave, so the on-disk TTL tracks
   "time since most recent paid wave" instead of "time since
   first wave". The receipt survives as long as it keeps being
   used; stale individual proofs are still pruned by the
   per-quote.timestamp check in batch.rs.

2. proof_is_safely_fresh rejected ANY future-dated quote, but the
   ant-node verifier tolerates up to QUOTE_FUTURE_SKEW_TOLERANCE_SECS
   = 300s of forward skew. A client clock 60s slow would prune
   perfectly fresh proofs and force re-payment. Fix: mirror the
   300s tolerance in CACHED_PROOF_FUTURE_SKEW_TOLERANCE_SECS and
   thread max_future_skew through proof_is_safely_fresh.

3. file_hash_key used std::collections::hash_map::DefaultHasher
   whose output is explicitly NOT stable across rustc releases.
   User pays on binary A, upgrades within 24h, retries on binary
   B, cache miss, re-pay. Fix: BLAKE3 of the canonical path
   string, truncated to 128 bits. Adds blake3 = "1" to ant-core.

4. dedupe_canonical_receipts was content-blind: it unlinked older
   siblings purely by timestamp without merging their proofs.
   Pre-existing duplicates from buggier earlier binaries or
   manual file recovery could hold proofs only in the older
   sibling. Blind unlink stranded those payments. Fix: union all
   readable siblings' proofs into the newest, sum costs, take
   min(first_pay_timestamp), atomically rewrite the winner, then
   unlink the rest.

3 new tests:
- file_hash_key_uses_stable_digest_across_invocations pins the
  expected BLAKE3 digest so a future regression to a non-stable
  hash fails loudly.
- append_wave_rotates_filename_so_late_waves_dont_age_out verifies
  the canonical filename's timestamp tracks the latest wave (not
  the first) and that proofs survive the rotation.
- duplicate_canonical_receipts_are_merged_then_older_unlinked
  replaces the prior lossy test, asserts older sibling's proof
  is merged into the winner and costs are summed correctly.

Verification: 283 lib tests pass (was 280; +3), clippy
-D warnings clean, fmt clean, release build clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants