Split chunk GET timeout from store timeout by mickvandijke · Pull Request #78 · WithAutonomi/ant-client

mickvandijke · 2026-05-09T21:38:13Z

Summary

Add an independent chunk_get_timeout_secs client config value and hidden --chunk-get-timeout-secs CLI flag.
Keep store_timeout_secs scoped to chunk store/PUT operations while preserving default config behavior unless a CLI override is provided.
Reduce the non-Merkle chunk PUT store-response timeout from 30 seconds to 10 seconds.

Testing

Not run; PR opened from the existing committed branch state.

grumbach

Approving the split — independent chunk_get_timeout_secs is the right shape and removes the implicit coupling between PUT and GET timeouts.

One thing to flag: the non-merkle STORE_RESPONSE_TIMEOUT drop from 30s to 10s in this PR is fine for the non-merkle path (fail fast + retry, as the description says), but the merkle path uses the same store_timeout_secs config and ends up bound by it.

I've just opened #83 which bumps the default store_timeout_secs to 270s = storer-side CLOSENESS_LOOKUP_TIMEOUT (240s) + 30s padding. That's what made merkle uploads pass at scale on a 210-node 7-region testnet (0/22 -> 8/8). The two PRs touch the same constant — order matters.

Either:

merge #78 first; my #83 rebases on top and bumps the default further.
merge #83 first; #78 rebases on top, keeps its non-merkle hardcoded value, picks up my new default elsewhere.

Approving this one, will reconcile #83 once one of us has landed. This is #5 in the merge chain (transport -> core -> protocol -> node -> client).

…anged Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent. This commit: - Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const - Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270) - Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field - Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged - Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const) - Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value Coordinates with Mick's PR WithAutonomi#78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions (merkle PUT / non-merkle PUT / GET) will be cleanly separated.

… limit foundryup curls api.github.com to resolve the nightly tag. Anonymous calls are rate-limited at 60/hour shared per IP; macOS runners hit this regularly and fail every E2E and Merkle E2E job with `curl: (56) ... 403`. Passing the workflow's GITHUB_TOKEN authenticates the call, raising the cap to 1000/hour per token. Same fix Mick's PR WithAutonomi#78 will want.

…anged Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent. This commit: - Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const - Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270) - Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field - Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged - Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const) - Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value Coordinates with Mick's PR #78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions (merkle PUT / non-merkle PUT / GET) will be cleanly separated.

… limit foundryup curls api.github.com to resolve the nightly tag. Anonymous calls are rate-limited at 60/hour shared per IP; macOS runners hit this regularly and fail every E2E and Merkle E2E job with `curl: (56) ... 403`. Passing the workflow's GITHUB_TOKEN authenticates the call, raising the cap to 1000/hour per token. Same fix Mick's PR #78 will want.

PUTs and GETs have different payload directions and performance profiles, so a single shared timeout was a poor fit. Adds `chunk_get_timeout_secs` to `ClientConfig` (default 10s) and a matching `--chunk-get-timeout-secs` CLI flag, while keeping `store_timeout_secs` for the PUT path. Also bumps the non-merkle store-response timeout from 5s to 10s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…bump related package versions in Cargo.lock

`chunk_put_with_proof` was deep-copying every chunk's content into a fresh `Vec<u8>` via `Bytes::to_vec()` before stuffing it into a `ChunkPutRequest`. Since `chunk_put_to_close_group` spawns one of these per recipient (CLOSE_GROUP_MAJORITY ≈ 5) and the AIMD store controller caps concurrent chunks at 64, peak in-flight chunk content reached ~64 × 5 × 4 MB = 2.5 GB just from these copies — enough to OOM-kill `ant` on a 4 GB client VM partway through a 300 MB-or-larger upload. Heaptrack against a clean-exit 20 MB upload (release `ant` on a 4 GB VM) showed 285 MB peak in `alloc::raw_vec::RawVecInner::finish_grow`, with 168 MB of that consumed via `ChunkMessage::encode` → `chunk_put_to_close_group` — i.e. the per-recipient copy + encode loop. ant-protocol now holds `ChunkPutRequest::content` as `bytes::Bytes` (see jacderida/ant-protocol#1), so the caller can pass the refcounted `Bytes` straight through. Each peer's spawned task now shares the single 4 MB backing buffer instead of holding an independent copy. Wire format is unchanged. The patch.crates-io stanza pins ant-protocol to the perf branch commit so the build resolves the Bytes-typed field. The pin should be removed once a crates.io release containing the protocol change is published. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…anged Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent. This commit: - Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const - Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270) - Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field - Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged - Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const) - Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value Coordinates with Mick's PR #78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions (merkle PUT / non-merkle PUT / GET) will be cleanly separated.

When a merkle batch upload fails partway through (network flake, slow close-K, client crash), the on-chain payment is lost but the proofs needed to re-attempt the store are lost too — the user has to pay again from scratch. This change persists the MerkleBatchPaymentResult to disk immediately after the on-chain payment confirms, then re-loads it on the next upload of the same file path. The cache is keyed by a hash of the source path; a successful upload deletes the cache, a partial failure leaves it for the next attempt to pick up. Files older than the on-chain payment expiration (7 days) are GC'd opportunistically. The library handles save/load/delete transparently — no CLI flag and no app-level change needed. If the cached receipt doesn't match the current file content (file edited between attempts), the cache is discarded and the user pays fresh. Foundation laid by adding Serialize/Deserialize to MerkleBatchPaymentResult and threading the on-chain payment timestamp through. The new module also handles its own failure modes defensively: any IO/serialization error is logged but never bubbled up to break the upload itself. Cache misses are silent.

`MerkleBatchPaymentResult.proofs` is keyed by `[u8; 32]`, which JSON cannot represent as a map key — `serde_json::to_writer` errors with "key must be a string", causing the roundtrip_save_load_delete unit test to fail. Switch to rmp-serde (already a dep) for save/load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mickvandijke force-pushed the fix/stability-improvements branch 3 times, most recently from 8bcdc8b to cdb0945 Compare May 10, 2026 20:35

grumbach approved these changes May 12, 2026

View reviewed changes

grumbach mentioned this pull request May 12, 2026

fix(client): match DEFAULT_STORE_TIMEOUT_SECS to storer CLOSENESS_LOOKUP_TIMEOUT+padding #83

Merged

7 tasks

grumbach mentioned this pull request May 12, 2026

feat(client): resumable merkle upload (auto-load cached payment receipt) #84

Merged

mickvandijke and others added 6 commits May 13, 2026 01:28

fix(chunk): reduce store-response timeout for non-merkle PUTs

e02242c

chore: update Git branches for ant-protocol and saorsa dependencies, …

acdf755

…bump related package versions in Cargo.lock

jacderida force-pushed the fix/stability-improvements branch from a92b3ae to f03b91b Compare May 13, 2026 00:51

jacderida merged commit 91d5f18 into main May 13, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split chunk GET timeout from store timeout#78

Split chunk GET timeout from store timeout#78
jacderida merged 7 commits into
mainfrom
fix/stability-improvements

mickvandijke commented May 9, 2026

Uh oh!

grumbach left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mickvandijke commented May 9, 2026

Summary

Testing

Uh oh!

grumbach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants