Split chunk GET timeout from store timeout#78
Conversation
8bcdc8b to
cdb0945
Compare
grumbach
left a comment
There was a problem hiding this comment.
Approving the split — independent chunk_get_timeout_secs is the right shape and removes the implicit coupling between PUT and GET timeouts.
One thing to flag: the non-merkle STORE_RESPONSE_TIMEOUT drop from 30s to 10s in this PR is fine for the non-merkle path (fail fast + retry, as the description says), but the merkle path uses the same store_timeout_secs config and ends up bound by it.
I've just opened #83 which bumps the default store_timeout_secs to 270s = storer-side CLOSENESS_LOOKUP_TIMEOUT (240s) + 30s padding. That's what made merkle uploads pass at scale on a 210-node 7-region testnet (0/22 -> 8/8). The two PRs touch the same constant — order matters.
Either:
- merge #78 first; my #83 rebases on top and bumps the default further.
- merge #83 first; #78 rebases on top, keeps its non-merkle hardcoded value, picks up my new default elsewhere.
Approving this one, will reconcile #83 once one of us has landed. This is #5 in the merge chain (transport -> core -> protocol -> node -> client).
…anged Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent. This commit: - Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const - Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270) - Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field - Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged - Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const) - Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value Coordinates with Mick's PR WithAutonomi#78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions (merkle PUT / non-merkle PUT / GET) will be cleanly separated.
… limit foundryup curls api.github.com to resolve the nightly tag. Anonymous calls are rate-limited at 60/hour shared per IP; macOS runners hit this regularly and fail every E2E and Merkle E2E job with `curl: (56) ... 403`. Passing the workflow's GITHUB_TOKEN authenticates the call, raising the cap to 1000/hour per token. Same fix Mick's PR WithAutonomi#78 will want.
…anged Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent. This commit: - Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const - Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270) - Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field - Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged - Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const) - Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value Coordinates with Mick's PR #78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions (merkle PUT / non-merkle PUT / GET) will be cleanly separated.
… limit foundryup curls api.github.com to resolve the nightly tag. Anonymous calls are rate-limited at 60/hour shared per IP; macOS runners hit this regularly and fail every E2E and Merkle E2E job with `curl: (56) ... 403`. Passing the workflow's GITHUB_TOKEN authenticates the call, raising the cap to 1000/hour per token. Same fix Mick's PR #78 will want.
PUTs and GETs have different payload directions and performance profiles, so a single shared timeout was a poor fit. Adds `chunk_get_timeout_secs` to `ClientConfig` (default 10s) and a matching `--chunk-get-timeout-secs` CLI flag, while keeping `store_timeout_secs` for the PUT path. Also bumps the non-merkle store-response timeout from 5s to 10s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bump related package versions in Cargo.lock
`chunk_put_with_proof` was deep-copying every chunk's content into a fresh `Vec<u8>` via `Bytes::to_vec()` before stuffing it into a `ChunkPutRequest`. Since `chunk_put_to_close_group` spawns one of these per recipient (CLOSE_GROUP_MAJORITY ≈ 5) and the AIMD store controller caps concurrent chunks at 64, peak in-flight chunk content reached ~64 × 5 × 4 MB = 2.5 GB just from these copies — enough to OOM-kill `ant` on a 4 GB client VM partway through a 300 MB-or-larger upload. Heaptrack against a clean-exit 20 MB upload (release `ant` on a 4 GB VM) showed 285 MB peak in `alloc::raw_vec::RawVecInner::finish_grow`, with 168 MB of that consumed via `ChunkMessage::encode` → `chunk_put_to_close_group` — i.e. the per-recipient copy + encode loop. ant-protocol now holds `ChunkPutRequest::content` as `bytes::Bytes` (see jacderida/ant-protocol#1), so the caller can pass the refcounted `Bytes` straight through. Each peer's spawned task now shares the single 4 MB backing buffer instead of holding an independent copy. Wire format is unchanged. The patch.crates-io stanza pins ant-protocol to the perf branch commit so the build resolves the Bytes-typed field. The pin should be removed once a crates.io release containing the protocol change is published. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anged Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent. This commit: - Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const - Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270) - Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field - Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged - Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const) - Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value Coordinates with Mick's PR #78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions (merkle PUT / non-merkle PUT / GET) will be cleanly separated.
When a merkle batch upload fails partway through (network flake, slow close-K, client crash), the on-chain payment is lost but the proofs needed to re-attempt the store are lost too — the user has to pay again from scratch. This change persists the MerkleBatchPaymentResult to disk immediately after the on-chain payment confirms, then re-loads it on the next upload of the same file path. The cache is keyed by a hash of the source path; a successful upload deletes the cache, a partial failure leaves it for the next attempt to pick up. Files older than the on-chain payment expiration (7 days) are GC'd opportunistically. The library handles save/load/delete transparently — no CLI flag and no app-level change needed. If the cached receipt doesn't match the current file content (file edited between attempts), the cache is discarded and the user pays fresh. Foundation laid by adding Serialize/Deserialize to MerkleBatchPaymentResult and threading the on-chain payment timestamp through. The new module also handles its own failure modes defensively: any IO/serialization error is logged but never bubbled up to break the upload itself. Cache misses are silent.
a92b3ae to
f03b91b
Compare
`MerkleBatchPaymentResult.proofs` is keyed by `[u8; 32]`, which JSON cannot represent as a map key — `serde_json::to_writer` errors with "key must be a string", causing the roundtrip_save_load_delete unit test to fail. Switch to rmp-serde (already a dep) for save/load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
chunk_get_timeout_secsclient config value and hidden--chunk-get-timeout-secsCLI flag.store_timeout_secsscoped to chunk store/PUT operations while preserving default config behavior unless a CLI override is provided.Testing