NAR transfers via NarRequest / NarPush currently can't resume from an interrupted state, a network glitch or reconnect mid-transfer forces a restart from byte 0.
Proposal
Additive proto changes (no breaking changes to existing messages):
NarStreamHeader { path_hash, total_bytes, content_hash_sha256 }: sent before chunks; receiver sizes storage and verifies the full content hash on completion.
NarRequestResume { path_hash, received_bytes }: pull side requests resume from offset.
NarPushResume { path_hash, received_bytes }: push side announces buffered prefix.
- Receivers persist incoming chunks to a
*.partial file keyed by path_hash; survives process restart.
- Senders stay stateless; seek to the requested offset on resume.
- Bounded
.partial lifetime: GC after a configurable TTL.
Server and worker both gain support.
Why
- Workers recovering from flaky networks don't redo large transfers.
- Prerequisite for the upcoming gradient-proxy work — proxy will need the same resume semantics for both its proxy↔server and proxy↔worker links.
- General resilience improvement; no user-facing API change.
Out of scope
- User-initiated pause/resume.
- Multi-source / parallel downloads.
Also
- reconnection flow, so when networking stops the worker doesn't abort builds that have been running for long, rather just reconnect and reauth and continue where stopped.
- Better Upload stability: If any output upload fails, the worker should retry the entire upload
sequence up to 3 times, re-requesting presigned URLs / Nar Push.
Design doc and implementation plan to follow as a separate spec.
NAR transfers via
NarRequest/NarPushcurrently can't resume from an interrupted state, a network glitch or reconnect mid-transfer forces a restart from byte 0.Proposal
Additive proto changes (no breaking changes to existing messages):
NarStreamHeader { path_hash, total_bytes, content_hash_sha256 }: sent before chunks; receiver sizes storage and verifies the full content hash on completion.NarRequestResume { path_hash, received_bytes }: pull side requests resume from offset.NarPushResume { path_hash, received_bytes }: push side announces buffered prefix.*.partialfile keyed bypath_hash; survives process restart..partiallifetime: GC after a configurable TTL.Server and worker both gain support.
Why
Out of scope
Also
sequence up to 3 times, re-requesting presigned URLs / Nar Push.
Design doc and implementation plan to follow as a separate spec.