Eliminate HEAD requests during downloads, especially for faster transfers of small files#363
Eliminate HEAD requests during downloads, especially for faster transfers of small files#363crowecawcaw wants to merge 4 commits intoboto:developfrom
Conversation
aemous
left a comment
There was a problem hiding this comment.
I left 1 nit. Note that I am not the primary reviewer for this PR, and we are still pending the primary review.
|
Looks like downloads may be failing when the object has a size of 0 bytes. The error I'm seeing is Can you confirm whether you are able to reproduce this, and look into resolving it? InvalidRange occurs when the range we are requesting has no overlap with the object itself. For 0 byte objects, this will always be the case if we are requesting the first 0-x bytes, since there are no bytes at all. The only feasible patch I can imagine here would be a try-except clause that handles InvalidRange by performing a non-ranged GET, which may be undesirable. If you have a better alternative for resolving this that doesn't involve needing an extra request, that would be preferred. |
Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
|
I see the same issue. I added an integ test, fixed the unit test to mirror the same behavior, then fixed the issue. Instead of sending a second Get request for an object we know is empty, it just returns the empty content. |
|
@crowecawcaw Can you push a revision that satisfies the 'Lint code' GH Action? In case you cannot see the Action details, I pasted it here: |
This PR optimizes downloads through the TransferManager by removing the upfront HEAD request. Previously, every download issued a HEAD request to determine object size before starting the GET request. This
change eliminates that extra round-trip by extracting metadata from the first GET response instead.
For small files, the download time is dominated by the request latency, so eliminating one of the two requests results in a ~50% download time reduction. For large files, the effect is less noticeable because the download time is dominated by the the transfer time and because there are multiple chunks to download. In both cases, we save the cost of the HEAD request.
What Changed
• Removed HEAD requests: Downloads now start immediately with a ranged GET request for the first chunk
• Dynamic size detection: Extract object size and ETag from the first GET response headers (ContentRange or ContentLength)
• Dynamic chunk scheduling: After the first chunk completes, schedule additional chunks only if the object is larger than the chunk size
• Simplified code flow: Consolidated download logic into a single path instead of branching on size upfront
• ETag consistency for all chunks: When an ETag is available (either pre-provided via a subscriber, e.g. from a prior HEAD request by the AWS CLI, or extracted from the first GET response), all subsequent ranged GET requests include an
IfMatchheader with that ETag. This ensures S3 rejects the request if the object changes mid-download. The first GET request also includesIfMatchif an ETag was pre-provided before the download started.Testing
Unit, functional, and integ tests pass. I also added a new script to benchmark downloading many small files. For downloading 1000 1kB files on my laptop, the total duration dropped 41% from 15.0s to 8.9.
Backward Compatibility
External API unchanged. All download methods have the same signatures.
Flow Diagrams
Before (with HEAD request)
flowchart TD A[HEAD Request] --> C{Size < 8MB?} C -->|Yes| D[GET Request] C -->|No| E[Multiple GET Requests] D --> F[Complete] E --> FAfter (no HEAD request)
flowchart TD A[GET First Chunk] --> B{Size < 8MB?} B -->|Yes| C[Complete] B -->|No| D[GET Remaining Chunks] D --> CBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.