Skip to content

Eliminate HEAD requests during downloads, especially for faster transfers of small files#363

Open
crowecawcaw wants to merge 4 commits intoboto:developfrom
crowecawcaw:callback
Open

Eliminate HEAD requests during downloads, especially for faster transfers of small files#363
crowecawcaw wants to merge 4 commits intoboto:developfrom
crowecawcaw:callback

Conversation

@crowecawcaw
Copy link
Copy Markdown

@crowecawcaw crowecawcaw commented Nov 21, 2025

This PR optimizes downloads through the TransferManager by removing the upfront HEAD request. Previously, every download issued a HEAD request to determine object size before starting the GET request. This
change eliminates that extra round-trip by extracting metadata from the first GET response instead.

For small files, the download time is dominated by the request latency, so eliminating one of the two requests results in a ~50% download time reduction. For large files, the effect is less noticeable because the download time is dominated by the the transfer time and because there are multiple chunks to download. In both cases, we save the cost of the HEAD request.

What Changed

Removed HEAD requests: Downloads now start immediately with a ranged GET request for the first chunk
Dynamic size detection: Extract object size and ETag from the first GET response headers (ContentRange or ContentLength)
Dynamic chunk scheduling: After the first chunk completes, schedule additional chunks only if the object is larger than the chunk size
Simplified code flow: Consolidated download logic into a single path instead of branching on size upfront
ETag consistency for all chunks: When an ETag is available (either pre-provided via a subscriber, e.g. from a prior HEAD request by the AWS CLI, or extracted from the first GET response), all subsequent ranged GET requests include an IfMatch header with that ETag. This ensures S3 rejects the request if the object changes mid-download. The first GET request also includes IfMatch if an ETag was pre-provided before the download started.

Testing

Unit, functional, and integ tests pass. I also added a new script to benchmark downloading many small files. For downloading 1000 1kB files on my laptop, the total duration dropped 41% from 15.0s to 8.9.

Backward Compatibility

External API unchanged. All download methods have the same signatures.

Flow Diagrams

Before (with HEAD request)

flowchart TD
    A[HEAD Request] --> C{Size < 8MB?}
    C -->|Yes| D[GET Request]
    C -->|No| E[Multiple GET Requests]
    D --> F[Complete]
    E --> F
Loading

After (no HEAD request)

flowchart TD
    A[GET First Chunk] --> B{Size < 8MB?}
    B -->|Yes| C[Complete]
    B -->|No| D[GET Remaining Chunks]
    D --> C
Loading

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Copy Markdown
Contributor

@aemous aemous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left 1 nit. Note that I am not the primary reviewer for this PR, and we are still pending the primary review.

@aemous
Copy link
Copy Markdown
Contributor

aemous commented Mar 6, 2026

Looks like downloads may be failing when the object has a size of 0 bytes. The error I'm seeing is

botocore.exceptions.ClientError: An error occurred (InvalidRange) when calling the GetObject operation: The requested range is not satisfiable

Can you confirm whether you are able to reproduce this, and look into resolving it?

InvalidRange occurs when the range we are requesting has no overlap with the object itself. For 0 byte objects, this will always be the case if we are requesting the first 0-x bytes, since there are no bytes at all. The only feasible patch I can imagine here would be a try-except clause that handles InvalidRange by performing a non-ranged GET, which may be undesirable. If you have a better alternative for resolving this that doesn't involve needing an extra request, that would be preferred.

Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
@crowecawcaw
Copy link
Copy Markdown
Author

I see the same issue. I added an integ test, fixed the unit test to mirror the same behavior, then fixed the issue. Instead of sending a second Get request for an object we know is empty, it just returns the empty content.

@aemous
Copy link
Copy Markdown
Contributor

aemous commented Mar 12, 2026

@crowecawcaw Can you push a revision that satisfies the 'Lint code' GH Action?

In case you cannot see the Action details, I pasted it here:

pre-commit hook(s) made changes.
If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.
To run `pre-commit` as part of git workflow, use `pre-commit install`.
All changes made by hooks:
diff --git a/scripts/performance/time-batch-download.py b/scripts/performance/time-batch-download.py
index 93e8a2a..1cf04fc 100755
--- a/scripts/performance/time-batch-download.py
+++ b/scripts/performance/time-batch-download.py
@@ -2,10 +2,12 @@
 """Direct timing of batch downloads without shell wrapper."""
 
 import argparse
-import tempfile
 import shutil
+import tempfile
 import time
+
 from botocore.session import get_session
+
 from s3transfer.manager import TransferManager
 
 
@@ -20,13 +22,13 @@ def main():
     parser.add_argument('--file-size', type=int, required=True)
     parser.add_argument('--s3-bucket', required=True)
     args = parser.parse_args()
-    
+
     session = get_session()
     client = session.create_client('s3')
-    
+
     tempdir = tempfile.mkdtemp()
     s3_keys = []
-    
+
     try:
         # Upload files
         print(f"Uploading {args.file_count} files...")
@@ -37,7 +39,7 @@ def main():
                 s3_key = f"perf_test_{i}"
                 manager.upload(file_path, args.s3_bucket, s3_key)
                 s3_keys.append(s3_key)
-        
+
         # Download files
         print(f"Downloading {args.file_count} files...")
         start_time = time.time()
@@ -46,9 +48,9 @@ def main():
                 download_path = f"{tempdir}/download_{i}"
                 manager.download(args.s3_bucket, s3_key, download_path)
         duration = time.time() - start_time
-        
+
         print(f"Download duration: {duration:.2f} seconds")
-        
+
         # Cleanup
         for s3_key in s3_keys:
             client.delete_object(Bucket=args.s3_bucket, Key=s3_key)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants