Skip to content

feat(aws_tools): add batch S3 uploader/downloader (s3_files_uploader, s3_files_download)#3276

Merged
crazywoola merged 2 commits into
langgenius:mainfrom
leoou331:add-s3-batch-tools
Jun 11, 2026
Merged

feat(aws_tools): add batch S3 uploader/downloader (s3_files_uploader, s3_files_download)#3276
crazywoola merged 2 commits into
langgenius:mainfrom
leoou331:add-s3-batch-tools

Conversation

@leoou331

Copy link
Copy Markdown
Contributor

What this PR does

Follow-up to #3273 (which added the single-file s3_file_uploader / s3_file_download). This PR adds batch counterparts so a single workflow node can process N files in one invocation, instead of forcing users to wrap the single-file tools in an Iteration node.

  • s3_files_uploader — takes input_files (type: files) from an upstream node and uploads each to a configurable S3 bucket with optional per-object presigned GET URLs.
  • s3_files_download — takes s3_uris (type: array of s3://bucket/key) and emits one Dify file blob per success in input order, plus structured metadata for downstream nodes.

The single-file tools are unchanged.

Why

For many real workflows (Frame Extractor, Nova Canvas with numberOfImages > 1, multi-file user uploads from a Start node), the natural input is N files at once. Wrapping the single-file tools in an Iteration node works but:

  • adds a node and complicates the graph,
  • recreates the boto3 client on every iteration (extra latency),
  • offers no first-class way to surface per-file outcomes in one structured payload.

A typical batch workflow now looks like:

[Start: file-list input] -> [s3_files_uploader] -> ... -> [s3_files_download] -> [LLM/Code/...]

Tool surface

s3_files_uploader

Param Type Required Notes
input_files files yes Bound to an upstream array[file] / file-list variable.
bucket_name string yes Without s3://.
key_prefix string no Folder-style prefix; final key per file is {prefix}/{filename}.
aws_region / aws_access_key_id / aws_secret_access_key / aws_session_token string no Same per-tool overrides / STS support as the single-file tool.
generate_presign_url boolean no Default false; produces a presigned GET URL per object.
presign_expiry number no Default 3600 seconds.

Outputs: json = {count, ok, failed, results: [{index, bucket_name, object_key, s3_uri, presigned_url?, presign_expiry?, status: "ok"|"failed", error?}, ...]} plus a per-line text summary (one line per entry: presigned URL when available, else s3_uri, else FAILED [i]: <error>).

s3_files_download

Param Type Required Notes
s3_uris array (LLM-fillable) yes List of s3://bucket/key.
aws_region / aws_access_key_id / aws_secret_access_key / aws_session_token string no Same overrides as uploader.

Outputs: json = {count, ok, failed, results: [{index, s3_uri, bucket, key, content_type, content_length, etag, last_modified, filename, status, error?}, ...]}, a per-line text summary (bucket / key / content_length for each success, FAILED [i] <s3_uri>: <error> for each failure), and one Dify file blob per successful URI in input order (failed entries simply produce no blob; downstream nodes correlate via results).

Design choices

  • No object_key override on the batch uploader. A single override cannot apply to N files. The final key per file is always derived from the file's own filename (or a UUID fallback) and optionally prepended with key_prefix. The batch uploader auto-disambiguates duplicate filenames in the same batch (image.png / image-1.png / image-2.png) so concurrent upstream branches with identical filenames don't silently overwrite each other.
  • Per-entry failure isolation. A single bad file/URI does not abort the batch — status=ok|failed (+ error) is captured per entry in results[]. The whole invocation only emits a top-level error message when every entry fails, so downstream nodes still see a clear failure signal in the all-failed case.
  • Same inline credential helpers as the single-file tools. The batch tools reuse the same _resolve_aws_credentials / _build_boto3_client_kwargs shape introduced in feat(aws_tools): add S3 File Uploader and S3 File Download tools #3273 — no shared utils/ module is introduced. The boto3 client is created once per _invoke call and reused across batch entries within that one call.

Files

tools/aws/
├── manifest.yaml                          # 0.0.27 -> 0.0.28
├── README.md                              # +2 Features lines
├── provider/aws_tools.yaml                # +2 lines (register both new tools)
└── tools/
    ├── s3_files_uploader.py               # 251 lines
    ├── s3_files_uploader.yaml             # 134 lines
    ├── s3_files_download.py               # 209 lines
    └── s3_files_download.yaml             # 80 lines

Validation

Static:

  • python -m py_compile on both .py files — ✅
  • yaml.safe_load on the new + modified yaml files — ✅
  • Verified extra.python.source paths resolve to existing files — ✅
  • black --check -l 100 and ruff check both clean on the new files — ✅
  • Confirmed label / description languages match the rest of the plugin (en_US / zh_Hans / pt_BR) — ✅
  • No new dependencies (boto3/botocore already pinned in pyproject.toml from feat(aws_tools): add S3 File Uploader and S3 File Download tools #3273) — ✅

Mock unit tests (10/10 pass): basic batch upload with presign, duplicate-filename dedup, partial upload failure (ClientError), all-fail top-level error, empty input, presign error isolation (upload OK + presign fails), basic batch download, partial download failure (NoSuchKey), invalid URI yields no blob, empty s3_uris.

End-to-end on Dify 1.14.2 Community Edition + real S3 (cn-northwest-1):

  1. Built a .difypkg from tools/aws/ on this branch, installed on a self-hosted instance.
  2. Imported a workflow [Start (file-list) -> s3_files_uploader -> Code (extract URIs) -> s3_files_download -> Code (summary) -> End].
  3. Triggered via Service API with three files at once: text/plain (a.txt, 40 B), image/png (100×100 RGBA, 220 B), and application/pdf (doc.pdf, 540 B).
  4. Result: status = succeeded, all 6 steps green, elapsed ~1.2 s. upload_uris and download_keys matched input order; download_sizes = [40, 220, 540].
  5. Pulled all three objects back from S3 with aws s3 cp and compared SHA-256 — byte-for-byte identical for all 3 files:
    • a.txt c377b72e7343c1642a35c7ff5108fef6c14fc5fba3aecce89c18d9ac526e4de8
    • img.png 5a2fe18dbec51b2426f8aa31f6424b0efff246497646f1aa2314abe8d09b7aec
    • doc.pdf 7b6fed1b75159c5cbc633e04f9011a1a9e4f22efce2621b8e14646064cf8c6fa
  6. With generate_presign_url=true, every entry's presigned_url was fetched via curl and produced byte-identical content to the local source.
  7. Partial-failure run (2 valid + 1 bogus s3:// URI) on the download tool: returned count=3, ok=2, failed=1 with a NoSuchKey error string for the bogus URI, and exactly 2 file blobs yielded in input order.

Out of scope (kept for a follow-up)

  • Multipart upload / streaming for large files
  • Any change to existing s3_operator / s3_file_uploader / s3_file_download
  • Shared utils/ module

… s3_files_download)

Add batch counterparts to the new s3_file_uploader / s3_file_download tools
introduced by this PR, so a single workflow node can process N files in one
invocation instead of forcing users to wrap the single-file tools in an
Iteration node.

New tools (tools/aws/tools/):
- s3_files_uploader (input_files: files; key_prefix; per-entry presigned URL)
- s3_files_download (s3_uris: array; emits one Dify file blob per success)

Behavior:
- Per-entry failure isolation: a single bad file/URI does not abort the
  batch; status='ok'|'failed' (+ error) is captured per entry in
  results[]. The whole invocation only emits a top-level error message
  when *every* entry fails.
- Uploader auto-disambiguates duplicate filenames in the same batch
  (image.png, image-1.png, image-2.png) so concurrent upstream branches
  with identical filenames do not silently overwrite each other.
- Downloader yields blobs in input order; failed entries simply produce
  no blob and downstream nodes can correlate via results[].
- Both tools reuse the same inline credential helpers as the single-file
  versions (no shared utils/ module introduced).

Registry / metadata:
- tools/aws/manifest.yaml: 0.0.27 -> 0.0.28
- tools/aws/provider/aws_tools.yaml: register both new tool yaml files
- tools/aws/README.md: add Features lines for the batch variants

Validation:
* python -m py_compile + yaml.safe_load on the new files: clean
* black --check -l 100 + ruff check: clean
* 10 mock-boto3 unit tests covering basic batch, dedup, partial failure
  (ClientError), all-fail top-level error, empty input, presign error
  isolation, partial download (NoSuchKey), invalid URI: 10/10 pass
* End-to-end on Dify 1.14.2 Community Edition + real S3 (cn-northwest-1):
  - Workflow [Start file-list -> s3_files_uploader -> extract URIs ->
    s3_files_download -> summarize -> End], 3 files (txt 40B + png 220B
    + pdf 540B), elapsed ~1.2s, all 6 steps succeeded
  - SHA-256 round-trip byte-identical for all 3 files (pulled back via
    aws s3 cp and compared with the source)
  - Each presigned URL returned by the uploader fetched via curl and
    verified byte-identical to the source
  - Partial-failure run (2 valid + 1 bogus s3:// URI): downloader
    returned count=3, ok=2, failed=1 with a NoSuchKey error string for
    the bogus URI and exactly 2 file blobs yielded in input order

Out of scope (kept for a follow-up):
- Multipart upload / streaming for large files
- Any change to existing s3_operator / s3_file_uploader / s3_file_download
- Shared utils/ module
@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Jun 11, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new tools to the AWS plugin: AWS S3 Batch File Uploader and AWS S3 Batch File Download, enabling multi-file S3 operations with per-file failure isolation in a single invocation. The review feedback suggests optimizing memory usage in the batch download tool by yielding file blobs immediately rather than buffering them in memory, which helps prevent Out-Of-Memory (OOM) errors. Additionally, it recommends catching general exceptions during presigned URL generation in the batch uploader to ensure that secondary presigning failures do not incorrectly fail the entire upload process.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread tools/aws/tools/s3_files_download.py Outdated
Comment thread tools/aws/tools/s3_files_download.py
Comment thread tools/aws/tools/s3_files_download.py Outdated
Comment thread tools/aws/tools/s3_files_uploader.py Outdated
- s3_files_download: yield each blob inline during the download loop
  instead of buffering all file bytes in memory and yielding at the
  end. The buffer-then-flush pattern made peak RSS scale with N x
  file_size, which can trip the Dify plugin container's 256 MB memory
  limit on a batch of large files. Inline yield keeps peak RSS bounded
  by a single file's size.

- s3_files_uploader: catch a broader Exception around
  generate_presigned_url instead of just ClientError. The presign call
  is a client-side helper that can raise ParamValidationError, other
  BotoCoreError subclasses, or unrelated runtime errors; letting any
  of those propagate would fail the whole batch even though the
  upload itself already succeeded. We still record the message in
  entry['presign_error'] so the result remains observable.

Validation:
- python -m py_compile + black --check -l 100 + ruff check: clean
- Mock unit tests (10/10): still pass
- End-to-end re-run on Dify 1.14.2 + real S3 (cn-northwest-1):
  - 3-file batch: succeeded, sizes [40, 220, 540] match, order preserved
  - Partial-failure download (2 valid + 1 bogus s3:// URI): count=3,
    ok=2, failed=1, files yielded in input order ['a.txt', 'img.png']
    (failed entry produces no blob, downstream nodes correlate via
    json results)
@leoou331

Copy link
Copy Markdown
Contributor Author

Thanks @gemini-code-assist for the review! Both points adopted in a68d6c11 (just pushed):

# Severity File Change
1 high s3_files_download.py Yield each blob inline during the download loop instead of buffering. Peak RSS is now bounded by a single file's size (O(file_size)) instead of O(N × file_size), so a batch of large files no longer risks tripping the plugin container's 256 MB memory limit.
2 medium s3_files_uploader.py Catch broader Exception around generate_presigned_url (with ClientError downcast for structured error messages). The presign call can raise ParamValidationError / other BotoCoreError subclasses / unrelated runtime errors; since the upload itself already succeeded, we record the failure on the entry as presign_error rather than failing the whole batch.

Re-validation after the fix:

  • python -m py_compile + black --check -l 100 + ruff check: clean
  • 10/10 mock unit tests still pass
  • End-to-end on Dify 1.14.2 + real S3 (cn-northwest-1):
    • 3-file batch: succeeded, sizes [40, 220, 540] match, order preserved.
    • Partial-failure download (2 valid + 1 bogus s3:// URI): count=3, ok=2, failed=1, exactly 2 file blobs yielded in input order (['a.txt', 'img.png']), failed entry produces no blob and downstream nodes correlate via json.results.

@leoou331 leoou331 deployed to tools/aws June 11, 2026 11:34 — with GitHub Actions Active
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 11, 2026
@crazywoola crazywoola merged commit 48e8212 into langgenius:main Jun 11, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants