Skip to content

fix(storage): guard downloads against SSRF to non-public hosts (GHSA-hjhx, code-only)#460

Merged
padak merged 1 commit into
mainfrom
fix/storage-download-ssrf-guard
Jun 23, 2026
Merged

fix(storage): guard downloads against SSRF to non-public hosts (GHSA-hjhx, code-only)#460
padak merged 1 commit into
mainfrom
fix/storage-download-ssrf-guard

Conversation

@padak

@padak padak commented Jun 22, 2026

Copy link
Copy Markdown
Member

Summary

Fixes M8 from the 2026-06-12 security audit (private advisory GHSA-hjhx-mx7m-8xx2) — an SSRF in the Storage download path.

Every file/table download fetches a URL that comes from a Storage API response (the presigned url, the sliced-file manifest URL, and per-slice cloud URLs). A malicious or compromised response could point one of these at the cloud instance-metadata endpoint (169.254.169.254) or localhost. The download clients already carry no Storage token and don't follow redirects (good — no credential leak, no redirect pivot), but the residual SSRF still writes internal/credential data into the user's download file.

Fix

New _assert_safe_download_url(url) resolves the host and refuses loopback / link-local / reserved / multicast / unspecified targets, called before each of the three download fetch sites:

  • download_file (non-sliced presigned URL)
  • _prepare_sliced_download (the sliced-file manifest fetch — shared by both sliced variants)
  • stream_to_file (the per-slice chokepoint — covers every slice download)

BYOC-safe by design

RFC1918 private ranges (10/8, 172.16/12, 192.168/16) are deliberately allowed. Keboola runs in customer private clouds (BYOC / private-tenant), where storage is legitimately served from private endpoints — blocking those would break real deployments. The high-value SSRF target (instance metadata) is link-local (169.254.0.0/16), not private, so blocking link-local + loopback + reserved protects the crown jewel without breaking BYOC.

⚠️ Code-only PR

Touches only client.py + test_client.pyno version bump / changelog entry (added at the next release). Same rationale as #422: main releases ~daily and a version-bumping PR conflicts on pyproject.toml/changelog.py against every release; code-only keeps this conflict-free.

Tests

TestAssertSafeDownloadUrl in test_client.py (10 tests, all offline — IP literals / localhost): rejects metadata endpoint, loopback (v4/v6/name), unspecified, and host-less URLs; allows public + RFC1918-private (BYOC) hosts. Full suite green: 4176 passed, 135 skipped; lint/format/ty clean.

Audit progress

Open after this: M7 (plaintext-on-encrypt warning — 3 services), M9 (npm --ignore-scripts), M10 (version regex), + M1/M5 residuals. All will be authored code-only.


Open in Devin Review

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 1 additional finding.

Open in Devin Review

@padak padak left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of #460 — fix(storage): guard downloads against SSRF to non-public hosts (GHSA-hjhx, code-only)

Generated by kbagent-pr-reviewer subagent. Verdict and findings below
are advisory; the human author retains every veto. CI-coverable issues
(lint, format, tests) are confirmed via make check, not duplicated here.

Summary

This PR adds _assert_safe_download_url() — a host-resolution + IP-classification guard that fires before each of the three download fetch sites in client.py, protecting against SSRF to cloud instance-metadata endpoints and localhost. The fix is narrowly scoped (two files, 101 lines added, 0 deleted), the logic is correct for the BYOC design intent described in the PR, and make check passes clean (4176 passed, 8 skipped). The conventional-commit prefix and description are accurate. The single finding worth surfacing is a non-blocking test coverage gap: the RFC 6598 Carrier-Grade NAT block (100.64.0.0/10) is not caught by any of the five ip_address predicates and is not tested. This block is unlikely to host cloud instance metadata in practice, but it is technically an unguarded public-IP look-alike that a DNS-rebind attack could exploit. Everything else checked out. Verdict: APPROVE.

Verdict

  • Verdict: APPROVE
  • Blocking findings: 0
  • Non-blocking findings: 1
  • Nits: 2

Blocking findings

(none)

Non-blocking findings

[NB-1] src/keboola_agent_cli/client.py:3106 — RFC 6598 Carrier-Grade NAT block (100.64.0.0/10) passes the guard unchecked

_assert_safe_download_url checks is_loopback, is_link_local, is_reserved, is_multicast, and is_unspecified. In Python's ipaddress module, 100.64.0.1.is_reserved returns False (RFC 6598 CGNAT is not classified as "reserved" per IANA — it is assigned, just not globally routable). An attacker who controls a DNS name in the Storage API response could point it at a CGNAT address and bypass the guard. 100.64.0.0/10 is not the cloud-metadata crown jewel (169.254.169.254 is link-local and already blocked), but it is a non-public range that cloud providers occasionally use for inter-service communication. The fix is a single or ip.is_global inversion (if not ip.is_global and not ip.is_private) — but ip.is_private has its own quirks in Python 3.12 (covers RFC1918 + loopback + link-local), so the cleaner approach is to add ipaddress.ip_network("100.64.0.0/10") to an explicit allowance / block list, or add a _is_blocked_range helper that explicitly enumerates the non-routable space. As accepted by the author, this does not block the current BYOC use-case (RFC1918 is allowed), so it is non-blocking, but it is a residual gap in the SSRF surface.

Nits

  • [NIT-1] src/keboola_agent_cli/client.py:3121ipaddress, socket, and urlparse are imported inside the function body on every call. These are stdlib modules with negligible import cost but the pattern is inconsistent with client.py's module-level from urllib.parse import quote. Moving them to the module-level import block would be consistent and matches the existing file conventions (12 other function-local imports in the file are for optional/heavy deps like gzip, json, shutil — all three in this function are always available and lightweight).

  • [NIT-2] src/keboola_agent_cli/client.py:3106 — The function is defined after KeboolaClient (line 150) and _CloudDownloader (line 3191) but is called from methods in both classes (lines 2298, 2397, 3359). This is valid Python (method bodies are resolved at call time, not parse time), and the file already uses this pattern elsewhere, but placing the guard before KeboolaClient or at least with a # defined below comment would make it discoverable for a reader scanning top-to-bottom.

Verification log

  • gh pr view 460 --json title,body,files,additions,deletions,state → 2 files, +101/-0, state OPEN, conventional fix(storage):
  • git rev-parse --abbrev-ref HEADfix/storage-download-ssrf-guard ✓ (working tree matches PR branch)
  • grep -E '^\+(from typer|import typer|formatter\.|console\.print)' diff | grep services/ → empty ✓ (no layer violation)
  • grep -E '^\+(from httpx|import httpx|httpx\.Client)' diff | grep commands/ → empty ✓ (no layer violation)
  • Three download fetch sites identified and verified as guarded: _prepare_sliced_download (line 2298), download_file (line 2397), _CloudDownloader.stream_to_file (line 3359) ✓
  • download_sliced_file (line 2235) and download_sliced_file_to_dir (line 2318) both delegate to _prepare_sliced_download + stream_to_file — guard fires on both paths ✓
  • unload_table_to_file (line 2061) calls _wait_for_storage_job and hands the job result to the storage service; the service calls download_file or download_sliced_file — guard fires ✓
  • Server-side file_download route (server/routers/storage.py:514) delegates through registry.storage.download_filestorage_service.pyclient.download_file — guard fires ✓ (no unguarded download bypass in the serve path)
  • python3 -c "ipaddress.ip_address('::ffff:169.254.169.254').is_link_local"True — IPv4-mapped IPv6 form of the metadata endpoint is blocked ✓
  • python3 -c "ipaddress.ip_address('100.64.0.1').is_reserved"False — RFC 6598 CGNAT is NOT blocked (see NB-1)
  • python3 -c "socket.getaddrinfo('::ffff:169.254.169.254', None)" → returns {'::ffff:169.254.169.254'}, is_link_local=True → blocked ✓
  • pytest tests/test_client.py::TestAssertSafeDownloadUrl -v → 10 passed (5 reject cases + 4 BYOC-allow cases + 1 no-host case) ✓
  • make check → 4176 passed, 8 skipped, exit 0 ✓
  • Plugin synchronization map: this PR adds no new CLI command and changes no command surface. Per focus hint and PR description, version bump + changelog are intentionally deferred; no plugin file updates are expected or required for this diff. The OPERATION_REGISTRY, AGENT_CONTEXT, CLAUDE.md, and commands-reference.md checks are N/A (no command added). ✓

Open questions for the author

(none)

…hjhx)

A malicious or compromised Storage API response can return a download URL
pointing at the cloud instance-metadata endpoint (169.254.169.254) or
localhost. The download clients carry no Storage token and don't follow
redirects, but the residual SSRF still writes internal/credential data into
the user's download file.

Add _assert_safe_download_url(): resolve the host and refuse loopback /
link-local / reserved / multicast / unspecified targets before each download
fetch (download_file, the sliced-manifest fetch in _prepare_sliced_download,
and the per-slice stream_to_file). RFC1918 private ranges are ALLOWED so
BYOC / private-tenant deployments keep working; the high-value target
(instance metadata) is link-local, not private.

Code-only (no version bump / changelog entry) to stay conflict-free against
the rapid main release cadence; version + changelog added at next release.

Private advisory GHSA-hjhx-mx7m-8xx2.
@padak padak force-pushed the fix/storage-download-ssrf-guard branch from ff85a33 to 99f623c Compare June 22, 2026 22:42
@padak

padak commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

Author response — review findings addressed (commit 99f623c, still code-only)

Thanks for the thorough pass — and for verifying the full download path (unload-table, the serve file_download route, and IPv4-mapped IPv6 all funnel through the three guarded sites).

  • NB-1 — CGNAT 100.64.0.0/10 slips the guard: Fixed, and I took your deeper point: an enumerate-the-bad-predicates block-list will always have gaps (CGNAT today, something else tomorrow). Rewrote _assert_safe_download_url to an allow-list: a resolved address is permitted only if it is ip.is_global (public) or falls in an explicit BYOC-private set (10/8, 172.16/12, 192.168/16, fc00::/7). Everything else — loopback, link-local (incl. 169.254.169.254), CGNAT, reserved, multicast, unspecified — is refused. Added http://100.64.0.1/... to the reject test cases; verified 100.64.0.1.is_global is False and it now raises, while public + RFC1918 still pass.

  • NIT-1 — function-local ipaddress/socket/urlparse imports: Kept as-is. The function is intentionally self-contained (single call surface, lazy import keeps the cost off the module-import path), consistent with the other function-local imports in client.py. Happy to hoist them at the next cleanup pass if you'd prefer module-level.

  • NIT-2 — function defined after its callers: Kept (valid Python; the file already uses this forward-reference pattern). Low-value churn to move it; noted for a future tidy-up.

Still code-only (no version bump / changelog) per the conflict-immunity rationale. make check green: 4177 passed. Version + changelog + a gotchas entry will be added together at the next release alongside #422's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant