fix(storage): guard downloads against SSRF to non-public hosts (GHSA-hjhx, code-only)#460
Conversation
padak
left a comment
There was a problem hiding this comment.
Review of #460 — fix(storage): guard downloads against SSRF to non-public hosts (GHSA-hjhx, code-only)
Generated by
kbagent-pr-reviewersubagent. Verdict and findings below
are advisory; the human author retains every veto. CI-coverable issues
(lint, format, tests) are confirmed viamake check, not duplicated here.
Summary
This PR adds _assert_safe_download_url() — a host-resolution + IP-classification guard that fires before each of the three download fetch sites in client.py, protecting against SSRF to cloud instance-metadata endpoints and localhost. The fix is narrowly scoped (two files, 101 lines added, 0 deleted), the logic is correct for the BYOC design intent described in the PR, and make check passes clean (4176 passed, 8 skipped). The conventional-commit prefix and description are accurate. The single finding worth surfacing is a non-blocking test coverage gap: the RFC 6598 Carrier-Grade NAT block (100.64.0.0/10) is not caught by any of the five ip_address predicates and is not tested. This block is unlikely to host cloud instance metadata in practice, but it is technically an unguarded public-IP look-alike that a DNS-rebind attack could exploit. Everything else checked out. Verdict: APPROVE.
Verdict
- Verdict: APPROVE
- Blocking findings: 0
- Non-blocking findings: 1
- Nits: 2
Blocking findings
(none)
Non-blocking findings
[NB-1] src/keboola_agent_cli/client.py:3106 — RFC 6598 Carrier-Grade NAT block (100.64.0.0/10) passes the guard unchecked
_assert_safe_download_url checks is_loopback, is_link_local, is_reserved, is_multicast, and is_unspecified. In Python's ipaddress module, 100.64.0.1.is_reserved returns False (RFC 6598 CGNAT is not classified as "reserved" per IANA — it is assigned, just not globally routable). An attacker who controls a DNS name in the Storage API response could point it at a CGNAT address and bypass the guard. 100.64.0.0/10 is not the cloud-metadata crown jewel (169.254.169.254 is link-local and already blocked), but it is a non-public range that cloud providers occasionally use for inter-service communication. The fix is a single or ip.is_global inversion (if not ip.is_global and not ip.is_private) — but ip.is_private has its own quirks in Python 3.12 (covers RFC1918 + loopback + link-local), so the cleaner approach is to add ipaddress.ip_network("100.64.0.0/10") to an explicit allowance / block list, or add a _is_blocked_range helper that explicitly enumerates the non-routable space. As accepted by the author, this does not block the current BYOC use-case (RFC1918 is allowed), so it is non-blocking, but it is a residual gap in the SSRF surface.
Nits
-
[NIT-1]src/keboola_agent_cli/client.py:3121—ipaddress,socket, andurlparseare imported inside the function body on every call. These are stdlib modules with negligible import cost but the pattern is inconsistent withclient.py's module-levelfrom urllib.parse import quote. Moving them to the module-level import block would be consistent and matches the existing file conventions (12 other function-local imports in the file are for optional/heavy deps likegzip,json,shutil— all three in this function are always available and lightweight). -
[NIT-2]src/keboola_agent_cli/client.py:3106— The function is defined afterKeboolaClient(line 150) and_CloudDownloader(line 3191) but is called from methods in both classes (lines 2298, 2397, 3359). This is valid Python (method bodies are resolved at call time, not parse time), and the file already uses this pattern elsewhere, but placing the guard beforeKeboolaClientor at least with a# defined belowcomment would make it discoverable for a reader scanning top-to-bottom.
Verification log
gh pr view 460 --json title,body,files,additions,deletions,state→ 2 files, +101/-0, state OPEN, conventionalfix(storage):✓git rev-parse --abbrev-ref HEAD→fix/storage-download-ssrf-guard✓ (working tree matches PR branch)grep -E '^\+(from typer|import typer|formatter\.|console\.print)' diff | grep services/→ empty ✓ (no layer violation)grep -E '^\+(from httpx|import httpx|httpx\.Client)' diff | grep commands/→ empty ✓ (no layer violation)- Three download fetch sites identified and verified as guarded:
_prepare_sliced_download(line 2298),download_file(line 2397),_CloudDownloader.stream_to_file(line 3359) ✓ download_sliced_file(line 2235) anddownload_sliced_file_to_dir(line 2318) both delegate to_prepare_sliced_download+stream_to_file— guard fires on both paths ✓unload_table_to_file(line 2061) calls_wait_for_storage_joband hands the job result to the storage service; the service callsdownload_fileordownload_sliced_file— guard fires ✓- Server-side
file_downloadroute (server/routers/storage.py:514) delegates throughregistry.storage.download_file→storage_service.py→client.download_file— guard fires ✓ (no unguarded download bypass in the serve path) python3 -c "ipaddress.ip_address('::ffff:169.254.169.254').is_link_local"→True— IPv4-mapped IPv6 form of the metadata endpoint is blocked ✓python3 -c "ipaddress.ip_address('100.64.0.1').is_reserved"→False— RFC 6598 CGNAT is NOT blocked (see NB-1)python3 -c "socket.getaddrinfo('::ffff:169.254.169.254', None)"→ returns{'::ffff:169.254.169.254'},is_link_local=True→ blocked ✓pytest tests/test_client.py::TestAssertSafeDownloadUrl -v→ 10 passed (5 reject cases + 4 BYOC-allow cases + 1 no-host case) ✓make check→ 4176 passed, 8 skipped, exit 0 ✓- Plugin synchronization map: this PR adds no new CLI command and changes no command surface. Per focus hint and PR description, version bump + changelog are intentionally deferred; no plugin file updates are expected or required for this diff. The
OPERATION_REGISTRY,AGENT_CONTEXT,CLAUDE.md, andcommands-reference.mdchecks are N/A (no command added). ✓
Open questions for the author
(none)
…hjhx) A malicious or compromised Storage API response can return a download URL pointing at the cloud instance-metadata endpoint (169.254.169.254) or localhost. The download clients carry no Storage token and don't follow redirects, but the residual SSRF still writes internal/credential data into the user's download file. Add _assert_safe_download_url(): resolve the host and refuse loopback / link-local / reserved / multicast / unspecified targets before each download fetch (download_file, the sliced-manifest fetch in _prepare_sliced_download, and the per-slice stream_to_file). RFC1918 private ranges are ALLOWED so BYOC / private-tenant deployments keep working; the high-value target (instance metadata) is link-local, not private. Code-only (no version bump / changelog entry) to stay conflict-free against the rapid main release cadence; version + changelog added at next release. Private advisory GHSA-hjhx-mx7m-8xx2.
ff85a33 to
99f623c
Compare
Author response — review findings addressed (commit 99f623c, still code-only)Thanks for the thorough pass — and for verifying the full download path (unload-table, the serve
Still code-only (no version bump / changelog) per the conflict-immunity rationale. |
Summary
Fixes M8 from the 2026-06-12 security audit (private advisory GHSA-hjhx-mx7m-8xx2) — an SSRF in the Storage download path.
Every file/table download fetches a URL that comes from a Storage API response (the presigned
url, the sliced-file manifest URL, and per-slice cloud URLs). A malicious or compromised response could point one of these at the cloud instance-metadata endpoint (169.254.169.254) orlocalhost. The download clients already carry no Storage token and don't follow redirects (good — no credential leak, no redirect pivot), but the residual SSRF still writes internal/credential data into the user's download file.Fix
New
_assert_safe_download_url(url)resolves the host and refuses loopback / link-local / reserved / multicast / unspecified targets, called before each of the three download fetch sites:download_file(non-sliced presigned URL)_prepare_sliced_download(the sliced-file manifest fetch — shared by both sliced variants)stream_to_file(the per-slice chokepoint — covers every slice download)BYOC-safe by design
RFC1918 private ranges (
10/8,172.16/12,192.168/16) are deliberately allowed. Keboola runs in customer private clouds (BYOC / private-tenant), where storage is legitimately served from private endpoints — blocking those would break real deployments. The high-value SSRF target (instance metadata) is link-local (169.254.0.0/16), not private, so blocking link-local + loopback + reserved protects the crown jewel without breaking BYOC.Touches only
client.py+test_client.py— no version bump / changelog entry (added at the next release). Same rationale as #422:mainreleases ~daily and a version-bumping PR conflicts onpyproject.toml/changelog.pyagainst every release; code-only keeps this conflict-free.Tests
TestAssertSafeDownloadUrlintest_client.py(10 tests, all offline — IP literals / localhost): rejects metadata endpoint, loopback (v4/v6/name), unspecified, and host-less URLs; allows public + RFC1918-private (BYOC) hosts. Full suite green: 4176 passed, 135 skipped; lint/format/tyclean.Audit progress
Open after this: M7 (plaintext-on-encrypt warning — 3 services), M9 (npm
--ignore-scripts), M10 (version regex), + M1/M5 residuals. All will be authored code-only.