Skip to content

feat(sts): cache AssumeRoleWithWebIdentity responses across isolates#175

Draft
alukach wants to merge 4 commits into
mainfrom
fix/sts-credential-cache
Draft

feat(sts): cache AssumeRoleWithWebIdentity responses across isolates#175
alukach wants to merge 4 commits into
mainfrom
fix/sts-credential-cache

Conversation

@alukach

@alukach alukach commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Draft / stacked on fix/sts-request-timeout (#172). The post_form change
builds on that PR's timeout, and the two are complementary (cache removes STS
from the hot path; timeout bounds the rare cold miss). Retarget to main once
#172 merges.

Problem

Private products federate to AWS STS (AssumeRoleWithWebIdentity) on the cold path. multistore's credential cache lives in per-isolate memory (OIDC_PROVIDER is a OnceLock), and Cloudflare spins up many short-lived isolates — so a large fraction of requests re-run the STS exchange on the request hot path. When that exchange stalls, the worker hangs until the edge kills it, surfacing to the app as an unparseable 503. This is the root cause behind the intermittent "product won't load, self-heals on reload" reports.

Approach

Add an L2 cache for the STS response, shared across isolates within a colo via the Cloudflare Cache API — the same pattern already used for Source API responses in source_api/cache.rs. It sits under multistore's in-isolate L1 cache:

  • L1 (multistore, per-isolate): caches typed BackendCredentials, single-flights within an isolate.
  • L2 (this PR, per-colo): caches the raw STS response body, keyed by RoleArn (L1's own cache key). On a hit, the proxy skips the STS round-trip entirely.

The only seam data.source.coop controls in the mint path is FetchHttpExchange::post_form (the outbound STS call) — get_credentials and the L1 cache live inside multistore. So the L2 cache wraps post_form.

Effect: STS goes from ~once per isolate per credential lifetime → ~once per colo. The slow exchange leaves the user hot path almost entirely.

What's cached (and not)

  • Only AssumeRoleWithWebIdentity forms (role_arn_from_form returns None for other actions / Azure-GCP flows → bypass).
  • TTL = time to the response's <Expiration> minus a 300s lead (≥ multistore's 60s refresh lead, so an L2 entry always expires before L1 would call the derived credential stale).
  • STS error documents are never cachedttl_secs returns None when there's no parseable <Expiration>.
  • On an L2 hit we still mint the (cheap, local) JWT; only the slow STS network call is skipped. Skipping the mint too would need an L1-level hook in multistore (noted below).

Security

The cached values are short-lived, role-scoped temporary credentials, stored under a synthetic non-routable cache key (https://sts-creds.cache.internal/…, never a real edge request URL, so not externally addressable), with TTL ≤ credential lifetime, per-colo. If a deployment needs global reach, encryption-at-rest, or true cross-isolate single-flight (a cold colo can still see a small STS herd), the same cache_key/ttl_secs helpers drop into KV (global, encrypted) or a Durable Object (global, single-flight).

Follow-up (not here)

Cleaner long-term: give multistore's CredentialCache::get_or_fetch an optional runtime L2 hook (the crate doc already anticipates "a runtime can layer an additional cache tier inside the closure"). That caches typed creds at L1 and skips the JWT mint on hits too — but it's a cross-repo API change + release, vs. this which ships from data.source.coop today.

Verification

  • cargo test --test sts_cache8/8 (role/key/ttl helpers, incl. error-doc and near-expiry → not cached).
  • cargo check --target wasm32-unknown-unknown — clean.
  • cargo clippy --target wasm32-unknown-unknown -- -D warnings — clean.

🤖 Generated with Claude Code

alukach and others added 2 commits June 29, 2026 15:09
…eout

Private products federate to AWS STS on the cold path: the OIDC backend-auth
middleware POSTs AssumeRoleWithWebIdentity over the shared reqwest client, which
was built with `Client::new()` — no timeout. If that exchange stalls, the whole
Worker request hangs until the Cloudflare edge kills it and returns a non-XML
`error code: NNNN` plaintext body. The caller's AWS S3 SDK then fails to parse
it ("char 'e' is not expected.:1:1"), surfacing as an opaque contents-load
failure that self-heals after a few reloads.

It self-heals because the OIDC provider caches credentials across requests in a
warm isolate, so STS only runs on a cold isolate / cache miss — exactly when
it's slow. Public products list anonymously and never hit this path, which is
why only private products are affected.

Bound the STS POST with a 10s per-request timeout (reqwest's wasm backend honors
it via AbortController). On a stall the call now returns OidcProviderError::
HttpError → ProxyError::BackendError → a proper 503 ServiceUnavailable XML error
the client can parse and retry, instead of an unparseable edge timeout body.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
multistore's credential cache is per-isolate (OIDC_PROVIDER is a OnceLock),
and Cloudflare spins up many short-lived isolates — so a large fraction of
requests re-run the STS exchange on the request hot path. When that stalls,
the worker hangs until the edge kills it, surfacing as an unparseable 503.

Layer an L2 (cross-isolate, per-colo) Cloudflare Cache API tier UNDER
multistore's in-isolate cache, keyed by RoleArn, wrapping the only seam this
repo controls in the mint path: FetchHttpExchange::post_form. On a hit the
proxy skips the slow STS round-trip entirely.

Pure helpers (role/key/ttl) live in sts_cache.rs (host-tested); the Cache API
I/O lives in lib.rs. TTL derives from the response <Expiration> minus a 300s
lead (>= multistore's 60s refresh lead, so L2 never outlives L1's freshness).
STS error documents are never cached.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Claude finished @alukach's task in 5m 51s —— View job


Review

  • Fetch PR diff (origin/main...HEAD) and read changed files
  • Trace post_form L2 cache wiring in src/lib.rs
  • Review src/sts_cache.rs pure helpers (key derivation, TTL, action filter)
  • Review tests/sts_cache.rs coverage
  • Check Cargo.toml/Cargo.lock dependency changes

✅ No blocking issues — safe to merge.

  • Logic checks out: role_arn_from_form correctly gates the cache to AssumeRoleWithWebIdentity only, ttl_secs correctly refuses to cache error documents (no <Expiration>) or near/past-expiry responses, and the 300s lead is safely ≥ multistore's 60s L1 refresh lead. Test coverage in tests/sts_cache.rs matches the implementation's edge cases (error doc, unparseable expiration, near-expiry, expired).
  • src/lib.rs:97-102,125-130 — cache-hit/cache-miss paths are correctly gated by the same Option<String> cache key, and STS error bodies are never written to L2 (ttl_secs returns None for them), so a cache hit is always a real credential response.
  • Note (non-blocking): src/sts_cache.rs:22-33 — the L2 key is RoleArn only, so within a colo the cached STS response is shared across any connections/tenants that happen to configure the same role_arn. This deliberately mirrors multistore's own L1 cache key (per the PR description) rather than introducing a new sharing boundary, and the credential's actual permissions are role-scoped regardless of caller — but it does widen that sharing from "one isolate" to "one colo," which is worth being explicit about if a future reviewer assumes L2 is connection-scoped.
  • Note (non-blocking): Cargo.toml:59-60chrono is added with default-features = false, features = ["alloc"], but multistore/multistore-sts already depend on chrono elsewhere in the graph (confirmed in Cargo.lock); Cargo unifies features per resolved version, so the final build likely already has std/clock enabled regardless of this crate's declared features. Cosmetic only — doesn't affect correctness of parse_from_rfc3339.
  • worker::Cache usage (sts_cache_get/sts_cache_put) follows the existing pattern in src/source_api/cache.rs (best-effort, degrade-to-miss on error), and the synthetic non-routable cache key (https://sts-creds.cache.internal/...) prevents external cache-key addressing/poisoning.

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown

🚀 Latest commit deployed to https://source-data-proxy-pr-175.source-coop.workers.dev

  • Date: 2026-07-01T18:27:23Z
  • Commit: 65a190f

Base automatically changed from fix/sts-request-timeout to main July 1, 2026 17:45
alukach and others added 2 commits July 1, 2026 11:10
Resolve Cargo.toml conflict: keep both [[test]] entries (sts_cache from
this branch, object_path from main) and take main's multistore 0.6.3 bump.
Also shrink the chrono dependency comment to one line (ponytail review).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `.ok()?` branch in `ttl_secs` (STS response has an <Expiration> tag
but it isn't valid RFC3339) was untested. Asserts the credential-safety
invariant: never cache a credential whose real expiry we can't determine.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant