feat(sts): cache AssumeRoleWithWebIdentity responses across isolates#175
Draft
alukach wants to merge 4 commits into
Draft
feat(sts): cache AssumeRoleWithWebIdentity responses across isolates#175alukach wants to merge 4 commits into
alukach wants to merge 4 commits into
Conversation
…eout
Private products federate to AWS STS on the cold path: the OIDC backend-auth
middleware POSTs AssumeRoleWithWebIdentity over the shared reqwest client, which
was built with `Client::new()` — no timeout. If that exchange stalls, the whole
Worker request hangs until the Cloudflare edge kills it and returns a non-XML
`error code: NNNN` plaintext body. The caller's AWS S3 SDK then fails to parse
it ("char 'e' is not expected.:1:1"), surfacing as an opaque contents-load
failure that self-heals after a few reloads.
It self-heals because the OIDC provider caches credentials across requests in a
warm isolate, so STS only runs on a cold isolate / cache miss — exactly when
it's slow. Public products list anonymously and never hit this path, which is
why only private products are affected.
Bound the STS POST with a 10s per-request timeout (reqwest's wasm backend honors
it via AbortController). On a stall the call now returns OidcProviderError::
HttpError → ProxyError::BackendError → a proper 503 ServiceUnavailable XML error
the client can parse and retry, instead of an unparseable edge timeout body.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
multistore's credential cache is per-isolate (OIDC_PROVIDER is a OnceLock), and Cloudflare spins up many short-lived isolates — so a large fraction of requests re-run the STS exchange on the request hot path. When that stalls, the worker hangs until the edge kills it, surfacing as an unparseable 503. Layer an L2 (cross-isolate, per-colo) Cloudflare Cache API tier UNDER multistore's in-isolate cache, keyed by RoleArn, wrapping the only seam this repo controls in the mint path: FetchHttpExchange::post_form. On a hit the proxy skips the slow STS round-trip entirely. Pure helpers (role/key/ttl) live in sts_cache.rs (host-tested); the Cache API I/O lives in lib.rs. TTL derives from the response <Expiration> minus a 300s lead (>= multistore's 60s refresh lead, so L2 never outlives L1's freshness). STS error documents are never cached. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
Claude finished @alukach's task in 5m 51s —— View job Review
✅ No blocking issues — safe to merge.
|
|
🚀 Latest commit deployed to https://source-data-proxy-pr-175.source-coop.workers.dev
|
Resolve Cargo.toml conflict: keep both [[test]] entries (sts_cache from this branch, object_path from main) and take main's multistore 0.6.3 bump. Also shrink the chrono dependency comment to one line (ponytail review). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `.ok()?` branch in `ttl_secs` (STS response has an <Expiration> tag but it isn't valid RFC3339) was untested. Asserts the credential-safety invariant: never cache a credential whose real expiry we can't determine. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Private products federate to AWS STS (
AssumeRoleWithWebIdentity) on the cold path. multistore's credential cache lives in per-isolate memory (OIDC_PROVIDERis aOnceLock), and Cloudflare spins up many short-lived isolates — so a large fraction of requests re-run the STS exchange on the request hot path. When that exchange stalls, the worker hangs until the edge kills it, surfacing to the app as an unparseable503. This is the root cause behind the intermittent "product won't load, self-heals on reload" reports.Approach
Add an L2 cache for the STS response, shared across isolates within a colo via the Cloudflare Cache API — the same pattern already used for Source API responses in
source_api/cache.rs. It sits under multistore's in-isolate L1 cache:BackendCredentials, single-flights within an isolate.RoleArn(L1's own cache key). On a hit, the proxy skips the STS round-trip entirely.The only seam
data.source.coopcontrols in the mint path isFetchHttpExchange::post_form(the outbound STS call) —get_credentialsand the L1 cache live insidemultistore. So the L2 cache wrapspost_form.Effect: STS goes from ~once per isolate per credential lifetime → ~once per colo. The slow exchange leaves the user hot path almost entirely.
What's cached (and not)
AssumeRoleWithWebIdentityforms (role_arn_from_formreturnsNonefor other actions / Azure-GCP flows → bypass).<Expiration>minus a 300s lead (≥ multistore's 60s refresh lead, so an L2 entry always expires before L1 would call the derived credential stale).ttl_secsreturnsNonewhen there's no parseable<Expiration>.multistore(noted below).Security
The cached values are short-lived, role-scoped temporary credentials, stored under a synthetic non-routable cache key (
https://sts-creds.cache.internal/…, never a real edge request URL, so not externally addressable), with TTL ≤ credential lifetime, per-colo. If a deployment needs global reach, encryption-at-rest, or true cross-isolate single-flight (a cold colo can still see a small STS herd), the samecache_key/ttl_secshelpers drop into KV (global, encrypted) or a Durable Object (global, single-flight).Follow-up (not here)
Cleaner long-term: give
multistore'sCredentialCache::get_or_fetchan optional runtime L2 hook (the crate doc already anticipates "a runtime can layer an additional cache tier inside the closure"). That caches typed creds at L1 and skips the JWT mint on hits too — but it's a cross-repo API change + release, vs. this which ships fromdata.source.cooptoday.Verification
cargo test --test sts_cache— 8/8 (role/key/ttl helpers, incl. error-doc and near-expiry → not cached).cargo check --target wasm32-unknown-unknown— clean.cargo clippy --target wasm32-unknown-unknown -- -D warnings— clean.🤖 Generated with Claude Code