Cache failed fetches and cap the connect timeout to avoid per-request outbound calls#243
Open
HEROGWP wants to merge 1 commit into
Open
Cache failed fetches and cap the connect timeout to avoid per-request outbound calls#243HEROGWP wants to merge 1 commit into
HEROGWP wants to merge 1 commit into
Conversation
Resolving a request's real IP runs `Importer.cloudflare_ips` on every request (via the `CheckTrustedProxies` / `RemoteIpProxies` patches). That method fetches Cloudflare's published ranges over HTTP. Two issues make a blocked or slow upstream catastrophic under load: 1. Failures were never cached. `Rails.cache.fetch` does not write when its block raises, and `cloudflare_ips` returned the fallback without memoizing `@ips`. So when egress to cloudflare.com is unavailable, *every* request issued a fresh outbound HTTP request, exhausting the worker pool. 2. Only `read_timeout` was set. With no `open_timeout`, a blackholed egress lets Net::HTTP block on the TCP connect for its default (~60s) per attempt. Changes: - `fetch_with_cache` now caches the fallback for a short `error_expires_in` ttl on failure (negative caching), so a failing upstream is hit at most once per ttl instead of once per request. - `@ips` is only memoized on a fully successful fetch, so once the short ttl lapses the next call retries the network and a transient outage self-heals without a process restart. - `fetch` now sets `open_timeout` (configurable, default 5s). - the success path passes `race_condition_ttl` to collapse the thundering herd when a cached entry expires under load. - an empty response body is treated as a failed fetch rather than cached as a successful (empty) list. New config: `open_timeout`, `error_expires_in`, `race_condition_ttl`. Legacy bare-array cache entries are tolerated for rolling deploys.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
request.ip/request.remote_ipresolution runsImporter.cloudflare_ipson every request (via theCheckTrustedProxies/RemoteIpProxiespatches), which fetches Cloudflare's IP ranges over HTTP. Two things make a blocked or slowwww.cloudflare.comcatastrophic under load:Rails.cache.fetchdoesn't write when its block raises, andcloudflare_ipsreturned the fallback without memoizing@ips. So when egress is unavailable, every request issues a fresh outbound HTTP request → the worker pool saturates and the whole app stalls.open_timeout. Onlyread_timeoutis set, so a blackholed egress letsNet::HTTPblock on the TCP connect for its default (~60s) per attempt.Combined: under traffic, every request can block a worker for up to ~60s. (We hit exactly this in production behind a restricted egress.)
Fix
fetch_with_cachecaches the fallback for a short, configurableerror_expires_in(default 1m). A failing upstream is now hit at most once per ttl, not once per request.@ipsis memoized only on a fully successful fetch, so once the short ttl lapses the next call retries the network and a transient outage recovers without a restart.open_timeoutadded (configurable, default 5s) so a single attempt can't hang for ~60s.race_condition_ttlon the success path to collapse the thundering herd when the cached entry expires under load.Compatibility
open_timeout,error_expires_in,race_condition_ttl(all defaulted; existing behaviour for successful fetches is unchanged).Tests
Added specs covering: success memoization (no repeat network call), failed-fetch negative caching (no per-request hammering), and recovery after the error ttl lapses.
bundle exec rake(all rack-attack variants) andbundle exec rubocoppass.