Skip to content

Cache failed fetches and cap the connect timeout to avoid per-request outbound calls#243

Open
HEROGWP wants to merge 1 commit into
modosc:mainfrom
HEROGWP:fix/cache-failed-fetches-and-open-timeout
Open

Cache failed fetches and cap the connect timeout to avoid per-request outbound calls#243
HEROGWP wants to merge 1 commit into
modosc:mainfrom
HEROGWP:fix/cache-failed-fetches-and-open-timeout

Conversation

@HEROGWP

@HEROGWP HEROGWP commented Jun 3, 2026

Copy link
Copy Markdown

Problem

request.ip / request.remote_ip resolution runs Importer.cloudflare_ips on every request (via the CheckTrustedProxies / RemoteIpProxies patches), which fetches Cloudflare's IP ranges over HTTP. Two things make a blocked or slow www.cloudflare.com catastrophic under load:

  1. Failures are never cached. Rails.cache.fetch doesn't write when its block raises, and cloudflare_ips returned the fallback without memoizing @ips. So when egress is unavailable, every request issues a fresh outbound HTTP request → the worker pool saturates and the whole app stalls.
  2. No open_timeout. Only read_timeout is set, so a blackholed egress lets Net::HTTP block on the TCP connect for its default (~60s) per attempt.

Combined: under traffic, every request can block a worker for up to ~60s. (We hit exactly this in production behind a restricted egress.)

Fix

  • Negative caching — on failure, fetch_with_cache caches the fallback for a short, configurable error_expires_in (default 1m). A failing upstream is now hit at most once per ttl, not once per request.
  • Self-healing@ips is memoized only on a fully successful fetch, so once the short ttl lapses the next call retries the network and a transient outage recovers without a restart.
  • open_timeout added (configurable, default 5s) so a single attempt can't hang for ~60s.
  • race_condition_ttl on the success path to collapse the thundering herd when the cached entry expires under load.
  • An empty body is now treated as a failed fetch rather than cached as a successful (empty) list.

Compatibility

  • New config: open_timeout, error_expires_in, race_condition_ttl (all defaulted; existing behaviour for successful fetches is unchanged).
  • Legacy bare-array cache entries are tolerated, so rolling deploys are safe.
  • Errors are now logged once per address family (v4 + v6) instead of short-circuiting after the first; the two existing failure specs were updated accordingly.

Tests

Added specs covering: success memoization (no repeat network call), failed-fetch negative caching (no per-request hammering), and recovery after the error ttl lapses. bundle exec rake (all rack-attack variants) and bundle exec rubocop pass.

Resolving a request's real IP runs `Importer.cloudflare_ips` on every request
(via the `CheckTrustedProxies` / `RemoteIpProxies` patches). That method fetches
Cloudflare's published ranges over HTTP. Two issues make a blocked or slow
upstream catastrophic under load:

1. Failures were never cached. `Rails.cache.fetch` does not write when its block
   raises, and `cloudflare_ips` returned the fallback without memoizing `@ips`.
   So when egress to cloudflare.com is unavailable, *every* request issued a
   fresh outbound HTTP request, exhausting the worker pool.

2. Only `read_timeout` was set. With no `open_timeout`, a blackholed egress lets
   Net::HTTP block on the TCP connect for its default (~60s) per attempt.

Changes:
- `fetch_with_cache` now caches the fallback for a short `error_expires_in` ttl
  on failure (negative caching), so a failing upstream is hit at most once per
  ttl instead of once per request.
- `@ips` is only memoized on a fully successful fetch, so once the short ttl
  lapses the next call retries the network and a transient outage self-heals
  without a process restart.
- `fetch` now sets `open_timeout` (configurable, default 5s).
- the success path passes `race_condition_ttl` to collapse the thundering herd
  when a cached entry expires under load.
- an empty response body is treated as a failed fetch rather than cached as a
  successful (empty) list.

New config: `open_timeout`, `error_expires_in`, `race_condition_ttl`.
Legacy bare-array cache entries are tolerated for rolling deploys.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant