Skip to content

Set timeouts on the reporting HTTP connection#617

Open
mitchh456 wants to merge 1 commit into
masterfrom
add-http-timeouts-to-reporter
Open

Set timeouts on the reporting HTTP connection#617
mitchh456 wants to merge 1 commit into
masterfrom
add-http-timeouts-to-reporter

Conversation

@mitchh456

Copy link
Copy Markdown
Contributor

Background

A customer running on AWS Fargate (Puma cluster with preload_app!, plus a separate good_job container) saw deploys go from ~2 min to ~11 min and frequently fail — Puma would log boot but never reach Listening, so health checks never passed. Setting monitor: false made the problem disappear entirely.

Root cause

The Railtie starts the agent during Rails init, which under preload_app! runs in the Puma master before fork. There it spawns the AppServerLoad/background/error-reporting threads, which immediately make HTTPS calls to checkin.scoutapp.com / errors.scoutapm.com.

Reporter#http built its Net::HTTP client without any timeouts, so when egress to those hosts was momentarily slow, a reporting thread stayed blocked inside a native call (DNS/TLS/read) for up to Net::HTTP's 60s default. If fork() fired while that thread held a process-global lock (resolver, OpenSSL, malloc arena, or the Logger mutex), the forked worker inherited the lock with no thread alive to release it → deadlock on boot. The race explains the intermittency ("retry sometimes works") and the perfect correlation with monitor on/off. The same missing timeout also stalls graceful shutdown/drain, since at_exitstop_background_worker joins the worker thread.

Change

  • Set open_timeout / read_timeout on the reporting connection.
  • Make them configurable via new connect_timeout / read_timeout options (both default 5s), with coercions and docs.
  • This Reporter is shared by the checkin and errors paths, so both are covered.

This doesn't depend on correct app-server detection, so it helps every deployment regardless of how Puma is launched. It bounds the dangerous window to a few seconds instead of 60s+, and bounds shutdown.

Out of scope (follow-up)

The deeper fix — not spawning background threads / doing network I/O in a forking master pre-fork — is a larger lifecycle change and is complicated by the fact that this app is detected as null (booted via bin/rails server, so $0 isn't puma), meaning a forking?-based gate wouldn't even trigger here. Worth a separate PR with its own discussion. The timeout fix above is the safe, universal mitigation.

Testing

  • New test/unit/reporter_test.rb: default timeouts, configured overrides, string coercion (env vars), and plain-HTTP endpoints.
  • config_test.rb still green.

🤖 Generated with Claude Code

The Reporter built its Net::HTTP client without setting open_timeout or
read_timeout, so a slow or unreachable Scout host fell back to Net::HTTP's
60s default and left the reporting thread stuck inside a native call (DNS,
TLS, socket read).

This is harmful in a preloaded/forking app server: when one of these
reporting threads is spawned in the master during preload and is mid-call
at fork() time, the forked worker can inherit a lock (resolver/OpenSSL/
malloc/logger mutex) held by a thread that does not exist in the child,
deadlocking the worker on boot. It also stalls graceful shutdown/drain,
since at_exit joins the background worker thread.

Add configurable connect_timeout / read_timeout (both default 5s) and
apply them to every reporting connection (checkin and errors share this
Reporter).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Spg5MGVB1XXoxJh6LnmgRu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant