Set timeouts on the reporting HTTP connection#617
Open
mitchh456 wants to merge 1 commit into
Open
Conversation
The Reporter built its Net::HTTP client without setting open_timeout or read_timeout, so a slow or unreachable Scout host fell back to Net::HTTP's 60s default and left the reporting thread stuck inside a native call (DNS, TLS, socket read). This is harmful in a preloaded/forking app server: when one of these reporting threads is spawned in the master during preload and is mid-call at fork() time, the forked worker can inherit a lock (resolver/OpenSSL/ malloc/logger mutex) held by a thread that does not exist in the child, deadlocking the worker on boot. It also stalls graceful shutdown/drain, since at_exit joins the background worker thread. Add configurable connect_timeout / read_timeout (both default 5s) and apply them to every reporting connection (checkin and errors share this Reporter). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Spg5MGVB1XXoxJh6LnmgRu
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
A customer running on AWS Fargate (Puma cluster with
preload_app!, plus a separate good_job container) saw deploys go from ~2 min to ~11 min and frequently fail — Puma would log boot but never reachListening, so health checks never passed. Settingmonitor: falsemade the problem disappear entirely.Root cause
The Railtie starts the agent during Rails init, which under
preload_app!runs in the Puma master before fork. There it spawns the AppServerLoad/background/error-reporting threads, which immediately make HTTPS calls tocheckin.scoutapp.com/errors.scoutapm.com.Reporter#httpbuilt itsNet::HTTPclient without any timeouts, so when egress to those hosts was momentarily slow, a reporting thread stayed blocked inside a native call (DNS/TLS/read) for up to Net::HTTP's 60s default. Iffork()fired while that thread held a process-global lock (resolver, OpenSSL, malloc arena, or theLoggermutex), the forked worker inherited the lock with no thread alive to release it → deadlock on boot. The race explains the intermittency ("retry sometimes works") and the perfect correlation withmonitoron/off. The same missing timeout also stalls graceful shutdown/drain, sinceat_exit→stop_background_workerjoins the worker thread.Change
open_timeout/read_timeouton the reporting connection.connect_timeout/read_timeoutoptions (both default 5s), with coercions and docs.This doesn't depend on correct app-server detection, so it helps every deployment regardless of how Puma is launched. It bounds the dangerous window to a few seconds instead of 60s+, and bounds shutdown.
Out of scope (follow-up)
The deeper fix — not spawning background threads / doing network I/O in a forking master pre-fork — is a larger lifecycle change and is complicated by the fact that this app is detected as
null(booted viabin/rails server, so$0isn'tpuma), meaning aforking?-based gate wouldn't even trigger here. Worth a separate PR with its own discussion. The timeout fix above is the safe, universal mitigation.Testing
test/unit/reporter_test.rb: default timeouts, configured overrides, string coercion (env vars), and plain-HTTP endpoints.config_test.rbstill green.🤖 Generated with Claude Code