Set timeouts on the reporting HTTP connection by mitchh456 · Pull Request #617 · scoutapp/scout_apm_ruby

mitchh456 · 2026-06-18T16:07:00Z

Background

A customer running on AWS Fargate (Puma cluster with preload_app!, plus a separate good_job container) saw deploys go from ~2 min to ~11 min and frequently fail — Puma would log boot but never reach Listening, so health checks never passed. Setting monitor: false made the problem disappear entirely.

Root cause

The Railtie starts the agent during Rails init, which under preload_app! runs in the Puma master before fork. There it spawns the AppServerLoad/background/error-reporting threads, which immediately make HTTPS calls to checkin.scoutapp.com / errors.scoutapm.com.

Reporter#http built its Net::HTTP client without any timeouts, so when egress to those hosts was momentarily slow, a reporting thread stayed blocked inside a native call (DNS/TLS/read) for up to Net::HTTP's 60s default. If fork() fired while that thread held a process-global lock (resolver, OpenSSL, malloc arena, or the Logger mutex), the forked worker inherited the lock with no thread alive to release it → deadlock on boot. The race explains the intermittency ("retry sometimes works") and the perfect correlation with monitor on/off. The same missing timeout also stalls graceful shutdown/drain, since at_exit → stop_background_worker joins the worker thread.

Change

Set open_timeout / read_timeout on the reporting connection.
Make them configurable via new connect_timeout / read_timeout options (both default 5s), with coercions and docs.
This Reporter is shared by the checkin and errors paths, so both are covered.

This doesn't depend on correct app-server detection, so it helps every deployment regardless of how Puma is launched. It bounds the dangerous window to a few seconds instead of 60s+, and bounds shutdown.

Out of scope (follow-up)

The deeper fix — not spawning background threads / doing network I/O in a forking master pre-fork — is a larger lifecycle change and is complicated by the fact that this app is detected as null (booted via bin/rails server, so $0 isn't puma), meaning a forking?-based gate wouldn't even trigger here. Worth a separate PR with its own discussion. The timeout fix above is the safe, universal mitigation.

Testing

New test/unit/reporter_test.rb: default timeouts, configured overrides, string coercion (env vars), and plain-HTTP endpoints.
config_test.rb still green.

🤖 Generated with Claude Code

The Reporter built its Net::HTTP client without setting open_timeout or read_timeout, so a slow or unreachable Scout host fell back to Net::HTTP's 60s default and left the reporting thread stuck inside a native call (DNS, TLS, socket read). This is harmful in a preloaded/forking app server: when one of these reporting threads is spawned in the master during preload and is mid-call at fork() time, the forked worker can inherit a lock (resolver/OpenSSL/ malloc/logger mutex) held by a thread that does not exist in the child, deadlocking the worker on boot. It also stalls graceful shutdown/drain, since at_exit joins the background worker thread. Add configurable connect_timeout / read_timeout (both default 5s) and apply them to every reporting connection (checkin and errors share this Reporter). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Spg5MGVB1XXoxJh6LnmgRu

This was referenced Jun 22, 2026

Agent starts background threads during app boot — unsafe under forking servers (preload), and fails in thread-constrained processes #618

Open

Make the agent fork-safe (hook Process._fork) #619

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set timeouts on the reporting HTTP connection#617

Set timeouts on the reporting HTTP connection#617
mitchh456 wants to merge 1 commit into
masterfrom
add-http-timeouts-to-reporter

mitchh456 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mitchh456 commented Jun 18, 2026

Background

Root cause

Change

Out of scope (follow-up)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant