Skip to content

Agent starts background threads during app boot — unsafe under forking servers (preload), and fails in thread-constrained processes #618

Description

@mitchh456

Summary

The agent creates its long-lived background threads (and makes blocking network calls) during application boot. On a forking app server with preload_app!, those threads are started in the master process before it forks workers. This is unsafe and produces intermittent, hard-to-diagnose boot failures, and in thread/pid/memory-constrained environments it can fail thread allocation outright. This issue documents the observed behaviors and proposes fixes.

Observed behaviors

1. Threads are started in the preloading master, before fork

The Railtie eagerly calls Agent#installstart during Rails initialization (lib/scout_apm.rb:221-228). Under a forking server with preload_app!, this runs in the master during preload. start then unconditionally spawns:

  • the AppServerLoad thread, which makes a blocking HTTP POST (lib/scout_apm/agent.rb:82lib/scout_apm/app_server_load.rb:12)
  • the metrics background worker thread (lib/scout_apm/agent.rb:84)
  • the error-service background worker thread (lib/scout_apm/agent.rb:85)
  • the BackgroundRecorder thread when async_recording: true (lib/scout_apm/agent_context.rb:236-243lib/scout_apm/background_recorder.rb:21)

Puma detects this and warns:

! WARNING: Detected 4 Thread(s) started in app boot:
! .../scout_apm/agent.rb:176 sleep                 (metrics worker)
! .../scout_apm/agent.rb:210 sleep                 (error-service worker)
! .../timeout-0.6.1/lib/timeout.rb:87 sleep_forever
! .../scout_apm/background_recorder.rb:36           (background recorder)

Forking a process that has live threads is unsafe: only the calling thread survives in the child, and any lock held by another thread at fork() time (resolver, OpenSSL, malloc arena, Logger mutex, etc.) is inherited locked with no owner. The result is intermittent worker boot deadlocks — the worker never finishes booting / the server never reaches a listening state, so deploys behind a health check hang and roll back. It is timing-dependent (a retry sometimes succeeds), which matches a fork/thread race. Disabling monitoring (monitor: false) eliminates it entirely.

start_background_worker? already exists and returns !forking? (lib/scout_apm/agent.rb:131-134), but start does not consult it, and the Railtie calls start regardless.

2. ThreadError: can't alloc thread in constrained processes

In a non-forking but thread/pid/memory-constrained process (e.g. a background-job container running as PID 1), agent startup can fail with:

INFO : Failed Sending Application Startup Info - can't alloc thread

can't alloc thread is EAGAIN from pthread_create — the process hit RLIMIT_NPROC, the cgroup pids.max, or ran out of memory for a new thread stack. Because the agent adds ~3 always-on threads on top of the host application's own thread pool, it can push a near-ceiling process over the edge — and starve the host application of the threads it needs to start, so the failure is not isolated to the agent.

3. App-server detection silently degrades to null

Puma is only detected when the process name starts with puma (lib/scout_apm/server_integrations/puma.rb:23: File.basename($0) =~ /\Apuma/). When the server is launched via bin/rails server, $0 is rails, no integration matches, and detection falls through to Null (lib/scout_apm/server_integrations/null.rb), which reports forking? => false and installs no before_worker_boot hook.

Consequences:

  • forking? is wrong, so even a forking?-aware deferral (fix context cleanup, SlowTransaction #1 below) would not trigger.
  • The before_worker_boot hook that is supposed to (re)start the worker post-fork is never registered, so a forked worker only starts the agent lazily on first request.

The agent is nonetheless started in the master because PRECONDITION_DETECTED_SERVER passes whenever any app-server or background-job integration is found (lib/scout_apm/agent/preconditions.rb:27-36) — e.g. when a background-job framework is present — independent of the null app-server result.

4. No HTTP timeouts on reporting (related)

Reporter#http builds its Net::HTTP client with no open_timeout/read_timeout (lib/scout_apm/reporter.rb:121-133), so a slow/unreachable host leaves a reporting thread blocked inside a native call for up to Net::HTTP's 60s default — widening the dangerous window in (1) and stalling graceful shutdown (the at_exit handler joins the worker thread). Addressed in #617.

5. The error-service thread starts unconditionally

start_error_service_background_worker is called from start (lib/scout_apm/agent.rb:85) and is not gated by errors_enabled (lib/scout_apm/agent.rb:206-215). The thread is created even when the error service is disabled, contributing to (2).

Proposed fixes

  1. Do not create background threads (or make network calls) during app boot in a forking/preloading master. Defer all thread creation to the post-fork hook so threads only ever exist in a process that will not fork again. Honor the existing start_background_worker?/forking? signal in start, and have the Railtie defer when running under a forking server.

  2. Make forking detection reliable. Don't rely solely on $0 to detect Puma — also detect when running under Puma (e.g. defined?(::Puma) + cluster/preload context) so forking? is correct regardless of launcher, and ensure the before_worker_boot hook is installed in that case. At minimum, treat "preloaded app, app server unknown" conservatively (defer thread start).

  3. Reduce always-on thread footprint and make threads lazy.

    • Gate the error-service worker behind errors_enabled (Heroku FTW #5).
    • Consider not spawning reporting threads until there is data to report.
    • Guard Thread.new call sites so a ThreadError is logged and survivable rather than surfacing as an opaque failure.
  4. Set HTTP timeouts on the reporting connection (Set timeouts on the reporting HTTP connection #617), and bound the background-worker join on shutdown so drain cannot hang.

Acceptance

  • With a forking app server + preload_app!, no agent threads are started in the master; Puma emits no "Detected N Thread(s) started in app boot" warning naming scout_apm.
  • Worker boot is deterministic (no fork/thread race); repeated deploys succeed.
  • Agent startup degrades gracefully (logs and continues) if a thread cannot be allocated.
  • forking? is correct under Puma regardless of launch command.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions