Agent starts background threads during app boot — unsafe under forking servers (preload), and fails in thread-constrained processes

## Summary

The agent creates its long-lived background threads (and makes blocking network calls) **during application boot**. On a forking app server with `preload_app!`, those threads are started in the master process *before* it forks workers. This is unsafe and produces intermittent, hard-to-diagnose boot failures, and in thread/pid/memory-constrained environments it can fail thread allocation outright. This issue documents the observed behaviors and proposes fixes.

## Observed behaviors

### 1. Threads are started in the preloading master, before fork

The Railtie eagerly calls `Agent#install` → `start` during Rails initialization (`lib/scout_apm.rb:221-228`). Under a forking server with `preload_app!`, this runs in the **master** during preload. `start` then unconditionally spawns:

- the `AppServerLoad` thread, which makes a blocking HTTP POST (`lib/scout_apm/agent.rb:82` → `lib/scout_apm/app_server_load.rb:12`)
- the metrics background worker thread (`lib/scout_apm/agent.rb:84`)
- the error-service background worker thread (`lib/scout_apm/agent.rb:85`)
- the `BackgroundRecorder` thread when `async_recording: true` (`lib/scout_apm/agent_context.rb:236-243` → `lib/scout_apm/background_recorder.rb:21`)

Puma detects this and warns:

```
! WARNING: Detected 4 Thread(s) started in app boot:
! .../scout_apm/agent.rb:176 sleep                 (metrics worker)
! .../scout_apm/agent.rb:210 sleep                 (error-service worker)
! .../timeout-0.6.1/lib/timeout.rb:87 sleep_forever
! .../scout_apm/background_recorder.rb:36           (background recorder)
```

Forking a process that has live threads is unsafe: only the calling thread survives in the child, and any lock held by another thread at `fork()` time (resolver, OpenSSL, malloc arena, `Logger` mutex, etc.) is inherited locked with no owner. The result is **intermittent worker boot deadlocks** — the worker never finishes booting / the server never reaches a listening state, so deploys behind a health check hang and roll back. It is timing-dependent (a retry sometimes succeeds), which matches a fork/thread race. Disabling monitoring (`monitor: false`) eliminates it entirely.

`start_background_worker?` already exists and returns `!forking?` (`lib/scout_apm/agent.rb:131-134`), but `start` does not consult it, and the Railtie calls `start` regardless.

### 2. `ThreadError: can't alloc thread` in constrained processes

In a non-forking but thread/pid/memory-constrained process (e.g. a background-job container running as PID 1), agent startup can fail with:

```
INFO : Failed Sending Application Startup Info - can't alloc thread
```

`can't alloc thread` is `EAGAIN` from `pthread_create` — the process hit `RLIMIT_NPROC`, the cgroup `pids.max`, or ran out of memory for a new thread stack. Because the agent adds ~3 always-on threads on top of the host application's own thread pool, it can push a near-ceiling process over the edge — and starve the host application of the threads *it* needs to start, so the failure is not isolated to the agent.

### 3. App-server detection silently degrades to `null`

Puma is only detected when the process name starts with `puma` (`lib/scout_apm/server_integrations/puma.rb:23`: `File.basename($0) =~ /\Apuma/`). When the server is launched via `bin/rails server`, `$0` is `rails`, no integration matches, and detection falls through to `Null` (`lib/scout_apm/server_integrations/null.rb`), which reports `forking? => false` and installs no `before_worker_boot` hook.

Consequences:
- `forking?` is wrong, so even a `forking?`-aware deferral (fix #1 below) would not trigger.
- The `before_worker_boot` hook that is supposed to (re)start the worker post-fork is never registered, so a forked worker only starts the agent lazily on first request.

The agent is nonetheless started in the master because `PRECONDITION_DETECTED_SERVER` passes whenever **any** app-server *or* background-job integration is found (`lib/scout_apm/agent/preconditions.rb:27-36`) — e.g. when a background-job framework is present — independent of the `null` app-server result.

### 4. No HTTP timeouts on reporting (related)

`Reporter#http` builds its `Net::HTTP` client with no `open_timeout`/`read_timeout` (`lib/scout_apm/reporter.rb:121-133`), so a slow/unreachable host leaves a reporting thread blocked inside a native call for up to Net::HTTP's 60s default — widening the dangerous window in (1) and stalling graceful shutdown (the `at_exit` handler joins the worker thread). Addressed in #617.

### 5. The error-service thread starts unconditionally

`start_error_service_background_worker` is called from `start` (`lib/scout_apm/agent.rb:85`) and is **not** gated by `errors_enabled` (`lib/scout_apm/agent.rb:206-215`). The thread is created even when the error service is disabled, contributing to (2).

## Proposed fixes

1. **Do not create background threads (or make network calls) during app boot in a forking/preloading master.** Defer all thread creation to the post-fork hook so threads only ever exist in a process that will not fork again. Honor the existing `start_background_worker?`/`forking?` signal in `start`, and have the Railtie defer when running under a forking server.

2. **Make forking detection reliable.** Don't rely solely on `$0` to detect Puma — also detect when running under Puma (e.g. `defined?(::Puma)` + cluster/preload context) so `forking?` is correct regardless of launcher, and ensure the `before_worker_boot` hook is installed in that case. At minimum, treat "preloaded app, app server unknown" conservatively (defer thread start).

3. **Reduce always-on thread footprint and make threads lazy.**
   - Gate the error-service worker behind `errors_enabled` (#5).
   - Consider not spawning reporting threads until there is data to report.
   - Guard `Thread.new` call sites so a `ThreadError` is logged and survivable rather than surfacing as an opaque failure.

4. **Set HTTP timeouts on the reporting connection** (#617), and bound the background-worker `join` on shutdown so drain cannot hang.

## Acceptance

- With a forking app server + `preload_app!`, no agent threads are started in the master; Puma emits no "Detected N Thread(s) started in app boot" warning naming `scout_apm`.
- Worker boot is deterministic (no fork/thread race); repeated deploys succeed.
- Agent startup degrades gracefully (logs and continues) if a thread cannot be allocated.
- `forking?` is correct under Puma regardless of launch command.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agent starts background threads during app boot — unsafe under forking servers (preload), and fails in thread-constrained processes #618

Summary

Observed behaviors

1. Threads are started in the preloading master, before fork

2. `ThreadError: can't alloc thread` in constrained processes

3. App-server detection silently degrades to `null`

4. No HTTP timeouts on reporting (related)

5. The error-service thread starts unconditionally

Proposed fixes

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Agent starts background threads during app boot — unsafe under forking servers (preload), and fails in thread-constrained processes #618

Description

Summary

Observed behaviors

1. Threads are started in the preloading master, before fork

2. ThreadError: can't alloc thread in constrained processes

3. App-server detection silently degrades to null

4. No HTTP timeouts on reporting (related)

5. The error-service thread starts unconditionally

Proposed fixes

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. `ThreadError: can't alloc thread` in constrained processes

3. App-server detection silently degrades to `null`