Skip to content

Make the agent fork-safe (hook Process._fork)#619

Open
mitchh456 wants to merge 1 commit into
masterfrom
fork-safety
Open

Make the agent fork-safe (hook Process._fork)#619
mitchh456 wants to merge 1 commit into
masterfrom
fork-safety

Conversation

@mitchh456

Copy link
Copy Markdown
Contributor

Fixes the core problem in #618: the agent starts its background threads during app boot and leaves them alive across fork(), which can intermittently deadlock worker boot under forking app servers (a thread holding a lock — resolver/OpenSSL/malloc/logger — at the instant of fork() leaves the child holding it with no owner).

Why a fork hook (not boot-time detection)

A probe (documented on #618) showed we can't detect "about to fork" when the Railtie runs: under bin/rails server, Puma hasn't configured itself yet (Puma.cli_config is nil), so we can neither read preload_app/workers nor install a before_worker_boot hook. And rails server on Rails 8 is Puma cluster+preload, so it's exactly the affected path.

So instead of detecting the server, we make the fork itself safe by hooking Process._fork (invoked by every Kernel#fork/Process.fork, including Puma's worker forks):

  • before the fork: stop the agent's threads so none is alive at fork time
  • after the fork (parent and child): restart them

Restarting on both sides keeps monitoring alive in the surviving parent (apps fork for reasons unrelated to web workers) and gives each child a fresh, working set of threads. Crucially it also covers the :null-detection case (rails server), where no before_worker_boot hook is ever installed — the _fork hook is then the only thing restarting the agent in each worker.

Process._fork is Ruby 3.1+; the hook is a no-op below that (gemspec still declares >= 2.1), where the existing before_worker_boot + first-request-middleware mitigations remain.

Also in this PR

  • Guard the three Thread.new sites (app_server_load.rb, agent.rb metrics + error workers) against ThreadError — a process at its thread/pid ceiling now logs and degrades instead of aborting Agent#start uncaught (the can't alloc thread symptom).
  • Start the error-service worker only when errors_enabled (it previously started unconditionally — one extra always-on thread).
  • Install the at_exit shutdown handler once per process (it's inherited across fork; restarting must not stack handlers).

HTTP reporting timeouts are handled separately in #617.

Validation

Against a minimal Puma + good_job app on this branch (mirroring the reported setup), with debug logging:

  • bundle exec puma (detected :puma) and bin/rails server (detected :null): for every worker fork the log shows [ForkSafety] Stopping agent threads before fork → fork → Restarting agent threads after fork in both the master and each worker child. Workers boot cleanly; no errors.
  • Unit suite: 283 runs, 0 failures (3 pre-existing errors are an unrelated sqlite3 version conflict in active_record_test, identical with/without this change). New tests cover the fork lifecycle, ThreadError resilience, and errors_enabled gating.

Note: Puma's "Detected N Thread(s) started in app boot" warning still appears — by design. We intentionally keep the master's eager start (so non-forking processes like a good_job worker keep working) and make the fork safe; the threads are torn down inside _fork, after Puma's enumeration. Removing the warning would require the lazy-start alternative (see #618), which was the rejected approach.

🤖 Generated with Claude Code

Under forking app servers (Puma cluster / preload_app!, and `rails server`
which boots Puma in cluster mode), the Railtie starts the agent in the master
during app load, spawning its background threads (metrics worker, error-service
worker, async recorder, app-server-load reporter). Threads are not inherited by
forked children, and a thread mid-operation (resolver / OpenSSL / malloc /
logger lock) at fork() time leaves the child holding a lock with no owner -- an
intermittent worker-boot deadlock.

We cannot reliably detect "about to fork" at boot (under `rails server`, Puma
has not configured itself when the Railtie runs, so Puma.cli_config is nil and a
before_worker_boot hook can't be installed). Instead, hook Process._fork
(Ruby 3.1+, no-op below that): stop the agent's threads just before the fork so
none is alive at fork time, and restart them in both parent and child. Restarting
on both sides keeps monitoring alive in the surviving parent and gives each child
a fresh set of threads -- and it covers the null-detection case where no
before_worker_boot hook exists.

Also:
- Guard the three Thread.new sites (app_server_load, background worker, error
  worker) against ThreadError so a process at its thread/pid ceiling degrades
  and logs instead of aborting Agent#start.
- Start the error-service worker only when errors_enabled (was unconditional).
- Install the at_exit shutdown handler once per process (it's inherited across
  fork; restarting must not stack handlers).

Validated against a Puma+good_job repro on the local gem: for both
`bundle exec puma` (detected :puma) and `bin/rails server` (detected :null),
threads are torn down before each worker fork and restarted in every worker,
workers boot cleanly, no regressions in the unit suite.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Spg5MGVB1XXoxJh6LnmgRu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant