Make the agent fork-safe (hook Process._fork)#619
Open
mitchh456 wants to merge 1 commit into
Open
Conversation
Under forking app servers (Puma cluster / preload_app!, and `rails server` which boots Puma in cluster mode), the Railtie starts the agent in the master during app load, spawning its background threads (metrics worker, error-service worker, async recorder, app-server-load reporter). Threads are not inherited by forked children, and a thread mid-operation (resolver / OpenSSL / malloc / logger lock) at fork() time leaves the child holding a lock with no owner -- an intermittent worker-boot deadlock. We cannot reliably detect "about to fork" at boot (under `rails server`, Puma has not configured itself when the Railtie runs, so Puma.cli_config is nil and a before_worker_boot hook can't be installed). Instead, hook Process._fork (Ruby 3.1+, no-op below that): stop the agent's threads just before the fork so none is alive at fork time, and restart them in both parent and child. Restarting on both sides keeps monitoring alive in the surviving parent and gives each child a fresh set of threads -- and it covers the null-detection case where no before_worker_boot hook exists. Also: - Guard the three Thread.new sites (app_server_load, background worker, error worker) against ThreadError so a process at its thread/pid ceiling degrades and logs instead of aborting Agent#start. - Start the error-service worker only when errors_enabled (was unconditional). - Install the at_exit shutdown handler once per process (it's inherited across fork; restarting must not stack handlers). Validated against a Puma+good_job repro on the local gem: for both `bundle exec puma` (detected :puma) and `bin/rails server` (detected :null), threads are torn down before each worker fork and restarted in every worker, workers boot cleanly, no regressions in the unit suite. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Spg5MGVB1XXoxJh6LnmgRu
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes the core problem in #618: the agent starts its background threads during app boot and leaves them alive across
fork(), which can intermittently deadlock worker boot under forking app servers (a thread holding a lock — resolver/OpenSSL/malloc/logger — at the instant offork()leaves the child holding it with no owner).Why a fork hook (not boot-time detection)
A probe (documented on #618) showed we can't detect "about to fork" when the Railtie runs: under
bin/rails server, Puma hasn't configured itself yet (Puma.cli_configisnil), so we can neither readpreload_app/workersnor install abefore_worker_boothook. Andrails serveron Rails 8 is Puma cluster+preload, so it's exactly the affected path.So instead of detecting the server, we make the fork itself safe by hooking
Process._fork(invoked by everyKernel#fork/Process.fork, including Puma's worker forks):Restarting on both sides keeps monitoring alive in the surviving parent (apps fork for reasons unrelated to web workers) and gives each child a fresh, working set of threads. Crucially it also covers the
:null-detection case (rails server), where nobefore_worker_boothook is ever installed — the_forkhook is then the only thing restarting the agent in each worker.Process._forkis Ruby 3.1+; the hook is a no-op below that (gemspec still declares>= 2.1), where the existingbefore_worker_boot+ first-request-middleware mitigations remain.Also in this PR
Thread.newsites (app_server_load.rb,agent.rbmetrics + error workers) againstThreadError— a process at its thread/pid ceiling now logs and degrades instead of abortingAgent#startuncaught (thecan't alloc threadsymptom).errors_enabled(it previously started unconditionally — one extra always-on thread).at_exitshutdown handler once per process (it's inherited across fork; restarting must not stack handlers).HTTP reporting timeouts are handled separately in #617.
Validation
Against a minimal Puma + good_job app on this branch (mirroring the reported setup), with debug logging:
bundle exec puma(detected:puma) andbin/rails server(detected:null): for every worker fork the log shows[ForkSafety] Stopping agent threads before fork→ fork →Restarting agent threads after forkin both the master and each worker child. Workers boot cleanly; no errors.283 runs, 0 failures(3 pre-existing errors are an unrelatedsqlite3version conflict inactive_record_test, identical with/without this change). New tests cover the fork lifecycle,ThreadErrorresilience, anderrors_enabledgating.Note: Puma's "Detected N Thread(s) started in app boot" warning still appears — by design. We intentionally keep the master's eager start (so non-forking processes like a good_job worker keep working) and make the fork safe; the threads are torn down inside
_fork, after Puma's enumeration. Removing the warning would require the lazy-start alternative (see #618), which was the rejected approach.🤖 Generated with Claude Code