fix: time-box document bouncer codex/claude calls to prevent infinite hang#38
fix: time-box document bouncer codex/claude calls to prevent infinite hang#38alanshurafa wants to merge 1 commit into
Conversation
… hang The document bouncers dispatched agents with no timeout, so a network-stalled codex read could block a run indefinitely (observed: a ~4h zombie). Route every compose/bounce call through invoke_agent_with_timeout, which the code pipeline already used but the document path never did. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 49a331fe05
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| log " Input: $INPUT_TYPE ($(echo "$INPUT_CONTENT" | wc -w | tr -d '\r\n ') words)" | ||
|
|
||
| invoke_agent "$AGENT_A" "$compose_prompt_file" "$compose_output_file" "$compose_stderr_file" | ||
| invoke_agent_with_timeout "$AGENT_A" "$compose_prompt_file" "$compose_output_file" "$compose_stderr_file" |
There was a problem hiding this comment.
Kill the agent process, not only the wrapper shell
When this new document-bouncer call uses invoke_agent_with_timeout, it runs timeout --foreground ... bash -c '... invoke_codex/claude ...'. I checked timeout --help, which states that with --foreground, “children of COMMAND will not be timed out”; the actual codex/claude process is a child of that bash -c, so a network-stalled agent can keep running after the wrapper returns 124, and the retry path can start another stuck copy. The timeout needs to target/exec the agent process or kill the process group before these bouncer call sites can reliably prevent the hang.
Useful? React with 👍 / 👎.
| log "--------------------------------------------" | ||
|
|
||
| invoke_agent "$current_agent" "$prompt_file" "$output_file" "$stderr_file" | ||
| invoke_agent_with_timeout "$current_agent" "$prompt_file" "$output_file" "$stderr_file" |
There was a problem hiding this comment.
Abort timed-out bounce passes instead of completing
In a bounce pass, invoke_agent_with_timeout deliberately returns success and records a timeout only in LAST_INVOKE_EXIT_CODE, but this new call ignores that state. If a pass times out and leaves no output, the code retries, then breaks out as “Stopping” and the main flow still finalizes state.json as complete using the previous WORKING_FILE; if partial output exists, it can be accepted as the pass result. Check for LAST_INVOKE_EXIT_CODE == 124 here (as the dev-review path does) and abort/finalize aborted rather than marking an incomplete bounce complete.
Useful? React with 👍 / 👎.
What
The document bouncers (
co-evolve-bouncer.sh, legacyagent-bouncer.sh) dispatched Codex/Claude with no timeout. A network-stalled Codex read could block a run indefinitely — observed in practice as a ~4-hour zombie process that produced nothing.The code pipeline (
dev-review.sh) already wrapped every call ininvoke_agent_with_timeout(RNPT-05, defaultPHASE_TIMEOUT=1800). The document path never did. This routes it through the same guard.Changes
lib/co-evolution.sh— the timeout wrapper'scasenow acceptsclaude|opus)(wasopus)only), so the document bouncer'sclaudeagent label routes through the guard. Purely additive; dev-review only ever passesopus/codex.co-evolve-bouncer.sh— the 4 compose/bounce call sites now callinvoke_agent_with_timeout. The rawinvoke_agentdispatcher is kept on purpose: the wrapper's no-timeout-binary degradation branch falls back to it, so removing it would infinite-loop. Addedexport CLAUDE_MODELso a--claude-modeloverride survives the wrapper'sbash -cre-source (which re-applies the lib default).agent-bouncer/agent-bouncer.sh— same bug class. Added a 3-arginvoke_agent_boundedshim (keeps theinvoke_<agent>(prompt, output, stderr)convention thatvalidate_output's retry callback requires) used for the main bounce call, the retry callback, and the name-gen call.Behavior change
A persistently unreachable Codex now caps at ~2× the per-phase timeout (one retry on empty output) instead of hanging forever. Tighten via
PHASE_TIMEOUTif a shorter document-bounce ceiling is wanted.Verification
bash -nclean on all three scriptsreliability17/17 (bouncer auth-abort intact),claude-model-override4/4 (export didn't break wiring),bounce-state6/6 (S1 runs the bouncer through the new path with a codex stub; S5 covers agent-bouncer)🤖 Generated with Claude Code