Skip to content

fix: time-box document bouncer codex/claude calls to prevent infinite hang#38

Open
alanshurafa wants to merge 1 commit into
masterfrom
claude/gifted-fermat-3ff5c3
Open

fix: time-box document bouncer codex/claude calls to prevent infinite hang#38
alanshurafa wants to merge 1 commit into
masterfrom
claude/gifted-fermat-3ff5c3

Conversation

@alanshurafa

Copy link
Copy Markdown
Owner

What

The document bouncers (co-evolve-bouncer.sh, legacy agent-bouncer.sh) dispatched Codex/Claude with no timeout. A network-stalled Codex read could block a run indefinitely — observed in practice as a ~4-hour zombie process that produced nothing.

The code pipeline (dev-review.sh) already wrapped every call in invoke_agent_with_timeout (RNPT-05, default PHASE_TIMEOUT=1800). The document path never did. This routes it through the same guard.

Changes

  • lib/co-evolution.sh — the timeout wrapper's case now accepts claude|opus) (was opus) only), so the document bouncer's claude agent label routes through the guard. Purely additive; dev-review only ever passes opus/codex.
  • co-evolve-bouncer.sh — the 4 compose/bounce call sites now call invoke_agent_with_timeout. The raw invoke_agent dispatcher is kept on purpose: the wrapper's no-timeout-binary degradation branch falls back to it, so removing it would infinite-loop. Added export CLAUDE_MODEL so a --claude-model override survives the wrapper's bash -c re-source (which re-applies the lib default).
  • agent-bouncer/agent-bouncer.sh — same bug class. Added a 3-arg invoke_agent_bounded shim (keeps the invoke_<agent>(prompt, output, stderr) convention that validate_output's retry callback requires) used for the main bounce call, the retry callback, and the name-gen call.

Behavior change

A persistently unreachable Codex now caps at ~2× the per-phase timeout (one retry on empty output) instead of hanging forever. Tighten via PHASE_TIMEOUT if a shorter document-bounce ceiling is wanted.

Verification

  • bash -n clean on all three scripts
  • Full suite: 25/25 (685s) — including reliability 17/17 (bouncer auth-abort intact), claude-model-override 4/4 (export didn't break wiring), bounce-state 6/6 (S1 runs the bouncer through the new path with a codex stub; S5 covers agent-bouncer)

🤖 Generated with Claude Code

… hang

The document bouncers dispatched agents with no timeout, so a network-stalled codex read could block a run indefinitely (observed: a ~4h zombie). Route every compose/bounce call through invoke_agent_with_timeout, which the code pipeline already used but the document path never did.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 49a331fe05

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread co-evolve-bouncer.sh
log " Input: $INPUT_TYPE ($(echo "$INPUT_CONTENT" | wc -w | tr -d '\r\n ') words)"

invoke_agent "$AGENT_A" "$compose_prompt_file" "$compose_output_file" "$compose_stderr_file"
invoke_agent_with_timeout "$AGENT_A" "$compose_prompt_file" "$compose_output_file" "$compose_stderr_file"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Kill the agent process, not only the wrapper shell

When this new document-bouncer call uses invoke_agent_with_timeout, it runs timeout --foreground ... bash -c '... invoke_codex/claude ...'. I checked timeout --help, which states that with --foreground, “children of COMMAND will not be timed out”; the actual codex/claude process is a child of that bash -c, so a network-stalled agent can keep running after the wrapper returns 124, and the retry path can start another stuck copy. The timeout needs to target/exec the agent process or kill the process group before these bouncer call sites can reliably prevent the hang.

Useful? React with 👍 / 👎.

Comment thread co-evolve-bouncer.sh
log "--------------------------------------------"

invoke_agent "$current_agent" "$prompt_file" "$output_file" "$stderr_file"
invoke_agent_with_timeout "$current_agent" "$prompt_file" "$output_file" "$stderr_file"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Abort timed-out bounce passes instead of completing

In a bounce pass, invoke_agent_with_timeout deliberately returns success and records a timeout only in LAST_INVOKE_EXIT_CODE, but this new call ignores that state. If a pass times out and leaves no output, the code retries, then breaks out as “Stopping” and the main flow still finalizes state.json as complete using the previous WORKING_FILE; if partial output exists, it can be accepted as the pass result. Check for LAST_INVOKE_EXIT_CODE == 124 here (as the dev-review path does) and abort/finalize aborted rather than marking an incomplete bounce complete.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant