Skip to content

Use response files for public agent output#62

Merged
wwind123 merged 2 commits into
mainfrom
codex/strip-gemini-internal-dialogue
May 3, 2026
Merged

Use response files for public agent output#62
wwind123 merged 2 commits into
mainfrom
codex/strip-gemini-internal-dialogue

Conversation

@wwind123
Copy link
Copy Markdown
Owner

@wwind123 wwind123 commented May 3, 2026

Summary

  • Add shared public response-file support under /tmp/coding-review-agent-loop/responses/.
  • Instruct Claude and Gemini to write the final public GitHub response to that file.
  • Prefer the response-file contents over stdout when present, while keeping existing stdout/marker parsing as fallback.
  • Add a Gemini regression test using PR Refresh architecture docs and memory map #61-style internal stdout chatter to verify only the response file is posted.
  • Document the response-file behavior.

Local mimic

I ran a local FakeRunner PR #61 mimic with Gemini stdout containing diagnostics/tool narration and a clean public response file. The posted comment was only:

The round 1 blocking issue has been fully addressed.

-- Google Gemini

<!-- AGENT_STATE: approved -->

No GitHub writes were made by the mimic.

Tests

  • /home/wwind123/tools/coding-review-agent-loop/.venv/bin/python -m pytest tests/test_agent_loop.py -q
  • python3 -m py_compile src/coding_review_agent_loop/agents/base.py src/coding_review_agent_loop/agents/claude.py src/coding_review_agent_loop/agents/gemini.py tests/test_agent_loop.py

-- OpenAI Codex

@wwind123
Copy link
Copy Markdown
Owner Author

wwind123 commented May 3, 2026

Now I have enough context to write a complete review.


PR #62 Review: Use response files for public agent output

Summary of change: Adds a shared mechanism in base.py for generating a UUID-keyed response file path under /tmp/coding-review-agent-loop/responses/. Claude and Gemini are instructed to write their public response there; the orchestrator reads the file and prefers it over stdout. Stdout parsing is kept as fallback. A new Gemini regression test verifies the happy path.

Correctness

public_response_path / read_public_response_file (base.py):

  • The UUID per-invocation guarantees no cross-run collision. mkdir(parents=True, exist_ok=True) is called eagerly at path-generation time (not file-write time), which is fine — it only creates the directory.
  • read_public_response_file catches FileNotFoundError and returns None (correct fallback); other I/O errors propagate loudly, which is the right call.
  • _safe_repo_slug handles / and : but not other potential shell-special characters in repo names (e.g. .. or whitespace). In practice GitHub repo names disallow most of these, so this is not a blocking issue.

Claude (claude.py):

  • Prompt is always appended last; read_public_response_file(response_path) or text correctly falls back to the parsed JSON stdout if the file is absent. Session ID is still sourced from stdout, which is correct — only the visible response text changes.

Gemini (gemini.py):

  • Prompt wrapping order: _with_public_response_marker_instruction(with_public_response_file_instruction(...)) — the file instruction is inside, the marker instruction outside. Both mechanisms are active, and the orchestrator prefers the file. The existing _parse_gemini_output / _strip_gemini_preamble fallback still operates on stdout when no file is written.

Test _maybe_write_public_response_file:

  • Regex r"Write the final public response.*?\n\n([^\n]+/responses/[^\n]+\.md)" correctly parses the injected instruction. The re.S flag lets .*? skip past "that should be posted to GitHub to this file:" to reach the \n\n separator before the path. Verified against with_public_response_file_instruction's f-string template.
  • If no regex match is found despite public_response_outputs being non-empty, outputs are not consumed and could bleed into subsequent calls. This is acceptable for test-helper code in a deterministic test suite, but worth noting.

No blocking issues found.

Same-PR follow-ups

  1. No test_claude_review_loop_prefers_public_response_file_over_stdout test. The PR adds test_gemini_review_loop_prefers_public_response_file_over_stdout but the analogous Claude test is missing. Claude's code path is symmetric (same public_response_path + read_public_response_file or text), so a mirror test using claude_outputs + public_response_outputs would complete the coverage in tests/test_agent_loop.py.

-- Anthropic Claude

@wwind123
Copy link
Copy Markdown
Owner Author

wwind123 commented May 3, 2026

The PR looks solid and correctly addresses the issue of internal agent dialogue leaking into GitHub comments by providing a dedicated response-file mechanism.

Same-PR follow-ups

  1. Missing Claude regression test. While a Gemini test was added (test_gemini_review_loop_prefers_public_response_file_over_stdout), an analogous test for Claude should be added to tests/test_agent_loop.py to ensure symmetric coverage, especially since Claude's output parsing (JSON-based) differs from Gemini's.

-- Google Gemini

@wwind123
Copy link
Copy Markdown
Owner Author

wwind123 commented May 3, 2026

Implemented and pushed the same-PR follow-up to PR #62.

Added test_claude_review_loop_prefers_public_response_file_over_stdout in tests/test_agent_loop.py, mirroring the Gemini regression while using Claude’s JSON output path and asserting the public response file content is what gets posted.

Verified:
python -m pytest tests/test_agent_loop.py -k 'public_response_file_over_stdout'
python -m pytest

Commit pushed:
6f5c67a Add Claude public response regression test

-- OpenAI Codex

@wwind123
Copy link
Copy Markdown
Owner Author

wwind123 commented May 3, 2026

The new test is well-constructed. A few checks on correctness:

  • claude_outputs is a valid JSON-encoded string with a result field containing internal chatter — correctly exercises Claude's --output-format json path via _parse_claude_output.
  • claude_call[-1] is the right index: Claude appends the prompt last (args.append(...)), distinct from Gemini which uses cmd[2] as the --prompt value.
  • The path assertion "/coding-review-agent-loop/responses/OWNER-REPO/claude/" matches the structure produced by _safe_repo_slug("OWNER/REPO") = "OWNER-REPO" plus agent name "claude".
  • read_public_response_file strips the text, and the asserted comment matches the provided public_response_outputs value exactly — no off-by-one whitespace issues.
  • run_pr_loop(...) == 0 verifies the loop correctly parses <!-- AGENT_STATE: approved --> from the file content and terminates cleanly.

The round 2 change directly and completely addresses the only same-PR follow-up from round 1. No new issues introduced.

-- Anthropic Claude

@wwind123
Copy link
Copy Markdown
Owner Author

wwind123 commented May 3, 2026

The PR #62 is well-implemented and effectively addresses the issue of internal agent dialogue leaking into GitHub comments. By introducing a dedicated "public response file" mechanism, it provides a robust way for agents to separate their internal reasoning and diagnostics from the final public output.

Key highlights of the PR:

  • Robust Mechanism: The use of UUID-keyed response files under /tmp/coding-review-agent-loop/responses/ ensures isolation between different runs and agents.
  • Graceful Fallback: The orchestrator maintains backward compatibility by falling back to stdout parsing if the response file is missing or empty.
  • Clear Instructions: Both Claude and Gemini are given explicit instructions in their prompts to use the response file for their final output.
  • Symmetric Test Coverage: Regression tests have been added for both Gemini and Claude to verify that the response file content is preferred over stdout chatter.
  • Belts and Suspenders: Gemini retains its existing marker-based filtering as an additional layer of protection.

I've verified the changes by running the newly added regression tests, and they pass as expected.

Same-PR follow-ups

None. The previously identified missing Claude test has been implemented and verified.

Future follow-ups

None.

-- Google Gemini

@wwind123 wwind123 merged commit e2d76ec into main May 3, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant