Skip to content

Single-model co-evolution: --single-model flag, loud failure UX, codex-access skill#30

Open
alanshurafa wants to merge 4 commits into
masterfrom
claude/single-model-test-strategies-i3uKx
Open

Single-model co-evolution: --single-model flag, loud failure UX, codex-access skill#30
alanshurafa wants to merge 4 commits into
masterfrom
claude/single-model-test-strategies-i3uKx

Conversation

@alanshurafa

Copy link
Copy Markdown
Owner

Summary

Adds opt-in same-model co-evolution to co-evolve-bouncer.sh, surfaces partner-agent failures loudly with an actionable HINT, and centralizes the codex launch pattern into a reusable skill + helper.

  • --single-model [claude|codex] — pins both reviewer and composer to one agent. Defaults to bare two-role mode (no preface) based on empirical A/B testing. Opt-in only; cross-vendor (claude,codex) remains the default invocation.
  • --persona-discipline — opt-in divergence preface. Decoupled from --single-model after the A/B showed the preface suppressed concrete code-grounded critique on dense technical documents (only 6 markers vs 14 for bare mode on the same paper section).
  • Loud failure UX — when a partner agent returns empty output on retry, the bouncer now exits 2, logs CO-EVOLVE INCOMPLETE, and emits a HINT pointing at --single-model <working_agent>. Suppressed under --single-model mode (no escape hatch to suggest).
  • codex-access skill + lib/codex-access.sh helper — centralizes the WSL→cmd.exe bridge pattern previously inlined across 6+ files. Three functions: codex_available, codex_invoke, codex_install_hint. Non-invasive integration; invoke_codex in lib/co-evolution.sh is unchanged.

Empirical findings driving the defaults

Input Bare two-role markers (pass 1) With-preface markers (pass 1) Winner
Open prompt (~30 words) 4 7 preface (more critique, marginal quality)
Dense paper section (1277 words, 04-pilot-data.md) 14 6 bare (concrete code-grounded findings vs abstract objections)

Hypothesis: the preface's "read as if a stranger wrote it" framing pushes the model into pure-text-critique mode and dampens its natural impulse to investigate the codebase via Read tool. Full methodology + data table in notesforhumans.md under "Same-Model A/B (2026-05-24)".

Invariants preserved

  • Default is cross-vendor adversarial review (claude,codex). No flag = no behavior change.
  • Single-model only via explicit flag. No auto-fallback to single-model when codex (or any partner) is unavailable. The bouncer fails loudly and points the user at the right escape hatch instead.

Test plan

  • tests/single-model-simulation.sh — 9/9 scenarios cover flag parsing, banner output, byte-parity for default runs, preface gating, and the new exit-2/HINT/INCOMPLETE machinery (scenarios H + I).
  • tests/codex-access-simulation.sh — 7/7 scenarios exercise all three launch-matrix branches via PATH-shadow stubs (no real codex/cmd.exe invoked).
  • tests/lab-routing-simulation.sh — 4/4, regression check confirms --lab routing byte-parity.
  • Real-model smoke test with --single-model claude --chain — compose → critique surfaces 4 contested + 3 clarify markers → defend resolves all → tighten polishes. Exit 0.
  • Real-world failure smoke test in this remote container (codex CLI unavailable): default invocation produces exit 2, CO-EVOLVE INCOMPLETE banner, WARNING that markers may be unresolved, and a HINT suggesting --single-model claude.
  • lib/codex-access.sh smoke test in codex-less env: codex_available returns 1, codex_invoke writes the install hint with npm install -g @openai/codex, OPENAI_API_KEY setup, WSL guidance, and --single-model claude fallback to stderr.

Files

  • co-evolve-bouncer.sh — new flags, failure tracking, INCOMPLETE banner, conditional HINT
  • templates/co-evolve/single-model-preface.md — new (gated by --persona-discipline)
  • lib/codex-access.sh — new helper with codex_available / codex_invoke / codex_install_hint
  • skills/codex-access/SKILL.md — new skill documenting the launch matrix + install paths
  • tests/single-model-simulation.sh — new, 9 scenarios
  • tests/codex-access-simulation.sh — new, 7 scenarios
  • README.md, CLAUDE.md, notesforhumans.md — docs + empirical methodology

https://claude.ai/code/session_018T89MTHJfAo1ZGgeygEC2B


Generated by Claude Code

claude added 4 commits May 24, 2026 04:13
Pins both reviewer and composer onto one agent (claude or codex) when a
second model isn't available, and prepends a persona-discipline preface
asking the model to deliberately diverge from its own prior turn rather
than nod along to shared-weight bias.

Default (cross-model) runs are byte-identical — flag is opt-in only.
Hermetic test exercises arg-parser + preamble-builder paths.

https://claude.ai/code/session_018T89MTHJfAo1ZGgeygEC2B
A/B test on a dense paper section (2026-05-24) showed the persona-discipline
preface SUPPRESSED concrete code-grounded critique — the bare same-model
baseline raised 14 markers (citing specific files and line numbers) while
the preface variant raised only 6 abstract methodological objections on
the same input.

Hypothesis: the preface's "read as if a stranger wrote it" + "suppress default
voice" framing pushes the model into pure-text-critique mode and dampens
its natural impulse to investigate the codebase via Read tool.

Change: --single-model now defaults to bare two-role mode (no preface).
The preface is preserved as opt-in via the new --persona-discipline flag,
which still suits compose-then-bounce of one's own draft (where shared-author
bias actually applies — that case wasn't isolated in the A/B).

Test scenario F flipped from "preface present by default" to "preface absent
by default"; scenario G added to cover the opt-in wiring. Empirical
methodology and findings logged in notesforhumans.md.

https://claude.ai/code/session_018T89MTHJfAo1ZGgeygEC2B
…ilable

Smoke test on a container without codex installed surfaced a silent-failure
mode: when AGENT_B returned empty output on retry the bouncer broke out of
the bounce loop but still emitted "CO-EVOLVE COMPLETE", exited 0, and handed
the user a half-bounced document with raw [CONTESTED]/[CLARIFY] markers
embedded in the prose.

Fix:
- Track agent-failure state in the bounce loop (BOUNCE_FAILED + FAILED_AGENT)
- Closing banner now reads "CO-EVOLVE INCOMPLETE (bounce aborted)" on failure
- Loud WARNING explains the bounce ended early and the output may contain
  unresolved markers
- Conditional HINT points the user at "--single-model <working_agent>" when
  they were running cross-vendor (the working agent is whichever AGENT_A/B
  was NOT the one that failed). Suppressed under --single-model since the
  user already opted in and there is no useful escape hatch to suggest.
- Exit code 2 (distinct from compose-failure's exit 1, distinct from success's 0)

Invariant preserved: no auto-fallback to single-model. The bouncer still
fails when the partner agent is unavailable — it just fails loudly now and
tells the user how to recover.

Test additions: scenarios H (cross-vendor failure → exit 2 + HINT) and I
(single-model failure → exit 2 without HINT). Stub now supports
STUB_<AGENT>_FAILS_AFTER=N to isolate bounce-failure from compose-failure.

https://claude.ai/code/session_018T89MTHJfAo1ZGgeygEC2B
Centralizes the codex launch matrix that was previously inlined across 6+
files (lib/co-evolution.sh, lab/pel/*/adapter.sh, dev-review/codex/dev-review.sh).

New helper functions in lib/codex-access.sh:
  - codex_available()  — exit 0 iff codex is reachable via WSL bridge or
                          direct PATH lookup
  - codex_invoke()     — drop-in for the existing invoke_codex contract;
                          routes through WSL→cmd.exe bridge when applicable,
                          falls back to native codex on PATH, otherwise
                          writes install hint to stderr + empty output_file
  - codex_install_hint() — multi-line actionable message covering npm install,
                          OPENAI_API_KEY auth, WSL gotcha, and the remote-
                          container fallback (--single-model claude)

Skill at skills/codex-access/SKILL.md documents the launch matrix, the
install paths for native + WSL + cloud cases, and is explicit about what
this skill does NOT do (no auto-install, no faking codex with claude, no
hidden fallback).

Non-invasive integration: lib/co-evolution.sh:invoke_codex is left alone
(it already implements the inline pattern). New scripts should source
lib/codex-access.sh instead of copying the inline form. A future refactor
may collapse invoke_codex onto the helper, gated on a separate change.

Test: tests/codex-access-simulation.sh exercises all three launch-matrix
branches via PATH-shadow stubs (no real codex/cmd.exe/wslpath invoked).
7/7 scenarios pass.

https://claude.ai/code/session_018T89MTHJfAo1ZGgeygEC2B

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02c1220007

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread co-evolve-bouncer.sh
Comment on lines +141 to +142
if [[ $# -ge 2 && -n "${2:-}" && "${2:0:1}" != "-" ]]; then
SINGLE_MODEL_AGENT="$2"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not consume the task as the optional single-model agent

When users follow the documented bare form, e.g. --single-model "Stress test this argument", this branch treats the positional task as the optional agent name and then fails because it is not claude or codex. Since the parser cannot distinguish a bare task from an optional positional agent, the advertised default-agent syntax is unusable unless the caller adds --, pipes input, or uses --single-model=claude; the bare flag should not consume the next non-flag argument.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants