feat: add test-both override for empirical deadlock resolution#1
feat: add test-both override for empirical deadlock resolution#1claytona500 wants to merge 1 commit intopeteromallet:mainfrom
Conversation
When the critique loop stagnates (ESCALATE), the only options today are add-note, force-proceed, or abort — all of which punt the decision to the human without evidence. This adds a test-both override that invokes a judge agent to evaluate the current plan against an alternative approach, then renders a verdict (approach_a, approach_b, or synthesis) based on empirical assessment. Changes: - New test-both.json schema for structured judge output - Judge prompt in prompts.py that evaluates both approaches against unresolved flags - _override_test_both handler in cli.py with full state machine integration - Mock worker for test-both in workers.py - Default agent routing (claude) in _core.py - Updated infer_next_steps to surface test-both for ESCALATE/ABORT - Documentation in instructions.md - 15 new tests covering all verdict paths, state transitions, and schema Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hey Clayton — nice idea, the problem you're targeting is real. Some thoughts before we go deeper: v0.3.0 contextWe just shipped v0.3.0 on main which changes how the orchestrator handles ESCALATE. The orchestrator now has:
This covers the most common stagnation case: the critic keeps flagging things that are implementation details or phantom concerns. The orchestrator investigates and force-proceeds. Where does the alternative come from?The judge prompt asks the judge to both propose approach B and rule between A and B. That's a conflict of interest — it's judging its own proposal. The "empirical" framing implies independent evaluation, but it's really one agent arguing with itself. Does stagnation actually need a whole new approach?When the loop stagnates, it's usually because:
Case 3 is the only one where an alternative plan would help. But generating a credible alternative in a single agent call is a lot to ask — it's doing in one step what the full clarify→plan→critique loop does in many. Simpler alternatives that feed into the existing systemA few things that might solve the same problem with less machinery:
Not saying noThe core insight — "break deadlocks with evidence instead of just punting" — is good. I'm just wondering if we can get there by making the existing actors smarter (orchestrator judgment, fresh critic sessions, robustness tuning) rather than adding a new actor. What do you think? The PR will also need a rebase against v0.3.0 — |
Summary
When the critique loop hits ESCALATE, the current options are
add-note,force-proceed, orabort— all of which punt the decision to the human without evidence. This PR adds atest-bothoverride that breaks deadlocks empirically:approach_a,approach_b, orsynthesisMotivation
Inspired by adversarial convergence patterns where competing approaches are tested empirically rather than debated endlessly. When the same critique concerns recur across iterations and neither force-proceed nor add-note resolves the impasse,
test-bothgives the orchestrator an evidence-based path forward.Changes
schemas.py— Newtest-both.jsonschema with structured approach comparison and verdict enumprompts.py— Judge prompt that evaluates both approaches against unresolved flagsworkers.py— Mock worker, schema filename mapping, session key fortest-bothstep_core.py— Default agent routing (claude) fortest-bothcli.py—_override_test_bothhandler with full state machine integration; updatedinfer_next_stepsand argparse choicesinstructions.md— Documentation for the new override optiontests/test_test_both.py— 15 new tests covering all verdict paths, state transitions, schema, and mocktests/test_megaplan.py,tests/test_schemas.py— Updated existing parametrized testsUsage
Test plan
test-bothonly available from EVALUATED state with ESCALATE/ABORT recommendation🤖 Generated with Claude Code