Skip to content

feat: add test-both override for empirical deadlock resolution#1

Open
claytona500 wants to merge 1 commit intopeteromallet:mainfrom
claytona500:feat/test-both-override
Open

feat: add test-both override for empirical deadlock resolution#1
claytona500 wants to merge 1 commit intopeteromallet:mainfrom
claytona500:feat/test-both-override

Conversation

@claytona500
Copy link

Summary

When the critique loop hits ESCALATE, the current options are add-note, force-proceed, or abort — all of which punt the decision to the human without evidence. This PR adds a test-both override that breaks deadlocks empirically:

  • Invokes a judge agent to evaluate the current plan (approach A) against an alternative approach (approach B) that addresses the unresolved flags
  • The judge renders a structured verdict: approach_a, approach_b, or synthesis
  • The verdict determines the next step: approach_a wins → gate, approach_b/synthesis → integrate with the judge's recommendations

Motivation

Inspired by adversarial convergence patterns where competing approaches are tested empirically rather than debated endlessly. When the same critique concerns recur across iterations and neither force-proceed nor add-note resolves the impasse, test-both gives the orchestrator an evidence-based path forward.

Changes

  • schemas.py — New test-both.json schema with structured approach comparison and verdict enum
  • prompts.py — Judge prompt that evaluates both approaches against unresolved flags
  • workers.py — Mock worker, schema filename mapping, session key for test-both step
  • _core.py — Default agent routing (claude) for test-both
  • cli.py_override_test_both handler with full state machine integration; updated infer_next_steps and argparse choices
  • instructions.md — Documentation for the new override option
  • tests/test_test_both.py — 15 new tests covering all verdict paths, state transitions, schema, and mock
  • tests/test_megaplan.py, tests/test_schemas.py — Updated existing parametrized tests

Usage

megaplan override test-both --plan <name> --reason "critique loop stagnated"

Test plan

  • All 15 new tests pass
  • All 289 existing tests pass (274 original + 15 new)
  • test-both only available from EVALUATED state with ESCALATE/ABORT recommendation
  • All three verdict paths (approach_a, approach_b, synthesis) produce correct state transitions
  • History entry, override metadata, and artifacts written correctly
  • Manual test with real agents on a stagnated plan

🤖 Generated with Claude Code

When the critique loop stagnates (ESCALATE), the only options today are
add-note, force-proceed, or abort — all of which punt the decision to the
human without evidence. This adds a test-both override that invokes a judge
agent to evaluate the current plan against an alternative approach, then
renders a verdict (approach_a, approach_b, or synthesis) based on empirical
assessment.

Changes:
- New test-both.json schema for structured judge output
- Judge prompt in prompts.py that evaluates both approaches against
  unresolved flags
- _override_test_both handler in cli.py with full state machine integration
- Mock worker for test-both in workers.py
- Default agent routing (claude) in _core.py
- Updated infer_next_steps to surface test-both for ESCALATE/ABORT
- Documentation in instructions.md
- 15 new tests covering all verdict paths, state transitions, and schema

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@peteromallet
Copy link
Owner

Hey Clayton — nice idea, the problem you're targeting is real. Some thoughts before we go deeper:

v0.3.0 context

We just shipped v0.3.0 on main which changes how the orchestrator handles ESCALATE. The orchestrator now has:

  • Judgment-based continue/stop decisions — it reads the flags, checks the actual project code, and determines whether flags are real concerns or implementation noise
  • Authority to force-proceed with evidence — it can cite the numbers, explain per-flag reasoning, and act without asking
  • Rich evaluation signalsunresolved_flags with details, resolved_flags, loop_summary, idea, recurring_critiques

This covers the most common stagnation case: the critic keeps flagging things that are implementation details or phantom concerns. The orchestrator investigates and force-proceeds.

Where does the alternative come from?

The judge prompt asks the judge to both propose approach B and rule between A and B. That's a conflict of interest — it's judging its own proposal. The "empirical" framing implies independent evaluation, but it's really one agent arguing with itself.

Does stagnation actually need a whole new approach?

When the loop stagnates, it's usually because:

  1. Flags are noise → orchestrator judgment handles this now (v0.3.0)
  2. Flags are real but plan-unfixable → force-proceed, executor handles it
  3. The approach is fundamentally wrong → the user needs to redirect

Case 3 is the only one where an alternative plan would help. But generating a credible alternative in a single agent call is a lot to ask — it's doing in one step what the full clarify→plan→critique loop does in many.

Simpler alternatives that feed into the existing system

A few things that might solve the same problem with less machinery:

  • --fresh on critique: if the critic is stuck in a rut, a fresh session re-evaluates from scratch. The recurring critique might not recur.
  • Robustness adjustment: orchestrator drops from thorough to standard when it detects churning — fewer nitpick flags.
  • Orchestrator flag triage: already in v0.3.0 — the orchestrator reads each flag, checks the code, classifies as plan-level vs executor-level, and force-proceeds on the noise.
  • User redirect on ESCALATE: if the approach is truly wrong, asking the user is the right call. They have constraints the judge doesn't know about.

Not saying no

The core insight — "break deadlocks with evidence instead of just punting" — is good. I'm just wondering if we can get there by making the existing actors smarter (orchestrator judgment, fresh critic sessions, robustness tuning) rather than adding a new actor. What do you think?

The PR will also need a rebase against v0.3.0 — max_iterations/budget_usd were removed, instructions were rewritten, and step responses are richer now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants