feat: add test-both override for empirical deadlock resolution by claytona500 · Pull Request #1 · peteromallet/megaplan

claytona500 · 2026-03-21T18:42:12Z

Summary

When the critique loop hits ESCALATE, the current options are add-note, force-proceed, or abort — all of which punt the decision to the human without evidence. This PR adds a test-both override that breaks deadlocks empirically:

Invokes a judge agent to evaluate the current plan (approach A) against an alternative approach (approach B) that addresses the unresolved flags
The judge renders a structured verdict: approach_a, approach_b, or synthesis
The verdict determines the next step: approach_a wins → gate, approach_b/synthesis → integrate with the judge's recommendations

Motivation

Inspired by adversarial convergence patterns where competing approaches are tested empirically rather than debated endlessly. When the same critique concerns recur across iterations and neither force-proceed nor add-note resolves the impasse, test-both gives the orchestrator an evidence-based path forward.

Changes

schemas.py — New test-both.json schema with structured approach comparison and verdict enum
prompts.py — Judge prompt that evaluates both approaches against unresolved flags
workers.py — Mock worker, schema filename mapping, session key for test-both step
_core.py — Default agent routing (claude) for test-both
cli.py — _override_test_both handler with full state machine integration; updated infer_next_steps and argparse choices
instructions.md — Documentation for the new override option
tests/test_test_both.py — 15 new tests covering all verdict paths, state transitions, schema, and mock
tests/test_megaplan.py, tests/test_schemas.py — Updated existing parametrized tests

Usage

megaplan override test-both --plan <name> --reason "critique loop stagnated"

Test plan

All 15 new tests pass
All 289 existing tests pass (274 original + 15 new)
test-both only available from EVALUATED state with ESCALATE/ABORT recommendation
All three verdict paths (approach_a, approach_b, synthesis) produce correct state transitions
History entry, override metadata, and artifacts written correctly
Manual test with real agents on a stagnated plan

🤖 Generated with Claude Code

When the critique loop stagnates (ESCALATE), the only options today are add-note, force-proceed, or abort — all of which punt the decision to the human without evidence. This adds a test-both override that invokes a judge agent to evaluate the current plan against an alternative approach, then renders a verdict (approach_a, approach_b, or synthesis) based on empirical assessment. Changes: - New test-both.json schema for structured judge output - Judge prompt in prompts.py that evaluates both approaches against unresolved flags - _override_test_both handler in cli.py with full state machine integration - Mock worker for test-both in workers.py - Default agent routing (claude) in _core.py - Updated infer_next_steps to surface test-both for ESCALATE/ABORT - Documentation in instructions.md - 15 new tests covering all verdict paths, state transitions, and schema Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

peteromallet · 2026-03-22T23:19:48Z

Hey Clayton — nice idea, the problem you're targeting is real. Some thoughts before we go deeper:

v0.3.0 context

We just shipped v0.3.0 on main which changes how the orchestrator handles ESCALATE. The orchestrator now has:

Judgment-based continue/stop decisions — it reads the flags, checks the actual project code, and determines whether flags are real concerns or implementation noise
Authority to force-proceed with evidence — it can cite the numbers, explain per-flag reasoning, and act without asking
Rich evaluation signals — unresolved_flags with details, resolved_flags, loop_summary, idea, recurring_critiques

This covers the most common stagnation case: the critic keeps flagging things that are implementation details or phantom concerns. The orchestrator investigates and force-proceeds.

Where does the alternative come from?

The judge prompt asks the judge to both propose approach B and rule between A and B. That's a conflict of interest — it's judging its own proposal. The "empirical" framing implies independent evaluation, but it's really one agent arguing with itself.

Does stagnation actually need a whole new approach?

When the loop stagnates, it's usually because:

Flags are noise → orchestrator judgment handles this now (v0.3.0)
Flags are real but plan-unfixable → force-proceed, executor handles it
The approach is fundamentally wrong → the user needs to redirect

Case 3 is the only one where an alternative plan would help. But generating a credible alternative in a single agent call is a lot to ask — it's doing in one step what the full clarify→plan→critique loop does in many.

Simpler alternatives that feed into the existing system

A few things that might solve the same problem with less machinery:

--fresh on critique: if the critic is stuck in a rut, a fresh session re-evaluates from scratch. The recurring critique might not recur.
Robustness adjustment: orchestrator drops from thorough to standard when it detects churning — fewer nitpick flags.
Orchestrator flag triage: already in v0.3.0 — the orchestrator reads each flag, checks the code, classifies as plan-level vs executor-level, and force-proceeds on the noise.
User redirect on ESCALATE: if the approach is truly wrong, asking the user is the right call. They have constraints the judge doesn't know about.

Not saying no

The core insight — "break deadlocks with evidence instead of just punting" — is good. I'm just wondering if we can get there by making the existing actors smarter (orchestrator judgment, fresh critic sessions, robustness tuning) rather than adding a new actor. What do you think?

The PR will also need a rebase against v0.3.0 — max_iterations/budget_usd were removed, instructions were rewritten, and step responses are richer now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add test-both override for empirical deadlock resolution#1

feat: add test-both override for empirical deadlock resolution#1
claytona500 wants to merge 1 commit intopeteromallet:mainfrom
claytona500:feat/test-both-override

claytona500 commented Mar 21, 2026

Uh oh!

peteromallet commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

claytona500 commented Mar 21, 2026

Summary

Motivation

Changes

Usage

Test plan

Uh oh!

peteromallet commented Mar 22, 2026

v0.3.0 context

Where does the alternative come from?

Does stagnation actually need a whole new approach?

Simpler alternatives that feed into the existing system

Not saying no

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants