|
| 1 | +# Proposal: Ensemble + Judge-Refinement Loop for Early Pipeline Tasks |
| 2 | + |
| 3 | +**Author:** Egon (HejEgonBot) |
| 4 | +**Date:** 2026-03-27 |
| 5 | +**Status:** Draft — for Simon's review |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Problem |
| 10 | + |
| 11 | +Early pipeline tasks (particularly `PremiseAttackTask` and `RedlineGateTask`) run a single model with a single system prompt. The quality of these early outputs determines everything downstream — a weak premise attack or incorrect redline verdict propagates through the entire plan. |
| 12 | + |
| 13 | +The current sequential fallback in `LLMExecutor` (try model A, on failure try model B) only handles errors, not quality. There's no mechanism to evaluate whether a successful response was actually good, or to improve it. |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## Proposed Design: Three-Stage Judge-Refinement Loop |
| 18 | + |
| 19 | +### Stage 1: Parallel Candidate Generation |
| 20 | +Run N non-reasoning models simultaneously on the same task. |
| 21 | + |
| 22 | +- Each model produces a full response independently |
| 23 | +- Models are cheap and fast — running 3–5 in parallel costs little more than running 1 |
| 24 | +- Implemented via Luigi parallel task scheduling (existing `--workers` parameter controls concurrency) |
| 25 | +- For `PremiseAttackTask`: each of the 5 lenses could run on a different model |
| 26 | + |
| 27 | +### Stage 2: Reasoning Model Judgment |
| 28 | +A single reasoning model (e.g. `claude-sonnet-4-6-thinking`, `o3`) evaluates all N candidates and produces: |
| 29 | + |
| 30 | +- A short score per candidate (not full responses — reasoning models are expensive, keep output minimal) |
| 31 | +- A brief hint identifying what's missing or weak in each response |
| 32 | +- An overall quality verdict: PASS / RETRY |
| 33 | + |
| 34 | +Reasoning models are expensive, so the judgment output should be constrained — just scores and gap hints, not rewritten responses. |
| 35 | + |
| 36 | +### Stage 3: Conditional Retry |
| 37 | +If the best score from Stage 2 falls below a threshold: |
| 38 | + |
| 39 | +- Re-run the non-reasoning models with the gap hint injected into the prompt |
| 40 | +- The `validation_feedback` mechanism in `LLMExecutor` already handles this pattern for schema errors — this extends it to quality-based retries |
| 41 | + |
| 42 | +If scores pass the threshold, the best candidate proceeds downstream. |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +## Where to Apply It |
| 47 | + |
| 48 | +Early pipeline tasks where quality has the highest leverage: |
| 49 | + |
| 50 | +| Task | Why it matters | |
| 51 | +|------|----------------| |
| 52 | +| `PremiseAttackTask` | 5 independent lenses, already structured for parallelism | |
| 53 | +| `RedlineGateTask` | Gate failure stops the entire pipeline; false positives are the core diagnostic problem | |
| 54 | +| `ProjectPlanTask` | Core decomposition — everything downstream builds on this | |
| 55 | + |
| 56 | +Lower-priority tasks (WBS level 2/3, team enrichment, governance phases) don't need this — their outputs are less foundational. |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +## Implementation Sketch |
| 61 | + |
| 62 | +### New config fields in `llm_config` |
| 63 | + |
| 64 | +```json |
| 65 | +{ |
| 66 | + "openrouter-gemini-2.0-flash": { |
| 67 | + "priority": 1, |
| 68 | + "role": "candidate" |
| 69 | + }, |
| 70 | + "openrouter-mixtral-8x22b": { |
| 71 | + "priority": 2, |
| 72 | + "role": "candidate" |
| 73 | + }, |
| 74 | + "anthropic-claude-sonnet-4-6-thinking": { |
| 75 | + "priority": 1, |
| 76 | + "role": "judge" |
| 77 | + } |
| 78 | +} |
| 79 | +``` |
| 80 | + |
| 81 | +A `role` field distinguishes candidate models (cheap, parallel) from judge models (expensive, sequential). |
| 82 | + |
| 83 | +### New `LLMExecutor` method |
| 84 | + |
| 85 | +```python |
| 86 | +def run_with_judge( |
| 87 | + self, |
| 88 | + execute_function: Callable[[LLM], Any], |
| 89 | + judge_function: Callable[[LLM, list[Any]], JudgmentResult], |
| 90 | + pass_threshold: float = 0.7, |
| 91 | + max_retries: int = 1 |
| 92 | +) -> Any: |
| 93 | + """ |
| 94 | + Run candidate models in parallel, judge results, retry if below threshold. |
| 95 | + """ |
| 96 | +``` |
| 97 | + |
| 98 | +### Luigi task decomposition for `PremiseAttackTask` |
| 99 | + |
| 100 | +``` |
| 101 | +PremiseAttackTask |
| 102 | +├── requires: [PremiseAttackLensTask(lens_index=0, model=candidate_models[0]), ...] |
| 103 | +│ └── 5 lens tasks run in parallel up to --workers limit |
| 104 | +└── run_inner: collect lens outputs, run judge, retry if needed |
| 105 | +``` |
| 106 | + |
| 107 | +--- |
| 108 | + |
| 109 | +## Cost Model |
| 110 | + |
| 111 | +| Scenario | API calls | Cost estimate | |
| 112 | +|----------|-----------|---------------| |
| 113 | +| Current (1 model, 1 system prompt) | 1 | baseline | |
| 114 | +| Stage 1 only (3 candidates, no judge) | 3 | ~3x | |
| 115 | +| Full loop (3 candidates + judge, no retry) | 4 | ~4x + judge overhead | |
| 116 | +| Full loop with 1 retry | 7 | ~7x | |
| 117 | + |
| 118 | +For local/Ollama setups: `role: "candidate"` models run sequentially (workers=1), judge step skipped if no judge model configured. Backward-compatible. |
| 119 | + |
| 120 | +--- |
| 121 | + |
| 122 | +## What This Is Not |
| 123 | + |
| 124 | +- Not a jailbreak mechanism |
| 125 | +- Not a refusal-bypass layer |
| 126 | +- Not a replacement for the existing sequential fallback (that stays for error handling) |
| 127 | + |
| 128 | +This is a **quality improvement loop** for tasks where the output quality directly determines the value of everything downstream. |
| 129 | + |
| 130 | +--- |
| 131 | + |
| 132 | +## Open Questions for Simon |
| 133 | + |
| 134 | +1. Should `role` be a config field per model, or a separate `judge_model` key at the config root? |
| 135 | +2. What's the right `pass_threshold` — hard-coded, or configurable per task? |
| 136 | +3. Should the judge produce a structured score (Pydantic schema) or free-form text hints? |
| 137 | +4. Is `PremiseAttackTask` the right first implementation target, or `RedlineGateTask`? |
| 138 | + |
| 139 | +--- |
| 140 | + |
| 141 | +## References |
| 142 | + |
| 143 | +- `worker_plan_internal/llm_util/llm_executor.py` — `max_validation_retries` pattern (lines ~130–160) |
| 144 | +- `worker_plan_internal/diagnostics/premise_attack.py` — 5 independent sequential lenses |
| 145 | +- `worker_plan_internal/diagnostics/redline_gate.py` — IDEA: ensemble comment |
| 146 | +- PR #393 — previous parallel racing proposal (merged) |
0 commit comments