Skip to content

Commit 093bc1d

Browse files
authored
Merge pull request #395 from VoynichLabs/proposal/ensemble-judge-refinement
Proposal: ensemble + judge-refinement loop for early pipeline tasks
2 parents 6936547 + e1cb242 commit 093bc1d

File tree

1 file changed

+146
-0
lines changed

1 file changed

+146
-0
lines changed
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# Proposal: Ensemble + Judge-Refinement Loop for Early Pipeline Tasks
2+
3+
**Author:** Egon (HejEgonBot)
4+
**Date:** 2026-03-27
5+
**Status:** Draft — for Simon's review
6+
7+
---
8+
9+
## Problem
10+
11+
Early pipeline tasks (particularly `PremiseAttackTask` and `RedlineGateTask`) run a single model with a single system prompt. The quality of these early outputs determines everything downstream — a weak premise attack or incorrect redline verdict propagates through the entire plan.
12+
13+
The current sequential fallback in `LLMExecutor` (try model A, on failure try model B) only handles errors, not quality. There's no mechanism to evaluate whether a successful response was actually good, or to improve it.
14+
15+
---
16+
17+
## Proposed Design: Three-Stage Judge-Refinement Loop
18+
19+
### Stage 1: Parallel Candidate Generation
20+
Run N non-reasoning models simultaneously on the same task.
21+
22+
- Each model produces a full response independently
23+
- Models are cheap and fast — running 3–5 in parallel costs little more than running 1
24+
- Implemented via Luigi parallel task scheduling (existing `--workers` parameter controls concurrency)
25+
- For `PremiseAttackTask`: each of the 5 lenses could run on a different model
26+
27+
### Stage 2: Reasoning Model Judgment
28+
A single reasoning model (e.g. `claude-sonnet-4-6-thinking`, `o3`) evaluates all N candidates and produces:
29+
30+
- A short score per candidate (not full responses — reasoning models are expensive, keep output minimal)
31+
- A brief hint identifying what's missing or weak in each response
32+
- An overall quality verdict: PASS / RETRY
33+
34+
Reasoning models are expensive, so the judgment output should be constrained — just scores and gap hints, not rewritten responses.
35+
36+
### Stage 3: Conditional Retry
37+
If the best score from Stage 2 falls below a threshold:
38+
39+
- Re-run the non-reasoning models with the gap hint injected into the prompt
40+
- The `validation_feedback` mechanism in `LLMExecutor` already handles this pattern for schema errors — this extends it to quality-based retries
41+
42+
If scores pass the threshold, the best candidate proceeds downstream.
43+
44+
---
45+
46+
## Where to Apply It
47+
48+
Early pipeline tasks where quality has the highest leverage:
49+
50+
| Task | Why it matters |
51+
|------|----------------|
52+
| `PremiseAttackTask` | 5 independent lenses, already structured for parallelism |
53+
| `RedlineGateTask` | Gate failure stops the entire pipeline; false positives are the core diagnostic problem |
54+
| `ProjectPlanTask` | Core decomposition — everything downstream builds on this |
55+
56+
Lower-priority tasks (WBS level 2/3, team enrichment, governance phases) don't need this — their outputs are less foundational.
57+
58+
---
59+
60+
## Implementation Sketch
61+
62+
### New config fields in `llm_config`
63+
64+
```json
65+
{
66+
"openrouter-gemini-2.0-flash": {
67+
"priority": 1,
68+
"role": "candidate"
69+
},
70+
"openrouter-mixtral-8x22b": {
71+
"priority": 2,
72+
"role": "candidate"
73+
},
74+
"anthropic-claude-sonnet-4-6-thinking": {
75+
"priority": 1,
76+
"role": "judge"
77+
}
78+
}
79+
```
80+
81+
A `role` field distinguishes candidate models (cheap, parallel) from judge models (expensive, sequential).
82+
83+
### New `LLMExecutor` method
84+
85+
```python
86+
def run_with_judge(
87+
self,
88+
execute_function: Callable[[LLM], Any],
89+
judge_function: Callable[[LLM, list[Any]], JudgmentResult],
90+
pass_threshold: float = 0.7,
91+
max_retries: int = 1
92+
) -> Any:
93+
"""
94+
Run candidate models in parallel, judge results, retry if below threshold.
95+
"""
96+
```
97+
98+
### Luigi task decomposition for `PremiseAttackTask`
99+
100+
```
101+
PremiseAttackTask
102+
├── requires: [PremiseAttackLensTask(lens_index=0, model=candidate_models[0]), ...]
103+
│ └── 5 lens tasks run in parallel up to --workers limit
104+
└── run_inner: collect lens outputs, run judge, retry if needed
105+
```
106+
107+
---
108+
109+
## Cost Model
110+
111+
| Scenario | API calls | Cost estimate |
112+
|----------|-----------|---------------|
113+
| Current (1 model, 1 system prompt) | 1 | baseline |
114+
| Stage 1 only (3 candidates, no judge) | 3 | ~3x |
115+
| Full loop (3 candidates + judge, no retry) | 4 | ~4x + judge overhead |
116+
| Full loop with 1 retry | 7 | ~7x |
117+
118+
For local/Ollama setups: `role: "candidate"` models run sequentially (workers=1), judge step skipped if no judge model configured. Backward-compatible.
119+
120+
---
121+
122+
## What This Is Not
123+
124+
- Not a jailbreak mechanism
125+
- Not a refusal-bypass layer
126+
- Not a replacement for the existing sequential fallback (that stays for error handling)
127+
128+
This is a **quality improvement loop** for tasks where the output quality directly determines the value of everything downstream.
129+
130+
---
131+
132+
## Open Questions for Simon
133+
134+
1. Should `role` be a config field per model, or a separate `judge_model` key at the config root?
135+
2. What's the right `pass_threshold` — hard-coded, or configurable per task?
136+
3. Should the judge produce a structured score (Pydantic schema) or free-form text hints?
137+
4. Is `PremiseAttackTask` the right first implementation target, or `RedlineGateTask`?
138+
139+
---
140+
141+
## References
142+
143+
- `worker_plan_internal/llm_util/llm_executor.py``max_validation_retries` pattern (lines ~130–160)
144+
- `worker_plan_internal/diagnostics/premise_attack.py` — 5 independent sequential lenses
145+
- `worker_plan_internal/diagnostics/redline_gate.py` — IDEA: ensemble comment
146+
- PR #393 — previous parallel racing proposal (merged)

0 commit comments

Comments
 (0)