A test bench to measure the quality of the enrichment system. Distinct from the mechanical validator (a per-run integrity gate) — this measures whether summaries, topic assignments and overviews are good, so configurations, rubrics and executors can be compared and regressions caught.
Scope
- Layered evaluation: item summaries, topic assignment, topic-page overviews.
- A gold set produced by a frontier model + an LLM-as-judge scorer.
- An action loop that feeds eval findings back into the declarative rubrics.
Status
Designed, not yet planned. Built after the WS2 enrichment layers exist (they now do).
A test bench to measure the quality of the enrichment system. Distinct from the mechanical validator (a per-run integrity gate) — this measures whether summaries, topic assignments and overviews are good, so configurations, rubrics and executors can be compared and regressions caught.
Scope
Status
Designed, not yet planned. Built after the WS2 enrichment layers exist (they now do).