Evaluation Stack

HelpmateAI now uses a layered evaluation stack so retrieval changes can be judged from more than one angle.

Current Evaluation Layers

custom retrieval benchmark:
- page-hit rate
- mean reciprocal rank
- structure-aware planner/topology metrics
custom negative benchmark:
- abstention rate
external hosted baselines:
- Vectara retrieval snippet match rate
- OpenAI file-search snippet match rate
open-source quality benchmark:
- ragas answer faithfulness
- ragas answer relevancy
- ragas no-reference context precision
shared-answer quality benchmark:
- ragas scoring on top of OpenAI retrieval contexts
- ragas scoring on top of Vectara retrieval contexts
frozen final-eval suite:
- held-out manifest validation
- equalized context budgets
- per-intent answer-quality reporting
- abstention-aware vendor comparison

Why This Matters

These layers answer different questions:

page-hit rate tells us whether retrieval is landing on the right document region
MRR tells us how high the right evidence is ranked
abstention rate tells us whether unsupported questions are rejected honestly
Vectara retrieval gives us the strongest current external managed-retrieval baseline
OpenAI file search gives us a historical convenience baseline
ragas gives us answer-quality signals on top of retrieval, especially for broad or narrative questions This is important because a system can retrieve the right page while still give a vague or weak answer. That is exactly why the ragas layer is now the main answer-quality signal.

Current Summary Table

As of the stabilized 2026-04-19 snapshot, the repo treats these tables as the current reference view of the benchmark stack.

Methodology And Caveats

The current benchmark is useful, but it should be read precisely.

The main four-document suite contains documents and question families used during HelpmateAI development, so it is a tuned workload for HelpmateAI and a less tuned workload for external vendors.
Vendor answer-quality rows are generated with the shared Helpmate answer generator on top of vendor retrieval contexts. This makes retrieval context quality easier to compare, but it does not evaluate each vendor's full native answer product.
OpenAI File Search is queried with rewrite_query=True and max_num_results=5.
Historical Vectara rows before the final-eval scaffold used limit=5.
Final-eval Vectara rows use the hybrid_rerank profile: initial limit=20, lexical_interpolation=0.025, and Rerank_Multilingual_v1 with returned limit=5.
Vendor snippets are truncated to 400 characters before answer generation and ragas scoring.
HelpmateAI uses its own final evidence bundle, currently final_top_k=4, after planning, fusion, reranking, and optional reorder-only evidence selection.
ragas uses the configured OpenAI-backed evaluator and no-reference metrics because these datasets are retrieval-labeled rather than gold-answer datasets.
Faithfulness can be affected by abstention behavior. A system that refuses unsupported questions may score differently from a system that attempts every answer.

Therefore, the benchmark supports a narrow claim: HelpmateAI outperforms the tested vendor retrieval configurations on this project workload. It is not yet a broad claim that HelpmateAI beats every tuned Vectara or OpenAI deployment on arbitrary documents.

The next stronger protocol is:

use never-tuned documents and fresh questions
equalize context budgets across systems
report attempted-only faithfulness separately from all-query faithfulness
report abstention/support rates beside answer-quality metrics
break scores down by intent type
rerun with a second judge model family

The scaffold for that protocol is now in:

docs/evals/final_eval_protocol.md
docs/evals/final_eval_manifest.example.json
docs/evals/final_eval_sources_20260428.md
docs/evals/final_eval_question_authoring_prompt.md
docs/evals/download_final_eval_docs.ps1
docs/evals/financebench_protocol.md
src/evals/final_eval_suite.py

The final-eval runner is intentionally separate from the current project benchmark scripts. The project benchmark remains the regression suite; the final-eval suite is the blind, auditable comparison path.

Held-out PDFs are stored locally under static/held_out/ and ignored by git. This keeps the repo small while preserving reproducibility through source URLs and hashes.

FinanceBench assets are also local-only and reproducible. Use src/evals/financebench_eval.py to convert the open-source 150-question sample into the same manifest format as the final-eval runner.

Retrieval-Level Comparison

Document	Ours hit/MRR	Vectara retrieval	OpenAI retrieval
Health policy	`0.8462 / 0.7051`	`0.7692`	`0.6923`
Thesis	`0.9167 / 0.5764`	`0.6667`	`0.6667`
`pancreas7`	`0.7778 / 0.6111`	`0.5556`	`0.3333`
`pancreas8`	`1.0000 / 0.8833`	`0.8000`	`0.4000`

Answer-Quality Comparison

These scores use either our own pipeline answers or a shared answer model on top of vendor retrieval contexts.

Document	System	Ragas faithfulness	Ragas answer relevancy	Ragas context precision
Health policy	Ours	`0.8846`	`0.6378`	`0.8825`
Health policy	Vectara retrieval + shared answer model	`0.7692`	`0.4504`	`0.8235`
Health policy	OpenAI retrieval + shared answer model	`0.6154`	`0.1357`	`0.4970`
Thesis	Ours	`1.0000`	`0.6031`	`0.8588`
Thesis	Vectara retrieval + shared answer model	`0.8750`	`0.5579`	`0.8035`
Thesis	OpenAI retrieval + shared answer model	`0.3702`	`0.2944`	`0.6024`
`pancreas7`	Ours	`0.9444`	`0.6499`	`1.0000`
`pancreas7`	Vectara retrieval + shared answer model	`0.6111`	`0.5009`	`0.7350`
`pancreas7`	OpenAI retrieval + shared answer model	`0.5556`	`0.2514`	`0.6210`
`pancreas8`	Ours	`0.9250`	`0.5527`	`0.9000`
`pancreas8`	Vectara retrieval + shared answer model	`0.7000`	`0.3941`	`0.6700`
`pancreas8`	OpenAI retrieval + shared answer model	`0.4000`	`0.1535`	`0.4422`

Average current margin:

versus Vectara: +0.1997 faithfulness, +0.1350 answer relevancy, +0.1523 context precision
versus OpenAI File Search: +0.4532 faithfulness, +0.4021 answer relevancy, +0.3697 context precision

Structure-Aware Metrics

The current local retrieval stack also emits planner/topology metrics:

section_hit_rate
region_hit_rate
plan_accuracy
global_fallback_recovery_rate
multi_region_recall

These are diagnostic metrics for the local architecture, not vendor-comparison metrics.

Lean Smart-Indexing Upgrade Check

The 2026-04-27 smart-indexing/orchestrator branch added a smaller targeted ragas suite to test the newest failure modes without rerunning the full vendor benchmark every time.

Coverage:

thesis local chapter scope
thesis broad synthesis
policy claims/reimbursement
held-out life-policy limits
report-generation main contribution
pancreas review broad synthesis

Current branch result on the six-case suite:

Supported rate	Faithfulness	Answer relevancy	Context precision
`1.0000`	`0.9050`	`0.6034`	`0.7500`

Regression against main used the five shared cases available to both branches:

System	Supported rate	Faithfulness	Answer relevancy	Context precision
`main` before smart indexing/orchestration	`1.0000`	`0.8750`	`0.4964`	`0.7000`
Smart-indexing branch	`1.0000`	`0.8860`	`0.5930`	`0.8000`
Delta	`+0.0000`	`+0.0110`	`+0.0966`	`+0.1000`

Lean vendor comparison on the same six cases:

System	Supported rate	Faithfulness	Answer relevancy	Context precision
Helpmate smart-indexing branch	`1.0000`	`0.9050`	`0.6034`	`0.7500`
OpenAI File Search + shared answer model	`0.6667`	`0.9667`	`0.2676`	`0.6028`
Vectara + shared answer model	`0.6667`	`0.7639`	`0.3690`	`0.5556`

Interpretation:

this lean suite is a targeted upgrade/regression check, not a replacement for the stabilized four-document benchmark snapshot
the implementation-chapter thesis question improved from drifting into conclusion/methodology/results pages on main to staying inside the implementation chapter on the smart-indexing branch
OpenAI's high faithfulness in this lean run is partly from abstaining or giving limited answers on weak retrieval contexts
Helpmate led the lean suite on supported rate, answer relevancy, and context precision

Reports:

docs/evals/reports/lean_ragas_upgrade_20260427_185007.json
docs/evals/reports/lean_ragas_main_baseline_20260427_185723.json
docs/evals/reports/lean_ragas_upgrade_regression_compare_20260427_1859.json
docs/evals/reports/lean_vendor_ragas_upgrade_comparison_20260427_191421.json
docs/evals/reports/lean_ragas_ours_vs_vendor_comparison_20260427_1915.json

New Report-Generation Eval Sets

Two additional journal-paper eval sets are now included:

reportgeneration
reportgeneration2

Current local retrieval snapshot on those datasets:

Document	Ours hit/MRR	Negative abstention
`reportgeneration`	`0.9000 / 0.8500`	`1.0000`
`reportgeneration2`	`1.0000 / 0.8333`	`1.0000`

Interpretation:

retrieval on these papers is already healthy
the remaining challenge is the broadest paper-summary phrasing, not raw evidence discovery
that is why the latest architecture work focused on a dedicated global-summary evidence route instead of another retrieval rewrite

Evidence-Selector Calibration

The evidence selector combines:

a retrieval prior from fused retrieval ranking
an LLM score over the shortlisted candidates

Those blend weights are treated as hyperparameters rather than fixed intuition.

The repo now includes an offline sweep script:

uv run python -m src.evals.evidence_selector_weight_sweep --step 0.01

That sweep caches the selector model's candidate scores once, then replays the same questions offline across different rank/LLM mixes using the labeled retrieval datasets already in this repo.

Current tuning snapshot:

benchmark size: 76 labeled questions across health policy, thesis, pancreas, and report-generation datasets
objective: 0.45 * page_hit_rate + 0.35 * fragment_recall + 0.20 * MRR
best-performing plateau: rank_weight from about 0.03 to 0.37
chosen default: rank_weight=0.25, llm_weight=0.75

Why that default was chosen:

it sits inside the top-performing plateau instead of at a fragile edge
it keeps the LLM score as the dominant signal when the selector is invoked
it still preserves a meaningful retrieval prior for tie-breaking and stability
it outperforms the older 0.65 / 0.35 hand-set blend on fragment recall while keeping the same page-hit and MRR on the current benchmark

Current production selector policy:

reorder-only, not prune mode
spread-only trigger policy
weak_evidence=false
ambiguity=false

Why:

the selector's original regression came from pruning away support, not from reordering itself
spread-only gives the best current production tradeoff between answer quality and activation frequency
always-on remains a useful reference mode, but not the default shipping policy

Support Guardrail Eval

The weak/unsupported retrieval thresholds are now checked separately from final answer support.

Run the support guardrail eval with:

uv run python -m src.evals.support_guardrail_eval

The eval covers:

labeled calibration positives and negatives from docs/evals
held-out manual questions over static/sample_files/test
retrieval status distribution
final answer supported/abstained behavior

Current report:

docs/evals/reports/support_guardrail_eval_20260427_032609.json

Current result:

calibration positive supported rate: 0.9079
calibration negative abstention rate: 1.0000
calibration false support rate: 0.0000
held-out answer supported rate: 1.0000
held-out unsupported retrieval rate: 0.0000

Decision from this run:

keep weak/unsupported thresholds unchanged
use answer-layer support verification as the final safety layer
allow partial but grounded answers when retrieved evidence answers only part of a broad question

Baseline Policy

Going forward, Vectara should be treated as the primary external managed-retrieval benchmark.

Reason:

it is consistently stronger than OpenAI File Search on the current document families
it is the more demanding and useful comparison point for Helpmate now

OpenAI File Search should remain in the repo as a historical/reference benchmark, but it no longer needs to be the default benchmark we optimize against.

Current Command Surfaces

run the existing benchmark comparison:

uv run python -m src.evals.compare_benchmarks

run a document-specific comparison from Python:

from pathlib import Path
from src.evals.compare_benchmarks import compare

root = Path(".").resolve()
report = compare(
    document_path=root / "static" / "sample_files" / "pancreas8.pdf",
    positive_dataset_path=root / "docs" / "evals" / "pancreas8_retrieval_eval_dataset.json",
    negative_dataset_path=root / "docs" / "evals" / "pancreas8_negative_eval_dataset.json",
)

Reports are saved under docs/evals/reports/.

Current Limitations

ragas currently uses OpenAI-backed evaluation models, so it is still not a zero-cost eval layer
the current ragas bridge uses no-reference metrics because our datasets are retrieval-labeled, not gold-answer datasets
Vectara retrieval is the primary external baseline, while OpenAI remains a historical/reference baseline
Vectara factual-consistency is no longer part of the active decision-making benchmark stack because it was too sensitive to answer formatting for our current usage

Next Eval Upgrades

add answer-quality eval coverage for the newer report-generation datasets
add optional gold-answer fields to selected datasets
add stronger academic-synthesis eval questions
keep Vectara as the main external benchmark and OpenAI as a reference baseline in routine benchmark loops
keep ragas as the main answer-quality meter in routine benchmarking
compare against additional vendors when credentials are available:
- Google Vertex AI Search
- Cohere

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Stack

Current Evaluation Layers

Why This Matters

Current Summary Table

Methodology And Caveats

Retrieval-Level Comparison

Answer-Quality Comparison

Structure-Aware Metrics

Lean Smart-Indexing Upgrade Check

New Report-Generation Eval Sets

Evidence-Selector Calibration

Support Guardrail Eval

Baseline Policy

Current Command Surfaces

Current Limitations

Next Eval Upgrades

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Evaluation Stack

Current Evaluation Layers

Why This Matters

Current Summary Table

Methodology And Caveats

Retrieval-Level Comparison

Answer-Quality Comparison

Structure-Aware Metrics

Lean Smart-Indexing Upgrade Check

New Report-Generation Eval Sets

Evidence-Selector Calibration

Support Guardrail Eval

Baseline Policy

Current Command Surfaces

Current Limitations

Next Eval Upgrades