feat: add a representative eval dataset suite and labeling contract for the Python backend

## Problem statement

`POST /eval/run` and the local eval artifact path already exist, but the current eval dataset is still effectively a smoke check rather than a representative regression guard.

Today, the repo has a reproducible eval runner, yet the dataset itself is too small and too weakly specified to answer questions such as:

- did a retrieval change improve or regress note selection quality?
- did a local embedding/provider swap degrade retrieval behavior?
- did ambiguity handling or missing-context handling regress?
- did the answer stage regress independently from retrieval?

Without a stronger golden dataset and an explicit labeling contract, the eval path can appear implemented while still being too weak for safe default changes, meaningful comparisons, or strong portfolio evidence.

## Proposed solution

Add a representative local eval dataset suite and an explicit labeling/scoring contract for the Python backend.

Keep this issue focused on **evaluation quality and reproducibility**, not on introducing remote judge models or cloud-only evaluation infrastructure.

Suggested implementation breakdown:

1. Define the dataset contract clearly.
   - Keep the current JSON-array format as the starting point, or add an additive schema version if needed.
   - Document required fields and their semantics:
     - `case_id`
     - `input`
     - `context`
     - `expected_paths`
     - `expected_terms`
   - Add optional fields only when they materially improve evaluation quality, for example:
     - `notes`
     - `expected_failure_code`
     - `forbidden_paths`
     - `forbidden_terms`
     - `retrieval_mode`
   - Fail closed on invalid or incomplete dataset entries instead of silently accepting weak labels.

2. Add a representative golden dataset.
   - Expand beyond the current single smoke case into a balanced suite that covers:
     - straightforward retrieval success
     - multi-note grounded summarization
     - `path_prefix`-scoped retrieval
     - tag-filtered retrieval
     - missing-context failures
     - ambiguous-note-resolution failures
     - lexical vs semantic behavior where relevant
     - regression guards for future local embedding/provider swaps
   - If local use is expected to be bilingual, include both Korean and English query/note coverage.
   - Keep the suite small enough for routine local runs, but large enough to catch obvious regressions.
   - Consider maintaining two tiers:
     - smoke subset
     - representative subset

3. Define scoring rules explicitly.
   - Separate retrieval quality from answer quality.
   - Retrieval scoring should define pass/fail semantics for:
     - expected path presence
     - optional rank expectations when order matters
     - forbidden path exclusion where needed
   - Agent scoring should define pass/fail semantics for:
     - required grounded terms present
     - forbidden terms absent where needed
     - expected failure code returned for guarded cases
     - citations present for successful grounded answers
   - Keep scoring deterministic and inspectable.
   - Do **not** add LLM-as-judge behavior in this issue.

4. Improve eval artifacts for failure inspection.
   - Persist enough case-level detail to explain failures quickly:
     - dataset case metadata
     - retrieval result paths and ordering
     - pass/fail reasons
     - expected vs observed failure code
     - underlying agent artifact path when available
   - Keep artifact output predictable under the configured eval artifact directory.

5. Add dataset validation and regression coverage.
   - Add tests for dataset loading and validation failures.
   - Add tests for representative pass and fail cases.
   - Add regression tests that prove the eval runner can distinguish:
     - retrieval failure
     - answer-stage failure
     - expected guarded failure
   - Ensure `POST /eval/run` still fails clearly when the dataset is missing or invalid.

6. Document the authoring and maintenance workflow.
   - Explain how to add a new eval case.
   - Explain how to label a case and how to review failures.
   - Clarify when the dataset should change after intended retrieval/agent behavior changes.
   - Document how this eval suite should be used for local model/provider comparisons, especially future open-source local embedding integrations.

Suggested files / areas:

- `apps/api/data/*.json`
- `apps/api/src/ailss_api/evals.py`
- `apps/api/src/ailss_api/models.py`
- `apps/api/tests/test_app.py`
- `apps/api/README.md`
- `docs/architecture/python-first-local-agent-backend.md`
- `docs/03-plan.md`

Follow-up relationship:

- Child issue under #175 Phase 4.
- This issue should be completed before treating eval as a strong regression gate for local embedding/provider changes.

Suggested done criteria:

- The repo contains at least one representative golden dataset, not just a smoke case.
- The dataset contract and scoring rules are documented and enforced.
- `POST /eval/run` produces case-level artifacts that explain pass/fail outcomes.
- The eval path can distinguish retrieval regressions from answer-stage regressions.
- Local contributors can add new eval cases without guessing the labeling rules.
- Docs clearly explain how eval supports future model/provider comparisons.

## Constraints / context (optional)

- Keep the stack local-first and single-user.
- Do not introduce cloud-only evaluation dependencies or paid judge models in this issue.
- Prefer additive schema changes over breaking the current `POST /eval/run` contract unless a break is clearly justified.
- Do not broaden write authority or modify vault content as part of eval runs.
- Keep the first implementation deterministic and easy to review.
- Parent umbrella: #175


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add a representative eval dataset suite and labeling contract for the Python backend #185

Problem statement

Proposed solution

Constraints / context (optional)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: add a representative eval dataset suite and labeling contract for the Python backend #185

Description

Problem statement

Proposed solution

Constraints / context (optional)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions