Skip to content

feat: add a representative eval dataset suite and labeling contract for the Python backend #185

@maehwasoo

Description

@maehwasoo

Problem statement

POST /eval/run and the local eval artifact path already exist, but the current eval dataset is still effectively a smoke check rather than a representative regression guard.

Today, the repo has a reproducible eval runner, yet the dataset itself is too small and too weakly specified to answer questions such as:

  • did a retrieval change improve or regress note selection quality?
  • did a local embedding/provider swap degrade retrieval behavior?
  • did ambiguity handling or missing-context handling regress?
  • did the answer stage regress independently from retrieval?

Without a stronger golden dataset and an explicit labeling contract, the eval path can appear implemented while still being too weak for safe default changes, meaningful comparisons, or strong portfolio evidence.

Proposed solution

Add a representative local eval dataset suite and an explicit labeling/scoring contract for the Python backend.

Keep this issue focused on evaluation quality and reproducibility, not on introducing remote judge models or cloud-only evaluation infrastructure.

Suggested implementation breakdown:

  1. Define the dataset contract clearly.

    • Keep the current JSON-array format as the starting point, or add an additive schema version if needed.
    • Document required fields and their semantics:
      • case_id
      • input
      • context
      • expected_paths
      • expected_terms
    • Add optional fields only when they materially improve evaluation quality, for example:
      • notes
      • expected_failure_code
      • forbidden_paths
      • forbidden_terms
      • retrieval_mode
    • Fail closed on invalid or incomplete dataset entries instead of silently accepting weak labels.
  2. Add a representative golden dataset.

    • Expand beyond the current single smoke case into a balanced suite that covers:
      • straightforward retrieval success
      • multi-note grounded summarization
      • path_prefix-scoped retrieval
      • tag-filtered retrieval
      • missing-context failures
      • ambiguous-note-resolution failures
      • lexical vs semantic behavior where relevant
      • regression guards for future local embedding/provider swaps
    • If local use is expected to be bilingual, include both Korean and English query/note coverage.
    • Keep the suite small enough for routine local runs, but large enough to catch obvious regressions.
    • Consider maintaining two tiers:
      • smoke subset
      • representative subset
  3. Define scoring rules explicitly.

    • Separate retrieval quality from answer quality.
    • Retrieval scoring should define pass/fail semantics for:
      • expected path presence
      • optional rank expectations when order matters
      • forbidden path exclusion where needed
    • Agent scoring should define pass/fail semantics for:
      • required grounded terms present
      • forbidden terms absent where needed
      • expected failure code returned for guarded cases
      • citations present for successful grounded answers
    • Keep scoring deterministic and inspectable.
    • Do not add LLM-as-judge behavior in this issue.
  4. Improve eval artifacts for failure inspection.

    • Persist enough case-level detail to explain failures quickly:
      • dataset case metadata
      • retrieval result paths and ordering
      • pass/fail reasons
      • expected vs observed failure code
      • underlying agent artifact path when available
    • Keep artifact output predictable under the configured eval artifact directory.
  5. Add dataset validation and regression coverage.

    • Add tests for dataset loading and validation failures.
    • Add tests for representative pass and fail cases.
    • Add regression tests that prove the eval runner can distinguish:
      • retrieval failure
      • answer-stage failure
      • expected guarded failure
    • Ensure POST /eval/run still fails clearly when the dataset is missing or invalid.
  6. Document the authoring and maintenance workflow.

    • Explain how to add a new eval case.
    • Explain how to label a case and how to review failures.
    • Clarify when the dataset should change after intended retrieval/agent behavior changes.
    • Document how this eval suite should be used for local model/provider comparisons, especially future open-source local embedding integrations.

Suggested files / areas:

  • apps/api/data/*.json
  • apps/api/src/ailss_api/evals.py
  • apps/api/src/ailss_api/models.py
  • apps/api/tests/test_app.py
  • apps/api/README.md
  • docs/architecture/python-first-local-agent-backend.md
  • docs/03-plan.md

Follow-up relationship:

Suggested done criteria:

  • The repo contains at least one representative golden dataset, not just a smoke case.
  • The dataset contract and scoring rules are documented and enforced.
  • POST /eval/run produces case-level artifacts that explain pass/fail outcomes.
  • The eval path can distinguish retrieval regressions from answer-stage regressions.
  • Local contributors can add new eval cases without guessing the labeling rules.
  • Docs clearly explain how eval supports future model/provider comparisons.

Constraints / context (optional)

  • Keep the stack local-first and single-user.
  • Do not introduce cloud-only evaluation dependencies or paid judge models in this issue.
  • Prefer additive schema changes over breaking the current POST /eval/run contract unless a break is clearly justified.
  • Do not broaden write authority or modify vault content as part of eval runs.
  • Keep the first implementation deterministic and easy to review.
  • Parent umbrella: feat: establish a Python-first local agent backend baseline #175

Metadata

Metadata

Assignees

Labels

apiPython backenddocsImprovements or additions to documentationenhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions