You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
POST /eval/run and the local eval artifact path already exist, but the current eval dataset is still effectively a smoke check rather than a representative regression guard.
Today, the repo has a reproducible eval runner, yet the dataset itself is too small and too weakly specified to answer questions such as:
did a retrieval change improve or regress note selection quality?
did a local embedding/provider swap degrade retrieval behavior?
did ambiguity handling or missing-context handling regress?
did the answer stage regress independently from retrieval?
Without a stronger golden dataset and an explicit labeling contract, the eval path can appear implemented while still being too weak for safe default changes, meaningful comparisons, or strong portfolio evidence.
Proposed solution
Add a representative local eval dataset suite and an explicit labeling/scoring contract for the Python backend.
Keep this issue focused on evaluation quality and reproducibility, not on introducing remote judge models or cloud-only evaluation infrastructure.
Suggested implementation breakdown:
Define the dataset contract clearly.
Keep the current JSON-array format as the starting point, or add an additive schema version if needed.
Document required fields and their semantics:
case_id
input
context
expected_paths
expected_terms
Add optional fields only when they materially improve evaluation quality, for example:
notes
expected_failure_code
forbidden_paths
forbidden_terms
retrieval_mode
Fail closed on invalid or incomplete dataset entries instead of silently accepting weak labels.
Add a representative golden dataset.
Expand beyond the current single smoke case into a balanced suite that covers:
straightforward retrieval success
multi-note grounded summarization
path_prefix-scoped retrieval
tag-filtered retrieval
missing-context failures
ambiguous-note-resolution failures
lexical vs semantic behavior where relevant
regression guards for future local embedding/provider swaps
If local use is expected to be bilingual, include both Korean and English query/note coverage.
Keep the suite small enough for routine local runs, but large enough to catch obvious regressions.
Consider maintaining two tiers:
smoke subset
representative subset
Define scoring rules explicitly.
Separate retrieval quality from answer quality.
Retrieval scoring should define pass/fail semantics for:
expected path presence
optional rank expectations when order matters
forbidden path exclusion where needed
Agent scoring should define pass/fail semantics for:
required grounded terms present
forbidden terms absent where needed
expected failure code returned for guarded cases
citations present for successful grounded answers
Keep scoring deterministic and inspectable.
Do not add LLM-as-judge behavior in this issue.
Improve eval artifacts for failure inspection.
Persist enough case-level detail to explain failures quickly:
dataset case metadata
retrieval result paths and ordering
pass/fail reasons
expected vs observed failure code
underlying agent artifact path when available
Keep artifact output predictable under the configured eval artifact directory.
Add dataset validation and regression coverage.
Add tests for dataset loading and validation failures.
Add tests for representative pass and fail cases.
Add regression tests that prove the eval runner can distinguish:
retrieval failure
answer-stage failure
expected guarded failure
Ensure POST /eval/run still fails clearly when the dataset is missing or invalid.
Document the authoring and maintenance workflow.
Explain how to add a new eval case.
Explain how to label a case and how to review failures.
Clarify when the dataset should change after intended retrieval/agent behavior changes.
Document how this eval suite should be used for local model/provider comparisons, especially future open-source local embedding integrations.
Problem statement
POST /eval/runand the local eval artifact path already exist, but the current eval dataset is still effectively a smoke check rather than a representative regression guard.Today, the repo has a reproducible eval runner, yet the dataset itself is too small and too weakly specified to answer questions such as:
Without a stronger golden dataset and an explicit labeling contract, the eval path can appear implemented while still being too weak for safe default changes, meaningful comparisons, or strong portfolio evidence.
Proposed solution
Add a representative local eval dataset suite and an explicit labeling/scoring contract for the Python backend.
Keep this issue focused on evaluation quality and reproducibility, not on introducing remote judge models or cloud-only evaluation infrastructure.
Suggested implementation breakdown:
Define the dataset contract clearly.
case_idinputcontextexpected_pathsexpected_termsnotesexpected_failure_codeforbidden_pathsforbidden_termsretrieval_modeAdd a representative golden dataset.
path_prefix-scoped retrievalDefine scoring rules explicitly.
Improve eval artifacts for failure inspection.
Add dataset validation and regression coverage.
POST /eval/runstill fails clearly when the dataset is missing or invalid.Document the authoring and maintenance workflow.
Suggested files / areas:
apps/api/data/*.jsonapps/api/src/ailss_api/evals.pyapps/api/src/ailss_api/models.pyapps/api/tests/test_app.pyapps/api/README.mddocs/architecture/python-first-local-agent-backend.mddocs/03-plan.mdFollow-up relationship:
Suggested done criteria:
POST /eval/runproduces case-level artifacts that explain pass/fail outcomes.Constraints / context (optional)
POST /eval/runcontract unless a break is clearly justified.