Skip to content

feat: evals#89

Open
renaudcepre wants to merge 21 commits intomainfrom
feat/evals-native
Open

feat: evals#89
renaudcepre wants to merge 21 commits intomainfrom
feat/evals-native

Conversation

@renaudcepre
Copy link
Copy Markdown
Owner

No description provided.

    evaluators=[
        not_empty,
        ShortCircuit([
            contains_expected_facts(min_score=0.5),
            llm_judge(rubric="..."),  # skipped if above fails
        ]),
    ]

First Verdict=False stops the group. Evaluators outside run regardless.
- EvalPayload, EvalScoreEntry on TestResult for eval case results
- EVAL_SUITE_END event emitted by core runner
- is_eval flag on TestRegistration/TestItem
- KindFilterPlugin for protest run vs protest eval
- get_type_hints_compat: PEP 563 + TYPE_CHECKING support in all DI sites
- Async fixture teardown on same event loop (no more loop mismatch)
- Fixture resolution time excluded from test duration
- Log records captured on TestResult for --show-logs
An eval is a test that returns a scored value. Uses ForEach/From
for parametrization — no separate EvalSuite/EvalCase framework.

- @session.eval(evaluators=[...]) decorator
- @evaluator decorator with partial-application binding
- EvalSession(model=) for eval-focused sessions
- EvalContext passed to evaluators
- Scoring v2: evaluators return bool or dataclass
  - Annotated[bool, Verdict] → pass/fail
  - Annotated[float, Metric] → stats aggregation
  - Annotated[str, Reason] → displayed on failure
- EvalCase dataclass for typed ForEach data
- Built-in evaluators: contains_keywords, not_empty, max_length, etc.
- EvalHistoryPlugin listens to EVAL_SUITE_END
- EvalResultsWriter for per-case .md files
- Evaluator exception → error (not fail)
- on_eval_suite_end: Rich table for scores, plain text for ASCII
- Scores inline in -v, --show-output for inputs/output/expected
- --show-logs flag for captured log records
- Fixture setup time always displayed
- protest history --runs: per-suite breakdown with model
- protest.console.print(): progress output bypassing capture
- Lifecycle messages bypass capture (no re-display on fail)
- Output truncated at 20 lines with pointer to full output
- Case id in lifecycle messages (chatbot[lookup] not chatbot)
- 1063 tests (56 eval-specific)
- Yorkshire chatbot example with @session.eval + ForEach
- History module: JSONL storage, git info, env info
- docs/evals.md: full guide (scoring, evaluators, CLI, history)
- docs/core-concepts/console.md: console.print guide
@renaudcepre renaudcepre changed the title Feat/evals native feat: evals Mar 30, 2026
ProTest owns the interface, user plugs in their LLM library.

- Judge protocol: `async judge(prompt, output_type) -> JudgeResponse[T]`
- JudgeResponse wraps output with optional tokens/cost tracking
- EvalContext.judge() unwraps for evaluators, accumulates usage stats
- JudgeInfo auto-derived from instance for history
- EvalPayload carries judge_call_count, tokens, cost per case
- EvalSession(judge=MyJudge()) wires through to evaluators
- suite.eval(judge=) for standalone usage
- 19 new tests (protocol, ctx.judge, e2e, structured output, tokens)
Task: 45.2k in / 27.1k out, $0.0142
Judge: 5 calls, 800 in / 400 out, $0.0030
…flag

Fixture crashes (errored >= total_cases) were counted in pass_rates,
score_values, and flaky — polluting stats with noise. Now:
- EvalCaseResult.is_error propagated from TestResult.is_fixture_error
- History serializes errored count per suite + is_error per case
- _aggregate_suites skips error-only runs from stats entirely
- _track_cases skips error cases from score_values and flaky
- Error runs still visible in `protest history --runs`

Also: docs/evals.md updated for TaskResult section and Judge protocol fix.
- Remove defensive getattr in session.py where types are known
- Type plugin setup(session: ProTestSession) instead of Any
- Add name/provider to Judge Protocol — explicit contract
- Delete ModelInfo.from_agent and JudgeInfo.from_instance — user wires
- Fix lint: PLR2004 magic values, PLR0912 noqa, ambiguous unicode
Replace fragile repr() fallback with explicit error on unknown types.
Add evaluator_identity() as user-controlled escape hatch for custom
evaluators. Introspect dataclass/partial/callable as fallback only.

- Remove hasattr(obj, "model_dump") duck-typing (Pydantic leak)
- Remove default=str silent fallback in json.dumps
- Skip _prefixed dataclass fields (runtime internals, not config)
- Add functools.partial support (qualname + bound kwargs)
- Add ShortCircuit.evaluator_identity()
- 33 tests covering all paths including fail-hard
Type-safe suite kind across the codebase. StrEnum keeps JSON
compat (SuiteKind.EVAL == "eval") so no migration needed.
…ores

28 lazy imports in protest/, none resolving a real circular dependency.
Moved all to top-level except justified cases (optional deps like rich,
conditional wiring, and one true circular import in evals/__init__.py).

Removed blanket PLC0415 per-file-ignores from pyproject.toml — remaining
suppressions use inline noqa with justification.
- Type built-in evaluators as EvalContext[Any, str] (text evaluators)
- not_empty typed EvalContext[Any, Any] (works on any output)
- Fix mypy running outside venv (uv run mypy in justfile)
- Add mypy config in pyproject.toml with rich stubs override
- Fix no-any-return, arg-type, unused type-ignore across codebase
- Remove stale type: ignore[import-not-found] on rich imports
- Remove is_async_evaluator(), _is_evaluator, _is_async_evaluator
  (written but never read — dead code with hasattr duck-typing)
- Add yorkshire example evaluators showing EvalContext generics:
  [Any, str] for text, [str, float] for numeric, [str, bytes] for binary
- Removed unnecessary `# type: ignore[import-not-found]` markers on imports.
- Added `--group dev` flag to dependency sync in CI workflow.
- Updated `uv.lock` to include new packages: `librt` and `mypy`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant