feat: evals by renaudcepre · Pull Request #89 · renaudcepre/protest

renaudcepre · 2026-03-30T08:40:40Z

No description provided.

evaluators=[ not_empty, ShortCircuit([ contains_expected_facts(min_score=0.5), llm_judge(rubric="..."), # skipped if above fails ]), ] First Verdict=False stops the group. Evaluators outside run regardless.

- EvalPayload, EvalScoreEntry on TestResult for eval case results - EVAL_SUITE_END event emitted by core runner - is_eval flag on TestRegistration/TestItem - KindFilterPlugin for protest run vs protest eval - get_type_hints_compat: PEP 563 + TYPE_CHECKING support in all DI sites - Async fixture teardown on same event loop (no more loop mismatch) - Fixture resolution time excluded from test duration - Log records captured on TestResult for --show-logs

@evaluator

An eval is a test that returns a scored value. Uses ForEach/From for parametrization — no separate EvalSuite/EvalCase framework. - @session.eval(evaluators=[...]) decorator - @evaluator decorator with partial-application binding - EvalSession(model=) for eval-focused sessions - EvalContext passed to evaluators - Scoring v2: evaluators return bool or dataclass - Annotated[bool, Verdict] → pass/fail - Annotated[float, Metric] → stats aggregation - Annotated[str, Reason] → displayed on failure - EvalCase dataclass for typed ForEach data - Built-in evaluators: contains_keywords, not_empty, max_length, etc. - EvalHistoryPlugin listens to EVAL_SUITE_END - EvalResultsWriter for per-case .md files - Evaluator exception → error (not fail)

- on_eval_suite_end: Rich table for scores, plain text for ASCII - Scores inline in -v, --show-output for inputs/output/expected - --show-logs flag for captured log records - Fixture setup time always displayed - protest history --runs: per-suite breakdown with model - protest.console.print(): progress output bypassing capture - Lifecycle messages bypass capture (no re-display on fail) - Output truncated at 20 lines with pointer to full output - Case id in lifecycle messages (chatbot[lookup] not chatbot)

- 1063 tests (56 eval-specific) - Yorkshire chatbot example with @session.eval + ForEach - History module: JSONL storage, git info, env info - docs/evals.md: full guide (scoring, evaluators, CLI, history) - docs/core-concepts/console.md: console.print guide

…to feat/evals-native

ProTest owns the interface, user plugs in their LLM library. - Judge protocol: `async judge(prompt, output_type) -> JudgeResponse[T]` - JudgeResponse wraps output with optional tokens/cost tracking - EvalContext.judge() unwraps for evaluators, accumulates usage stats - JudgeInfo auto-derived from instance for history - EvalPayload carries judge_call_count, tokens, cost per case - EvalSession(judge=MyJudge()) wires through to evaluators - suite.eval(judge=) for standalone usage - 19 new tests (protocol, ctx.judge, e2e, structured output, tokens)

Task: 45.2k in / 27.1k out, $0.0142 Judge: 5 calls, 800 in / 400 out, $0.0030

…flag Fixture crashes (errored >= total_cases) were counted in pass_rates, score_values, and flaky — polluting stats with noise. Now: - EvalCaseResult.is_error propagated from TestResult.is_fixture_error - History serializes errored count per suite + is_error per case - _aggregate_suites skips error-only runs from stats entirely - _track_cases skips error cases from score_values and flaky - Error runs still visible in `protest history --runs` Also: docs/evals.md updated for TaskResult section and Judge protocol fix.

- Remove defensive getattr in session.py where types are known - Type plugin setup(session: ProTestSession) instead of Any - Add name/provider to Judge Protocol — explicit contract - Delete ModelInfo.from_agent and JudgeInfo.from_instance — user wires - Fix lint: PLR2004 magic values, PLR0912 noqa, ambiguous unicode

Replace fragile repr() fallback with explicit error on unknown types. Add evaluator_identity() as user-controlled escape hatch for custom evaluators. Introspect dataclass/partial/callable as fallback only. - Remove hasattr(obj, "model_dump") duck-typing (Pydantic leak) - Remove default=str silent fallback in json.dumps - Skip _prefixed dataclass fields (runtime internals, not config) - Add functools.partial support (qualname + bound kwargs) - Add ShortCircuit.evaluator_identity() - 33 tests covering all paths including fail-hard

Type-safe suite kind across the codebase. StrEnum keeps JSON compat (SuiteKind.EVAL == "eval") so no migration needed.

…ores 28 lazy imports in protest/, none resolving a real circular dependency. Moved all to top-level except justified cases (optional deps like rich, conditional wiring, and one true circular import in evals/__init__.py). Removed blanket PLC0415 per-file-ignores from pyproject.toml — remaining suppressions use inline noqa with justification.

- Type built-in evaluators as EvalContext[Any, str] (text evaluators) - not_empty typed EvalContext[Any, Any] (works on any output) - Fix mypy running outside venv (uv run mypy in justfile) - Add mypy config in pyproject.toml with rich stubs override - Fix no-any-return, arg-type, unused type-ignore across codebase - Remove stale type: ignore[import-not-found] on rich imports

- Remove is_async_evaluator(), _is_evaluator, _is_async_evaluator (written but never read — dead code with hasattr duck-typing) - Add yorkshire example evaluators showing EvalContext generics: [Any, str] for text, [str, float] for numeric, [str, bytes] for binary

- Removed unnecessary `# type: ignore[import-not-found]` markers on imports. - Added `--group dev` flag to dependency sync in CI workflow. - Updated `uv.lock` to include new packages: `librt` and `mypy`.

renaudcepre added 6 commits March 30, 2026 07:30

feat(evals): ShortCircuit — skip expensive evaluators on early fail

7aa3b49

evaluators=[ not_empty, ShortCircuit([ contains_expected_facts(min_score=0.5), llm_judge(rubric="..."), # skipped if above fails ]), ] First Verdict=False stops the group. Evaluators outside run regardless.

chore: entity exports, pyproject config

29204bc

renaudcepre changed the title ~~Feat/evals native~~ feat: evals Mar 30, 2026

renaudcepre force-pushed the feat/evals-native branch from 4e4d91d to 29204bc Compare March 30, 2026 09:53

renaudcepre added 15 commits March 30, 2026 22:09

Merge branch 'main' into feat/evals-native

bc4d16d

fix ci

3ed68a4

Merge branch 'feat/evals-native' of github.com:renaudcepre/protest in…

7abeb1d

…to feat/evals-native

chore: fix all lint — move imports to top-level, no lazy imports

ad7a207

fix(reporters): show in/out token split in eval usage summary

015c451

Task: 45.2k in / 27.1k out, $0.0142 Judge: 5 calls, 800 in / 400 out, $0.0030

refactor: replace kind string literals with SuiteKind StrEnum

d7fbba3

Type-safe suite kind across the codebase. StrEnum keeps JSON compat (SuiteKind.EVAL == "eval") so no migration needed.

ci: update workflow to install dependencies and fix mypy invocation

155db22

refactor: remove redundant type ignores, update dependency management

96d3632

- Removed unnecessary `# type: ignore[import-not-found]` markers on imports. - Added `--group dev` flag to dependency sync in CI workflow. - Updated `uv.lock` to include new packages: `librt` and `mypy`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: evals#89

feat: evals#89
renaudcepre wants to merge 21 commits intomainfrom
feat/evals-native

renaudcepre commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

renaudcepre commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant