Open
Conversation
evaluators=[
not_empty,
ShortCircuit([
contains_expected_facts(min_score=0.5),
llm_judge(rubric="..."), # skipped if above fails
]),
]
First Verdict=False stops the group. Evaluators outside run regardless.
- EvalPayload, EvalScoreEntry on TestResult for eval case results - EVAL_SUITE_END event emitted by core runner - is_eval flag on TestRegistration/TestItem - KindFilterPlugin for protest run vs protest eval - get_type_hints_compat: PEP 563 + TYPE_CHECKING support in all DI sites - Async fixture teardown on same event loop (no more loop mismatch) - Fixture resolution time excluded from test duration - Log records captured on TestResult for --show-logs
An eval is a test that returns a scored value. Uses ForEach/From for parametrization — no separate EvalSuite/EvalCase framework. - @session.eval(evaluators=[...]) decorator - @evaluator decorator with partial-application binding - EvalSession(model=) for eval-focused sessions - EvalContext passed to evaluators - Scoring v2: evaluators return bool or dataclass - Annotated[bool, Verdict] → pass/fail - Annotated[float, Metric] → stats aggregation - Annotated[str, Reason] → displayed on failure - EvalCase dataclass for typed ForEach data - Built-in evaluators: contains_keywords, not_empty, max_length, etc. - EvalHistoryPlugin listens to EVAL_SUITE_END - EvalResultsWriter for per-case .md files - Evaluator exception → error (not fail)
- on_eval_suite_end: Rich table for scores, plain text for ASCII - Scores inline in -v, --show-output for inputs/output/expected - --show-logs flag for captured log records - Fixture setup time always displayed - protest history --runs: per-suite breakdown with model - protest.console.print(): progress output bypassing capture - Lifecycle messages bypass capture (no re-display on fail) - Output truncated at 20 lines with pointer to full output - Case id in lifecycle messages (chatbot[lookup] not chatbot)
- 1063 tests (56 eval-specific) - Yorkshire chatbot example with @session.eval + ForEach - History module: JSONL storage, git info, env info - docs/evals.md: full guide (scoring, evaluators, CLI, history) - docs/core-concepts/console.md: console.print guide
4e4d91d to
29204bc
Compare
…to feat/evals-native
ProTest owns the interface, user plugs in their LLM library. - Judge protocol: `async judge(prompt, output_type) -> JudgeResponse[T]` - JudgeResponse wraps output with optional tokens/cost tracking - EvalContext.judge() unwraps for evaluators, accumulates usage stats - JudgeInfo auto-derived from instance for history - EvalPayload carries judge_call_count, tokens, cost per case - EvalSession(judge=MyJudge()) wires through to evaluators - suite.eval(judge=) for standalone usage - 19 new tests (protocol, ctx.judge, e2e, structured output, tokens)
Task: 45.2k in / 27.1k out, $0.0142 Judge: 5 calls, 800 in / 400 out, $0.0030
…flag Fixture crashes (errored >= total_cases) were counted in pass_rates, score_values, and flaky — polluting stats with noise. Now: - EvalCaseResult.is_error propagated from TestResult.is_fixture_error - History serializes errored count per suite + is_error per case - _aggregate_suites skips error-only runs from stats entirely - _track_cases skips error cases from score_values and flaky - Error runs still visible in `protest history --runs` Also: docs/evals.md updated for TaskResult section and Judge protocol fix.
- Remove defensive getattr in session.py where types are known - Type plugin setup(session: ProTestSession) instead of Any - Add name/provider to Judge Protocol — explicit contract - Delete ModelInfo.from_agent and JudgeInfo.from_instance — user wires - Fix lint: PLR2004 magic values, PLR0912 noqa, ambiguous unicode
Replace fragile repr() fallback with explicit error on unknown types. Add evaluator_identity() as user-controlled escape hatch for custom evaluators. Introspect dataclass/partial/callable as fallback only. - Remove hasattr(obj, "model_dump") duck-typing (Pydantic leak) - Remove default=str silent fallback in json.dumps - Skip _prefixed dataclass fields (runtime internals, not config) - Add functools.partial support (qualname + bound kwargs) - Add ShortCircuit.evaluator_identity() - 33 tests covering all paths including fail-hard
Type-safe suite kind across the codebase. StrEnum keeps JSON compat (SuiteKind.EVAL == "eval") so no migration needed.
…ores 28 lazy imports in protest/, none resolving a real circular dependency. Moved all to top-level except justified cases (optional deps like rich, conditional wiring, and one true circular import in evals/__init__.py). Removed blanket PLC0415 per-file-ignores from pyproject.toml — remaining suppressions use inline noqa with justification.
- Type built-in evaluators as EvalContext[Any, str] (text evaluators) - not_empty typed EvalContext[Any, Any] (works on any output) - Fix mypy running outside venv (uv run mypy in justfile) - Add mypy config in pyproject.toml with rich stubs override - Fix no-any-return, arg-type, unused type-ignore across codebase - Remove stale type: ignore[import-not-found] on rich imports
- Remove is_async_evaluator(), _is_evaluator, _is_async_evaluator (written but never read — dead code with hasattr duck-typing) - Add yorkshire example evaluators showing EvalContext generics: [Any, str] for text, [str, float] for numeric, [str, bytes] for binary
- Removed unnecessary `# type: ignore[import-not-found]` markers on imports. - Added `--group dev` flag to dependency sync in CI workflow. - Updated `uv.lock` to include new packages: `librt` and `mypy`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.