Skip to content

Commit 4e4d91d

Browse files
committed
chore: entity exports, pyproject config
1 parent 82f736b commit 4e4d91d

6 files changed

Lines changed: 382 additions & 16 deletions

File tree

docs/evals.md

Lines changed: 25 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Evals
22

3-
Evaluate LLM outputs with scored metrics, thresholds, and historical tracking.
3+
Evaluate LLM outputs with scored metrics and historical tracking.
44

55
## What is an Eval?
66

@@ -41,7 +41,7 @@ protest eval evals.session:session
4141
1. Your function receives case data via `ForEach`/`From` (same as parameterized tests)
4242
2. It returns the output (string, object, anything)
4343
3. ProTest passes the output to evaluators → scores
44-
4. Scores determine pass/fail via thresholds
44+
4. Bool verdicts determine pass/fail
4545
5. Aggregated stats appear in the terminal
4646

4747
The rest of the pipeline — fixtures, DI, parallelism, reporters — works identically to tests.
@@ -87,15 +87,23 @@ An evaluator is a function decorated with `@evaluator` that receives an `EvalCon
8787

8888
### Return Types
8989

90-
Evaluators return `bool` (simple verdict) or a `dataclass` (structured result). The framework reads fields by type:
90+
Evaluators return `bool` (simple verdict) or a `dataclass` (structured result). In dataclasses, annotate fields to tell the framework what each one is:
9191

92-
| Field Type | Role |
92+
```python
93+
from typing import Annotated
94+
from protest.evals import Metric, Verdict, Reason
95+
```
96+
97+
| Annotation | Role |
9398
|------------|------|
94-
| `bool` | Verdict — pass/fail (`all(bool_fields)`) |
95-
| `float` | Metric — aggregated in stats (mean/p50/p95) |
96-
| `str` | Reason — displayed on failure, stored in history |
99+
| `Annotated[bool, Verdict]` | Verdict — pass/fail (`all(verdicts)`) |
100+
| `Annotated[float, Metric]` | Metric — aggregated in stats (mean/p50/p95) |
101+
| `Annotated[int, Metric]` | Metric — converted to float |
102+
| `Annotated[str, Reason]` | Reason — displayed on failure, stored in history |
103+
104+
Unannotated fields are ignored by the runner — free metadata.
97105

98-
Returning `float`, `dict`, or any other type raises `TypeError`.
106+
Returning `float`, `dict`, or any other non-dataclass/non-bool type raises `TypeError`.
99107

100108
### Simple Evaluator
101109

@@ -109,12 +117,14 @@ def not_empty(ctx: EvalContext) -> bool:
109117

110118
```python
111119
from dataclasses import dataclass
120+
from typing import Annotated
121+
from protest.evals import Metric, Verdict, Reason
112122

113123
@dataclass
114124
class KeywordScores:
115-
keyword_recall: float # metric → stats
116-
all_present: bool # verdict → pass/fail
117-
detail: str = "" # reason → shown on failure
125+
keyword_recall: Annotated[float, Metric]
126+
all_present: Annotated[bool, Verdict]
127+
detail: Annotated[str, Reason] = ""
118128

119129
@evaluator
120130
def keyword_check(ctx: EvalContext, keywords: list[str], min_recall: float = 0.5) -> KeywordScores:
@@ -134,9 +144,9 @@ The threshold (`min_recall`) is a parameter of the evaluator, not a framework co
134144
```python
135145
@dataclass
136146
class JudgeResult:
137-
accuracy: float
138-
accurate_enough: bool
139-
reason: str = ""
147+
accuracy: Annotated[float, Metric]
148+
accurate_enough: Annotated[bool, Verdict]
149+
reason: Annotated[str, Reason] = ""
140150

141151
@evaluator
142152
async def llm_judge(ctx: EvalContext, rubric: str = "", min_score: float = 0.7) -> JudgeResult:
@@ -334,7 +344,7 @@ protest history --evals --compare
334344
Each case in history carries two hashes:
335345

336346
- **`case_hash`** — hash of inputs + expected output. Changes when the test data changes.
337-
- **`eval_hash`** — hash of evaluators + thresholds. Changes when the scoring criteria change.
347+
- **`eval_hash`** — hash of evaluators. Changes when the scoring criteria change.
338348

339349
`protest history --compare` uses these hashes to detect modified cases vs regressions. If a case's `eval_hash` changed between runs, it's reported as "scoring modified" rather than a real regression.
340350

protest/entities/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@
1010
format_fixture_scope,
1111
)
1212
from protest.entities.events import (
13+
EvalPayload,
14+
EvalScoreEntry,
1315
FixtureInfo,
1416
HandlerInfo,
1517
RunResult,
@@ -31,6 +33,8 @@
3133
from protest.entities.xfail import Xfail, normalize_xfail
3234

3335
__all__ = [
36+
"EvalPayload",
37+
"EvalScoreEntry",
3438
"Fixture",
3539
"FixtureCallable",
3640
"FixtureInfo",

protest/entities/core.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ class TestRegistration:
4949
xfail: Xfail | None = None
5050
timeout: float | None = None
5151
retry: Retry | None = None
52+
is_eval: bool = False
5253

5354

5455
@dataclass(frozen=True, slots=True)
@@ -111,6 +112,7 @@ class TestItem:
111112
xfail: Xfail | None = None
112113
timeout: float | None = None
113114
retry: Retry | None = None
115+
is_eval: bool = False
114116

115117
@property
116118
def test_name(self) -> str:

protest/entities/suite_path.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,11 @@ def lower(self) -> str:
5858
"""Return lowercase string representation for case-insensitive comparison."""
5959
return self._path.lower()
6060

61+
@property
62+
def root_name(self) -> str:
63+
"""Return the top-level suite name: 'A::B::C' -> 'A'."""
64+
return self.parts[0] if self.parts else ""
65+
6166
def __str__(self) -> str:
6267
return self._path
6368

pyproject.toml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,9 @@ rich = [
4949
web = [
5050
"websockets>=12.0",
5151
]
52+
evals = [
53+
"pydantic-evals>=0.1",
54+
]
5255

5356

5457
[tool.ruff]
@@ -100,6 +103,23 @@ ignore = [
100103
"PLC0415", # lazy import for optional rich dependency
101104
"PLR0913", # many args is deliberate API design
102105
]
106+
"protest/core/execution/test_executor.py" = [
107+
"PLR0915", # _run_test is inherently complex (retry loop + eval capture)
108+
]
109+
"protest/history/**" = [
110+
"PLC0415", # lazy imports
111+
"S603", # subprocess git calls are safe
112+
"PLR0913", # load_history has many filter params by design
113+
]
114+
"protest/cli/history.py" = [
115+
"T201", # print for CLI output
116+
"PLC0415", # lazy imports
117+
]
118+
"protest/evals/**" = [
119+
"T201", # print for eval reporting
120+
"PLC0415", # lazy imports for optional pydantic-evals dependency
121+
"PLR0913", # adapter functions have many params by design
122+
]
103123
"protest/reporting/ascii.py" = [
104124
"T201", # print is the purpose of this module
105125
]

0 commit comments

Comments
 (0)