Skip to content

EvalSet

gitpavleenbali edited this page Feb 17, 2026 · 2 revisions

EvalSet

An EvalSet is a collection of test cases grouped together for organized evaluation.

Import

from pyai.evaluation import EvalSet

Constructor

EvalSet(
    name: str,                       # Evaluation set name
    test_cases: list[TestCase],      # List of test cases
    description: str = None,         # Description
    version: str = "1.0",            # Version identifier
    metadata: dict = None            # Additional metadata
)

Creating EvalSets

Basic Creation

from pyai.evaluation import EvalSet, TestCase

eval_set = EvalSet(
    name="Math Evaluation",
    test_cases=[
        TestCase(input="2+2", expected_output="4"),
        TestCase(input="5*5", expected_output="25"),
        TestCase(input="10/2", expected_output="5"),
    ],
    description="Basic arithmetic tests"
)

From YAML File

# eval_set.yaml
name: Customer Support Evaluation
description: Tests for customer support agent
version: "2.0"

test_cases:
  - input: "I need help with my order"
    criteria: [helpfulness, tone]
    tags: [orders]
    
  - input: "How do I return an item?"
    expected_output_contains: "return policy"
    criteria: [accuracy, clarity]
    tags: [returns]
    
  - input: "I'm very angry about this!"
    criteria: [empathy, de-escalation]
    tags: [complaints]
eval_set = EvalSet.from_yaml("eval_set.yaml")

From JSON

eval_set = EvalSet.from_json("eval_set.json")

Methods

add_test_case()

Add a single test case:

eval_set.add_test_case(
    TestCase(input="New test", expected_output="Expected")
)

filter_by_tags()

Filter test cases by tags:

# Get only math-related tests
math_tests = eval_set.filter_by_tags(["math"])

# Exclude certain tags
no_advanced = eval_set.filter_by_tags(exclude=["advanced"])

split()

Split into training/validation sets:

train_set, val_set = eval_set.split(ratio=0.8)

sample()

Random sampling:

# Get 10 random test cases
sample = eval_set.sample(n=10)

Serialization

Save to File

eval_set.to_yaml("output.yaml")
eval_set.to_json("output.json")

Export for Sharing

# Export with all metadata
eval_set.export("benchmark_v1.zip")

Properties

Property Type Description
name str Set name
test_cases list All test cases
size int Number of test cases
tags set All unique tags
version str Version string

Built-in Benchmarks

from pyai.evaluation.benchmarks import (
    MMLU,
    HellaSwag,
    TruthfulQA,
    HumanEval
)

# Load standard benchmark
mmlu = MMLU.load(subset="computer_science")
results = evaluator.evaluate(mmlu, agent=my_agent)

See Also

🧠 PYAI Wiki

Home


πŸš€ Getting Started


πŸ’‘ Core Concepts


🎯 One-Liner APIs


πŸ€– Agent Framework


πŸ”— Multi-Agent


πŸ› οΈ Tools & Skills


🏒 Enterprise


πŸŽ™οΈ Voice


πŸ–ΌοΈ Multimodal


πŸ“Š Vector DB


🌐 OpenAPI


πŸ”Œ Plugins


🀝 A2A Protocol


πŸ”’ Security


πŸ“š Reference


Intelligence, Embedded.

Clone this wiki locally