feat(notebook): add agent evaluation framework with outcome & trajectory metrics by abdelhadi703 · Pull Request #254 · mistralai/cookbook

abdelhadi703 · 2026-03-03T11:16:45Z

Summary

This notebook introduces a comprehensive agent evaluation framework measuring both outcome and process quality:

Outcome-Level Metrics

Task completion: Did the agent finish the task?
Answer correctness: Is the output factually correct?
Format compliance: Does it match the required format?

Trajectory-Level Metrics

Tool selection: Were the right tools chosen?
Reasoning quality: Was the logic sound and efficient?
Error recovery: Were errors handled gracefully?

Implementation

6 Pydantic models with Score(Enum) for structured evaluation
3 realistic test scenarios (medical extraction, multi-step reasoning, code generation)
Simulated agent traces via Mistral Large (self-contained, no external deps)
LLM-as-Judge with temperature=0 for reproducible scoring
Visualization: bar charts + stacked comparison of outcome vs trajectory
Failure analysis: automatic detection of low-scoring metrics

Key insight

An agent can produce the right answer via a wrong process or follow perfect reasoning but fail on formatting. Both dimensions are essential for production evaluation.

Stack

mistralai — Structured outputs, agent simulation, LLM judge
pydantic — Score models with Enum types
pandas + numpy — Report generation
matplotlib — Visualization

…ory metrics Comprehensive framework for evaluating LLM agents on two dimensions: outcome-level (task completion, correctness, format) and trajectory-level (tool selection, reasoning quality, error recovery). Uses Mistral structured outputs and LLM-as-Judge pattern.

abdelhadi703 · 2026-03-30T01:15:42Z

Hi @mistralai/team,

Following up on this agent evaluation framework. It provides metrics for evaluating agent performance (outcome & trajectory based).

Happy to iterate based on feedback. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(notebook): add agent evaluation framework with outcome & trajectory metrics#254

feat(notebook): add agent evaluation framework with outcome & trajectory metrics#254
abdelhadi703 wants to merge 1 commit intomistralai:mainfrom
abdelhadi703:feat/agent-evaluation-framework

abdelhadi703 commented Mar 3, 2026

Uh oh!

abdelhadi703 commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abdelhadi703 commented Mar 3, 2026

Summary

Outcome-Level Metrics

Trajectory-Level Metrics

Implementation

Key insight

Stack

Uh oh!

abdelhadi703 commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant