Skip to content

feat(notebook): add agent evaluation framework with outcome & trajectory metrics#254

Open
abdelhadi703 wants to merge 1 commit intomistralai:mainfrom
abdelhadi703:feat/agent-evaluation-framework
Open

feat(notebook): add agent evaluation framework with outcome & trajectory metrics#254
abdelhadi703 wants to merge 1 commit intomistralai:mainfrom
abdelhadi703:feat/agent-evaluation-framework

Conversation

@abdelhadi703
Copy link
Copy Markdown

Summary

This notebook introduces a comprehensive agent evaluation framework measuring both outcome and process quality:

Outcome-Level Metrics

  • Task completion: Did the agent finish the task?
  • Answer correctness: Is the output factually correct?
  • Format compliance: Does it match the required format?

Trajectory-Level Metrics

  • Tool selection: Were the right tools chosen?
  • Reasoning quality: Was the logic sound and efficient?
  • Error recovery: Were errors handled gracefully?

Implementation

  • 6 Pydantic models with Score(Enum) for structured evaluation
  • 3 realistic test scenarios (medical extraction, multi-step reasoning, code generation)
  • Simulated agent traces via Mistral Large (self-contained, no external deps)
  • LLM-as-Judge with temperature=0 for reproducible scoring
  • Visualization: bar charts + stacked comparison of outcome vs trajectory
  • Failure analysis: automatic detection of low-scoring metrics

Key insight

An agent can produce the right answer via a wrong process or follow perfect reasoning but fail on formatting. Both dimensions are essential for production evaluation.

Stack

  • mistralai — Structured outputs, agent simulation, LLM judge
  • pydantic — Score models with Enum types
  • pandas + numpy — Report generation
  • matplotlib — Visualization

…ory metrics

Comprehensive framework for evaluating LLM agents on two dimensions:
outcome-level (task completion, correctness, format) and trajectory-level
(tool selection, reasoning quality, error recovery). Uses Mistral structured
outputs and LLM-as-Judge pattern.
@abdelhadi703
Copy link
Copy Markdown
Author

Hi @mistralai/team,

Following up on this agent evaluation framework. It provides metrics for evaluating agent performance (outcome & trajectory based).

Happy to iterate based on feedback. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant