Skip to content

Releases: parameterlab/MASEval

v0.2.0

05 Dec 16:29
c25fcdc

Choose a tag to compare

[0.2.0] - 2025-12-05

Added

Exceptions and Error Classification

  • Added AgentError, EnvironmentError, UserError exception hierarchy in maseval.core.exceptions for classifying execution failures by responsibility (PR: #13)
  • Added TaskExecutionStatus.AGENT_ERROR, ENVIRONMENT_ERROR, USER_ERROR, UNKNOWN_EXECUTION_ERROR for fine-grained error classification enabling fair scoring (PR: #13)
  • Added validation helpers: validate_argument_type(), validate_required_arguments(), validate_no_extra_arguments(), validate_arguments_from_schema() for tool implementers (PR: #13)
  • Added ToolSimulatorError and UserSimulatorError exception subclasses for simulator-specific context while inheriting proper classification (PR: #13)

Documentation

  • Added Exception Handling guide explaining error classification, fair scoring, and rerunning failed tasks (PR: #13)

Benchmarks

  • MACS Benchmark: Multi-Agent Collaboration Scenarios benchmark (PR: #13)

Benchmark

  • Added execution_loop() method to Benchmark base class enabling iterative agent-user interaction (PR: #13)
  • Added max_invocations constructor parameter to Benchmark (default: 1 for backwards compatibility) (PR: #13)
  • Added abstract get_model_adapter(model_id, **kwargs) method to Benchmark base class as universal model factory to be used throughout the benchmarks. (PR: #13)

User

  • Added max_turns and stop_token parameters to User base class for multi-turn support with early stopping. Same applied to UserLLMSimulator. (PR: #13)
  • Added is_done(), _check_stop_token(), and increment_turn() methods to User base class (PR: #13)
  • Added get_initial_query() method to User base class for LLM-generated initial messages (PR: #13)
  • Added initial_query parameter in User base class to trigger the agentic system. (PR: #13)

Environment

  • Added Environment.get_tool(name) method for single-tool lookup (PR: #13)

Interface

  • LlamaIndex integration: LlamaIndexAgentAdapter and LlamaIndexUser for evaluating LlamaIndex workflow-based agents (PR: #7)
  • The logs property inside SmolAgentAdapter and LanggraphAgentAdapter are now properly filled. (PR: #3)

Examples

  • Added a new example: The 5_a_day_benchmark (PR: #10)

Changed

Exception Handling

  • Benchmark now classifies execution errors into AGENT_ERROR (agent's fault), ENVIRONMENT_ERROR (tool/infra failure), USER_ERROR (user simulator failure), or UNKNOWN_EXECUTION_ERROR (unclassified) instead of generic TASK_EXECUTION_FAILED (PR: #13)
  • ToolLLMSimulator now raises ToolSimulatorError (classified as ENVIRONMENT_ERROR) on failure (PR: #13)
  • UserLLMSimulator now raises UserSimulatorError (classified as USER_ERROR) on failure (PR: #13)

Environment

  • Environment.create_tools() now returns Dict[str, Any] instead of list (PR: #13)

Benchmark

  • Benchmark.run_agents() signature changed: added query: str parameter (PR: #13)
  • Benchmark.run() now uses execution_loop() internally to handle agent-user interaction cycles (PR: #13)
  • Benchmark class now has a fail_on_setup_error flag that raises errors observed during setup of task (PR: #10)

Callback

  • FileResultLogger now accepts pathlib.Path for argument output_dir and has an overwrite argument to prevent overwriting of existing logs files.

Evaluator

  • The Evaluator class now has a filter_traces base method to conveniently adapt the same evaluator to different entities in the traces (PR: #10).

Simulator

  • The LLMSimulator now throws an exception when json cannot be decoded instead of returning the error message as text to the agent (PR: #13).

Other

  • Documentation formatting improved. Added darkmode and links to Github (PR: #11).
  • Improved Quick Start Guide in docs/getting-started/quickstart.md. (PR: #10)
  • maseval.interface.agents structure changed. Tools requiring framework imports (beyond just typing) now in <framework>_optional.py and imported dynamically from <framework>.py. (PR: #12)
  • Various formatting improvements in the documentation (PR: #12)
  • Added documentation for View Source Code pattern in CONTRIBUTING.md and _optional.py pattern in interface README (PR: #12)

Fixed

Interface

  • LlamaIndexAgentAdapter now supports multiple LlamaIndex agent types including ReActAgent (workflow-based), FunctionAgent, and legacy agents by checking for .chat(), .query(), and .run() methods in priority order (PR: #10)

Other

  • Consistent naming of agent adapter over wrapper (PR: #3)
  • Fixed an issue that LiteLLM interface and Mixins were not shown in documentation properly (#PR: 12)

Removed

  • Removed set_message_history, append_message_history and clear_message_history for AgentAdapter and subclasses. (PR: #3)

v0.1.2

18 Nov 18:03
982fca7

Choose a tag to compare

Full Changelog: v0.1.1...v0.1.2

Initial Release

18 Nov 15:47
a6294ff

Choose a tag to compare

This is the initial code release. Library under active development. API might change anytime.

v0.1.0-alpha

17 Nov 17:25
2728779

Choose a tag to compare

fixed email