Releases · parameterlab/MASEval · GitHub

05 Dec 16:29

v0.2.0 Latest

Latest

[0.2.0] - 2025-12-05

Added

Exceptions and Error Classification

Added AgentError, EnvironmentError, UserError exception hierarchy in maseval.core.exceptions for classifying execution failures by responsibility (PR: #13)
Added TaskExecutionStatus.AGENT_ERROR, ENVIRONMENT_ERROR, USER_ERROR, UNKNOWN_EXECUTION_ERROR for fine-grained error classification enabling fair scoring (PR: #13)
Added validation helpers: validate_argument_type(), validate_required_arguments(), validate_no_extra_arguments(), validate_arguments_from_schema() for tool implementers (PR: #13)
Added ToolSimulatorError and UserSimulatorError exception subclasses for simulator-specific context while inheriting proper classification (PR: #13)

Documentation

Added Exception Handling guide explaining error classification, fair scoring, and rerunning failed tasks (PR: #13)

Benchmarks

MACS Benchmark: Multi-Agent Collaboration Scenarios benchmark (PR: #13)

Benchmark

Added execution_loop() method to Benchmark base class enabling iterative agent-user interaction (PR: #13)
Added max_invocations constructor parameter to Benchmark (default: 1 for backwards compatibility) (PR: #13)
Added abstract get_model_adapter(model_id, **kwargs) method to Benchmark base class as universal model factory to be used throughout the benchmarks. (PR: #13)

User

Added max_turns and stop_token parameters to User base class for multi-turn support with early stopping. Same applied to UserLLMSimulator. (PR: #13)
Added is_done(), _check_stop_token(), and increment_turn() methods to User base class (PR: #13)
Added get_initial_query() method to User base class for LLM-generated initial messages (PR: #13)
Added initial_query parameter in User base class to trigger the agentic system. (PR: #13)

Environment

Added Environment.get_tool(name) method for single-tool lookup (PR: #13)

Interface

LlamaIndex integration: LlamaIndexAgentAdapter and LlamaIndexUser for evaluating LlamaIndex workflow-based agents (PR: #7)
The logs property inside SmolAgentAdapter and LanggraphAgentAdapter are now properly filled. (PR: #3)

Examples

Added a new example: The 5_a_day_benchmark (PR: #10)

Changed

Exception Handling

Benchmark now classifies execution errors into AGENT_ERROR (agent's fault), ENVIRONMENT_ERROR (tool/infra failure), USER_ERROR (user simulator failure), or UNKNOWN_EXECUTION_ERROR (unclassified) instead of generic TASK_EXECUTION_FAILED (PR: #13)
ToolLLMSimulator now raises ToolSimulatorError (classified as ENVIRONMENT_ERROR) on failure (PR: #13)
UserLLMSimulator now raises UserSimulatorError (classified as USER_ERROR) on failure (PR: #13)

Environment

Environment.create_tools() now returns Dict[str, Any] instead of list (PR: #13)

Benchmark

Benchmark.run_agents() signature changed: added query: str parameter (PR: #13)
Benchmark.run() now uses execution_loop() internally to handle agent-user interaction cycles (PR: #13)
Benchmark class now has a fail_on_setup_error flag that raises errors observed during setup of task (PR: #10)

Callback

FileResultLogger now accepts pathlib.Path for argument output_dir and has an overwrite argument to prevent overwriting of existing logs files.

Evaluator

The Evaluator class now has a filter_traces base method to conveniently adapt the same evaluator to different entities in the traces (PR: #10).

Simulator

The LLMSimulator now throws an exception when json cannot be decoded instead of returning the error message as text to the agent (PR: #13).

Other

Documentation formatting improved. Added darkmode and links to Github (PR: #11).
Improved Quick Start Guide in docs/getting-started/quickstart.md. (PR: #10)
maseval.interface.agents structure changed. Tools requiring framework imports (beyond just typing) now in <framework>_optional.py and imported dynamically from <framework>.py. (PR: #12)
Various formatting improvements in the documentation (PR: #12)
Added documentation for View Source Code pattern in CONTRIBUTING.md and _optional.py pattern in interface README (PR: #12)

Fixed

Interface

LlamaIndexAgentAdapter now supports multiple LlamaIndex agent types including ReActAgent (workflow-based), FunctionAgent, and legacy agents by checking for .chat(), .query(), and .run() methods in priority order (PR: #10)

Other

Consistent naming of agent adapter over wrapper (PR: #3)
Fixed an issue that LiteLLM interface and Mixins were not shown in documentation properly (#PR: 12)

Removed

Removed set_message_history, append_message_history and clear_message_history for AgentAdapter and subclasses. (PR: #3)

Assets 4

18 Nov 18:03

v0.1.2

Full Changelog: v0.1.1...v0.1.2

Assets 4

18 Nov 15:47

cemde

Initial Release

This is the initial code release. Library under active development. API might change anytime.

Assets 2

17 Nov 17:25

cemde

v0.1.0-alpha

fixed email

Assets 2