Add tau 2 bench #16

cemde · 2025-12-22T20:38:12Z

Description

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

I have read the CONTRIBUTING.md guide.
Commits follow "How to write a good git commit message"

Documentation

Added/updated docstrings for new/modified functions as instructed CONTRIBUTING.md
Updated relevant documentation in docs/ (if applicable)
Tag github issue with this PR (if applicable)

Changelog

Added entry to CHANGELOG.md under [Unreleased] section
- Use Added section for new features
- Use Changed section for modifications to existing functionality
- Use Fixed section for bug fixes
- Use Removed section for deprecated/removed features
OR this is a documentation-only change (no changelog needed)

Example:
- Support for multi-agent tracing (PR:#123)

Architecture (if applicable)

Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
Dependencies: New core dependencies added sparingly; framework integrations go to optional dependencies

Additional Notes

github-actions · 2025-12-22T20:39:25Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
maseval/benchmark/macs
data_loader.py					255, 297, 548-569
macs.py
maseval/benchmark/tau2
data_loader.py					84-85, 100-104, 141, 150, 163, 168, 252, 297, 344, 349, 427, 481
environment.py					85, 132, 140, 182-191, 260-262, 301, 318-322
evaluator.py					221, 234-236, 249-250, 273, 284-285, 315, 323, 328, 335-337, 353, 412, 421, 533, 619, 646
tau2.py					321, 324, 552, 708, 728, 730, 732, 738, 812-815
utils.py					169-170, 174-179, 204-209
maseval/benchmark/tau2/domains
base.py					74, 82, 162-170, 271-275, 287, 328, 333-342
maseval/benchmark/tau2/domains/airline
db.py
models.py
tools.py					61, 69, 77, 91-94, 108, 112, 138, 150-162, 179, 182, 184, 330-339, 399, 440, 444, 475, 480, 495, 560, 662, 666
maseval/benchmark/tau2/domains/retail
models.py
tools.py					65, 86, 104, 124, 217, 243, 311, 367-368, 416, 426, 430, 442, 542, 551, 555, 566, 577-578, 584, 626, 632-665, 746, 752
maseval/benchmark/tau2/domains/telecom
db.py
models.py
tools.py					74, 83, 87, 92, 96, 101, 110, 126, 147, 160-180, 229, 255, 277, 501, 563-616, 634-640
user_models.py					255-257
user_tools.py					46, 52, 82, 84, 86, 90-93, 120-121, 125, 152, 154, 156, 160, 170-177, 232, 270-272, 333, 427-428, 511, 586, 603, 631, 663-664, 680-688, 702-703, 708-713, 732-740, 752-753, 765-766, 778-779
maseval/core
benchmark.py
simulator.py
task.py
user.py					447, 450-455, 465, 539-544
maseval/core/callbacks
result_logger.py
maseval/interface/inference
anthropic.py
google_genai.py					178-188, 193
Project Total

The report is truncated to 25 files out of 31. To see the full report, please visit the workflow summary page.

_{This report was generated by python-coverage-comment-action}

…cUser

cemde

ok

cemde added 23 commits December 26, 2025 13:06

added files about implementation strategies and plans.

23511d4

added comment to plans

758da11

updated claude plan

9c99065

consolitated plan

efae183

initial attempt

338b2b6

added agentic user with tool use

dd7a4e6

docs: Add TESTING.md with testing plan for Tau2 benchmark

cb8499a

test(tau2): implement comprehensive testing strategy per TESTING.md

6e3bb26

docs: Remove TESTING.md after implementation

05dc4c2

formatting

8f5f6a1

fix: resolve linting, duplication, and type errors in Tau2 and Agenti…

217a2d7

…cUser

style: final ruff formatting and lint fixes

cc8fb7c

updated testing

da518ac

fixing type hinting and formatting

0d1cc23

fixed dependeny issue

80ee4ca

updated agent file

c5781eb

added better coverage scripts

11567c5

movedf agentic user

1a0be47

fixed testing

df4262e

added default tau2 implementation

fa937ce

added gitignore to example

c9c9332

initial defualt agent file

ad5e67e

cleaned up docstrings for model adapters

91de571

cemde force-pushed the add-tau-2-bench branch from 068ba6b to 91de571 Compare December 26, 2025 12:16

cemde added 5 commits December 26, 2025 13:39

fixed tau2 and agenticuser tests given new ModelAdapter chat features

782c9bf

updated default agent with verbosity level

6087c8f

fixed bugs

eb24ce1

added comparison script

8b5a4ef

added tools for user

17f523f

cemde added 17 commits December 27, 2025 13:40

implemented metric

9aaf1b6

slowly fixing tau2

6cf4dcf

updated user template

0a54a09

fixed typing

bcd9652

improved tau comparison script

6383bb7

enabled multiple reasons for user to terminate conversation

647840d

added temp setting to tau2 default script

83f49e0

improved provenance file

3f9f5d6

merged tau2 examples

4e2a9a8

fixed task.id bad pattern and updated changelog

ba8c986

removed old markdown files

87f35d2

formatting of notebooks

2578224

updated documentation to be independent of MACS

8b08b81

removed unnecessary script

b5a8697

removed comparison script

0aea1bc

improved tests

e9f57a4

simplified tau2 bench example

9fd02b4

cemde commented Dec 30, 2025

View reviewed changes

fixed docs

1483c30

cemde merged commit 85547ae into main Dec 30, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tau 2 bench #16

Add tau 2 bench #16

Uh oh!

cemde commented Dec 22, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 22, 2025 •

edited

Loading

Uh oh!

cemde left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add tau 2 bench #16

Add tau 2 bench #16

Uh oh!

Conversation

cemde commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Checklist

Contribution

Documentation

Changelog

Architecture (if applicable)

Additional Notes

Uh oh!

github-actions bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

cemde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cemde commented Dec 22, 2025 •

edited

Loading

github-actions bot commented Dec 22, 2025 •

edited

Loading