-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Add a TerminationReason enum to track how a task's execution loop ended.
Motivation
Currently there's no way to distinguish how an execution loop ended when TaskExecutionStatus is SUCCESS. Did the agent signal completion? Did the user signal satisfaction? Did we hit max_invocations?
This information is valuable for analytics and debugging.
Proposed Design
Current Enum (unchanged)
# maseval/core/benchmark.py
class TaskExecutionStatus(Enum):
"""Did execution complete without infrastructure errors?
SUCCESS means no exceptions were raised during execution.
The actual pass/fail of the task is in eval_results, not here.
"""
# Execution completed
SUCCESS = "success"
# Execution failed (various causes)
AGENT_ERROR = "agent_error"
ENVIRONMENT_ERROR = "environment_error"
USER_ERROR = "user_error"
TASK_TIMEOUT = "task_timeout"
UNKNOWN_EXECUTION_ERROR = "unknown_execution_error"
# Lifecycle failures
EVALUATION_FAILED = "evaluation_failed"
SETUP_FAILED = "setup_failed"New Enum
# maseval/core/execution.py (new file)
class TerminationReason(Enum):
"""How the execution loop ended.
Only meaningful when TaskExecutionStatus is SUCCESS.
For error statuses, the status itself explains why execution stopped.
"""
AGENT_STOP = "agent_stop" # Agent signaled completion
USER_STOP = "user_stop" # User/simulator signaled satisfaction
MAX_STEPS = "max_steps" # Hit max_invocations limit
UNKNOWN = "unknown" # Custom execution_loop didn't set itThree Independent Concepts
| Concept | Question Answered | Where Stored |
|---|---|---|
TaskExecutionStatus |
Did infrastructure work? | report["status"] |
TerminationReason |
Why did loop stop? | report["termination_reason"] |
eval_results |
Did agent pass the task? | report["eval"] |
Example Combinations
| status | termination_reason | eval | Meaning |
|---|---|---|---|
| SUCCESS | AGENT_STOP | passed=True | Agent finished, passed |
| SUCCESS | AGENT_STOP | passed=False | Agent finished, failed |
| SUCCESS | USER_STOP | passed=True | User satisfied, passed |
| SUCCESS | MAX_STEPS | passed=False | Hit limit, failed |
| TASK_TIMEOUT | N/A | N/A | Timed out, no eval |
| AGENT_ERROR | N/A | N/A | Exception, no eval |
Integration
def execution_loop(self, agents, task, environment, user=None):
for invocation in range(self.max_invocations):
response = self.run_agents(agents, task, environment, query)
if self._agent_is_done(response):
self._termination_reason = TerminationReason.AGENT_STOP
return response
if user:
user_response = user(response)
if self._user_is_satisfied(user_response):
self._termination_reason = TerminationReason.USER_STOP
return response
query = user_response
self._termination_reason = TerminationReason.MAX_STEPS
return responseOpen Questions
1. Should MAX_STEPS be an exception like TASK_TIMEOUT?
Currently these are handled inconsistently:
| Limit Hit | Current Behavior |
|---|---|
| Timeout | Raises TaskTimeoutError → status=TASK_TIMEOUT → eval skipped |
| Max steps | Loop ends normally → status=SUCCESS → eval runs |
Option A: Treat MAX_STEPS like TIMEOUT
- Add
MaxStepsErrorexception - Add
TASK_MAX_STEPSstatus - Skip evaluation (or make configurable)
- Rationale: Both are resource limits / truncations
Option B: Keep current behavior
- MAX_STEPS captured in TerminationReason only
- Status stays SUCCESS, evaluation runs
- Rationale: We have results (unlike timeout), let evaluator decide
2. Detection methods location
Should _agent_is_done() and _user_is_satisfied() be:
- Methods on Benchmark (current proposal)
- Methods on Agent/User base classes
- Configurable callbacks
3. Report field for errors
When status is an error (TIMEOUT, AGENT_ERROR, etc.), should report include:
termination_reason: null- Or omit the field entirely
Alternatives Considered
Extend TaskExecutionStatus instead of new enum
- Rejected: Conflates "infrastructure worked" with "how loop ended"
- Would make scoring logic more complex
Return TerminationReason from execution_loop
- Possible alternative to instance variable
- Cleaner but requires signature change
Files to Change
- New:
maseval/core/execution.py- TerminationReason enum - Update:
maseval/core/benchmark.py- tracking logic, detection helpers - Update:
maseval/__init__.py- export TerminationReason - Update:
maseval/benchmark/tau2/evaluator.py- remove local enum, import from core
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request