Skip to content

Feature: Add TerminationReason to Core Library #18

@cemde

Description

@cemde

Summary

Add a TerminationReason enum to track how a task's execution loop ended.

Motivation

Currently there's no way to distinguish how an execution loop ended when TaskExecutionStatus is SUCCESS. Did the agent signal completion? Did the user signal satisfaction? Did we hit max_invocations?

This information is valuable for analytics and debugging.

Proposed Design

Current Enum (unchanged)

# maseval/core/benchmark.py

class TaskExecutionStatus(Enum):
    """Did execution complete without infrastructure errors?

    SUCCESS means no exceptions were raised during execution.
    The actual pass/fail of the task is in eval_results, not here.
    """

    # Execution completed
    SUCCESS = "success"

    # Execution failed (various causes)
    AGENT_ERROR = "agent_error"
    ENVIRONMENT_ERROR = "environment_error"
    USER_ERROR = "user_error"
    TASK_TIMEOUT = "task_timeout"
    UNKNOWN_EXECUTION_ERROR = "unknown_execution_error"

    # Lifecycle failures
    EVALUATION_FAILED = "evaluation_failed"
    SETUP_FAILED = "setup_failed"

New Enum

# maseval/core/execution.py (new file)

class TerminationReason(Enum):
    """How the execution loop ended.

    Only meaningful when TaskExecutionStatus is SUCCESS.
    For error statuses, the status itself explains why execution stopped.
    """

    AGENT_STOP = "agent_stop"   # Agent signaled completion
    USER_STOP = "user_stop"     # User/simulator signaled satisfaction
    MAX_STEPS = "max_steps"     # Hit max_invocations limit

    UNKNOWN = "unknown"         # Custom execution_loop didn't set it

Three Independent Concepts

Concept Question Answered Where Stored
TaskExecutionStatus Did infrastructure work? report["status"]
TerminationReason Why did loop stop? report["termination_reason"]
eval_results Did agent pass the task? report["eval"]

Example Combinations

status termination_reason eval Meaning
SUCCESS AGENT_STOP passed=True Agent finished, passed
SUCCESS AGENT_STOP passed=False Agent finished, failed
SUCCESS USER_STOP passed=True User satisfied, passed
SUCCESS MAX_STEPS passed=False Hit limit, failed
TASK_TIMEOUT N/A N/A Timed out, no eval
AGENT_ERROR N/A N/A Exception, no eval

Integration

def execution_loop(self, agents, task, environment, user=None):
    for invocation in range(self.max_invocations):
        response = self.run_agents(agents, task, environment, query)

        if self._agent_is_done(response):
            self._termination_reason = TerminationReason.AGENT_STOP
            return response

        if user:
            user_response = user(response)
            if self._user_is_satisfied(user_response):
                self._termination_reason = TerminationReason.USER_STOP
                return response
            query = user_response

    self._termination_reason = TerminationReason.MAX_STEPS
    return response

Open Questions

1. Should MAX_STEPS be an exception like TASK_TIMEOUT?

Currently these are handled inconsistently:

Limit Hit Current Behavior
Timeout Raises TaskTimeoutErrorstatus=TASK_TIMEOUT → eval skipped
Max steps Loop ends normally → status=SUCCESS → eval runs

Option A: Treat MAX_STEPS like TIMEOUT

  • Add MaxStepsError exception
  • Add TASK_MAX_STEPS status
  • Skip evaluation (or make configurable)
  • Rationale: Both are resource limits / truncations

Option B: Keep current behavior

  • MAX_STEPS captured in TerminationReason only
  • Status stays SUCCESS, evaluation runs
  • Rationale: We have results (unlike timeout), let evaluator decide

2. Detection methods location

Should _agent_is_done() and _user_is_satisfied() be:

  • Methods on Benchmark (current proposal)
  • Methods on Agent/User base classes
  • Configurable callbacks

3. Report field for errors

When status is an error (TIMEOUT, AGENT_ERROR, etc.), should report include:

  • termination_reason: null
  • Or omit the field entirely

Alternatives Considered

Extend TaskExecutionStatus instead of new enum

  • Rejected: Conflates "infrastructure worked" with "how loop ended"
  • Would make scoring logic more complex

Return TerminationReason from execution_loop

  • Possible alternative to instance variable
  • Cleaner but requires signature change

Files to Change

  1. New: maseval/core/execution.py - TerminationReason enum
  2. Update: maseval/core/benchmark.py - tracking logic, detection helpers
  3. Update: maseval/__init__.py - export TerminationReason
  4. Update: maseval/benchmark/tau2/evaluator.py - remove local enum, import from core

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions