Feature: Add TerminationReason to Core Library

## Summary

Add a `TerminationReason` enum to track how a task's execution loop ended.

## Motivation

Currently there's no way to distinguish *how* an execution loop ended when `TaskExecutionStatus` is `SUCCESS`. Did the agent signal completion? Did the user signal satisfaction? Did we hit max_invocations?

This information is valuable for analytics and debugging.

## Proposed Design

### Current Enum (unchanged)

```python
# maseval/core/benchmark.py

class TaskExecutionStatus(Enum):
    """Did execution complete without infrastructure errors?

    SUCCESS means no exceptions were raised during execution.
    The actual pass/fail of the task is in eval_results, not here.
    """

    # Execution completed
    SUCCESS = "success"

    # Execution failed (various causes)
    AGENT_ERROR = "agent_error"
    ENVIRONMENT_ERROR = "environment_error"
    USER_ERROR = "user_error"
    TASK_TIMEOUT = "task_timeout"
    UNKNOWN_EXECUTION_ERROR = "unknown_execution_error"

    # Lifecycle failures
    EVALUATION_FAILED = "evaluation_failed"
    SETUP_FAILED = "setup_failed"
```

### New Enum

```python
# maseval/core/execution.py (new file)

class TerminationReason(Enum):
    """How the execution loop ended.

    Only meaningful when TaskExecutionStatus is SUCCESS.
    For error statuses, the status itself explains why execution stopped.
    """

    AGENT_STOP = "agent_stop"   # Agent signaled completion
    USER_STOP = "user_stop"     # User/simulator signaled satisfaction
    MAX_STEPS = "max_steps"     # Hit max_invocations limit

    UNKNOWN = "unknown"         # Custom execution_loop didn't set it
```

### Three Independent Concepts

| Concept | Question Answered | Where Stored |
|---------|-------------------|--------------|
| `TaskExecutionStatus` | Did infrastructure work? | `report["status"]` |
| `TerminationReason` | Why did loop stop? | `report["termination_reason"]` |
| `eval_results` | Did agent pass the task? | `report["eval"]` |

### Example Combinations

| status | termination_reason | eval | Meaning |
|--------|-------------------|------|---------|
| SUCCESS | AGENT_STOP | passed=True | Agent finished, passed |
| SUCCESS | AGENT_STOP | passed=False | Agent finished, failed |
| SUCCESS | USER_STOP | passed=True | User satisfied, passed |
| SUCCESS | MAX_STEPS | passed=False | Hit limit, failed |
| TASK_TIMEOUT | N/A | N/A | Timed out, no eval |
| AGENT_ERROR | N/A | N/A | Exception, no eval |

### Integration

```python
def execution_loop(self, agents, task, environment, user=None):
    for invocation in range(self.max_invocations):
        response = self.run_agents(agents, task, environment, query)

        if self._agent_is_done(response):
            self._termination_reason = TerminationReason.AGENT_STOP
            return response

        if user:
            user_response = user(response)
            if self._user_is_satisfied(user_response):
                self._termination_reason = TerminationReason.USER_STOP
                return response
            query = user_response

    self._termination_reason = TerminationReason.MAX_STEPS
    return response
```

## Open Questions

### 1. Should MAX_STEPS be an exception like TASK_TIMEOUT?

Currently these are handled inconsistently:

| Limit Hit | Current Behavior |
|-----------|------------------|
| Timeout | Raises `TaskTimeoutError` → `status=TASK_TIMEOUT` → eval skipped |
| Max steps | Loop ends normally → `status=SUCCESS` → eval runs |

**Option A: Treat MAX_STEPS like TIMEOUT**
- Add `MaxStepsError` exception
- Add `TASK_MAX_STEPS` status
- Skip evaluation (or make configurable)
- Rationale: Both are resource limits / truncations

**Option B: Keep current behavior**
- MAX_STEPS captured in TerminationReason only
- Status stays SUCCESS, evaluation runs
- Rationale: We have results (unlike timeout), let evaluator decide

### 2. Detection methods location

Should `_agent_is_done()` and `_user_is_satisfied()` be:
- Methods on Benchmark (current proposal)
- Methods on Agent/User base classes
- Configurable callbacks

### 3. Report field for errors

When status is an error (TIMEOUT, AGENT_ERROR, etc.), should report include:
- `termination_reason: null`
- Or omit the field entirely

## Alternatives Considered

**Extend TaskExecutionStatus instead of new enum**
- Rejected: Conflates "infrastructure worked" with "how loop ended"
- Would make scoring logic more complex

**Return TerminationReason from execution_loop**
- Possible alternative to instance variable
- Cleaner but requires signature change

## Files to Change

1. **New**: `maseval/core/execution.py` - TerminationReason enum
2. **Update**: `maseval/core/benchmark.py` - tracking logic, detection helpers
3. **Update**: `maseval/__init__.py` - export TerminationReason
4. **Update**: `maseval/benchmark/tau2/evaluator.py` - remove local enum, import from core


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Add TerminationReason to Core Library #18

Summary

Motivation

Proposed Design

Current Enum (unchanged)

New Enum

Three Independent Concepts

Example Combinations

Integration

Open Questions

1. Should MAX_STEPS be an exception like TASK_TIMEOUT?

2. Detection methods location

3. Report field for errors

Alternatives Considered

Files to Change

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Concept	Question Answered	Where Stored
`TaskExecutionStatus`	Did infrastructure work?	`report["status"]`
`TerminationReason`	Why did loop stop?	`report["termination_reason"]`
`eval_results`	Did agent pass the task?	`report["eval"]`

status	termination_reason	eval	Meaning
SUCCESS	AGENT_STOP	passed=True	Agent finished, passed
SUCCESS	AGENT_STOP	passed=False	Agent finished, failed
SUCCESS	USER_STOP	passed=True	User satisfied, passed
SUCCESS	MAX_STEPS	passed=False	Hit limit, failed
TASK_TIMEOUT	N/A	N/A	Timed out, no eval
AGENT_ERROR	N/A	N/A	Exception, no eval

Limit Hit	Current Behavior
Timeout	Raises `TaskTimeoutError` → `status=TASK_TIMEOUT` → eval skipped
Max steps	Loop ends normally → `status=SUCCESS` → eval runs

Feature: Add TerminationReason to Core Library #18

Description

Summary

Motivation

Proposed Design

Current Enum (unchanged)

New Enum

Three Independent Concepts

Example Combinations

Integration

Open Questions

1. Should MAX_STEPS be an exception like TASK_TIMEOUT?

2. Detection methods location

3. Report field for errors

Alternatives Considered

Files to Change

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions