Skip to content

fix(task): surface non-FINISHED sub-agent runs as errors, not empty success#3742

Merged
ak684 merged 1 commit into
mainfrom
alona/fix-subagent-budget
Jun 18, 2026
Merged

fix(task): surface non-FINISHED sub-agent runs as errors, not empty success#3742
ak684 merged 1 commit into
mainfrom
alona/fix-subagent-budget

Conversation

@ak684

@ak684 ak684 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

HUMAN:

bug fix: a non-FINISHED sub-agent (stuck/paused/run-limit) was reported to the parent as an empty success.

  • A human has tested these changes.

AGENT:

Narrowly scoped bug fix for sub-agent (Task tool) result surfacing. uv run pytest tests/tools/task/test_task_manager.py = 53 pass; pre-commit (ruff + pyright) clean.


Why

A sub-agent whose run ends in any non-FINISHED terminal status — stuck detection, paused, or a run-limit — was reported to the parent task as an empty success (Task completed with no result, is_error=False). _run_task only special-cased ERROR; every other non-FINISHED status fell through to the success branch, so the parent agent couldn't distinguish a real result from a silently-aborted sub-agent.

Summary

  • _run_task now treats only FINISHED as success; any other terminal status (stuck, paused, run-limit) is surfaced to the parent as an error.
  • The error detail preserves any partial output (_run_stop_detail) so the parent can use or retry it instead of getting nothing.

Issue Number

(none)

How to Test

uv run pytest tests/tools/task/test_task_manager.py

Covers: a non-FINISHED (stuck) status surfaced as an error; _run_stop_detail returns the last error event / falls back to a status message.

Type

  • Bug fix

Notes

This PR was narrowed to the status-surfacing fix only; the per-run budget plumbing it originally carried was dropped so the budget can be reworked separately. The change here is independent of the budget feature.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:61de89c-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-61de89c-python \
  ghcr.io/openhands/agent-server:61de89c-python

All tags pushed for this build

ghcr.io/openhands/agent-server:61de89c-golang-amd64
ghcr.io/openhands/agent-server:61de89cc55a99f675aaa1b3da427a9d6ce823968-golang-amd64
ghcr.io/openhands/agent-server:alona-fix-subagent-budget-golang-amd64
ghcr.io/openhands/agent-server:61de89c-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:61de89c-golang-arm64
ghcr.io/openhands/agent-server:61de89cc55a99f675aaa1b3da427a9d6ce823968-golang-arm64
ghcr.io/openhands/agent-server:alona-fix-subagent-budget-golang-arm64
ghcr.io/openhands/agent-server:61de89c-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:61de89c-java-amd64
ghcr.io/openhands/agent-server:61de89cc55a99f675aaa1b3da427a9d6ce823968-java-amd64
ghcr.io/openhands/agent-server:alona-fix-subagent-budget-java-amd64
ghcr.io/openhands/agent-server:61de89c-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:61de89c-java-arm64
ghcr.io/openhands/agent-server:61de89cc55a99f675aaa1b3da427a9d6ce823968-java-arm64
ghcr.io/openhands/agent-server:alona-fix-subagent-budget-java-arm64
ghcr.io/openhands/agent-server:61de89c-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:61de89c-python-amd64
ghcr.io/openhands/agent-server:61de89cc55a99f675aaa1b3da427a9d6ce823968-python-amd64
ghcr.io/openhands/agent-server:alona-fix-subagent-budget-python-amd64
ghcr.io/openhands/agent-server:61de89c-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:61de89c-python-arm64
ghcr.io/openhands/agent-server:61de89cc55a99f675aaa1b3da427a9d6ce823968-python-arm64
ghcr.io/openhands/agent-server:alona-fix-subagent-budget-python-arm64
ghcr.io/openhands/agent-server:61de89c-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:61de89c-golang
ghcr.io/openhands/agent-server:61de89cc55a99f675aaa1b3da427a9d6ce823968-golang
ghcr.io/openhands/agent-server:alona-fix-subagent-budget-golang
ghcr.io/openhands/agent-server:61de89c-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:61de89c-java
ghcr.io/openhands/agent-server:61de89cc55a99f675aaa1b3da427a9d6ce823968-java
ghcr.io/openhands/agent-server:alona-fix-subagent-budget-java
ghcr.io/openhands/agent-server:61de89c-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:61de89c-python
ghcr.io/openhands/agent-server:61de89cc55a99f675aaa1b3da427a9d6ce823968-python
ghcr.io/openhands/agent-server:alona-fix-subagent-budget-python
ghcr.io/openhands/agent-server:61de89c-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 61de89c-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 61de89c-python-amd64) are also available if needed

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-tools/openhands/tools/task
   manager.py18012132%82–84, 88–90, 100–101, 103–104, 108, 118, 122, 125, 128–133, 135, 141–142, 146, 150–153, 156–160, 182–183, 185–186, 191, 196, 203–205, 210–213, 222, 227, 234, 247–248, 250, 256, 261–262, 264, 274, 279, 285, 297–298, 300–303, 305, 323–324, 328–329, 331–332, 335, 337, 340, 343, 347–348, 350–353, 355–362, 367–371, 373–374, 376, 385, 390, 395–396, 402–403, 407–409, 411, 414, 416–417, 428–429, 433, 440–441, 449, 453, 458, 460–461
TOTAL312701576149% 

@ak684 ak684 marked this pull request as ready for review June 16, 2026 00:34

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

The PR achieves its stated goal: I reproduced the reported base-branch failures, then exercised the same SDK/task flows on the PR branch and observed the corrected behavior.

Does this PR achieve its stated goal?

Yes. The PR fixes the public local Conversation(..., max_budget_per_run=...) plumbing, makes the remote path fail loudly, surfaces a non-FINISHED sub-agent as an error instead of an empty success, and preserves inherited caps across task resume. I also verified the configured LiteLLM proxy returns non-zero cost metrics and that the local budget-stop path produces an ERROR run-limit event when accumulated cost exceeds the ceiling.

Phase Result
Environment Setup uv sync --dev completed and repo remained clean after QA runs
CI Status 🟡 34 successful, 1 skipped, 1 pending (QA Changes by OpenHands/qa-changes) at review time
Functional Verification ✅ Base failures reproduced; PR behavior verified through SDK APIs and a live LLM completion
Functional Verification

Test 1: Public Conversation budget argument and remote handling

Step 1 — Reproduce / establish baseline (without the fix):
Ran against detached base 9750af138dd67255d24526a7e438650deac83476:

cd /tmp/oh-sdk-qa-base && \
OPENHANDS_SUPPRESS_BANNER=1 \
PYTHONPATH="/tmp/oh-sdk-qa-base/openhands-sdk:/tmp/oh-sdk-qa-base/openhands-tools:/tmp/oh-sdk-qa-base/openhands-workspace:/tmp/oh-sdk-qa-base/openhands-agent-server" \
/home/runner/work/software-agent-sdk/software-agent-sdk/pr-repo/.venv/bin/python /tmp/qa_subagent_budget.py 2>&1 | grep '^QA:'

Output excerpt:

QA: local_factory_budget=ERROR type=TypeError message="Conversation.__new__() got an unexpected keyword argument 'max_budget_per_run'"
QA: remote_factory_budget=ERROR type=TypeError message="Conversation.__new__() got an unexpected keyword argument 'max_budget_per_run'"

This confirms the reported public-factory bug: a user could not pass max_budget_per_run through Conversation(...) at all.

Step 2 — Apply the PR's changes:
Checked out PR branch alona/fix-subagent-budget at 610f54a1325cd98e1c10e9f83a521b4ff3e7d52d.

Step 3 — Re-run with the fix in place:
Ran the same SDK exercise on the PR branch:

cd /home/runner/work/software-agent-sdk/software-agent-sdk/pr-repo && \
OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_subagent_budget.py 2>&1 | grep '^QA:'

Output excerpt:

QA: local_factory_budget=OK type=LocalConversation budget=2.5
QA: remote_factory_budget=ERROR type=NotImplementedError message='max_budget_per_run is not yet enforced for remote conversations (server-side budget tracking is pending).'

This shows the local user entry point now accepts and forwards the budget, while remote conversations fail explicitly instead of silently ignoring unsupported budget enforcement.

Test 2: Sub-agent non-FINISHED result and resume cap inheritance

Step 1 — Reproduce / establish baseline (without the fix):
Same base run output:

QA: task_stuck_result status=completed result='' error=None
QA: task_resume_caps first='caps iter=42 budget=7.0' second='caps iter=500 budget=None'

This confirms both reported bugs: a STUCK sub-agent was reported as an empty completed task, and resuming a task dropped inherited caps back to defaults.

Step 2 — Apply the PR's changes:
Checked out PR branch alona/fix-subagent-budget at 610f54a1325cd98e1c10e9f83a521b4ff3e7d52d.

Step 3 — Re-run with the fix in place:
Same PR run output:

QA: task_stuck_result status=error result=None error='Sub-agent stopped without finishing (status: stuck).'
QA: task_resume_caps first='caps iter=42 budget=7.0' second='caps iter=42 budget=7.0'

This verifies the sub-agent stop is now surfaced to the parent as an error, and resumed task conversations retain the parent-provided iteration and budget caps.

Test 3: Cost signal and local budget stop behavior

Step 1 — Reproduce / establish baseline (without the fix):
The base public Conversation(..., max_budget_per_run=...) entry point failed with TypeError in Test 1, so a real user could not configure this through the public factory before the PR.

Step 2 — Apply the PR's changes:
Used the PR branch with the configured LLM_MODEL, LLM_BASE_URL, and LLM_API_KEY environment.

Step 3 — Re-run with the fix in place:
Ran a live, tiny LLM completion through the configured LiteLLM proxy:

OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY'
import os
from openhands.sdk import LLM
from openhands.sdk.llm.message import Message, TextContent
llm = LLM(model=os.environ['LLM_MODEL'], api_key=os.environ['LLM_API_KEY'], base_url=os.environ['LLM_BASE_URL'], max_output_tokens=5, num_retries=0)
response = llm.completion([Message(role='user', content=[TextContent(text='Reply with exactly: OK')])], _return_metrics=True)
print(response.model_dump(exclude_none=True, exclude={'raw_response'}))
PY

Output excerpt:

'metrics': {'model_name': 'litellm_proxy/openai/gpt-5.5', 'accumulated_cost': 0.00020500000000000002, ...}

Then exercised a deterministic local budget-stop run:

OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY'
# Custom agent adds $0.01 to conversation metrics and stays RUNNING; budget is $0.001.
# Full command run in QA environment.
PY

Output:

QA: deterministic_budget_status error
QA: deterministic_budget_errors ['Agent reached maximum budget limit ($0.0010); accumulated cost $0.0100.']

This shows the environment provides a non-zero LiteLLM cost signal for priced models, and the local budget ceiling can stop a still-running conversation with the expected ERROR detail.

Issues Found

None.

This review was created by an AI agent (OpenHands) on behalf of the user.

@ak684 ak684 force-pushed the alona/fix-subagent-budget branch from 610f54a to 61de89c Compare June 16, 2026 19:43
…uccess

A sub-agent whose run ends in any non-FINISHED terminal status (stuck detection,
paused, a run-limit) was reported to the parent task as an empty "completed"
(result="", is_error=False). _run_task now treats only FINISHED as success; every
other status is surfaced as an error, and the detail preserves any partial output
so the parent can use or retry it.
@ak684 ak684 changed the title fix(task): correct sub-agent budget plumbing and non-FINISHED handling fix(task): surface non-FINISHED sub-agent runs as errors, not empty success Jun 16, 2026
@ak684 ak684 marked this pull request as draft June 16, 2026 19:45
@ak684 ak684 marked this pull request as ready for review June 16, 2026 19:52

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Verified the Task tool now surfaces stopped/non-FINISHED sub-agent runs as errors, while preserving normal FINISHED success behavior and partial output on stopped runs.

Does this PR achieve its stated goal?

Yes. The stated goal was to stop reporting non-FINISHED sub-agent runs as empty successes to the parent. Exercising the public TaskExecutor path showed the base branch returned completed, is_error=False, and Task completed with no result. for a STUCK sub-agent; the PR returns error, is_error=True, and Sub-agent stopped without finishing (status: stuck). The PR also preserves partial output for ERROR/run-limit-style stops and leaves FINISHED runs returning successful results.

Phase Result
Environment Setup uv sync --dev completed successfully before verification
CI Status ✅ Latest completed checks observed green; qa-changes was still in progress during this review
Functional Verification ✅ Public TaskExecutor behavior and stop-detail behavior verified with deterministic SDK-style execution
Functional Verification

Test 1: Public TaskExecutor no longer reports STUCK sub-agent as empty success

Step 1 — Reproduce / establish baseline (without the fix):
Ran a deterministic SDK-style script through TaskExecutor(TaskAction(...), conversation=parent) with a demo sub-conversation ending in ConversationExecutionStatus.STUCK on origin/main:

HEAD is now at 0d77efc4 fix(subagent): freshen explicit condensers per spawn (#3743)
Task 'task_executor_demo' completed.
observation_status= completed
observation_is_error= False
observation_text= Task completed with no result.

This confirms the bug exists on the base branch: a stopped sub-agent is surfaced to the parent as a successful empty result.

Step 2 — Apply the PR's changes:
Checked out commit 61de89cc55a99f675aaa1b3da427a9d6ce823968.

Step 3 — Re-run with the fix in place:
Ran the same TaskExecutor scenario:

HEAD is now at 61de89cc fix(task): surface non-FINISHED sub-agent runs as errors, not empty success
Task 'task_executor_demo' stopped: status 'stuck'.
observation_status= error
observation_is_error= True
observation_text= Sub-agent stopped without finishing (status: stuck).

This shows the parent-facing Task observation now marks the stopped sub-agent as an error with an explanatory status, which is the PR's primary goal.

Test 2: Partial output is preserved for stopped ERROR/run-limit-style tasks

Step 1 — Reproduce / establish baseline (without the fix):
Ran a TaskManager scenario on origin/main where the sub-conversation ended in ERROR, had a ConversationErrorEvent, and had prior agent output:

Task 'task_partial' ended with an error.
task_status= error
is_error= True
error= Agent reached maximum iterations.

This shows the old behavior surfaced the error reason but discarded the available partial agent output.

Step 2 — Apply the PR's changes:
Checked out commit 61de89cc55a99f675aaa1b3da427a9d6ce823968.

Step 3 — Re-run with the fix in place:
Ran the same stopped task scenario:

Task 'task_partial' stopped: status 'error'.
task_status= error
is_error= True
error= Agent reached maximum iterations.
Partial result:
partial answer before stop

This confirms the PR keeps the error semantics while adding the partial result detail promised in the PR description.

Test 3: FINISHED sub-agent success path still works

Step 1 — Establish expected behavior:
A normal finished sub-agent should remain a successful completed task.

Step 2 — Apply the PR's changes:
Checked out commit 61de89cc55a99f675aaa1b3da427a9d6ce823968.

Step 3 — Run with the fix in place:
Ran a deterministic TaskManager scenario with ConversationExecutionStatus.FINISHED and final assistant output:

Task 'task_finished' completed.
task_status= completed
is_error= False
result= finished answer
error= None

This shows the success path still returns the final result and is not incorrectly marked as an error.

Issues Found

None.

This review was generated by an AI agent (OpenHands) on behalf of the user.

@ak684 ak684 merged commit e75b9c2 into main Jun 18, 2026
44 of 45 checks passed
@ak684 ak684 deleted the alona/fix-subagent-budget branch June 18, 2026 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants