fix(task): surface non-FINISHED sub-agent runs as errors, not empty success#3742
Conversation
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
||||||||||||||||||||
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
The PR achieves its stated goal: I reproduced the reported base-branch failures, then exercised the same SDK/task flows on the PR branch and observed the corrected behavior.
Does this PR achieve its stated goal?
Yes. The PR fixes the public local Conversation(..., max_budget_per_run=...) plumbing, makes the remote path fail loudly, surfaces a non-FINISHED sub-agent as an error instead of an empty success, and preserves inherited caps across task resume. I also verified the configured LiteLLM proxy returns non-zero cost metrics and that the local budget-stop path produces an ERROR run-limit event when accumulated cost exceeds the ceiling.
| Phase | Result |
|---|---|
| Environment Setup | ✅ uv sync --dev completed and repo remained clean after QA runs |
| CI Status | 🟡 34 successful, 1 skipped, 1 pending (QA Changes by OpenHands/qa-changes) at review time |
| Functional Verification | ✅ Base failures reproduced; PR behavior verified through SDK APIs and a live LLM completion |
Functional Verification
Test 1: Public Conversation budget argument and remote handling
Step 1 — Reproduce / establish baseline (without the fix):
Ran against detached base 9750af138dd67255d24526a7e438650deac83476:
cd /tmp/oh-sdk-qa-base && \
OPENHANDS_SUPPRESS_BANNER=1 \
PYTHONPATH="/tmp/oh-sdk-qa-base/openhands-sdk:/tmp/oh-sdk-qa-base/openhands-tools:/tmp/oh-sdk-qa-base/openhands-workspace:/tmp/oh-sdk-qa-base/openhands-agent-server" \
/home/runner/work/software-agent-sdk/software-agent-sdk/pr-repo/.venv/bin/python /tmp/qa_subagent_budget.py 2>&1 | grep '^QA:'Output excerpt:
QA: local_factory_budget=ERROR type=TypeError message="Conversation.__new__() got an unexpected keyword argument 'max_budget_per_run'"
QA: remote_factory_budget=ERROR type=TypeError message="Conversation.__new__() got an unexpected keyword argument 'max_budget_per_run'"
This confirms the reported public-factory bug: a user could not pass max_budget_per_run through Conversation(...) at all.
Step 2 — Apply the PR's changes:
Checked out PR branch alona/fix-subagent-budget at 610f54a1325cd98e1c10e9f83a521b4ff3e7d52d.
Step 3 — Re-run with the fix in place:
Ran the same SDK exercise on the PR branch:
cd /home/runner/work/software-agent-sdk/software-agent-sdk/pr-repo && \
OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_subagent_budget.py 2>&1 | grep '^QA:'Output excerpt:
QA: local_factory_budget=OK type=LocalConversation budget=2.5
QA: remote_factory_budget=ERROR type=NotImplementedError message='max_budget_per_run is not yet enforced for remote conversations (server-side budget tracking is pending).'
This shows the local user entry point now accepts and forwards the budget, while remote conversations fail explicitly instead of silently ignoring unsupported budget enforcement.
Test 2: Sub-agent non-FINISHED result and resume cap inheritance
Step 1 — Reproduce / establish baseline (without the fix):
Same base run output:
QA: task_stuck_result status=completed result='' error=None
QA: task_resume_caps first='caps iter=42 budget=7.0' second='caps iter=500 budget=None'
This confirms both reported bugs: a STUCK sub-agent was reported as an empty completed task, and resuming a task dropped inherited caps back to defaults.
Step 2 — Apply the PR's changes:
Checked out PR branch alona/fix-subagent-budget at 610f54a1325cd98e1c10e9f83a521b4ff3e7d52d.
Step 3 — Re-run with the fix in place:
Same PR run output:
QA: task_stuck_result status=error result=None error='Sub-agent stopped without finishing (status: stuck).'
QA: task_resume_caps first='caps iter=42 budget=7.0' second='caps iter=42 budget=7.0'
This verifies the sub-agent stop is now surfaced to the parent as an error, and resumed task conversations retain the parent-provided iteration and budget caps.
Test 3: Cost signal and local budget stop behavior
Step 1 — Reproduce / establish baseline (without the fix):
The base public Conversation(..., max_budget_per_run=...) entry point failed with TypeError in Test 1, so a real user could not configure this through the public factory before the PR.
Step 2 — Apply the PR's changes:
Used the PR branch with the configured LLM_MODEL, LLM_BASE_URL, and LLM_API_KEY environment.
Step 3 — Re-run with the fix in place:
Ran a live, tiny LLM completion through the configured LiteLLM proxy:
OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY'
import os
from openhands.sdk import LLM
from openhands.sdk.llm.message import Message, TextContent
llm = LLM(model=os.environ['LLM_MODEL'], api_key=os.environ['LLM_API_KEY'], base_url=os.environ['LLM_BASE_URL'], max_output_tokens=5, num_retries=0)
response = llm.completion([Message(role='user', content=[TextContent(text='Reply with exactly: OK')])], _return_metrics=True)
print(response.model_dump(exclude_none=True, exclude={'raw_response'}))
PYOutput excerpt:
'metrics': {'model_name': 'litellm_proxy/openai/gpt-5.5', 'accumulated_cost': 0.00020500000000000002, ...}
Then exercised a deterministic local budget-stop run:
OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY'
# Custom agent adds $0.01 to conversation metrics and stays RUNNING; budget is $0.001.
# Full command run in QA environment.
PYOutput:
QA: deterministic_budget_status error
QA: deterministic_budget_errors ['Agent reached maximum budget limit ($0.0010); accumulated cost $0.0100.']
This shows the environment provides a non-zero LiteLLM cost signal for priced models, and the local budget ceiling can stop a still-running conversation with the expected ERROR detail.
Issues Found
None.
This review was created by an AI agent (OpenHands) on behalf of the user.
610f54a to
61de89c
Compare
…uccess A sub-agent whose run ends in any non-FINISHED terminal status (stuck detection, paused, a run-limit) was reported to the parent task as an empty "completed" (result="", is_error=False). _run_task now treats only FINISHED as success; every other status is surfaced as an error, and the detail preserves any partial output so the parent can use or retry it.
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Verified the Task tool now surfaces stopped/non-FINISHED sub-agent runs as errors, while preserving normal FINISHED success behavior and partial output on stopped runs.
Does this PR achieve its stated goal?
Yes. The stated goal was to stop reporting non-FINISHED sub-agent runs as empty successes to the parent. Exercising the public TaskExecutor path showed the base branch returned completed, is_error=False, and Task completed with no result. for a STUCK sub-agent; the PR returns error, is_error=True, and Sub-agent stopped without finishing (status: stuck). The PR also preserves partial output for ERROR/run-limit-style stops and leaves FINISHED runs returning successful results.
| Phase | Result |
|---|---|
| Environment Setup | ✅ uv sync --dev completed successfully before verification |
| CI Status | ✅ Latest completed checks observed green; qa-changes was still in progress during this review |
| Functional Verification | ✅ Public TaskExecutor behavior and stop-detail behavior verified with deterministic SDK-style execution |
Functional Verification
Test 1: Public TaskExecutor no longer reports STUCK sub-agent as empty success
Step 1 — Reproduce / establish baseline (without the fix):
Ran a deterministic SDK-style script through TaskExecutor(TaskAction(...), conversation=parent) with a demo sub-conversation ending in ConversationExecutionStatus.STUCK on origin/main:
HEAD is now at 0d77efc4 fix(subagent): freshen explicit condensers per spawn (#3743)
Task 'task_executor_demo' completed.
observation_status= completed
observation_is_error= False
observation_text= Task completed with no result.
This confirms the bug exists on the base branch: a stopped sub-agent is surfaced to the parent as a successful empty result.
Step 2 — Apply the PR's changes:
Checked out commit 61de89cc55a99f675aaa1b3da427a9d6ce823968.
Step 3 — Re-run with the fix in place:
Ran the same TaskExecutor scenario:
HEAD is now at 61de89cc fix(task): surface non-FINISHED sub-agent runs as errors, not empty success
Task 'task_executor_demo' stopped: status 'stuck'.
observation_status= error
observation_is_error= True
observation_text= Sub-agent stopped without finishing (status: stuck).
This shows the parent-facing Task observation now marks the stopped sub-agent as an error with an explanatory status, which is the PR's primary goal.
Test 2: Partial output is preserved for stopped ERROR/run-limit-style tasks
Step 1 — Reproduce / establish baseline (without the fix):
Ran a TaskManager scenario on origin/main where the sub-conversation ended in ERROR, had a ConversationErrorEvent, and had prior agent output:
Task 'task_partial' ended with an error.
task_status= error
is_error= True
error= Agent reached maximum iterations.
This shows the old behavior surfaced the error reason but discarded the available partial agent output.
Step 2 — Apply the PR's changes:
Checked out commit 61de89cc55a99f675aaa1b3da427a9d6ce823968.
Step 3 — Re-run with the fix in place:
Ran the same stopped task scenario:
Task 'task_partial' stopped: status 'error'.
task_status= error
is_error= True
error= Agent reached maximum iterations.
Partial result:
partial answer before stop
This confirms the PR keeps the error semantics while adding the partial result detail promised in the PR description.
Test 3: FINISHED sub-agent success path still works
Step 1 — Establish expected behavior:
A normal finished sub-agent should remain a successful completed task.
Step 2 — Apply the PR's changes:
Checked out commit 61de89cc55a99f675aaa1b3da427a9d6ce823968.
Step 3 — Run with the fix in place:
Ran a deterministic TaskManager scenario with ConversationExecutionStatus.FINISHED and final assistant output:
Task 'task_finished' completed.
task_status= completed
is_error= False
result= finished answer
error= None
This shows the success path still returns the final result and is not incorrectly marked as an error.
Issues Found
None.
This review was generated by an AI agent (OpenHands) on behalf of the user.
HUMAN:
bug fix: a non-FINISHED sub-agent (stuck/paused/run-limit) was reported to the parent as an empty success.
AGENT:
Narrowly scoped bug fix for sub-agent (Task tool) result surfacing.
uv run pytest tests/tools/task/test_task_manager.py= 53 pass; pre-commit (ruff + pyright) clean.Why
A sub-agent whose run ends in any non-
FINISHEDterminal status — stuck detection, paused, or a run-limit — was reported to the parent task as an empty success (Task completed with no result,is_error=False)._run_taskonly special-casedERROR; every other non-FINISHEDstatus fell through to the success branch, so the parent agent couldn't distinguish a real result from a silently-aborted sub-agent.Summary
_run_tasknow treats onlyFINISHEDas success; any other terminal status (stuck, paused, run-limit) is surfaced to the parent as an error._run_stop_detail) so the parent can use or retry it instead of getting nothing.Issue Number
(none)
How to Test
Covers: a non-FINISHED (stuck) status surfaced as an error;
_run_stop_detailreturns the last error event / falls back to a status message.Type
Notes
This PR was narrowed to the status-surfacing fix only; the per-run budget plumbing it originally carried was dropped so the budget can be reworked separately. The change here is independent of the budget feature.
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:61de89c-pythonRun
All tags pushed for this build
About Multi-Architecture Support
61de89c-python) is a multi-arch manifest supporting both amd64 and arm6461de89c-python-amd64) are also available if needed