feat(task): persist the sub-agent task index so resume survives a restart#3698
feat(task): persist the sub-agent task index so resume survives a restart#3698ak684 wants to merge 3 commits into
Conversation
…tart TaskManager.resume relied on the in-memory _tasks map, which is rebuilt empty in a new process, so a parent restart lost the ability to resume even though each sub-agent conversation's events persist on disk. Persist the task_id->task index to <parent-persistence>/subagents/task_index.json (written on create/resume/evict, atomic replace) and rehydrate it in _ensure_parent, so a fresh TaskManager for the same parent conversation can resume prior tasks. No-op when the parent does not persist; tolerant of a missing/corrupt index.
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
Functional QA verified the task-index persistence goal end-to-end, but CI is not currently green.
Does this PR achieve its stated goal?
Yes. I reproduced the old behavior on origin/main: after creating a persistent parent task and constructing a fresh TaskManager, the task list rehydrated empty and resume failed with ValueError: Task 'task_00000001' not found. On the PR branch, the same SDK workflow wrote task_index.json, the fresh manager rehydrated task_00000001, and _resume_task(...) loaded the same persisted sub-agent conversation state.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully; dependencies installed with uv sync --dev. |
| CI Status | Validate PR description, pre-commit), 20 successes, 11 in progress, 1 skipped. |
| Functional Verification | ✅ Persistent task resume across a fresh manager worked on the PR branch; non-persistent parent path did not create an index. |
Functional Verification
Test 1: Persistent sub-agent task resume across a fresh manager
Step 1 — Reproduce / establish baseline (without the fix):
Checked out origin/main and ran uv run python /tmp/qa_task_resume_index.py.
Key output:
PERSISTENT_CREATED task_id=task_00000001 conversation_id=2cb7554b-f44b-4dfd-99f6-f3f07fb0e15b
PERSISTENT_FIRST_MANAGER_TASKS ['task_00000001']
PERSISTENT_INDEX_EXISTS_AFTER_CREATE False
PERSISTENT_RESTARTED_MANAGER_TASKS []
PERSISTENT_RESUME_ERROR ValueError: Task 'task_00000001' not found. Available tasks:
This confirms the reported bug: the sub-agent conversation persisted, but a fresh manager had no durable task index, so resume could not find the prior task_id.
Step 2 — Apply the PR's changes:
Checked out alona/sdk-task-resume-index at 71d3833e859b7576fb4e0a1330f173491c0b3384.
Step 3 — Re-run with the fix in place:
Ran the same uv run python /tmp/qa_task_resume_index.py command.
Key output:
PERSISTENT_CREATED task_id=task_00000001 conversation_id=aa75c0f0-d1a6-4c5a-85a8-715d41a36ea0
PERSISTENT_INDEX_PATH /tmp/qa-task-resume-aulzmj9z/parent-persist/82ba46de52864ceb8ba292ed16e3d5ca/subagents/task_index.json
PERSISTENT_INDEX_EXISTS_AFTER_CREATE True
PERSISTENT_INDEX_ENTRIES_AFTER_CREATE [{"conversation_id": "aa75c0f0-d1a6-4c5a-85a8-715d41a36ea0", "id": "task_00000001", "status": "running"}]
PERSISTENT_RESTARTED_MANAGER_TASKS ['task_00000001']
PERSISTENT_RESUME_OK same_task=True same_conversation=True loaded_state=True
This shows the PR writes the durable index, a fresh manager reloads the task id, and resume reconstructs the same sub-agent conversation from persisted state.
Test 2: Non-persistent parent remains non-durable
The same PR-branch script also created a task under a parent conversation without persistence_dir.
Key output:
EPHEMERAL_CREATED task_id=task_00000001
EPHEMERAL_MANAGER_DIR /tmp/openhands_tasks_fonb7m5m
EPHEMERAL_INDEX_EXISTS False
This confirms the no-persistence path still avoids writing task_index.json, matching the PR description's no-op behavior for non-durable parents.
Issues Found
- 🟠 Issue: CI is not green at review time: GitHub reports failures for
Validate PR descriptionandpre-commit. I did not rerun or investigate those checks because QA was scoped to functional execution, not tests/linters.
This review was created by an AI agent (OpenHands) on behalf of the user.
Coverage Report •
|
||||||||||||||||||||
…hed flag Address review: drop the _persists bool and compute durability in _index_path the same way close() does (parent.state.persistence_dir is not None), so there is no duplicated state to keep in sync.
|
Major
import os, tempfile
fd, tmp = tempfile.mkstemp(dir=index_path.parent, prefix=index_path.name + ".", suffix=".tmp")
try:
with os.fdopen(fd, "w", encoding="utf-8") as f:
f.write(json.dumps(payload))
os.replace(tmp, index_path)
except OSError as e:
logger.warning(f"Failed to persist task index: {e}")
with contextlib.suppress(OSError):
os.unlink(tmp)
Minor
|
HUMAN:
Restores resume across process restarts for sub-agent tasks, the index was in-memory and lost on restart. Reviewed the change and the cross-process resume test. Confirmed working.
AGENT:
Why
TaskManagerresume relied on the in-memory_tasksmap, which a new processrebuilds empty (a fresh
TaskManageris created per tool construction). So aparent restart lost the ability to resume a sub-agent task — even though each
sub-agent conversation's events already persist on disk. The missing piece was a
durable
task_id → taskindex.Summary
<parent-persistence>/subagents/task_index.json,written on create / resume / evict via an atomic temp-file replace.
_ensure_parent, so a freshTaskManagerfor the same parentconversation can resume prior tasks across a process restart.
parent dir); tolerant of a missing or corrupt index (logged + ignored).
How to Test
Self-verifiable with pytest only (no LLM, no OHE):
Key new test — cross-process resume (
test_resume_across_fresh_manager): create atask with a persisting parent, then a fresh
TaskManagerwith the same parentconversation id + persistence dir rehydrates the index from disk and resumes the
prior task (asserts the resumed conversation reloads the same on-disk state). Also
covered: index written only when the parent persists, evicted status persisted,
and corrupt-index tolerance.
Type
Notes
Additive and back-compat: no behavior change for the non-persisting path; the
index only adds durability to the existing resume affordance. (The separate
"journaling / resume-from-prefix for the workflow tool" idea is intentionally NOT
in scope here — it needs a design discussion due to non-deterministic LLM replay.)
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:15088b8-pythonRun
All tags pushed for this build
About Multi-Architecture Support
15088b8-python) is a multi-arch manifest supporting both amd64 and arm6415088b8-python-amd64) are also available if needed