Skip to content

feat(task): persist the sub-agent task index so resume survives a restart#3698

Open
ak684 wants to merge 3 commits into
mainfrom
alona/sdk-task-resume-index
Open

feat(task): persist the sub-agent task index so resume survives a restart#3698
ak684 wants to merge 3 commits into
mainfrom
alona/sdk-task-resume-index

Conversation

@ak684

@ak684 ak684 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

HUMAN:

Restores resume across process restarts for sub-agent tasks, the index was in-memory and lost on restart. Reviewed the change and the cross-process resume test. Confirmed working.

  • A human has tested these changes.

AGENT:


Why

TaskManager resume relied on the in-memory _tasks map, which a new process
rebuilds empty (a fresh TaskManager is created per tool construction). So a
parent restart lost the ability to resume a sub-agent task — even though each
sub-agent conversation's events already persist on disk. The missing piece was a
durable task_id → task index.

Summary

  • Persist the task index to <parent-persistence>/subagents/task_index.json,
    written on create / resume / evict via an atomic temp-file replace.
  • Rehydrate it in _ensure_parent, so a fresh TaskManager for the same parent
    conversation can resume prior tasks across a process restart.
  • No-op when the parent does not persist (the index lives only under a durable
    parent dir); tolerant of a missing or corrupt index (logged + ignored).

How to Test

Self-verifiable with pytest only (no LLM, no OHE):

uv run pytest tests/tools/task            # 65 passed
uv run pytest tests/tools                 # 912 passed, 9 skipped, 0 failed
uv run ruff check && uv run ruff format   # clean

Key new test — cross-process resume (test_resume_across_fresh_manager): create a
task with a persisting parent, then a fresh TaskManager with the same parent
conversation id + persistence dir rehydrates the index from disk and resumes the
prior task (asserts the resumed conversation reloads the same on-disk state). Also
covered: index written only when the parent persists, evicted status persisted,
and corrupt-index tolerance.

Type

  • Feature

Notes

Additive and back-compat: no behavior change for the non-persisting path; the
index only adds durability to the existing resume affordance. (The separate
"journaling / resume-from-prefix for the workflow tool" idea is intentionally NOT
in scope here — it needs a design discussion due to non-deterministic LLM replay.)


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:15088b8-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-15088b8-python \
  ghcr.io/openhands/agent-server:15088b8-python

All tags pushed for this build

ghcr.io/openhands/agent-server:15088b8-golang-amd64
ghcr.io/openhands/agent-server:15088b8dc37cee988bcd11f40d5500642f5fa49a-golang-amd64
ghcr.io/openhands/agent-server:alona-sdk-task-resume-index-golang-amd64
ghcr.io/openhands/agent-server:15088b8-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:15088b8-golang-arm64
ghcr.io/openhands/agent-server:15088b8dc37cee988bcd11f40d5500642f5fa49a-golang-arm64
ghcr.io/openhands/agent-server:alona-sdk-task-resume-index-golang-arm64
ghcr.io/openhands/agent-server:15088b8-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:15088b8-java-amd64
ghcr.io/openhands/agent-server:15088b8dc37cee988bcd11f40d5500642f5fa49a-java-amd64
ghcr.io/openhands/agent-server:alona-sdk-task-resume-index-java-amd64
ghcr.io/openhands/agent-server:15088b8-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:15088b8-java-arm64
ghcr.io/openhands/agent-server:15088b8dc37cee988bcd11f40d5500642f5fa49a-java-arm64
ghcr.io/openhands/agent-server:alona-sdk-task-resume-index-java-arm64
ghcr.io/openhands/agent-server:15088b8-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:15088b8-python-amd64
ghcr.io/openhands/agent-server:15088b8dc37cee988bcd11f40d5500642f5fa49a-python-amd64
ghcr.io/openhands/agent-server:alona-sdk-task-resume-index-python-amd64
ghcr.io/openhands/agent-server:15088b8-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:15088b8-python-arm64
ghcr.io/openhands/agent-server:15088b8dc37cee988bcd11f40d5500642f5fa49a-python-arm64
ghcr.io/openhands/agent-server:alona-sdk-task-resume-index-python-arm64
ghcr.io/openhands/agent-server:15088b8-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:15088b8-golang
ghcr.io/openhands/agent-server:15088b8dc37cee988bcd11f40d5500642f5fa49a-golang
ghcr.io/openhands/agent-server:alona-sdk-task-resume-index-golang
ghcr.io/openhands/agent-server:15088b8-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:15088b8-java
ghcr.io/openhands/agent-server:15088b8dc37cee988bcd11f40d5500642f5fa49a-java
ghcr.io/openhands/agent-server:alona-sdk-task-resume-index-java
ghcr.io/openhands/agent-server:15088b8-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:15088b8-python
ghcr.io/openhands/agent-server:15088b8dc37cee988bcd11f40d5500642f5fa49a-python
ghcr.io/openhands/agent-server:alona-sdk-task-resume-index-python
ghcr.io/openhands/agent-server:15088b8-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 15088b8-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 15088b8-python-amd64) are also available if needed

…tart

TaskManager.resume relied on the in-memory _tasks map, which is rebuilt empty in
a new process, so a parent restart lost the ability to resume even though each
sub-agent conversation's events persist on disk. Persist the task_id->task index
to <parent-persistence>/subagents/task_index.json (written on create/resume/evict,
atomic replace) and rehydrate it in _ensure_parent, so a fresh TaskManager for the
same parent conversation can resume prior tasks. No-op when the parent does not
persist; tolerant of a missing/corrupt index.
@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ QA Report: PASS WITH ISSUES

Functional QA verified the task-index persistence goal end-to-end, but CI is not currently green.

Does this PR achieve its stated goal?

Yes. I reproduced the old behavior on origin/main: after creating a persistent parent task and constructing a fresh TaskManager, the task list rehydrated empty and resume failed with ValueError: Task 'task_00000001' not found. On the PR branch, the same SDK workflow wrote task_index.json, the fresh manager rehydrated task_00000001, and _resume_task(...) loaded the same persisted sub-agent conversation state.

Phase Result
Environment Setup make build completed successfully; dependencies installed with uv sync --dev.
CI Status ⚠️ Latest check rollup showed 2 failures (Validate PR description, pre-commit), 20 successes, 11 in progress, 1 skipped.
Functional Verification ✅ Persistent task resume across a fresh manager worked on the PR branch; non-persistent parent path did not create an index.
Functional Verification

Test 1: Persistent sub-agent task resume across a fresh manager

Step 1 — Reproduce / establish baseline (without the fix):
Checked out origin/main and ran uv run python /tmp/qa_task_resume_index.py.

Key output:

PERSISTENT_CREATED task_id=task_00000001 conversation_id=2cb7554b-f44b-4dfd-99f6-f3f07fb0e15b
PERSISTENT_FIRST_MANAGER_TASKS ['task_00000001']
PERSISTENT_INDEX_EXISTS_AFTER_CREATE False
PERSISTENT_RESTARTED_MANAGER_TASKS []
PERSISTENT_RESUME_ERROR ValueError: Task 'task_00000001' not found. Available tasks:

This confirms the reported bug: the sub-agent conversation persisted, but a fresh manager had no durable task index, so resume could not find the prior task_id.

Step 2 — Apply the PR's changes:
Checked out alona/sdk-task-resume-index at 71d3833e859b7576fb4e0a1330f173491c0b3384.

Step 3 — Re-run with the fix in place:
Ran the same uv run python /tmp/qa_task_resume_index.py command.

Key output:

PERSISTENT_CREATED task_id=task_00000001 conversation_id=aa75c0f0-d1a6-4c5a-85a8-715d41a36ea0
PERSISTENT_INDEX_PATH /tmp/qa-task-resume-aulzmj9z/parent-persist/82ba46de52864ceb8ba292ed16e3d5ca/subagents/task_index.json
PERSISTENT_INDEX_EXISTS_AFTER_CREATE True
PERSISTENT_INDEX_ENTRIES_AFTER_CREATE [{"conversation_id": "aa75c0f0-d1a6-4c5a-85a8-715d41a36ea0", "id": "task_00000001", "status": "running"}]
PERSISTENT_RESTARTED_MANAGER_TASKS ['task_00000001']
PERSISTENT_RESUME_OK same_task=True same_conversation=True loaded_state=True

This shows the PR writes the durable index, a fresh manager reloads the task id, and resume reconstructs the same sub-agent conversation from persisted state.

Test 2: Non-persistent parent remains non-durable

The same PR-branch script also created a task under a parent conversation without persistence_dir.

Key output:

EPHEMERAL_CREATED task_id=task_00000001
EPHEMERAL_MANAGER_DIR /tmp/openhands_tasks_fonb7m5m
EPHEMERAL_INDEX_EXISTS False

This confirms the no-persistence path still avoids writing task_index.json, matching the PR description's no-op behavior for non-durable parents.

Issues Found

  • 🟠 Issue: CI is not green at review time: GitHub reports failures for Validate PR description and pre-commit. I did not rerun or investigate those checks because QA was scoped to functional execution, not tests/linters.

This review was created by an AI agent (OpenHands) on behalf of the user.

@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-tools/openhands/tools/task
   manager.py21315129%83–85, 89–91, 101–102, 104–105, 109, 119, 123, 126, 129–134, 136, 140, 147–152, 157–167, 173–187, 189, 193–194, 198, 202–205, 208–212, 214, 236–237, 239–240, 245, 250, 257–259, 264–267, 276, 281, 287, 289–290, 303–304, 306, 312–313, 315, 324, 329, 335, 337–338, 349–350, 352–355, 357, 374–375, 379–380, 382–383, 386, 388, 391, 394, 398–399, 401–404, 406–414, 416–417, 419, 425–426, 430–432, 434, 437, 439–440, 451–452, 456, 463–464, 472, 476, 481, 483–484
TOTAL311841570749% 

Comment thread openhands-tools/openhands/tools/task/manager.py Outdated
…hed flag

Address review: drop the _persists bool and compute durability in _index_path the
same way close() does (parent.state.persistence_dir is not None), so there is no
duplicated state to keep in sync.
@ak684 ak684 requested a review from VascoSch92 June 14, 2026 19:27
@VascoSch92

Copy link
Copy Markdown
Member

Major

  • _save_index uses a fixed temp filename, races under parallel tool execution.

    tmp_path = index_path.parent / f"{index_path.name}.tmp"   # same path for every concurrent caller
    tmp_path.write_text(json.dumps(payload), encoding="utf-8")
    tmp_path.replace(index_path)

    TaskTool.declared_resources() returns DeclaredResources(keys=(), declared=True), which ParallelToolExecutor treats as "no locking" (parallel_executor.py: declared=True, empty keys → no lock). So when tool_concurrency_limit > 1 and the agent emits ≥2 task calls in a single step against a persisting parent, two _create_task/_evict_task/_resume_task calls run _save_index concurrently and both write the same task_index.json.tmp:

    • the loser's replace raises FileNotFoundError (file already moved) → swallowed by except OSError and logged as a warning;
    • the shared tmp can be truncated-on-open by one writer while another is mid-write → a corrupt file can get replaced into the index.

    _load_index tolerates corruption, so there's no crash. The existing test_task_manager_thread_safety.py cases don't catch it because their parent doesn't persist, so _index_path is None and _save_index no-ops.

    A fix can be an unique temp file in the same directory, then atomic replace:

  import os, tempfile
  fd, tmp = tempfile.mkstemp(dir=index_path.parent, prefix=index_path.name + ".", suffix=".tmp")
  try:
      with os.fdopen(fd, "w", encoding="utf-8") as f:
          f.write(json.dumps(payload))
      os.replace(tmp, index_path)
  except OSError as e:
      logger.warning(f"Failed to persist task index: {e}")
      with contextlib.suppress(OSError):
          os.unlink(tmp)
  • _load_index isn't fully corruption-tolerant, scalar JSON raises an uncaught TypeError

Minor

  • _save_index rewrites the entire task list on every create/resume/evict → O(N²) writes over a long-lived parent that spawns many sub-agents.
  • _resume_task persists a transient status: RUNNING before the run starts; if the process dies mid-run the on-disk status is stale RUNNING (still resumable, just can't be trusted as "currently running").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants