Skip to content

feat(agent-server): add deferred-init / dormant mode#3287

Merged
tofarr merged 17 commits into
mainfrom
feat/agent-server-deferred-init
Jun 17, 2026
Merged

feat(agent-server): add deferred-init / dormant mode#3287
tofarr merged 17 commits into
mainfrom
feat/agent-server-deferred-init

Conversation

@tofarr

@tofarr tofarr commented May 17, 2026

Copy link
Copy Markdown
Collaborator

H:

This paves the way for performance upgrades and tightening security in the K8s environment.


AGENT:

This PR was opened by an AI agent (OpenHands) on behalf of @tofarr. All 15 new tests in tests/agent_server/test_init_router.py pass; the rest of tests/agent_server/ is unaffected (one pre-existing failure in test_terminal_service.py::test_terminal_does_not_expose_session_api_key_via_env_command also reproduces on main, unrelated to this change).

Why

Implements the warm-pool agent-server proposal in #2523. Enables Kubernetes warm pods to be matched with users after boot, without pre-attached PVCs — the orchestrator starts a pool of dormant agent-server pods and delivers a per-user runtime config via /init when a conversation begins.

Summary

  • New Config.deferred_init (env OH_DEFERRED_INIT) puts the server in dormant mode at startup; InitService (init_router.py) owns the dormant → initializing → ready state machine with bootstrap auth via the existing secret_key (X-Init-API-Key header)
  • require_initialized FastAPI dependency gates all /api/* routes with 503 until ready; /ready flips to 200 as soon as stateless services are up so the orchestrator knows the pod is available for /init
  • Legacy path (deferred_init=False) is completely unaffected — no InitService is attached, /init routes return 404, and /api/* is live from startup as before

Issue Number

#2523

How to Test

uv run pytest tests/agent_server/test_init_router.py -v

Covers: config defaults + env wiring, InitRequest → Config merging, state machine transitions (dormant → ready, double-init 400, failed init rolls back to dormant), end-to-end over api_lifespan + TestClient (503 gating before /init, 200 after, auth checks — 401 wrong key, 200 right key, GET unauthenticated), and lifespan teardown.

Manual smoke test:

OH_DEFERRED_INIT=true OH_SECRET_KEY=mysecret uv run python -m openhands.agent_server
# /api/* returns 503 in dormant state
curl -X POST http://localhost:8000/api/init \
  -H "X-Init-API-Key: mysecret" -H "Content-Type: application/json" -d '{}'
# /api/* is now live

Video/Screenshots

No GUI changes. See test output above.

Type

  • Bug fix
  • Feature
  • Refactor
  • Breaking change
  • Docs / chore

Notes

Intentional scope limits (see #2523 discussion for rationale):

  • No /deinit yet — once ready, the server stays ready for the process lifetime; recyclable init is a follow-up requiring a workspace-flush story
  • No workspace-storage integration (rclone / S3 pull-on-init); InitRequest accepts conversations_path and bash_events_dir so an orchestrator can pre-populate before calling /init
  • No Workspace-class SDK integration — the two-phase start() / attach(config) API deserves its own PR once this server-side primitive is merged
  • Session-key timing trade-off: keys delivered via /init populate app.state.config but are not enforced by the auth dependency (documented in test). Production deployments should set OH_SESSION_API_KEYS_0 at pod start; the dormant gate already guarantees no traffic reaches gated routes before /init
  • No new Docker/k8s deployment changes — same images, same entrypoint, toggled by env var

Documentation example: OpenHands/docs#577


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:3e89d92-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-3e89d92-python \
  ghcr.io/openhands/agent-server:3e89d92-python

All tags pushed for this build

ghcr.io/openhands/agent-server:3e89d92-golang-amd64
ghcr.io/openhands/agent-server:3e89d92ef30553bbf79b7f14650a683f9c39ddda-golang-amd64
ghcr.io/openhands/agent-server:feat-agent-server-deferred-init-golang-amd64
ghcr.io/openhands/agent-server:3e89d92-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:3e89d92-golang-arm64
ghcr.io/openhands/agent-server:3e89d92ef30553bbf79b7f14650a683f9c39ddda-golang-arm64
ghcr.io/openhands/agent-server:feat-agent-server-deferred-init-golang-arm64
ghcr.io/openhands/agent-server:3e89d92-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:3e89d92-java-amd64
ghcr.io/openhands/agent-server:3e89d92ef30553bbf79b7f14650a683f9c39ddda-java-amd64
ghcr.io/openhands/agent-server:feat-agent-server-deferred-init-java-amd64
ghcr.io/openhands/agent-server:3e89d92-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:3e89d92-java-arm64
ghcr.io/openhands/agent-server:3e89d92ef30553bbf79b7f14650a683f9c39ddda-java-arm64
ghcr.io/openhands/agent-server:feat-agent-server-deferred-init-java-arm64
ghcr.io/openhands/agent-server:3e89d92-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:3e89d92-python-amd64
ghcr.io/openhands/agent-server:3e89d92ef30553bbf79b7f14650a683f9c39ddda-python-amd64
ghcr.io/openhands/agent-server:feat-agent-server-deferred-init-python-amd64
ghcr.io/openhands/agent-server:3e89d92-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:3e89d92-python-arm64
ghcr.io/openhands/agent-server:3e89d92ef30553bbf79b7f14650a683f9c39ddda-python-arm64
ghcr.io/openhands/agent-server:feat-agent-server-deferred-init-python-arm64
ghcr.io/openhands/agent-server:3e89d92-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:3e89d92-golang
ghcr.io/openhands/agent-server:3e89d92ef30553bbf79b7f14650a683f9c39ddda-golang
ghcr.io/openhands/agent-server:feat-agent-server-deferred-init-golang
ghcr.io/openhands/agent-server:3e89d92-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:3e89d92-java
ghcr.io/openhands/agent-server:3e89d92ef30553bbf79b7f14650a683f9c39ddda-java
ghcr.io/openhands/agent-server:feat-agent-server-deferred-init-java
ghcr.io/openhands/agent-server:3e89d92-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:3e89d92-python
ghcr.io/openhands/agent-server:3e89d92ef30553bbf79b7f14650a683f9c39ddda-python
ghcr.io/openhands/agent-server:feat-agent-server-deferred-init-python
ghcr.io/openhands/agent-server:3e89d92-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 3e89d92-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 3e89d92-python-amd64) are also available if needed

Implements the warm-pool agent-server proposal in #2523.

When `Config.deferred_init=True` (env `OH_DEFERRED_INIT`) the server
starts in *dormant* mode:

* Stateless services (VSCode, desktop, tool preload) start as usual so
  the warm pod is immediately useful to whoever attaches next.
* The conversation, event, and bash routers (everything under `/api/*`)
  return 503 via a new `require_initialized` dependency.
* `/alive`, `/health`, `/ready`, `/server_info` and a new top-level
  `/init` router are reachable. `/ready` reports ready once the
  stateless services are up so an orchestrator can match the pod with
  a user and send its `/init` payload.
* `POST /init` accepts an `InitRequest` (session API keys, workspace
  paths, webhooks, env vars, etc.), merges it with the dormant config,
  enters the `ConversationService` context, and flips the gate to
  `ready`. A second `/init` call gets 400; a failed init rolls back
  to dormant so the orchestrator can retry.
* Bootstrap auth for `POST /init` is a separate `OH_INIT_API_KEY`
  (`X-Init-API-Key` header), distinct from `session_api_keys` because
  the session key is part of the per-user payload that arrives *inside*
  the init body. `GET /init` (status polling) is unauthenticated.

The non-deferred path is unchanged — no `InitService` is attached to
`app.state` and the dormant gate is a no-op.

Tests cover: config defaults + env wiring, `InitRequest` → `Config`
merging, state machine (dormant → initializing → ready, second-call
400), env var application, end-to-end over the FastAPI lifespan +
`TestClient` (503 gating before init, 200 after, init key auth), and
the regression that `deferred_init=False` still works exactly as today.

Refs: #2523

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions

github-actions Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-agent-server/openhands/agent_server
   api.py2742690%108, 110–115, 117, 119, 121, 156, 168, 183, 189, 242, 247, 256–258, 501, 504, 508–510, 512, 519
   config.py97396%38, 51, 276
   init_router.py113496%153, 155, 157, 159
TOTAL317161376156% 

tofarr and others added 8 commits May 23, 2026 18:50
Resolved conflict in openhands-agent-server/openhands/agent_server/api.py:
- Kept retention_task cancellation logic added in main
- Kept stop_stateless_services() helper from PR branch

Co-authored-by: openhands <openhands@all-hands.dev>
Mounts the init router under /api instead of at the top level.
The router's own prefix (/init) combines with the new wrapper
APIRouter(prefix="/api") to produce /api/init, /api/init (GET/POST).

The init router remains exempt from both the session-key auth and the
require_initialized dormant gate — it gets its own unauthenticated
wrapper APIRouter with no dependencies, mirroring the pattern used by
the workspace router.

All comments, docstrings, log messages, field descriptions, and test
client URLs updated from /init to /api/init.

Co-authored-by: openhands <openhands@all-hands.dev>
@tofarr tofarr closed this Jun 14, 2026
@tofarr tofarr deleted the feat/agent-server-deferred-init branch June 14, 2026 14:20
@tofarr tofarr restored the feat/agent-server-deferred-init branch June 14, 2026 14:20
@tofarr tofarr reopened this Jun 14, 2026
openhands-agent and others added 2 commits June 15, 2026 14:39
Remove the separate init_api_key / OH_INIT_API_KEY config field.
The dormant server's existing secret_key already serves as a
per-pod bootstrap credential — the orchestrator holds it for
encryption and it is overwritten when the per-user runtime config
arrives in the /api/init body.

Co-authored-by: openhands <openhands@all-hands.dev>
@tofarr tofarr marked this pull request as ready for review June 17, 2026 17:30

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ QA Report: PASS WITH ISSUES

Deferred-init works end-to-end over real HTTP, but one non-functional CI check is currently failing.

Does this PR achieve its stated goal?

Yes. I started the actual agent-server process with OH_DEFERRED_INIT=1 and verified the new dormant lifecycle: readiness/health stayed available, /api/init reported dormant, gated /api/* routes returned 503 before init, bootstrap auth rejected missing/wrong keys, a correct POST /api/init transitioned to ready, and /api/conversations/count worked afterward. I also verified the non-deferred PR path remains live from startup and /api/init returns 404 when dormant mode is disabled.

Phase Result
Environment Setup make build completed successfully; no test suite, linter, formatter, or pre-commit run was executed.
CI Status ⚠️ 33 success, 1 failure (Validate PR description), 1 in progress (qa-changes), 1 skipped at time of QA.
Functional Verification ✅ Real agent-server processes were started and exercised with curl in baseline, deferred, and legacy modes.
Functional Verification

Test 1: Establish baseline without the PR

Step 1 — Reproduce / establish baseline (without the feature):
Checked out origin/main and ran the server with the new env var anyway:

OH_DEFERRED_INIT=1 OH_SECRET_KEY=pool-key \
  OH_CONVERSATIONS_PATH=/tmp/qa-main-convs \
  OH_BASH_EVENTS_DIR=/tmp/qa-main-bash \
  uv run agent-server --host 127.0.0.1 --port 18101

Then queried the API:

BASE /ready: 200 application/json
{"status":"ready"}
BASE /api/init GET: 404 application/json
{"detail":"Not Found"}
BASE /api/conversations/count: 200 application/json
0

This establishes the old behavior: OH_DEFERRED_INIT did not create a dormant state, /api/init was unavailable, and /api/* was live immediately.

Step 2 — Apply the PR's changes:
Switched back to feat/agent-server-deferred-init at d964bb942e6c664066065b91f0739290d93eb88b.

Step 3 — Re-run with the feature in place:
Started the PR server with equivalent deferred-init configuration:

OH_DEFERRED_INIT=1 OH_SECRET_KEY=pool-key \
  OH_CONVERSATIONS_PATH=/tmp/qa-pr-convs \
  OH_BASH_EVENTS_DIR=/tmp/qa-pr-bash \
  uv run agent-server --host 127.0.0.1 --port 18102

Observed before initialization:

PR /alive before init: 200 application/json
{"status":"ok"}
PR /ready before init: 200 application/json
{"status":"ready"}
PR /api count before init: 503 application/json
{"detail":"Internal Server Error","exception":"503: server is in deferred-init state 'dormant'; call POST /api/init first"}
PR /api/init GET before init: 200 application/json
{"state":"dormant","error":null}

This confirms the new dormant mode is active: readiness stays green for warm-pool availability, status polling works, and gated API traffic is blocked until init.

Test 2: Bootstrap auth and dormant → ready transition

Step 1 — Baseline while dormant:
Against the same dormant PR server, attempted init without valid bootstrap credentials:

PR /api/init POST missing key: 401 application/json
{"detail":"Unauthorized"}
PR /api/init POST wrong key: 401 application/json
{"detail":"Unauthorized"}

This confirms POST /api/init is auth-gated by X-Init-API-Key when OH_SECRET_KEY is configured.

Step 2 — Apply valid init payload:
Posted runtime config with the correct bootstrap key:

curl -X POST http://127.0.0.1:18102/api/init \
  -H 'X-Init-API-Key: pool-key' \
  -H 'Content-Type: application/json' \
  --data '{"conversations_path":"/tmp/qa-pr-user-convs","bash_events_dir":"/tmp/qa-pr-user-bash","env":{"DEFERRED_INIT_QA_VAR":"from_init"}}'

Observed:

PR /api/init POST correct key: 200 application/json
{"state":"ready","error":null}
PR /api/init GET after init: 200 application/json
{"state":"ready","error":null}
PR /api count after init: 200 application/json
0
PR /api/init POST second call: 400 application/json
{"detail":"server already in state: ready"}

This confirms the server transitions to ready, previously gated /api/* routes become usable, and repeated init is rejected as described.

Test 3: Public health/server-info and non-deferred compatibility

Deferred mode public endpoints:
Started another PR server in dormant mode and queried the ungated endpoints:

PR deferred /health: 200 application/json
{"status":"ok"}
PR deferred /server_info: 200 application/json
{"uptime":0.0,"idle_time":0.0,"title":"OpenHands Agent Server",...}
PR deferred /api count: 503 application/json
{"detail":"Internal Server Error","exception":"503: server is in deferred-init state 'dormant'; call POST /api/init first"}

This matches the behavior matrix: health/server info stay available while /api/* remains gated.

Non-deferred PR compatibility:
Started the PR server without OH_DEFERRED_INIT:

PR legacy /ready: 200 application/json
{"status":"ready"}
PR legacy /api/init GET: 404 application/json
{"detail":"server is not running with deferred_init=True; the /api/init endpoint is not available"}
PR legacy /api count: 200 application/json
0
PR legacy /server_info: 200 application/json
{"uptime":0.0,"idle_time":0.0,"title":"OpenHands Agent Server",...}

This confirms the default/legacy path remains live from startup and does not require init.

Issues Found

  • 🟡 Minor / merge-readiness: The GitHub check Validate PR description is failing at the time of QA. I did not edit the PR description; if this is due to the human-only PR description fields, a human contributor needs to update those in their own words.

This QA review was generated by an AI agent (OpenHands) on behalf of the user.

Comment thread openhands-agent-server/openhands/agent_server/config.py Outdated
)


init_router = APIRouter(prefix="/init", tags=["Init"])

@enyst enyst Jun 17, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this differently than idk sitting on the import path? It’s error prone imho and sometimes bites

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate what you mean here? This follows the same pattern as the other routers. e.g.:

@enyst enyst Jun 17, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you’re right it matches the current routers. My concern is that this is existing technical debt, more like anti-pattern than a pattern to keep.

The agent tells me as potentially more problematic examples bash_router.py and sockets.py, which also instantiate get_default_bash_event_service() / get_default_conversation_service() at module import. Maybe… Is that a bit risky for the behavior we want with this PR, if the service/config is supposed to be computed later by /api/init?

I had an old PR redesigning this: #400. It tried to move the server toward an app-factory + DI pattern where services are resolved from app.state instead of router-module globals. I’m not suggesting to refactor them all right now though 😅 More like, I was wondering how we could do it here to be less error prone

@tofarr tofarr Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should do this - but I think it should be a separate PR - We should resurrect #400 or create a new PR to do this.

@tofarr tofarr requested a review from enyst June 17, 2026 18:37

@xingyaoww xingyaoww left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nit: we should create an example under agent-server examples folders, so this can be tested before each release

Adds examples/02_remote_agent_server/16_deferred_init.py to demonstrate
the dormant-mode lifecycle introduced in the deferred-init PR:

- Start server with OH_DEFERRED_INIT=true (dormant state)
- Verify GET /api/init returns state=dormant
- Verify /api/* routes return 503 while dormant
- POST /api/init with X-Init-API-Key to activate the server
- Verify GET /api/init returns state=ready
- Run a conversation normally on the now-ready server

Addresses Xingyao's review comment requesting a testable agent-server
example before each release.

Co-authored-by: openhands <openhands@all-hands.dev>
@tofarr tofarr merged commit cceb86f into main Jun 17, 2026
37 of 38 checks passed
@tofarr tofarr deleted the feat/agent-server-deferred-init branch June 17, 2026 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants