feat(agent-server): add deferred-init / dormant mode#3287
Conversation
Implements the warm-pool agent-server proposal in #2523. When `Config.deferred_init=True` (env `OH_DEFERRED_INIT`) the server starts in *dormant* mode: * Stateless services (VSCode, desktop, tool preload) start as usual so the warm pod is immediately useful to whoever attaches next. * The conversation, event, and bash routers (everything under `/api/*`) return 503 via a new `require_initialized` dependency. * `/alive`, `/health`, `/ready`, `/server_info` and a new top-level `/init` router are reachable. `/ready` reports ready once the stateless services are up so an orchestrator can match the pod with a user and send its `/init` payload. * `POST /init` accepts an `InitRequest` (session API keys, workspace paths, webhooks, env vars, etc.), merges it with the dormant config, enters the `ConversationService` context, and flips the gate to `ready`. A second `/init` call gets 400; a failed init rolls back to dormant so the orchestrator can retry. * Bootstrap auth for `POST /init` is a separate `OH_INIT_API_KEY` (`X-Init-API-Key` header), distinct from `session_api_keys` because the session key is part of the per-user payload that arrives *inside* the init body. `GET /init` (status polling) is unauthenticated. The non-deferred path is unchanged — no `InitService` is attached to `app.state` and the dormant gate is a no-op. Tests cover: config defaults + env wiring, `InitRequest` → `Config` merging, state machine (dormant → initializing → ready, second-call 400), env var application, end-to-end over the FastAPI lifespan + `TestClient` (503 gating before init, 200 after, init key auth), and the regression that `deferred_init=False` still works exactly as today. Refs: #2523 Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Resolved conflict in openhands-agent-server/openhands/agent_server/api.py: - Kept retention_task cancellation logic added in main - Kept stop_stateless_services() helper from PR branch Co-authored-by: openhands <openhands@all-hands.dev>
Mounts the init router under /api instead of at the top level. The router's own prefix (/init) combines with the new wrapper APIRouter(prefix="/api") to produce /api/init, /api/init (GET/POST). The init router remains exempt from both the session-key auth and the require_initialized dormant gate — it gets its own unauthenticated wrapper APIRouter with no dependencies, mirroring the pattern used by the workspace router. All comments, docstrings, log messages, field descriptions, and test client URLs updated from /init to /api/init. Co-authored-by: openhands <openhands@all-hands.dev>
Remove the separate init_api_key / OH_INIT_API_KEY config field. The dormant server's existing secret_key already serves as a per-pod bootstrap credential — the orchestrator holds it for encryption and it is overwritten when the per-user runtime config arrives in the /api/init body. Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
Deferred-init works end-to-end over real HTTP, but one non-functional CI check is currently failing.
Does this PR achieve its stated goal?
Yes. I started the actual agent-server process with OH_DEFERRED_INIT=1 and verified the new dormant lifecycle: readiness/health stayed available, /api/init reported dormant, gated /api/* routes returned 503 before init, bootstrap auth rejected missing/wrong keys, a correct POST /api/init transitioned to ready, and /api/conversations/count worked afterward. I also verified the non-deferred PR path remains live from startup and /api/init returns 404 when dormant mode is disabled.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully; no test suite, linter, formatter, or pre-commit run was executed. |
| CI Status | Validate PR description), 1 in progress (qa-changes), 1 skipped at time of QA. |
| Functional Verification | ✅ Real agent-server processes were started and exercised with curl in baseline, deferred, and legacy modes. |
Functional Verification
Test 1: Establish baseline without the PR
Step 1 — Reproduce / establish baseline (without the feature):
Checked out origin/main and ran the server with the new env var anyway:
OH_DEFERRED_INIT=1 OH_SECRET_KEY=pool-key \
OH_CONVERSATIONS_PATH=/tmp/qa-main-convs \
OH_BASH_EVENTS_DIR=/tmp/qa-main-bash \
uv run agent-server --host 127.0.0.1 --port 18101Then queried the API:
BASE /ready: 200 application/json
{"status":"ready"}
BASE /api/init GET: 404 application/json
{"detail":"Not Found"}
BASE /api/conversations/count: 200 application/json
0
This establishes the old behavior: OH_DEFERRED_INIT did not create a dormant state, /api/init was unavailable, and /api/* was live immediately.
Step 2 — Apply the PR's changes:
Switched back to feat/agent-server-deferred-init at d964bb942e6c664066065b91f0739290d93eb88b.
Step 3 — Re-run with the feature in place:
Started the PR server with equivalent deferred-init configuration:
OH_DEFERRED_INIT=1 OH_SECRET_KEY=pool-key \
OH_CONVERSATIONS_PATH=/tmp/qa-pr-convs \
OH_BASH_EVENTS_DIR=/tmp/qa-pr-bash \
uv run agent-server --host 127.0.0.1 --port 18102Observed before initialization:
PR /alive before init: 200 application/json
{"status":"ok"}
PR /ready before init: 200 application/json
{"status":"ready"}
PR /api count before init: 503 application/json
{"detail":"Internal Server Error","exception":"503: server is in deferred-init state 'dormant'; call POST /api/init first"}
PR /api/init GET before init: 200 application/json
{"state":"dormant","error":null}
This confirms the new dormant mode is active: readiness stays green for warm-pool availability, status polling works, and gated API traffic is blocked until init.
Test 2: Bootstrap auth and dormant → ready transition
Step 1 — Baseline while dormant:
Against the same dormant PR server, attempted init without valid bootstrap credentials:
PR /api/init POST missing key: 401 application/json
{"detail":"Unauthorized"}
PR /api/init POST wrong key: 401 application/json
{"detail":"Unauthorized"}
This confirms POST /api/init is auth-gated by X-Init-API-Key when OH_SECRET_KEY is configured.
Step 2 — Apply valid init payload:
Posted runtime config with the correct bootstrap key:
curl -X POST http://127.0.0.1:18102/api/init \
-H 'X-Init-API-Key: pool-key' \
-H 'Content-Type: application/json' \
--data '{"conversations_path":"/tmp/qa-pr-user-convs","bash_events_dir":"/tmp/qa-pr-user-bash","env":{"DEFERRED_INIT_QA_VAR":"from_init"}}'Observed:
PR /api/init POST correct key: 200 application/json
{"state":"ready","error":null}
PR /api/init GET after init: 200 application/json
{"state":"ready","error":null}
PR /api count after init: 200 application/json
0
PR /api/init POST second call: 400 application/json
{"detail":"server already in state: ready"}
This confirms the server transitions to ready, previously gated /api/* routes become usable, and repeated init is rejected as described.
Test 3: Public health/server-info and non-deferred compatibility
Deferred mode public endpoints:
Started another PR server in dormant mode and queried the ungated endpoints:
PR deferred /health: 200 application/json
{"status":"ok"}
PR deferred /server_info: 200 application/json
{"uptime":0.0,"idle_time":0.0,"title":"OpenHands Agent Server",...}
PR deferred /api count: 503 application/json
{"detail":"Internal Server Error","exception":"503: server is in deferred-init state 'dormant'; call POST /api/init first"}
This matches the behavior matrix: health/server info stay available while /api/* remains gated.
Non-deferred PR compatibility:
Started the PR server without OH_DEFERRED_INIT:
PR legacy /ready: 200 application/json
{"status":"ready"}
PR legacy /api/init GET: 404 application/json
{"detail":"server is not running with deferred_init=True; the /api/init endpoint is not available"}
PR legacy /api count: 200 application/json
0
PR legacy /server_info: 200 application/json
{"uptime":0.0,"idle_time":0.0,"title":"OpenHands Agent Server",...}
This confirms the default/legacy path remains live from startup and does not require init.
Issues Found
- 🟡 Minor / merge-readiness: The GitHub check
Validate PR descriptionis failing at the time of QA. I did not edit the PR description; if this is due to the human-only PR description fields, a human contributor needs to update those in their own words.
This QA review was generated by an AI agent (OpenHands) on behalf of the user.
| ) | ||
|
|
||
|
|
||
| init_router = APIRouter(prefix="/init", tags=["Init"]) |
There was a problem hiding this comment.
Could we do this differently than idk sitting on the import path? It’s error prone imho and sometimes bites
There was a problem hiding this comment.
Can you elaborate what you mean here? This follows the same pattern as the other routers. e.g.:
There was a problem hiding this comment.
Yes, you’re right it matches the current routers. My concern is that this is existing technical debt, more like anti-pattern than a pattern to keep.
The agent tells me as potentially more problematic examples bash_router.py and sockets.py, which also instantiate get_default_bash_event_service() / get_default_conversation_service() at module import. Maybe… Is that a bit risky for the behavior we want with this PR, if the service/config is supposed to be computed later by /api/init?
I had an old PR redesigning this: #400. It tried to move the server toward an app-factory + DI pattern where services are resolved from app.state instead of router-module globals. I’m not suggesting to refactor them all right now though 😅 More like, I was wondering how we could do it here to be less error prone
There was a problem hiding this comment.
I agree we should do this - but I think it should be a separate PR - We should resurrect #400 or create a new PR to do this.
xingyaoww
left a comment
There was a problem hiding this comment.
LGTM. Nit: we should create an example under agent-server examples folders, so this can be tested before each release
Adds examples/02_remote_agent_server/16_deferred_init.py to demonstrate the dormant-mode lifecycle introduced in the deferred-init PR: - Start server with OH_DEFERRED_INIT=true (dormant state) - Verify GET /api/init returns state=dormant - Verify /api/* routes return 503 while dormant - POST /api/init with X-Init-API-Key to activate the server - Verify GET /api/init returns state=ready - Run a conversation normally on the now-ready server Addresses Xingyao's review comment requesting a testable agent-server example before each release. Co-authored-by: openhands <openhands@all-hands.dev>
H:
This paves the way for performance upgrades and tightening security in the K8s environment.
AGENT:
This PR was opened by an AI agent (OpenHands) on behalf of @tofarr. All 15 new tests in
tests/agent_server/test_init_router.pypass; the rest oftests/agent_server/is unaffected (one pre-existing failure intest_terminal_service.py::test_terminal_does_not_expose_session_api_key_via_env_commandalso reproduces onmain, unrelated to this change).Why
Implements the warm-pool agent-server proposal in #2523. Enables Kubernetes warm pods to be matched with users after boot, without pre-attached PVCs — the orchestrator starts a pool of dormant agent-server pods and delivers a per-user runtime config via
/initwhen a conversation begins.Summary
Config.deferred_init(envOH_DEFERRED_INIT) puts the server in dormant mode at startup;InitService(init_router.py) owns thedormant → initializing → readystate machine with bootstrap auth via the existingsecret_key(X-Init-API-Keyheader)require_initializedFastAPI dependency gates all/api/*routes with 503 untilready;/readyflips to 200 as soon as stateless services are up so the orchestrator knows the pod is available for/initdeferred_init=False) is completely unaffected — noInitServiceis attached,/initroutes return 404, and/api/*is live from startup as beforeIssue Number
#2523
How to Test
Covers: config defaults + env wiring,
InitRequest → Configmerging, state machine transitions (dormant → ready, double-init 400, failed init rolls back to dormant), end-to-end overapi_lifespan+TestClient(503 gating before/init, 200 after, auth checks — 401 wrong key, 200 right key, GET unauthenticated), and lifespan teardown.Manual smoke test:
Video/Screenshots
No GUI changes. See test output above.
Type
Notes
Intentional scope limits (see #2523 discussion for rationale):
/deinityet — onceready, the server staysreadyfor the process lifetime; recyclable init is a follow-up requiring a workspace-flush storyInitRequestacceptsconversations_pathandbash_events_dirso an orchestrator can pre-populate before calling/initWorkspace-class SDK integration — the two-phasestart()/attach(config)API deserves its own PR once this server-side primitive is merged/initpopulateapp.state.configbut are not enforced by the auth dependency (documented in test). Production deployments should setOH_SESSION_API_KEYS_0at pod start; the dormant gate already guarantees no traffic reaches gated routes before/initDocumentation example: OpenHands/docs#577
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:3e89d92-pythonRun
All tags pushed for this build
About Multi-Architecture Support
3e89d92-python) is a multi-arch manifest supporting both amd64 and arm643e89d92-python-amd64) are also available if needed