-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Part of: #408
Story: Research Assistant Node-Pinned Session Routing
As a CIDX cluster administrator using the Research Assistant
I want to have my chat sessions automatically routed to the node where they were created
So that sessions work correctly in a multi-node cluster without requiring shared filesystems for node-local state (workspaces, Claude CLI transcripts, uploaded files)
Part of Epic: #408 (CIDX Server Clusterization)
Feature: Research Assistant Cluster Support (priority-2)
Context and Rationale
The Research Assistant creates node-local state that cannot be shared across cluster nodes:
- Session workspaces at
~/.cidx-server/research/{session_id}/with symlinks and uploaded files - Claude CLI transcripts at
~/.claude/projects/(used by--resumefor conversation continuity) - In-memory
_jobsdict for active job tracking (node-local by nature)
[Conversation: "the conversation in DB, but any operation that needs the file system, we pin to the server where it lives"]
The design decision is to route requests to the owning node rather than move or replicate the data:
[Conversation: "we should use some kind of token in the header or something so that when a new chat message is posted by the user, it goes directly to the server where the session lives"]
Read-only operations (browse sessions, view messages) work from any node via the shared database (PostgreSQL in cluster mode). Only stateful operations (send message, upload file, delete file, delete session) require routing to the owner node.
[Conversation: "we need to store that IN the database, and when starting a new chat, first message, it must go to the server we are hitting in that moment"]
Implementation Status
-
owner_nodecolumn added toresearch_sessionstable (SQLite migration + PostgreSQL schema) - Node identity service (
NodeIdentity) for consistent node identification - First-message node assignment in
create_session()andget_default_session() - Session routing middleware in Research Assistant router (proxy stateful requests to owner node)
- Node health check before proxying (fail fast if owner is down)
- Clear error responses when owner node is unavailable ("Session host unavailable")
- Read-only operations verified working from any node (no routing needed)
- Unit tests for routing logic, node assignment, health check, error paths
- Integration tests for proxy behavior and cross-node scenarios
- E2E manual testing on local server
Completion: 0/10 tasks complete (0%)
Algorithm
Node Identity:
node_id = hostname (from socket.gethostname()) OR config.json "node_id" field
get_node_id():
IF config.json has "node_id" field:
RETURN config["node_id"]
RETURN socket.gethostname()
Session Owner Assignment:
create_session():
session_id = uuid4()
owner_node = get_node_id()
INSERT INTO research_sessions (id, name, folder_path, created_at, updated_at, owner_node)
RETURN session dict with owner_node
get_default_session():
IF default session exists:
IF owner_node IS NULL:
UPDATE owner_node = get_node_id() # Backfill for pre-cluster sessions
RETURN session
ELSE:
CREATE with owner_node = get_node_id()
RETURN session
Request Routing (router-level middleware):
classify_operation(endpoint, method):
STATEFUL = {POST /send, POST /sessions/{id}/upload, DELETE /sessions/{id}/files/{name},
DELETE /sessions/{id}, GET /sessions/{id}/files,
GET /sessions/{id}/files/{name}}
READ_ONLY = {GET /sessions/{id} (messages), GET /poll/{job_id}, GET "" (page render),
POST /sessions (create - always local, assigns current node)}
IF endpoint IN STATEFUL: RETURN "stateful"
RETURN "read_only"
route_request(request, session_id):
IF classify_operation() == "read_only":
RETURN handle_locally() # Messages and session list come from shared DB
# Stateful operation - check owner
session = get_session(session_id)
IF session IS NULL:
RETURN 404 "Session not found"
owner_node = session.get("owner_node")
IF owner_node IS NULL OR owner_node == get_node_id():
RETURN handle_locally() # We are the owner (or pre-cluster session)
# Need to proxy to owner node
IF NOT is_node_healthy(owner_node):
RETURN 503 "Session host unavailable: {owner_node} is not responding.
Read-only access is still available."
RETURN proxy_to_node(owner_node, request)
is_node_healthy(node_id):
node_url = resolve_node_url(node_id) # From cluster config or discovery
TRY:
response = HTTP GET {node_url}/health with timeout=3s
RETURN response.status_code == 200
CATCH timeout/connection_error:
RETURN False
proxy_to_node(node_id, original_request):
node_url = resolve_node_url(node_id)
# Forward the request with all headers, body, auth
response = HTTP {method} {node_url}{original_path}
with headers=original_headers, body=original_body, timeout=120s
RETURN response
Node URL Resolution:
resolve_node_url(node_id):
# Option 1: Static config in config.json
cluster_nodes = config.get("cluster_nodes", {})
IF node_id IN cluster_nodes:
RETURN cluster_nodes[node_id]["url"] # e.g., "http://node1:8000"
# Option 2: Node registered its URL in shared DB at startup
RETURN lookup_node_url_from_db(node_id)
Acceptance Criteria
Scenario: First message in new session assigns owner node
Given a multi-node CIDX cluster with nodes A and B
When a user creates a new Research Assistant session on Node A
Then the session record in the database has owner_node set to Node A's node_id
And the session workspace is created at ~/.cidx-server/research/{id}/ on Node A only
And subsequent stateful operations for this session are handled by Node A
Scenario: Stateful request proxied to owner node
Given a Research Assistant session created on Node A with owner_node="nodeA"
When the user sends a chat message and the request hits Node B (via load balancer)
Then Node B checks the session's owner_node in the database
And Node B verifies Node A is healthy via /health endpoint
And Node B proxies the request to Node A
And Node A executes Claude CLI locally and stores the response
And the response is returned to the user through Node B
Scenario: Read-only operations work from any node
Given a Research Assistant session with messages stored in the shared database
When the user browses sessions or views message history from any cluster node
Then the request is handled locally without proxying
And all session metadata and messages are read from the shared database
And no filesystem access to the owner node is required
Scenario: Owner node is down - stateful operations fail with clear error
Given a Research Assistant session created on Node A
And Node A is currently unreachable (process down or network failure)
When the user sends a chat message and the request hits Node B
Then Node B detects Node A is unhealthy via /health check (3s timeout)
And Node B returns HTTP 503 with message "Session host unavailable"
And the error message identifies which node is down
And read-only operations (view messages, browse sessions) still work from Node B
Scenario: Pre-cluster sessions backfilled with owner node on first use
Given a research session created before cluster mode was enabled (owner_node IS NULL)
When a stateful operation is performed on this session from any node
Then the current node claims ownership by setting owner_node to its own node_id
And the session workspace is created on the current node if it does not exist
And subsequent operations route to this nodeTesting Requirements
Unit Tests
owner_nodecolumn migration: verify column added to existing databases, NULL default for old rowsNodeIdentity.get_node_id(): config-based override, hostname fallbackcreate_session(): verify owner_node is populated in returned session and databaseget_default_session(): verify owner_node backfill for NULL rows- Operation classification: verify stateful vs read-only for each endpoint
- Routing decision logic: local execution when owner matches, proxy trigger when different
- Health check: mock HTTP responses for healthy/unhealthy/timeout scenarios
- Error response format: verify 503 status and "Session host unavailable" message
- Proxy request construction: verify headers, body, auth forwarded correctly
Integration Tests
- Full routing flow: create session on node A identity, verify routing decision returns proxy for node B identity
- Database round-trip: session with owner_node persists and reads back correctly
- Health check integration: real HTTP call to /health endpoint on localhost
- Backfill scenario: NULL owner_node session gets claimed by current node
E2E Manual Testing
- Start local CIDX server on localhost:8000
- Create a research session via Web UI, verify owner_node is set in database
- Send a message, verify Claude CLI executes locally
- Query database to confirm owner_node column value
- Verify read-only operations (GET sessions, GET messages) work independently of owner_node
- Simulate owner mismatch by manually changing owner_node in DB, verify 503 error on send
What Cannot Be Tested Locally
- Actual cross-node proxying (requires two CIDX instances) -- deferred to staging cluster validation
- Load balancer sticky session interaction -- validated at cluster integration time
Key Design Decisions
Node Identification
Use socket.gethostname() as default, with config.json "node_id" override for environments where hostname is not stable (containers, dynamic cloud instances). This matches the pattern used elsewhere in Epic #408 for node identity.
Operation Classification
Operations are classified at the router level, not the service level. The router inspects the endpoint and HTTP method to determine if routing is needed before calling the service. This keeps the service layer unaware of cluster topology.
Stateful operations (require filesystem access on owner node):
POST /admin/research/send-- executes Claude CLI with--resume, needs~/.claude/projects/POST /admin/research/sessions/{id}/upload-- writes file to~/.cidx-server/research/{id}/DELETE /admin/research/sessions/{id}/files/{name}-- deletes file from workspaceDELETE /admin/research/sessions/{id}-- deletes workspace folder + Claude CLI project folderGET /admin/research/sessions/{id}/files-- reads filesystem viaos.listdir()GET /admin/research/sessions/{id}/files/{name}-- reads file from workspace for download
Read-only operations (work from any node via shared database):
GET /admin/research-- renders page with session list and messages from DBGET /admin/research/sessions/{id}-- loads session messages from DBGET /admin/research/poll/{job_id}-- polls job status from DB (Story [STORY] PostgreSQL Backend for BackgroundJobs and SyncJobs #413 moves jobs to PostgreSQL)POST /admin/research/sessions-- creates new session (assigns current node as owner, no proxy needed)PUT /admin/research/sessions/{id}-- renames session (database-only operation)
File Listing and Download
list_files() and download_file() currently read the filesystem (os.listdir(), FileResponse). These are classified as stateful and proxied to the owner node. This is simpler than duplicating file metadata in the database and consistent with the "route to owner" approach.
Proxy Implementation
Use httpx (already a dependency via FastAPI/Starlette) for the proxy HTTP calls. The proxy forwards:
- All original headers (including Authorization)
- Request body (form data, file uploads)
- HTTP method and path
- Query parameters
Timeout: 120 seconds for proxy calls (Claude CLI can take minutes to respond).
No Silent Failover
[Anti-fallback principle] If the owner node is down, stateful operations fail with a clear error. There is no attempt to re-create the session workspace on another node or start a new Claude CLI conversation without the existing transcript.
No ONTAP Involvement
Session workspaces remain node-local. There is no shared filesystem for Research Assistant data.
[Conversation: Design explicitly excludes ONTAP for Research Assistant]
Affected Files
Modified
src/code_indexer/server/storage/database_manager.py-- Addowner_nodecolumn toresearch_sessionsschema + migrationsrc/code_indexer/server/services/research_assistant_service.py-- Addowner_nodeto session creation, backfill logic, expose owner info in session dictssrc/code_indexer/server/routers/research_assistant.py-- Add routing middleware for stateful operations, proxy logic, health check
New
src/code_indexer/server/services/node_identity.py-- Node identification service (get_node_id from config or hostname)src/code_indexer/server/services/node_proxy.py-- HTTP proxy service for forwarding requests to owner nodes, health checktests/unit/server/services/test_node_identity.py-- Unit tests for node identitytests/unit/server/services/test_node_proxy.py-- Unit tests for proxy and health checktests/unit/server/routers/test_research_assistant_routing.py-- Unit tests for routing decisions
Definition of Done
- All acceptance criteria satisfied
- >90% unit test coverage on new routing and proxy logic
- Integration tests passing for routing decisions and database operations
- E2E manual testing completed on local server (single-node validation)
- Code review approved (tdd-engineer + code-reviewer workflow)
- No lint/type errors (
./lint.shpasses) -
./fast-automation.shpasses with zero failures - Existing Research Assistant tests continue passing (zero regression)
-
owner_nodecolumn migration is idempotent (safe to run on existing databases) - Error messages are clear and actionable when owner node is unavailable
- Working software deployable in standalone mode (routing logic is no-op when
owner_nodematches or is NULL)