Skip to content

[STORY] Research Assistant Node-Pinned Session Routing #433

@jsbattig

Description

@jsbattig

Part of: #408

Story: Research Assistant Node-Pinned Session Routing

As a CIDX cluster administrator using the Research Assistant
I want to have my chat sessions automatically routed to the node where they were created
So that sessions work correctly in a multi-node cluster without requiring shared filesystems for node-local state (workspaces, Claude CLI transcripts, uploaded files)

Part of Epic: #408 (CIDX Server Clusterization)
Feature: Research Assistant Cluster Support (priority-2)


Context and Rationale

The Research Assistant creates node-local state that cannot be shared across cluster nodes:

  • Session workspaces at ~/.cidx-server/research/{session_id}/ with symlinks and uploaded files
  • Claude CLI transcripts at ~/.claude/projects/ (used by --resume for conversation continuity)
  • In-memory _jobs dict for active job tracking (node-local by nature)

[Conversation: "the conversation in DB, but any operation that needs the file system, we pin to the server where it lives"]

The design decision is to route requests to the owning node rather than move or replicate the data:
[Conversation: "we should use some kind of token in the header or something so that when a new chat message is posted by the user, it goes directly to the server where the session lives"]

Read-only operations (browse sessions, view messages) work from any node via the shared database (PostgreSQL in cluster mode). Only stateful operations (send message, upload file, delete file, delete session) require routing to the owner node.

[Conversation: "we need to store that IN the database, and when starting a new chat, first message, it must go to the server we are hitting in that moment"]


Implementation Status

  • owner_node column added to research_sessions table (SQLite migration + PostgreSQL schema)
  • Node identity service (NodeIdentity) for consistent node identification
  • First-message node assignment in create_session() and get_default_session()
  • Session routing middleware in Research Assistant router (proxy stateful requests to owner node)
  • Node health check before proxying (fail fast if owner is down)
  • Clear error responses when owner node is unavailable ("Session host unavailable")
  • Read-only operations verified working from any node (no routing needed)
  • Unit tests for routing logic, node assignment, health check, error paths
  • Integration tests for proxy behavior and cross-node scenarios
  • E2E manual testing on local server

Completion: 0/10 tasks complete (0%)


Algorithm

Node Identity:
  node_id = hostname (from socket.gethostname()) OR config.json "node_id" field

  get_node_id():
    IF config.json has "node_id" field:
      RETURN config["node_id"]
    RETURN socket.gethostname()

Session Owner Assignment:

  create_session():
    session_id = uuid4()
    owner_node = get_node_id()
    INSERT INTO research_sessions (id, name, folder_path, created_at, updated_at, owner_node)
    RETURN session dict with owner_node

  get_default_session():
    IF default session exists:
      IF owner_node IS NULL:
        UPDATE owner_node = get_node_id()  # Backfill for pre-cluster sessions
      RETURN session
    ELSE:
      CREATE with owner_node = get_node_id()
      RETURN session

Request Routing (router-level middleware):

  classify_operation(endpoint, method):
    STATEFUL = {POST /send, POST /sessions/{id}/upload, DELETE /sessions/{id}/files/{name},
                DELETE /sessions/{id}, GET /sessions/{id}/files,
                GET /sessions/{id}/files/{name}}
    READ_ONLY = {GET /sessions/{id} (messages), GET /poll/{job_id}, GET "" (page render),
                 POST /sessions (create - always local, assigns current node)}
    IF endpoint IN STATEFUL: RETURN "stateful"
    RETURN "read_only"

  route_request(request, session_id):
    IF classify_operation() == "read_only":
      RETURN handle_locally()  # Messages and session list come from shared DB

    # Stateful operation - check owner
    session = get_session(session_id)
    IF session IS NULL:
      RETURN 404 "Session not found"

    owner_node = session.get("owner_node")
    IF owner_node IS NULL OR owner_node == get_node_id():
      RETURN handle_locally()  # We are the owner (or pre-cluster session)

    # Need to proxy to owner node
    IF NOT is_node_healthy(owner_node):
      RETURN 503 "Session host unavailable: {owner_node} is not responding.
                   Read-only access is still available."

    RETURN proxy_to_node(owner_node, request)

  is_node_healthy(node_id):
    node_url = resolve_node_url(node_id)  # From cluster config or discovery
    TRY:
      response = HTTP GET {node_url}/health with timeout=3s
      RETURN response.status_code == 200
    CATCH timeout/connection_error:
      RETURN False

  proxy_to_node(node_id, original_request):
    node_url = resolve_node_url(node_id)
    # Forward the request with all headers, body, auth
    response = HTTP {method} {node_url}{original_path}
      with headers=original_headers, body=original_body, timeout=120s
    RETURN response

Node URL Resolution:
  resolve_node_url(node_id):
    # Option 1: Static config in config.json
    cluster_nodes = config.get("cluster_nodes", {})
    IF node_id IN cluster_nodes:
      RETURN cluster_nodes[node_id]["url"]  # e.g., "http://node1:8000"

    # Option 2: Node registered its URL in shared DB at startup
    RETURN lookup_node_url_from_db(node_id)

Acceptance Criteria

Scenario: First message in new session assigns owner node
  Given a multi-node CIDX cluster with nodes A and B
  When a user creates a new Research Assistant session on Node A
  Then the session record in the database has owner_node set to Node A's node_id
  And the session workspace is created at ~/.cidx-server/research/{id}/ on Node A only
  And subsequent stateful operations for this session are handled by Node A

Scenario: Stateful request proxied to owner node
  Given a Research Assistant session created on Node A with owner_node="nodeA"
  When the user sends a chat message and the request hits Node B (via load balancer)
  Then Node B checks the session's owner_node in the database
  And Node B verifies Node A is healthy via /health endpoint
  And Node B proxies the request to Node A
  And Node A executes Claude CLI locally and stores the response
  And the response is returned to the user through Node B

Scenario: Read-only operations work from any node
  Given a Research Assistant session with messages stored in the shared database
  When the user browses sessions or views message history from any cluster node
  Then the request is handled locally without proxying
  And all session metadata and messages are read from the shared database
  And no filesystem access to the owner node is required

Scenario: Owner node is down - stateful operations fail with clear error
  Given a Research Assistant session created on Node A
  And Node A is currently unreachable (process down or network failure)
  When the user sends a chat message and the request hits Node B
  Then Node B detects Node A is unhealthy via /health check (3s timeout)
  And Node B returns HTTP 503 with message "Session host unavailable"
  And the error message identifies which node is down
  And read-only operations (view messages, browse sessions) still work from Node B

Scenario: Pre-cluster sessions backfilled with owner node on first use
  Given a research session created before cluster mode was enabled (owner_node IS NULL)
  When a stateful operation is performed on this session from any node
  Then the current node claims ownership by setting owner_node to its own node_id
  And the session workspace is created on the current node if it does not exist
  And subsequent operations route to this node

Testing Requirements

Unit Tests

  • owner_node column migration: verify column added to existing databases, NULL default for old rows
  • NodeIdentity.get_node_id(): config-based override, hostname fallback
  • create_session(): verify owner_node is populated in returned session and database
  • get_default_session(): verify owner_node backfill for NULL rows
  • Operation classification: verify stateful vs read-only for each endpoint
  • Routing decision logic: local execution when owner matches, proxy trigger when different
  • Health check: mock HTTP responses for healthy/unhealthy/timeout scenarios
  • Error response format: verify 503 status and "Session host unavailable" message
  • Proxy request construction: verify headers, body, auth forwarded correctly

Integration Tests

  • Full routing flow: create session on node A identity, verify routing decision returns proxy for node B identity
  • Database round-trip: session with owner_node persists and reads back correctly
  • Health check integration: real HTTP call to /health endpoint on localhost
  • Backfill scenario: NULL owner_node session gets claimed by current node

E2E Manual Testing

  • Start local CIDX server on localhost:8000
  • Create a research session via Web UI, verify owner_node is set in database
  • Send a message, verify Claude CLI executes locally
  • Query database to confirm owner_node column value
  • Verify read-only operations (GET sessions, GET messages) work independently of owner_node
  • Simulate owner mismatch by manually changing owner_node in DB, verify 503 error on send

What Cannot Be Tested Locally

  • Actual cross-node proxying (requires two CIDX instances) -- deferred to staging cluster validation
  • Load balancer sticky session interaction -- validated at cluster integration time

Key Design Decisions

Node Identification

Use socket.gethostname() as default, with config.json "node_id" override for environments where hostname is not stable (containers, dynamic cloud instances). This matches the pattern used elsewhere in Epic #408 for node identity.

Operation Classification

Operations are classified at the router level, not the service level. The router inspects the endpoint and HTTP method to determine if routing is needed before calling the service. This keeps the service layer unaware of cluster topology.

Stateful operations (require filesystem access on owner node):

  • POST /admin/research/send -- executes Claude CLI with --resume, needs ~/.claude/projects/
  • POST /admin/research/sessions/{id}/upload -- writes file to ~/.cidx-server/research/{id}/
  • DELETE /admin/research/sessions/{id}/files/{name} -- deletes file from workspace
  • DELETE /admin/research/sessions/{id} -- deletes workspace folder + Claude CLI project folder
  • GET /admin/research/sessions/{id}/files -- reads filesystem via os.listdir()
  • GET /admin/research/sessions/{id}/files/{name} -- reads file from workspace for download

Read-only operations (work from any node via shared database):

  • GET /admin/research -- renders page with session list and messages from DB
  • GET /admin/research/sessions/{id} -- loads session messages from DB
  • GET /admin/research/poll/{job_id} -- polls job status from DB (Story [STORY] PostgreSQL Backend for BackgroundJobs and SyncJobs #413 moves jobs to PostgreSQL)
  • POST /admin/research/sessions -- creates new session (assigns current node as owner, no proxy needed)
  • PUT /admin/research/sessions/{id} -- renames session (database-only operation)

File Listing and Download

list_files() and download_file() currently read the filesystem (os.listdir(), FileResponse). These are classified as stateful and proxied to the owner node. This is simpler than duplicating file metadata in the database and consistent with the "route to owner" approach.

Proxy Implementation

Use httpx (already a dependency via FastAPI/Starlette) for the proxy HTTP calls. The proxy forwards:

  • All original headers (including Authorization)
  • Request body (form data, file uploads)
  • HTTP method and path
  • Query parameters

Timeout: 120 seconds for proxy calls (Claude CLI can take minutes to respond).

No Silent Failover

[Anti-fallback principle] If the owner node is down, stateful operations fail with a clear error. There is no attempt to re-create the session workspace on another node or start a new Claude CLI conversation without the existing transcript.

No ONTAP Involvement

Session workspaces remain node-local. There is no shared filesystem for Research Assistant data.
[Conversation: Design explicitly excludes ONTAP for Research Assistant]


Affected Files

Modified

  • src/code_indexer/server/storage/database_manager.py -- Add owner_node column to research_sessions schema + migration
  • src/code_indexer/server/services/research_assistant_service.py -- Add owner_node to session creation, backfill logic, expose owner info in session dicts
  • src/code_indexer/server/routers/research_assistant.py -- Add routing middleware for stateful operations, proxy logic, health check

New

  • src/code_indexer/server/services/node_identity.py -- Node identification service (get_node_id from config or hostname)
  • src/code_indexer/server/services/node_proxy.py -- HTTP proxy service for forwarding requests to owner nodes, health check
  • tests/unit/server/services/test_node_identity.py -- Unit tests for node identity
  • tests/unit/server/services/test_node_proxy.py -- Unit tests for proxy and health check
  • tests/unit/server/routers/test_research_assistant_routing.py -- Unit tests for routing decisions

Definition of Done

  • All acceptance criteria satisfied
  • >90% unit test coverage on new routing and proxy logic
  • Integration tests passing for routing decisions and database operations
  • E2E manual testing completed on local server (single-node validation)
  • Code review approved (tdd-engineer + code-reviewer workflow)
  • No lint/type errors (./lint.sh passes)
  • ./fast-automation.sh passes with zero failures
  • Existing Research Assistant tests continue passing (zero regression)
  • owner_node column migration is idempotent (safe to run on existing databases)
  • Error messages are clear and actionable when owner node is unavailable
  • Working software deployable in standalone mode (routing logic is no-op when owner_node matches or is NULL)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions