Skip to content

bug(web): session-detail page connection-pool starvation — RSC sidebar prefetches + sync probes + first-failure-fatal cascade produces "Failed to load session" #1855

@i-trytoohard

Description

@i-trytoohard

Bug

Loading /projects/<projectId>/sessions/<sessionId> intermittently fails with the dashboard error:

Failed to load session. The session request is taking too long, so the page stopped waiting instead of spinning forever.

The failure is not a real server error — every aborted request in the HAR shows wait=0ms, blocked=5000–8140ms and status=0, meaning the browser client-aborted them after they sat queued in the HTTP/1.1 connection pool (Chrome caps at 6 concurrent per origin). The browser never actually opened a TCP connection for them.

The same page sometimes loads fine and sometimes shows the error, on the same machine, same code, same data, because the trigger is purely a race between RSC-rendering wall-clock time and the page's 8s AbortController deadline.

Source: live debugging session with HAR captures
Reported by: @i-trytoohard
Date: 2026-05-15
Analyzed against: 7d324b53 (origin/main HEAD)
AO version: 0.9.0 (PR #1849 / fix-1848 dist swapped — bug independent of this swap)
Environment: macOS, Node 25.9.0, Chromium-based browser
Confidence: High — root cause directly observed in two HAR captures + cross-traced to source.

What two HARs revealed

SUCCESS load (eventually rendered)

72 entries total
  /api/sessions/ao-187 calls : 19  (3 client-aborted, 16 succeeded)
  /api/sessions?fresh=true   : 14  (12 client-aborted, 2 succeeded)
  total wall time           : 212.3s of request time

FAILED load (error rendered)

38 entries total
  /api/sessions/ao-187 calls : 5  (5/5 client-aborted)
  /api/sessions?fresh=true   : 9  (9/9 client-aborted)
  total wall time           : 179.4s of request time

In both loads, all status=0 aborts have blocked ≈ deadline:

  • /api/sessions/ao-187: blocked=8001-8140ms → matches SESSION_FETCH_TIMEOUT_MS=8000
  • /api/sessions?fresh=true: blocked=5000-6002ms → matches PROJECT_SIDEBAR_FETCH_TIMEOUT_MS=5000

So the timeouts are client-side (AbortController), not server unresponsiveness. Browsers cap HTTP/1.1 same-origin connections at 6. The 6 slots are pinned, queued requests time out, page bails.

Root cause (multi-layer)

Layer 1 — 5 RSC project prefetches pin 5/6 connection slots

The session-detail page's sidebar renders <Link prefetch> (Next.js default) for every registered project in the sidebar:

  • /projects/integrator?_rsc=10mlo
  • /projects/workos?_rsc=10mlo
  • /projects/agent-orchestrator?_rsc=10mlo
  • /projects/mercury_8c1e44e68c?_rsc=10mlo
  • /projects/eng-hiring_dd1da96287?_rsc=10mlo

These fire concurrently on mount and each take 1.3–7.3 seconds to complete because the server-side rendering of each project tree calls into enrichSessionsMetadata → agent plugin's getActivityStatefindClaudeProcess/findCodexProcessps -eo pid,tty,args (the slow path partially mitigated by #1838 but still ≥200ms each).

Layer 2 — /mux WebSocket pins 1 more slot

The mux upgrade in the same HAR (status=101, time=55-74 seconds) pins 1 connection for its entire lifetime. Combined with Layer 1, 6/6 connection slots are now occupied.

Layer 3 — API requests queue behind them and hit deadlines

/api/sessions/<id> and /api/sessions?fresh=true get queued. After 8000ms / 5000ms they hit their AbortController deadlines and abort.

Layer 4 — fresh=true cascade compounds

useSessionEvents.scheduleRefresh() (packages/web/src/hooks/useSessionEvents.ts) fires /api/sessions?fresh=true on cadence (every 30s and on membership change). The HARs show 9 of these in the failed load — none with backoff or coordination. Each one would have triggered a full re-probe of all sessions server-side.

Layer 5 — First timeout becomes terminal UI state

packages/web/src/app/sessions/[id]/page.tsx:531:

setRouteError(err instanceof Error ? err : new Error("Failed to load session"));

One AbortError → terminal error UI. No retry-with-backoff. Yet the SUCCESS HAR shows successful fetches in later entries (#41, 43, 44, 47, 54, 55, 57, 60, 61, 63, 65, 69, 71) — data would have arrived if the page weren't locked in error state.

File:line references

What Where
Client timeout constants packages/web/src/app/sessions/[id]/page.tsx:86-88 (SESSION_FETCH_TIMEOUT_MS=8000, PROJECT_SIDEBAR_FETCH_TIMEOUT_MS=5000, PROJECTS_FETCH_TIMEOUT_MS=5000)
First-failure-fatal packages/web/src/app/sessions/[id]/page.tsx:531 setRouteError(...)
Mount-time session fetch packages/web/src/app/sessions/[id]/page.tsx:505
HTTP refresh that contends with mux packages/web/src/hooks/useSessionEvents.ts:scheduleRefresh()
Server-side enrichment path packages/web/src/app/api/sessions/[id]/route.tsenrichSessionsMetadata() → agent plugin probe
RSC prefetch in sidebar packages/web/src/components/ProjectSidebar.tsx<Link> defaults to prefetch={true}

Reproduction

  1. Have several projects registered (5+ to saturate the 6-slot pool reliably).
  2. Have several running sessions (ao spawn 5-10).
  3. From the dashboard, click any session to navigate to /projects/<projectId>/sessions/<id>.
  4. Observe (with DevTools Network panel open): 5 RSC prefetches fire, sit at varying blocked=X times; session and sidebar fetches queue behind them.
  5. Every few attempts the page shows the "Failed to load session" error; refreshing might succeed depending on probe timing.

Suggested fixes (in priority order)

  1. <Link prefetch={false}> on sidebar project links. Or restrict prefetch to the currently-active project. Single-line change in ProjectSidebar.tsx. Frees 4 connection slots immediately and fixes the user-visible symptom for everyone with >2 registered projects.

  2. Read sessions from disk index, not from probes. Background lifecycle-manager already writes enriched state to each <id>.json. /api/sessions/<id> should just deserialize that file (~5ms) — no synchronous probe. This kills the multiplicative slowdown that makes the RSC fetches slow in the first place. Pure indexing, no caching.

  3. Concurrent-fetch dedup on the client. A Map<key, Promise> for in-flight fetches in client-fetch.ts so 15 simultaneous /api/sessions/ao-187 calls become 1 HTTP request. Not caching — purely concurrent-request coalescing.

  4. First failure not fatal. Require N=3 consecutive failures (or N seconds of no progress) before setRouteError(...). With this, transient queue pressure becomes a hidden retry instead of a hard error.

  5. Don't HTTP-poll when mux is connected and recent. useSessionEvents's scheduleRefresh() should skip when mux.status === "connected" && Date.now() - lastSnapshotAt < STALE_THRESHOLD. Eliminates the redundant ?fresh=true storm.

(1) alone likely fixes the user-visible bug entirely. (2) is the architectural win. (3) and (4) are belt-and-suspenders.

Related

Impact

  • Any user with >2 registered projects intermittently can't open session-detail pages.
  • Especially visible after AO has been up a while and the process table is dense, slowing every RSC server-render.
  • Workarounds: hard-refresh after seeing the error. Or open the URL directly (skips some prefetch). Or ao session attach from the terminal.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions