Bug
Loading /projects/<projectId>/sessions/<sessionId> intermittently fails with the dashboard error:
Failed to load session. The session request is taking too long, so the page stopped waiting instead of spinning forever.
The failure is not a real server error — every aborted request in the HAR shows wait=0ms, blocked=5000–8140ms and status=0, meaning the browser client-aborted them after they sat queued in the HTTP/1.1 connection pool (Chrome caps at 6 concurrent per origin). The browser never actually opened a TCP connection for them.
The same page sometimes loads fine and sometimes shows the error, on the same machine, same code, same data, because the trigger is purely a race between RSC-rendering wall-clock time and the page's 8s AbortController deadline.
Source: live debugging session with HAR captures
Reported by: @i-trytoohard
Date: 2026-05-15
Analyzed against: 7d324b53 (origin/main HEAD)
AO version: 0.9.0 (PR #1849 / fix-1848 dist swapped — bug independent of this swap)
Environment: macOS, Node 25.9.0, Chromium-based browser
Confidence: High — root cause directly observed in two HAR captures + cross-traced to source.
What two HARs revealed
SUCCESS load (eventually rendered)
72 entries total
/api/sessions/ao-187 calls : 19 (3 client-aborted, 16 succeeded)
/api/sessions?fresh=true : 14 (12 client-aborted, 2 succeeded)
total wall time : 212.3s of request time
FAILED load (error rendered)
38 entries total
/api/sessions/ao-187 calls : 5 (5/5 client-aborted)
/api/sessions?fresh=true : 9 (9/9 client-aborted)
total wall time : 179.4s of request time
In both loads, all status=0 aborts have blocked ≈ deadline:
/api/sessions/ao-187: blocked=8001-8140ms → matches SESSION_FETCH_TIMEOUT_MS=8000
/api/sessions?fresh=true: blocked=5000-6002ms → matches PROJECT_SIDEBAR_FETCH_TIMEOUT_MS=5000
So the timeouts are client-side (AbortController), not server unresponsiveness. Browsers cap HTTP/1.1 same-origin connections at 6. The 6 slots are pinned, queued requests time out, page bails.
Root cause (multi-layer)
Layer 1 — 5 RSC project prefetches pin 5/6 connection slots
The session-detail page's sidebar renders <Link prefetch> (Next.js default) for every registered project in the sidebar:
/projects/integrator?_rsc=10mlo
/projects/workos?_rsc=10mlo
/projects/agent-orchestrator?_rsc=10mlo
/projects/mercury_8c1e44e68c?_rsc=10mlo
/projects/eng-hiring_dd1da96287?_rsc=10mlo
These fire concurrently on mount and each take 1.3–7.3 seconds to complete because the server-side rendering of each project tree calls into enrichSessionsMetadata → agent plugin's getActivityState → findClaudeProcess/findCodexProcess → ps -eo pid,tty,args (the slow path partially mitigated by #1838 but still ≥200ms each).
Layer 2 — /mux WebSocket pins 1 more slot
The mux upgrade in the same HAR (status=101, time=55-74 seconds) pins 1 connection for its entire lifetime. Combined with Layer 1, 6/6 connection slots are now occupied.
Layer 3 — API requests queue behind them and hit deadlines
/api/sessions/<id> and /api/sessions?fresh=true get queued. After 8000ms / 5000ms they hit their AbortController deadlines and abort.
Layer 4 — fresh=true cascade compounds
useSessionEvents.scheduleRefresh() (packages/web/src/hooks/useSessionEvents.ts) fires /api/sessions?fresh=true on cadence (every 30s and on membership change). The HARs show 9 of these in the failed load — none with backoff or coordination. Each one would have triggered a full re-probe of all sessions server-side.
Layer 5 — First timeout becomes terminal UI state
packages/web/src/app/sessions/[id]/page.tsx:531:
setRouteError(err instanceof Error ? err : new Error("Failed to load session"));
One AbortError → terminal error UI. No retry-with-backoff. Yet the SUCCESS HAR shows successful fetches in later entries (#41, 43, 44, 47, 54, 55, 57, 60, 61, 63, 65, 69, 71) — data would have arrived if the page weren't locked in error state.
File:line references
| What |
Where |
| Client timeout constants |
packages/web/src/app/sessions/[id]/page.tsx:86-88 (SESSION_FETCH_TIMEOUT_MS=8000, PROJECT_SIDEBAR_FETCH_TIMEOUT_MS=5000, PROJECTS_FETCH_TIMEOUT_MS=5000) |
| First-failure-fatal |
packages/web/src/app/sessions/[id]/page.tsx:531 setRouteError(...) |
| Mount-time session fetch |
packages/web/src/app/sessions/[id]/page.tsx:505 |
| HTTP refresh that contends with mux |
packages/web/src/hooks/useSessionEvents.ts:scheduleRefresh() |
| Server-side enrichment path |
packages/web/src/app/api/sessions/[id]/route.ts → enrichSessionsMetadata() → agent plugin probe |
| RSC prefetch in sidebar |
packages/web/src/components/ProjectSidebar.tsx — <Link> defaults to prefetch={true} |
Reproduction
- Have several projects registered (5+ to saturate the 6-slot pool reliably).
- Have several running sessions (
ao spawn 5-10).
- From the dashboard, click any session to navigate to
/projects/<projectId>/sessions/<id>.
- Observe (with DevTools Network panel open): 5 RSC prefetches fire, sit at varying
blocked=X times; session and sidebar fetches queue behind them.
- Every few attempts the page shows the "Failed to load session" error; refreshing might succeed depending on probe timing.
Suggested fixes (in priority order)
-
<Link prefetch={false}> on sidebar project links. Or restrict prefetch to the currently-active project. Single-line change in ProjectSidebar.tsx. Frees 4 connection slots immediately and fixes the user-visible symptom for everyone with >2 registered projects.
-
Read sessions from disk index, not from probes. Background lifecycle-manager already writes enriched state to each <id>.json. /api/sessions/<id> should just deserialize that file (~5ms) — no synchronous probe. This kills the multiplicative slowdown that makes the RSC fetches slow in the first place. Pure indexing, no caching.
-
Concurrent-fetch dedup on the client. A Map<key, Promise> for in-flight fetches in client-fetch.ts so 15 simultaneous /api/sessions/ao-187 calls become 1 HTTP request. Not caching — purely concurrent-request coalescing.
-
First failure not fatal. Require N=3 consecutive failures (or N seconds of no progress) before setRouteError(...). With this, transient queue pressure becomes a hidden retry instead of a hard error.
-
Don't HTTP-poll when mux is connected and recent. useSessionEvents's scheduleRefresh() should skip when mux.status === "connected" && Date.now() - lastSnapshotAt < STALE_THRESHOLD. Eliminates the redundant ?fresh=true storm.
(1) alone likely fixes the user-visible bug entirely. (2) is the architectural win. (3) and (4) are belt-and-suspenders.
Related
Impact
- Any user with >2 registered projects intermittently can't open session-detail pages.
- Especially visible after AO has been up a while and the process table is dense, slowing every RSC server-render.
- Workarounds: hard-refresh after seeing the error. Or open the URL directly (skips some prefetch). Or
ao session attach from the terminal.
Bug
Loading
/projects/<projectId>/sessions/<sessionId>intermittently fails with the dashboard error:The failure is not a real server error — every aborted request in the HAR shows
wait=0ms, blocked=5000–8140msandstatus=0, meaning the browser client-aborted them after they sat queued in the HTTP/1.1 connection pool (Chrome caps at 6 concurrent per origin). The browser never actually opened a TCP connection for them.The same page sometimes loads fine and sometimes shows the error, on the same machine, same code, same data, because the trigger is purely a race between RSC-rendering wall-clock time and the page's 8s
AbortControllerdeadline.Source: live debugging session with HAR captures
Reported by: @i-trytoohard
Date: 2026-05-15
Analyzed against:
7d324b53(origin/main HEAD)AO version: 0.9.0 (PR #1849 / fix-1848 dist swapped — bug independent of this swap)
Environment: macOS, Node 25.9.0, Chromium-based browser
Confidence: High — root cause directly observed in two HAR captures + cross-traced to source.
What two HARs revealed
SUCCESS load (eventually rendered)
FAILED load (error rendered)
In both loads, all
status=0aborts haveblocked ≈ deadline:/api/sessions/ao-187:blocked=8001-8140ms→ matchesSESSION_FETCH_TIMEOUT_MS=8000/api/sessions?fresh=true:blocked=5000-6002ms→ matchesPROJECT_SIDEBAR_FETCH_TIMEOUT_MS=5000So the timeouts are client-side (
AbortController), not server unresponsiveness. Browsers cap HTTP/1.1 same-origin connections at 6. The 6 slots are pinned, queued requests time out, page bails.Root cause (multi-layer)
Layer 1 — 5 RSC project prefetches pin 5/6 connection slots
The session-detail page's sidebar renders
<Link prefetch>(Next.js default) for every registered project in the sidebar:/projects/integrator?_rsc=10mlo/projects/workos?_rsc=10mlo/projects/agent-orchestrator?_rsc=10mlo/projects/mercury_8c1e44e68c?_rsc=10mlo/projects/eng-hiring_dd1da96287?_rsc=10mloThese fire concurrently on mount and each take 1.3–7.3 seconds to complete because the server-side rendering of each project tree calls into
enrichSessionsMetadata→ agent plugin'sgetActivityState→findClaudeProcess/findCodexProcess→ps -eo pid,tty,args(the slow path partially mitigated by #1838 but still ≥200ms each).Layer 2 —
/muxWebSocket pins 1 more slotThe mux upgrade in the same HAR (
status=101, time=55-74 seconds) pins 1 connection for its entire lifetime. Combined with Layer 1, 6/6 connection slots are now occupied.Layer 3 — API requests queue behind them and hit deadlines
/api/sessions/<id>and/api/sessions?fresh=trueget queued. After8000ms/5000msthey hit theirAbortControllerdeadlines and abort.Layer 4 —
fresh=truecascade compoundsuseSessionEvents.scheduleRefresh()(packages/web/src/hooks/useSessionEvents.ts) fires/api/sessions?fresh=trueon cadence (every 30s and on membership change). The HARs show 9 of these in the failed load — none with backoff or coordination. Each one would have triggered a full re-probe of all sessions server-side.Layer 5 — First timeout becomes terminal UI state
packages/web/src/app/sessions/[id]/page.tsx:531:One
AbortError→ terminal error UI. No retry-with-backoff. Yet the SUCCESS HAR shows successful fetches in later entries (#41, 43, 44, 47, 54, 55, 57, 60, 61, 63, 65, 69, 71) — data would have arrived if the page weren't locked in error state.File:line references
packages/web/src/app/sessions/[id]/page.tsx:86-88(SESSION_FETCH_TIMEOUT_MS=8000,PROJECT_SIDEBAR_FETCH_TIMEOUT_MS=5000,PROJECTS_FETCH_TIMEOUT_MS=5000)packages/web/src/app/sessions/[id]/page.tsx:531setRouteError(...)packages/web/src/app/sessions/[id]/page.tsx:505packages/web/src/hooks/useSessionEvents.ts:scheduleRefresh()packages/web/src/app/api/sessions/[id]/route.ts→enrichSessionsMetadata()→ agent plugin probepackages/web/src/components/ProjectSidebar.tsx—<Link>defaults toprefetch={true}Reproduction
ao spawn5-10)./projects/<projectId>/sessions/<id>.blocked=Xtimes; session and sidebar fetches queue behind them.Suggested fixes (in priority order)
<Link prefetch={false}>on sidebar project links. Or restrict prefetch to the currently-active project. Single-line change inProjectSidebar.tsx. Frees 4 connection slots immediately and fixes the user-visible symptom for everyone with >2 registered projects.Read sessions from disk index, not from probes. Background lifecycle-manager already writes enriched state to each
<id>.json./api/sessions/<id>should just deserialize that file (~5ms) — no synchronous probe. This kills the multiplicative slowdown that makes the RSC fetches slow in the first place. Pure indexing, no caching.Concurrent-fetch dedup on the client. A
Map<key, Promise>for in-flight fetches inclient-fetch.tsso 15 simultaneous/api/sessions/ao-187calls become 1 HTTP request. Not caching — purely concurrent-request coalescing.First failure not fatal. Require N=3 consecutive failures (or N seconds of no progress) before
setRouteError(...). With this, transient queue pressure becomes a hidden retry instead of a hard error.Don't HTTP-poll when mux is connected and recent.
useSessionEvents'sscheduleRefresh()should skip whenmux.status === "connected" && Date.now() - lastSnapshotAt < STALE_THRESHOLD. Eliminates the redundant?fresh=truestorm.(1) alone likely fixes the user-visible bug entirely. (2) is the architectural win. (3) and (4) are belt-and-suspenders.
Related
terminatedAtis set, regardless oflifecycle.session.state#1832 — dashboard renders Exited from wrong field — adjacent: about what the page shows when load succeeds, not whether load succeeds.ao statusis slow — CLI-side manifestation of the same slow-probe family root cause. Different surface.pstimeout conflated with "process not found" — bulk-terminates all sessions whenpsexceeds 5s #1838 —pstimeout = "process gone" (fixed in fix(agent-plugins,lifecycle): distinguish indeterminate probe from "not found" + bump ps timeout (closes #1838) #1839/fix(cli): reap daemon children on stop+SIGINT, sweep orphans on start (closes #1848) #1849). Made probes slow rather than fatal; this issue is what's left exposed once bug(agent-plugins):pstimeout conflated with "process not found" — bulk-terminates all sessions whenpsexceeds 5s #1838's safety net is in place.Impact
ao session attachfrom the terminal.