bug(web): session-detail page connection-pool starvation — RSC sidebar prefetches + sync probes + first-failure-fatal cascade produces "Failed to load session"

## Bug

Loading `/projects/<projectId>/sessions/<sessionId>` intermittently fails with the dashboard error:

> **Failed to load session.** The session request is taking too long, so the page stopped waiting instead of spinning forever.

The failure is not a real server error — every aborted request in the HAR shows `wait=0ms, blocked=5000–8140ms` and `status=0`, meaning the browser **client-aborted** them after they sat queued in the HTTP/1.1 connection pool (Chrome caps at 6 concurrent per origin). The browser never actually opened a TCP connection for them.

The same page sometimes loads fine and sometimes shows the error, on the same machine, same code, same data, because the trigger is purely a race between RSC-rendering wall-clock time and the page's 8s `AbortController` deadline.

**Source:** live debugging session with HAR captures
**Reported by:** @i-trytoohard
**Date:** 2026-05-15
**Analyzed against:** `7d324b53` (origin/main HEAD)
**AO version:** 0.9.0 (PR #1849 / fix-1848 dist swapped — bug independent of this swap)
**Environment:** macOS, Node 25.9.0, Chromium-based browser
**Confidence:** **High** — root cause directly observed in two HAR captures + cross-traced to source.

## What two HARs revealed

### SUCCESS load (eventually rendered)
```
72 entries total
  /api/sessions/ao-187 calls : 19  (3 client-aborted, 16 succeeded)
  /api/sessions?fresh=true   : 14  (12 client-aborted, 2 succeeded)
  total wall time           : 212.3s of request time
```

### FAILED load (error rendered)
```
38 entries total
  /api/sessions/ao-187 calls : 5  (5/5 client-aborted)
  /api/sessions?fresh=true   : 9  (9/9 client-aborted)
  total wall time           : 179.4s of request time
```

In **both** loads, all `status=0` aborts have `blocked ≈ deadline`:
- `/api/sessions/ao-187`: `blocked=8001-8140ms` → matches `SESSION_FETCH_TIMEOUT_MS=8000`
- `/api/sessions?fresh=true`: `blocked=5000-6002ms` → matches `PROJECT_SIDEBAR_FETCH_TIMEOUT_MS=5000`

So the timeouts are **client-side** (`AbortController`), not server unresponsiveness. Browsers cap HTTP/1.1 same-origin connections at 6. The 6 slots are pinned, queued requests time out, page bails.

## Root cause (multi-layer)

### Layer 1 — 5 RSC project prefetches pin 5/6 connection slots

The session-detail page's sidebar renders `<Link prefetch>` (Next.js default) for every registered project in the sidebar:
- `/projects/integrator?_rsc=10mlo`
- `/projects/workos?_rsc=10mlo`
- `/projects/agent-orchestrator?_rsc=10mlo`
- `/projects/mercury_8c1e44e68c?_rsc=10mlo`
- `/projects/eng-hiring_dd1da96287?_rsc=10mlo`

These fire concurrently on mount and each take **1.3–7.3 seconds** to complete because the server-side rendering of each project tree calls into `enrichSessionsMetadata` → agent plugin's `getActivityState` → `findClaudeProcess`/`findCodexProcess` → `ps -eo pid,tty,args` (the slow path partially mitigated by #1838 but still ≥200ms each).

### Layer 2 — `/mux` WebSocket pins 1 more slot

The mux upgrade in the same HAR (`status=101`, time=55-74 seconds) pins **1 connection** for its entire lifetime. Combined with Layer 1, **6/6 connection slots are now occupied**.

### Layer 3 — API requests queue behind them and hit deadlines

`/api/sessions/<id>` and `/api/sessions?fresh=true` get queued. After `8000ms` / `5000ms` they hit their `AbortController` deadlines and abort.

### Layer 4 — `fresh=true` cascade compounds

`useSessionEvents.scheduleRefresh()` (`packages/web/src/hooks/useSessionEvents.ts`) fires `/api/sessions?fresh=true` on cadence (every 30s and on membership change). The HARs show **9 of these in the failed load** — none with backoff or coordination. Each one would have triggered a full re-probe of all sessions server-side.

### Layer 5 — First timeout becomes terminal UI state

`packages/web/src/app/sessions/[id]/page.tsx:531`:

```ts
setRouteError(err instanceof Error ? err : new Error("Failed to load session"));
```

One `AbortError` → terminal error UI. No retry-with-backoff. Yet the SUCCESS HAR shows successful fetches in later entries (#41, 43, 44, 47, 54, 55, 57, 60, 61, 63, 65, 69, 71) — data *would* have arrived if the page weren't locked in error state.

## File:line references

| What | Where |
|------|------|
| Client timeout constants | `packages/web/src/app/sessions/[id]/page.tsx:86-88` (`SESSION_FETCH_TIMEOUT_MS=8000`, `PROJECT_SIDEBAR_FETCH_TIMEOUT_MS=5000`, `PROJECTS_FETCH_TIMEOUT_MS=5000`) |
| First-failure-fatal | `packages/web/src/app/sessions/[id]/page.tsx:531` `setRouteError(...)` |
| Mount-time session fetch | `packages/web/src/app/sessions/[id]/page.tsx:505` |
| HTTP refresh that contends with mux | `packages/web/src/hooks/useSessionEvents.ts:scheduleRefresh()` |
| Server-side enrichment path | `packages/web/src/app/api/sessions/[id]/route.ts` → `enrichSessionsMetadata()` → agent plugin probe |
| RSC prefetch in sidebar | `packages/web/src/components/ProjectSidebar.tsx` — `<Link>` defaults to `prefetch={true}` |

## Reproduction

1. Have several projects registered (5+ to saturate the 6-slot pool reliably).
2. Have several running sessions (`ao spawn` 5-10).
3. From the dashboard, click any session to navigate to `/projects/<projectId>/sessions/<id>`.
4. Observe (with DevTools Network panel open): 5 RSC prefetches fire, sit at varying `blocked=X` times; session and sidebar fetches queue behind them.
5. Every few attempts the page shows the **"Failed to load session"** error; refreshing might succeed depending on probe timing.

## Suggested fixes (in priority order)

1. **`<Link prefetch={false}>` on sidebar project links.** Or restrict prefetch to the currently-active project. **Single-line change in `ProjectSidebar.tsx`.** Frees 4 connection slots immediately and fixes the user-visible symptom for everyone with >2 registered projects.

2. **Read sessions from disk index, not from probes.** Background lifecycle-manager already writes enriched state to each `<id>.json`. `/api/sessions/<id>` should just deserialize that file (~5ms) — no synchronous probe. This kills the multiplicative slowdown that makes the RSC fetches slow in the first place. Pure indexing, no caching.

3. **Concurrent-fetch dedup on the client.** A `Map<key, Promise>` for in-flight fetches in `client-fetch.ts` so 15 simultaneous `/api/sessions/ao-187` calls become 1 HTTP request. Not caching — purely concurrent-request coalescing.

4. **First failure not fatal.** Require N=3 consecutive failures (or N seconds of no progress) before `setRouteError(...)`. With this, transient queue pressure becomes a hidden retry instead of a hard error.

5. **Don't HTTP-poll when mux is connected and recent.** `useSessionEvents`'s `scheduleRefresh()` should skip when `mux.status === "connected" && Date.now() - lastSnapshotAt < STALE_THRESHOLD`. Eliminates the redundant `?fresh=true` storm.

(1) alone likely fixes the user-visible bug entirely. (2) is the architectural win. (3) and (4) are belt-and-suspenders.

## Related

- **#1410** — sidebar "Failed to load sessions" — same connection-pool starvation but a different UI surface; same root cause path. The fix here would resolve that issue's sidebar manifestation too.
- **#1832** — dashboard renders Exited from wrong field — adjacent: about *what* the page shows when load succeeds, not *whether* load succeeds.
- **#1850** — `ao status` is slow — CLI-side manifestation of the same slow-probe family root cause. Different surface.
- **#1838** — `ps` timeout = "process gone" (fixed in #1839/#1849). Made probes slow rather than fatal; this issue is what's left exposed once #1838's safety net is in place.
- **#1454** — terminated sessions never revive. Independent but in the same neighborhood — both root-cause-class bugs we've been chasing in tandem.

## Impact

- Any user with >2 registered projects intermittently can't open session-detail pages.
- Especially visible after AO has been up a while and the process table is dense, slowing every RSC server-render.
- Workarounds: hard-refresh after seeing the error. Or open the URL directly (skips some prefetch). Or `ao session attach` from the terminal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(web): session-detail page connection-pool starvation — RSC sidebar prefetches + sync probes + first-failure-fatal cascade produces "Failed to load session" #1855

Bug

What two HARs revealed

SUCCESS load (eventually rendered)

FAILED load (error rendered)

Root cause (multi-layer)

Layer 1 — 5 RSC project prefetches pin 5/6 connection slots

Layer 2 — `/mux` WebSocket pins 1 more slot

Layer 3 — API requests queue behind them and hit deadlines

Layer 4 — `fresh=true` cascade compounds

Layer 5 — First timeout becomes terminal UI state

File:line references

Reproduction

Suggested fixes (in priority order)

Related

Impact

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

What	Where
Client timeout constants	`packages/web/src/app/sessions/[id]/page.tsx:86-88` (`SESSION_FETCH_TIMEOUT_MS=8000`, `PROJECT_SIDEBAR_FETCH_TIMEOUT_MS=5000`, `PROJECTS_FETCH_TIMEOUT_MS=5000`)
First-failure-fatal	`packages/web/src/app/sessions/[id]/page.tsx:531` `setRouteError(...)`
Mount-time session fetch	`packages/web/src/app/sessions/[id]/page.tsx:505`
HTTP refresh that contends with mux	`packages/web/src/hooks/useSessionEvents.ts:scheduleRefresh()`
Server-side enrichment path	`packages/web/src/app/api/sessions/[id]/route.ts` → `enrichSessionsMetadata()` → agent plugin probe
RSC prefetch in sidebar	`packages/web/src/components/ProjectSidebar.tsx` — `<Link>` defaults to `prefetch={true}`

bug(web): session-detail page connection-pool starvation — RSC sidebar prefetches + sync probes + first-failure-fatal cascade produces "Failed to load session" #1855

Description

Bug

What two HARs revealed

SUCCESS load (eventually rendered)

FAILED load (error rendered)

Root cause (multi-layer)

Layer 1 — 5 RSC project prefetches pin 5/6 connection slots

Layer 2 — /mux WebSocket pins 1 more slot

Layer 3 — API requests queue behind them and hit deadlines

Layer 4 — fresh=true cascade compounds

Layer 5 — First timeout becomes terminal UI state

File:line references

Reproduction

Suggested fixes (in priority order)

Related

Impact

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Layer 2 — `/mux` WebSocket pins 1 more slot

Layer 4 — `fresh=true` cascade compounds