Skip to content

BUG: SSE keepalive misses prefill stalls because the comment is emitted from the progress callback, not a wall-clock timer #222

@njbrake

Description

@njbrake

Written by Claude based on conversation between Claude and @njbrake. Prose is Claude but the idea is Nate's

Summary

The SSE keepalive added in 8d57664 (Handle prefill errors after SSE keepalive) fires from inside the server_prefill_progress callback, which runs on chunk progress. When a single prefill chunk hangs internally for longer than the client's idle timeout, no progress callback fires, no : prefill\n\n comment is sent, and the client eventually drops the socket. The server then logs client stream write failed when it
finally tries to emit gen tokens.

The fix works perfectly when prefill is slow but progressing (e.g. 295 s spread evenly across many chunks survives a strict 5-min client timeout). It fails when prefill stalls inside one chunk.

Repro

Single ds4-server run on an Anthropic-API client (Node + global fetch / undici, default idle timeouts). One prompt across three tool turns, ctx=600000.

09:29:34 Turn 1 prefill start (16936 tokens, ~5 min cold prefill)
09:34:30 Turn 1 prefill done 295.66 s @ 57.28 t/s ✓ ← chunks every ~24 s → keepalive fires regularly, client stays connected
09:34:56 Turn 1 finish=tool_calls

09:35:05 Turn 2 prefill done (268 tokens, 8.9 s) ✓
09:35:28 Turn 2 finish=tool_calls

09:35:28 Turn 3 prefill start (1046 tokens)
09:35:51 prefill 939/1046 in 22.9 s (avg 41.06 t/s) ← progressing normally
09:51:31 prefill 1046/1046 in 962.4 s (chunk = 0.11 t/s for the last 107 tokens)
↑ ~939 s spent inside a single sub-step; no progress callbacks during this window
09:51:35 gen starts
09:51:39 chat ctx=130133..131179:1046 TOOLS final stream failed
09:51:39 finish=error error="client stream write failed" 970.377 s

Turn 1 demonstrates the fix is in effect — without it, the same 5-minute prefill would have triggered a client-side body-idle abort. Turn 3 shows the structural gap: when the prefill loop itself stops calling back for ~16 minutes, the keepalive thread of execution is also paused (it lives on the same call path) so no : comments go out.

Why current keepalive doesn't cover this

ds4_server.c (current main, around line 9208) emits the comment only when the progress callback runs:

} else if (now - p->last_keepalive >= 5.0) {
    static const char ka[] = ": prefill\n\n";
    if (send_all(p->fd, ka, sizeof(ka) - 1)) {
        p->last_keepalive = now;

The 5-second guard is correct, but the comment is only attempted when the callback fires. If the prefill compute path is the thing that's blocked, the callback never runs, the guard never checks now, and the socket goes silent.

Suggested direction

Decouple keepalive emission from prefill progress. A few options, easiest first:

  1. Wall-clock keepalive thread. A small dedicated thread (or pthread + condvar with a 1-2 s timed wait) that owns the fd's write side during prefill and sends : prefill\n\n every 5 s regardless of what the model loop is doing. Hand-off back to the main thread when prefill finishes.
  2. Coalesced keepalive on a pselect/timer fd. If the event loop already has one, add a 5 s tick that checks now - last_keepalive for every connection in prefill and writes the comment.
  3. Watchdog around long sub-ops. If certain internal calls (KV disk fetch, Metal residency wait, big matmul tiles) are expected to be long, sprinkle the progress callback inside them with a bytes_consumed=0, in_progress=true signal so existing keepalive code still fires.

(1) is the most defensive — any future code path that gets stuck in prefill stays covered automatically.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions