Current behavior
The test-chat in wizard step 6 (Deploy tab) sends a POST /api/jobs/{id}/infer request and waits for the full response before displaying any output. For larger models or long outputs this creates a noticeable delay with no feedback.
Proposed change
Replace the single-response endpoint with a Server-Sent Events (SSE) stream:
GET /api/jobs/{id}/infer/stream?prompt=...
The server yields tokens as they are generated:
data: {"token": "Hello"}\n\n
data: {"token": ","}\n\n
data: {"token": " world"}\n\n
data: [DONE]\n\n
The Reflex frontend consumes the stream and appends tokens to the chat bubble in real time.
Implementation notes
- FastAPI supports SSE via
fastapi.responses.StreamingResponse with media_type="text/event-stream".
- The
transformers TextIteratorStreamer can be used to stream tokens from the model.
- The Reflex frontend can use a
fetch call with ReadableStream and dispatch state updates via rx.set_state (or a polling shim if full SSE support is unavailable in the current Reflex version).
- Keep the existing non-streaming
POST /infer endpoint for backward compatibility.
Acceptance criteria
Current behavior
The test-chat in wizard step 6 (Deploy tab) sends a
POST /api/jobs/{id}/inferrequest and waits for the full response before displaying any output. For larger models or long outputs this creates a noticeable delay with no feedback.Proposed change
Replace the single-response endpoint with a Server-Sent Events (SSE) stream:
The server yields tokens as they are generated:
The Reflex frontend consumes the stream and appends tokens to the chat bubble in real time.
Implementation notes
fastapi.responses.StreamingResponsewithmedia_type="text/event-stream".transformersTextIteratorStreamercan be used to stream tokens from the model.fetchcall withReadableStreamand dispatch state updates viarx.set_state(or a polling shim if full SSE support is unavailable in the current Reflex version).POST /inferendpoint for backward compatibility.Acceptance criteria
[DONE]