Skip to content

Streaming inference in deploy tab — replace polling with SSE #42

@SahilKumar75

Description

@SahilKumar75

Current behavior

The test-chat in wizard step 6 (Deploy tab) sends a POST /api/jobs/{id}/infer request and waits for the full response before displaying any output. For larger models or long outputs this creates a noticeable delay with no feedback.

Proposed change

Replace the single-response endpoint with a Server-Sent Events (SSE) stream:

GET /api/jobs/{id}/infer/stream?prompt=...

The server yields tokens as they are generated:

data: {"token": "Hello"}\n\n
data: {"token": ","}\n\n
data: {"token": " world"}\n\n
data: [DONE]\n\n

The Reflex frontend consumes the stream and appends tokens to the chat bubble in real time.

Implementation notes

  • FastAPI supports SSE via fastapi.responses.StreamingResponse with media_type="text/event-stream".
  • The transformers TextIteratorStreamer can be used to stream tokens from the model.
  • The Reflex frontend can use a fetch call with ReadableStream and dispatch state updates via rx.set_state (or a polling shim if full SSE support is unavailable in the current Reflex version).
  • Keep the existing non-streaming POST /infer endpoint for backward compatibility.

Acceptance criteria

  • Tokens appear incrementally in the chat bubble as they are generated
  • SSE connection closes cleanly on [DONE]
  • Existing non-streaming endpoint still passes its tests
  • Works with both CPU and CUDA inference backends

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:fine-tuningLoRA, QLoRA, training configuration, and tuning workflowsenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions