Skip to content

perf(inference): chunked batched prefill for long prompts on Metal#188

Open
ohdearquant wants to merge 1 commit into
mainfrom
pr/prefill-1-chunk
Open

perf(inference): chunked batched prefill for long prompts on Metal#188
ohdearquant wants to merge 1 commit into
mainfrom
pr/prefill-1-chunk

Conversation

@ohdearquant

@ohdearquant ohdearquant commented Jun 4, 2026

Copy link
Copy Markdown
Owner

chunked batched prefill for long prompts (Metal)

Replaces the per-token prefill loop with a chunked, batched Metal path for long
prompts. Long-context prompts no longer pay a GPU dispatch per token during prefill.

Why

For prompts longer than the prefill window, the old path fell back to a per-token
loop — one GPU dispatch per token — which dominated time-to-first-token on long
contexts. This chunks the request and batches the per-chunk work.

Result (Qwen3.5-0.8B, Apple Silicon Metal, 1000-token prompt)

Prefill tok/s TTFT
before (per-token fallback) 99 10.2 s
after (chunked batched) 184 5.5 s

~1.86× prefill, bit-exact (parity max_abs_diff = 0.000000). Decode path unchanged.

Notes

  • First of a 3-PR stack (prefill perf). PR2 adds batched attention; PR3 adds the
    parallel GatedDeltaNet scan.
  • No new crates; no library unwrap().

Stack: #188 (this) → #189 (batched attention) → #190 (GDN parallel scan)

Replaces the per-token prefill loop with a chunked batched Metal path for
long prompts, removing per-token GPU dispatch overhead during prefill.
Decode unchanged; prefill argmax parity preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant