perf(inference): chunked batched prefill for long prompts on Metal by ohdearquant · Pull Request #188 · ohdearquant/lattice

ohdearquant · 2026-06-04T14:10:04Z

chunked batched prefill for long prompts (Metal)

Replaces the per-token prefill loop with a chunked, batched Metal path for long
prompts. Long-context prompts no longer pay a GPU dispatch per token during prefill.

Why

For prompts longer than the prefill window, the old path fell back to a per-token
loop — one GPU dispatch per token — which dominated time-to-first-token on long
contexts. This chunks the request and batches the per-chunk work.

Result (Qwen3.5-0.8B, Apple Silicon Metal, 1000-token prompt)

	Prefill tok/s	TTFT
before (per-token fallback)	99	10.2 s
after (chunked batched)	184	5.5 s

~1.86× prefill, bit-exact (parity max_abs_diff = 0.000000). Decode path unchanged.

Notes

First of a 3-PR stack (prefill perf). PR2 adds batched attention; PR3 adds the
parallel GatedDeltaNet scan.
No new crates; no library unwrap().

Stack: #188 (this) → #189 (batched attention) → #190 (GDN parallel scan)

Replaces the per-token prefill loop with a chunked batched Metal path for long prompts, removing per-token GPU dispatch overhead during prefill. Decode unchanged; prefill argmax parity preserved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ohdearquant mentioned this pull request Jun 4, 2026

perf(inference): batched causal prefill attention + elementwise batching #189

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(inference): chunked batched prefill for long prompts on Metal#188

perf(inference): chunked batched prefill for long prompts on Metal#188
ohdearquant wants to merge 1 commit into
mainfrom
pr/prefill-1-chunk

ohdearquant commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohdearquant commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

chunked batched prefill for long prompts (Metal)

Why

Result (Qwen3.5-0.8B, Apple Silicon Metal, 1000-token prompt)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ohdearquant commented Jun 4, 2026 •

edited

Loading