Skip to content

perf(inference): batched causal prefill attention + elementwise batching#189

Open
ohdearquant wants to merge 1 commit into
pr/prefill-1-chunkfrom
pr/prefill-2-attn
Open

perf(inference): batched causal prefill attention + elementwise batching#189
ohdearquant wants to merge 1 commit into
pr/prefill-1-chunkfrom
pr/prefill-2-attn

Conversation

@ohdearquant

@ohdearquant ohdearquant commented Jun 4, 2026

Copy link
Copy Markdown
Owner

batched causal prefill attention + elementwise batching

Stacked on #188. Replaces the per-token decode-attention loop in prefill with a
single batched causal-attention pass, and batches the elementwise prefill ops so they
no longer issue a GPU dispatch per token.

Why

After the chunked-prefill fix, the remaining per-token cost in prefill was the
attention loop and the elementwise ops running once per position. This collapses both
into batched passes over the prompt.

Result

Additional prefill speedup on top of PR1 (cumulative interleaved A/B tracked
internally; a fresh same-process A/B will be attached before merge rather than quoting
a number I can't reproduce in-session). Decode path unchanged; prefill argmax parity
preserved across runs.

Notes

  • Removes an unnecessary unsafe in the attention fallback.
  • No new crates; no library unwrap().

Replaces the per-token decode-attention loop in prefill with a single batched
causal attention pass, and batches the elementwise prefill ops to eliminate
per-token GPU dispatches. Removes an unnecessary unsafe in the attn fallback.
Decode unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant