bench(inference): agentic workload (1000-ctx/100-resp) prefill measurement by ohdearquant · Pull Request #187 · ohdearquant/lattice

ohdearquant · 2026-06-03T01:56:49Z

Reproducible scaffolding for #185 (prefill unbatched) and #186 (agentic bench tracking).

What

bench_decode_ab: new BENCH_PROMPT_TOKENS env pads the prompt to a target token count, so the real Metal e2e path can be measured at arbitrary context depth (it was fixed at ~20 tokens).
scripts/bench_compare_1k.py: lattice (via bench_decode_ab) vs ollama (/api/generate) vs MLX (mlx_lm) at 1000-ctx / 100-resp. Reports TTFT, decode, total, prefill & decode tok/s.
docs/bench_results/agentic_1k_compare.json: this session's raw numbers.

Why

Our benches top out at 256-token context and measure decode-only — they miss the workload that actually matters for agents (long context, short response). This harness surfaces the gap:

Engine       Ctx  TTFT(ms)  Decode(ms)  Total(ms)  Prefill t/s  Decode t/s
lattice     1009    10207        1513      11720           99          66
ollama      1450      480        1194       1674         3022          84
mlx         1000      384         406        789         2608         246

Prefill at 99 tok/s (one token at a time) = 10.2s TTFT = 87% of latency = ~7x slower than ollama end-to-end. Full analysis in #185.

Test

cargo build --release -p lattice-inference --bin bench_decode_ab --features "f16,metal-gpu"
uv run python scripts/bench_compare_1k.py

clippy clean on the bench change; pre-commit (check + doc-lint) passed.

Refs #185, #186

🤖 Generated with Claude Code

…ement Adds BENCH_PROMPT_TOKENS prompt-padding to bench_decode_ab so the Metal e2e path can be measured at arbitrary context depth, plus a lattice-vs- ollama-vs-MLX driver for the 1000-token-context / 100-token-response agentic workload. Surfaces the prefill gap (#185): lattice ingests the prompt one token at a time (99 tok/s prefill, 10.2s TTFT at 1000 ctx), making end-to-end ~7x slower than ollama and ~15x slower than MLX at this workload. Existing benches top out at 256-token context and miss it (#186). Refs #185, #186 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(inference): agentic workload (1000-ctx/100-resp) prefill measurement#187

bench(inference): agentic workload (1000-ctx/100-resp) prefill measurement#187
ohdearquant wants to merge 1 commit into
mainfrom
perf/agentic-prefill-bench

ohdearquant commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohdearquant commented Jun 3, 2026

What

Why

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant