Skip to content

bench(inference): agentic workload (1000-ctx/100-resp) prefill measurement#187

Open
ohdearquant wants to merge 1 commit into
mainfrom
perf/agentic-prefill-bench
Open

bench(inference): agentic workload (1000-ctx/100-resp) prefill measurement#187
ohdearquant wants to merge 1 commit into
mainfrom
perf/agentic-prefill-bench

Conversation

@ohdearquant

Copy link
Copy Markdown
Owner

Reproducible scaffolding for #185 (prefill unbatched) and #186 (agentic bench tracking).

What

  • bench_decode_ab: new BENCH_PROMPT_TOKENS env pads the prompt to a target token count, so the real Metal e2e path can be measured at arbitrary context depth (it was fixed at ~20 tokens).
  • scripts/bench_compare_1k.py: lattice (via bench_decode_ab) vs ollama (/api/generate) vs MLX (mlx_lm) at 1000-ctx / 100-resp. Reports TTFT, decode, total, prefill & decode tok/s.
  • docs/bench_results/agentic_1k_compare.json: this session's raw numbers.

Why

Our benches top out at 256-token context and measure decode-only — they miss the workload that actually matters for agents (long context, short response). This harness surfaces the gap:

Engine       Ctx  TTFT(ms)  Decode(ms)  Total(ms)  Prefill t/s  Decode t/s
lattice     1009    10207        1513      11720           99          66
ollama      1450      480        1194       1674         3022          84
mlx         1000      384         406        789         2608         246

Prefill at 99 tok/s (one token at a time) = 10.2s TTFT = 87% of latency = ~7x slower than ollama end-to-end. Full analysis in #185.

Test

cargo build --release -p lattice-inference --bin bench_decode_ab --features "f16,metal-gpu"
uv run python scripts/bench_compare_1k.py

clippy clean on the bench change; pre-commit (check + doc-lint) passed.

Refs #185, #186

🤖 Generated with Claude Code

…ement

Adds BENCH_PROMPT_TOKENS prompt-padding to bench_decode_ab so the Metal
e2e path can be measured at arbitrary context depth, plus a lattice-vs-
ollama-vs-MLX driver for the 1000-token-context / 100-token-response
agentic workload.

Surfaces the prefill gap (#185): lattice ingests the prompt one token at
a time (99 tok/s prefill, 10.2s TTFT at 1000 ctx), making end-to-end ~7x
slower than ollama and ~15x slower than MLX at this workload. Existing
benches top out at 256-token context and miss it (#186).

Refs #185, #186

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant