What
Promote this session's ad-hoc agentic comparison into a tracked benchmark. Long-context + short-response is the workload that exposes the prefill gap (#185); our existing benches measure decode-only at short context (context_scaling.tsv tops out at 256 tokens) and miss it entirely.
Deliverables
Why separate from #185
#185 is the fix (batched prefill). This issue is the measurement that proves the fix worked — without a tracked agentic bench, a prefill improvement won't show up in CI (the regression gate only watches short-context decode slope).
Current numbers (baseline to beat)
Engine Ctx TTFT(ms) Decode(ms) Total(ms) Prefill t/s Decode t/s
lattice 1009 10207 1513 11720 99 66
ollama 1450 480 1194 1674 3022 84
mlx 1000 384 406 789 2608 246
What
Promote this session's ad-hoc agentic comparison into a tracked benchmark. Long-context + short-response is the workload that exposes the prefill gap (#185); our existing benches measure decode-only at short context (
context_scaling.tsvtops out at 256 tokens) and miss it entirely.Deliverables
scripts/bench_compare_1k.py(lattice viabench_decode_ab, ollama via/api/generate, MLX viamlx_lm) — reports TTFT / decode / total + prefill & decode tok/sBENCH_PROMPT_TOKENSprompt-padding knob inbench_decode_ab(added this session) so lattice can be measured at arbitrary context depthdocs/bench_results/agentic_*.json(MLX rows already exist inagentic_workload.json)Why separate from #185
#185 is the fix (batched prefill). This issue is the measurement that proves the fix worked — without a tracked agentic bench, a prefill improvement won't show up in CI (the regression gate only watches short-context decode slope).
Current numbers (baseline to beat)