mtp: ACCEPT_REPORT instrumentation + SPEC_DISABLE refactor (stacked on #3) by TrevorS · Pull Request #4 · TrevorS/ds4

TrevorS · 2026-05-23T22:25:54Z

PR4 draft: mtp: ACCEPT_REPORT instrumentation + SPEC_DISABLE refactor (stacked on #3)

Summary

Two new commits stacked on mtp-beats-plain-kernels-v2 (PR #3). Diagnostics and env-knob refactor for the MTP path — no default-behavior changes, no kernels touched.

fd7b9ff — mtp: instrument speculative-decode accept rate (DS4_MTP_ACCEPT_REPORT)
- Adds per-session counters: spec iters, drafts proposed, drafts accepted.
- Gated by env var DS4_MTP_ACCEPT_REPORT=1; default-off.
- Emits a single-line report on session teardown: mtp accept: iters=N proposed=M accepted=K rate=X.X% per_iter=Y.YY.
- Same shape as existing DS4_MTP_PROBE diagnostic.
d3513a6 — mtp: subsume mtp: make speculation disable skip draft work antirez/ds4#206 — DS4_MTP_SPEC_DISABLE honors no-draft semantics
- Refactors DS4_MTP_SPEC_DISABLE env knob to behave consistently across CLI entry points.
- DS4_MTP_SPEC_DISABLE=1 --mtp-draft 2 now produces output byte-identical to --mtp-draft 1 (the canonical no-draft path).
- Subsumes upstream issue mtp: make speculation disable skip draft work antirez/ds4#206.

What's NOT in this PR (deferred)

mtp: drop redundant end_commands sync before tensor_read in draft eval (2ed0134 from downstream) was evaluated and dropped. The commit message claims +1.1% MTP gen_tps on its author's hardware (likely Metal). On GB10 the elided cudaDeviceSynchronize is absorbed by the immediately-following synchronous cudaMemcpy D2H — the delta is below the noise floor (5-run mean shift of -0.04% to +0.14% across regimes, vs noise of ±0.3%). Decided not to ship a commit whose perf claim doesn't reproduce on the target hardware.
Test-vector fixture refresh (tests/test-vectors/official.vec) — intended as a 3rd commit to fix the long-pending --logprob-vectors short_code_completion failure (pre-existing on upstream/main). Requires running fetch_official_vectors.py against the DeepSeek API. The fetcher needs to be run interactively; left as a follow-up.

Tested against

make clean && make cuda-spark — clean, no new warnings
./ds4_test --all — only pre-existing logprob-vectors short_code_completion failure (also on upstream/main). Tensor-equivalence summary: capture_fail=0 logits_fail=0 greedy_fail=0 top1_mismatch=0 across all 5 cases.
make cuda-regression — pre-existing build error (also on upstream/main), unchanged
make cpu — clean build
DEFAULT path byte-identical to PR cuda: revive 2 dropped kernels with FMA-contraction fixes (stacked on #2) #3 for both plain (-p "knight" -n 64 --temp 0) and MTP-strict (DS4_MTP_BATCH_VERIFY=1 DS4_MTP_STRICT=1 --mtp-draft 2)
ENV-ON DS4_MTP_ACCEPT_REPORT=1 emits sensible counters (verified: iters=11 proposed=22 accepted=14 rate=63.6%)
ENV-ON DS4_MTP_SPEC_DISABLE=1 --mtp-draft 2 output byte-identical to --mtp-draft 1 (canonical no-draft)
Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf (80.76 GiB)
MTP: DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf (3.5 GiB)

Notes

Two env knobs added, both diagnostic per AGENT.md: DS4_MTP_ACCEPT_REPORT (counter readout), and the existing DS4_MTP_SPEC_DISABLE semantics aligned to no-draft canonical.
Zero kernel changes. Zero default-path drift.
Useful for measuring spec-decode quality during MTP tuning work (see also the broader perf investigation in mtp-beats-plain-kernels-v4 if/when it lands).

Out of scope / follow-ups

Test-vector fixture refresh (above)
2ed0134 end_commands sync drop (above)
Batched-N kernel divergences — separate PR (mtp-beats-plain-kernels-v4)
Combined-forward MTP strict mode — needs the batched-N divergence fixes first
Captured-graph spec decode — separate subsystem

Adds counters and a session-free report for MTP spec-decode acceptance, gated by the DS4_MTP_ACCEPT_REPORT env var. Session counters ---------------- ds4_session gains four uint64_t fields: mtp_spec_iters total spec-decode entries with draft_n proposed mtp_spec_drafts_proposed sum of draft_n across iters mtp_spec_drafts_accepted sum of drafts committed via the verifier (the existing mtp_probe_total / mtp_probe_hit pair stays untouched -- they count pre-spec MTP-draft probes, a different signal) Instrumentation sites --------------------- ds4_session_eval_speculative_argmax has one canonical drafting block at the top followed by ~10 distinct accept paths (margin-skip, N=2 exact decode + partial, micro-batch verifier full + prefix-1 + sequential, fallback sequential). Counters are bumped: - iters & proposed: once, right after the batched MTP-draft primitive returns a non-empty draft_n; - accepted: at every `accepted[n_accept++] = drafts[X];` site via the file-local DS4_MTP_RECORD_ACCEPT() macro (10 sites). The combined-forward helper (ds4_session_eval_speculative_argmax_combined, opt-in via DS4_MTP_COMBINED_FORWARD) is intentionally NOT instrumented; its accept accounting belongs to the canonical iters counter only when fall-through to canonical happens, and the opt-in path is not on by default. Report ------ ds4_session_free prints one line on teardown when iters>0 and DS4_MTP_ACCEPT_REPORT is set: ds4: mtp accept: iters=N proposed=P accepted=A rate=R.R% per_iter=X.XX Where rate = A/P, per_iter = A/N. Both are useful: rate compares to spark's bench output (82.8% at n=128); per_iter shows mean drafts committed per spec call (1.0 = perfectly utilizing K=1, 2.0 = perfect K=2, etc.). Initial measurement ------------------- On ds4flash.gguf + MTP-Q4K + n=128 + --mtp-draft 2 (default --mtp-draft 1 short-circuits the spec path entirely at line `e->mtp_draft_tokens <= 1`, so accept-rate measurement requires explicit K>=2): ds4: mtp accept: iters=44 proposed=88 accepted=55 rate=62.5% per_iter=1.25 Mainline's K=2 accept rate is 62.5% on the same prompt where spark's bench reports 82.8%. That confirms a draft-quality gap. Important side observation -------------------------- At --mtp-draft 2, mainline's generation rate is 4.23 t/s -- a 3.8x slowdown vs plain decode (16.1 t/s). The accept-rate gap (62.5% vs 82.8%) doesn't explain that magnitude on its own; the K=2 verifier path itself has substantial overhead that needs separate investigation (scout 3 prior report flagged inherited compression/indexing kernels in the MTP block as one candidate). Logged here, not fixed. Scope ----- ds4.c only. +60/-9 LOC (4 session fields + 11 macro/counter inserts + ~15 lines of report code). No header changes. Default behavior unchanged when DS4_MTP_ACCEPT_REPORT is unset. Tests: ds4_test passes (long-context, tool-call-quality, metal-kernels, server); pre-existing logprob-vectors failure unchanged. NO github push. jj change vzqyyvuu -> lquvymqk.

…tics

TrevorS mentioned this pull request May 24, 2026

cuda: DS4_CUDA_STRICT_BATCHED — bit-equal batched-N infrastructure (stacked on #4) #5

Open

TrevorS added 2 commits May 24, 2026 10:03

mtp: subsume antirez#206 — DS4_MTP_SPEC_DISABLE honors no-draft seman…

ed98f3e

…tics

TrevorS force-pushed the mtp-beats-plain-kernels-v2 branch from af16691 to 65d8182 Compare May 24, 2026 17:13

TrevorS force-pushed the mtp-beats-plain-kernels-v3 branch from d3513a6 to ed98f3e Compare May 24, 2026 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtp: ACCEPT_REPORT instrumentation + SPEC_DISABLE refactor (stacked on #3)#4

mtp: ACCEPT_REPORT instrumentation + SPEC_DISABLE refactor (stacked on #3)#4
TrevorS wants to merge 2 commits into
mtp-beats-plain-kernels-v2from
mtp-beats-plain-kernels-v3

TrevorS commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TrevorS commented May 23, 2026

PR4 draft: mtp: ACCEPT_REPORT instrumentation + SPEC_DISABLE refactor (stacked on #3)

Summary

What's NOT in this PR (deferred)

Tested against

Notes

Out of scope / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant