mtp: ACCEPT_REPORT instrumentation + SPEC_DISABLE refactor (stacked on #3)#4
Open
TrevorS wants to merge 2 commits into
Open
mtp: ACCEPT_REPORT instrumentation + SPEC_DISABLE refactor (stacked on #3)#4TrevorS wants to merge 2 commits into
TrevorS wants to merge 2 commits into
Conversation
Adds counters and a session-free report for MTP spec-decode acceptance,
gated by the DS4_MTP_ACCEPT_REPORT env var.
Session counters
----------------
ds4_session gains four uint64_t fields:
mtp_spec_iters total spec-decode entries with draft_n proposed
mtp_spec_drafts_proposed sum of draft_n across iters
mtp_spec_drafts_accepted sum of drafts committed via the verifier
(the existing mtp_probe_total / mtp_probe_hit pair stays untouched --
they count pre-spec MTP-draft probes, a different signal)
Instrumentation sites
---------------------
ds4_session_eval_speculative_argmax has one canonical drafting block at
the top followed by ~10 distinct accept paths (margin-skip, N=2 exact
decode + partial, micro-batch verifier full + prefix-1 + sequential,
fallback sequential). Counters are bumped:
- iters & proposed: once, right after the batched MTP-draft primitive
returns a non-empty draft_n;
- accepted: at every `accepted[n_accept++] = drafts[X];` site via the
file-local DS4_MTP_RECORD_ACCEPT() macro (10 sites).
The combined-forward helper (ds4_session_eval_speculative_argmax_combined,
opt-in via DS4_MTP_COMBINED_FORWARD) is intentionally NOT instrumented;
its accept accounting belongs to the canonical iters counter only when
fall-through to canonical happens, and the opt-in path is not on by
default.
Report
------
ds4_session_free prints one line on teardown when iters>0 and
DS4_MTP_ACCEPT_REPORT is set:
ds4: mtp accept: iters=N proposed=P accepted=A rate=R.R% per_iter=X.XX
Where rate = A/P, per_iter = A/N. Both are useful: rate compares to
spark's bench output (82.8% at n=128); per_iter shows mean drafts
committed per spec call (1.0 = perfectly utilizing K=1, 2.0 = perfect
K=2, etc.).
Initial measurement
-------------------
On ds4flash.gguf + MTP-Q4K + n=128 + --mtp-draft 2 (default --mtp-draft
1 short-circuits the spec path entirely at line `e->mtp_draft_tokens
<= 1`, so accept-rate measurement requires explicit K>=2):
ds4: mtp accept: iters=44 proposed=88 accepted=55 rate=62.5%
per_iter=1.25
Mainline's K=2 accept rate is 62.5% on the same prompt where spark's
bench reports 82.8%. That confirms a draft-quality gap.
Important side observation
--------------------------
At --mtp-draft 2, mainline's generation rate is 4.23 t/s -- a 3.8x
slowdown vs plain decode (16.1 t/s). The accept-rate gap (62.5% vs
82.8%) doesn't explain that magnitude on its own; the K=2 verifier
path itself has substantial overhead that needs separate investigation
(scout 3 prior report flagged inherited compression/indexing kernels
in the MTP block as one candidate). Logged here, not fixed.
Scope
-----
ds4.c only. +60/-9 LOC (4 session fields + 11 macro/counter inserts +
~15 lines of report code). No header changes. Default behavior
unchanged when DS4_MTP_ACCEPT_REPORT is unset.
Tests: ds4_test passes (long-context, tool-call-quality, metal-kernels,
server); pre-existing logprob-vectors failure unchanged.
NO github push. jj change vzqyyvuu -> lquvymqk.
af16691 to
65d8182
Compare
d3513a6 to
ed98f3e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR4 draft: mtp: ACCEPT_REPORT instrumentation + SPEC_DISABLE refactor (stacked on #3)
Summary
Two new commits stacked on
mtp-beats-plain-kernels-v2(PR #3). Diagnostics and env-knob refactor for the MTP path — no default-behavior changes, no kernels touched.fd7b9ff— mtp: instrument speculative-decode accept rate (DS4_MTP_ACCEPT_REPORT)DS4_MTP_ACCEPT_REPORT=1; default-off.mtp accept: iters=N proposed=M accepted=K rate=X.X% per_iter=Y.YY.DS4_MTP_PROBEdiagnostic.d3513a6— mtp: subsume mtp: make speculation disable skip draft work antirez/ds4#206 — DS4_MTP_SPEC_DISABLE honors no-draft semanticsDS4_MTP_SPEC_DISABLEenv knob to behave consistently across CLI entry points.DS4_MTP_SPEC_DISABLE=1 --mtp-draft 2now produces output byte-identical to--mtp-draft 1(the canonical no-draft path).What's NOT in this PR (deferred)
mtp: drop redundant end_commands sync before tensor_read in draft eval(2ed0134from downstream) was evaluated and dropped. The commit message claims +1.1% MTP gen_tps on its author's hardware (likely Metal). On GB10 the elidedcudaDeviceSynchronizeis absorbed by the immediately-following synchronouscudaMemcpyD2H — the delta is below the noise floor (5-run mean shift of -0.04% to +0.14% across regimes, vs noise of ±0.3%). Decided not to ship a commit whose perf claim doesn't reproduce on the target hardware.tests/test-vectors/official.vec) — intended as a 3rd commit to fix the long-pending--logprob-vectors short_code_completionfailure (pre-existing onupstream/main). Requires runningfetch_official_vectors.pyagainst the DeepSeek API. The fetcher needs to be run interactively; left as a follow-up.Tested against
make clean && make cuda-spark— clean, no new warnings./ds4_test --all— only pre-existinglogprob-vectors short_code_completionfailure (also onupstream/main). Tensor-equivalence summary:capture_fail=0 logits_fail=0 greedy_fail=0 top1_mismatch=0across all 5 cases.make cuda-regression— pre-existing build error (also onupstream/main), unchangedmake cpu— clean build-p "knight" -n 64 --temp 0) and MTP-strict (DS4_MTP_BATCH_VERIFY=1 DS4_MTP_STRICT=1 --mtp-draft 2)DS4_MTP_ACCEPT_REPORT=1emits sensible counters (verified:iters=11 proposed=22 accepted=14 rate=63.6%)DS4_MTP_SPEC_DISABLE=1 --mtp-draft 2output byte-identical to--mtp-draft 1(canonical no-draft)DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf(80.76 GiB)DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf(3.5 GiB)Notes
DS4_MTP_ACCEPT_REPORT(counter readout), and the existingDS4_MTP_SPEC_DISABLEsemantics aligned to no-draft canonical.mtp-beats-plain-kernels-v4if/when it lands).Out of scope / follow-ups
2ed0134end_commands sync drop (above)mtp-beats-plain-kernels-v4)