Skip to content

mtp: ACCEPT_REPORT instrumentation + SPEC_DISABLE refactor (stacked on #3)#4

Open
TrevorS wants to merge 2 commits into
mtp-beats-plain-kernels-v2from
mtp-beats-plain-kernels-v3
Open

mtp: ACCEPT_REPORT instrumentation + SPEC_DISABLE refactor (stacked on #3)#4
TrevorS wants to merge 2 commits into
mtp-beats-plain-kernels-v2from
mtp-beats-plain-kernels-v3

Conversation

@TrevorS
Copy link
Copy Markdown
Owner

@TrevorS TrevorS commented May 23, 2026

PR4 draft: mtp: ACCEPT_REPORT instrumentation + SPEC_DISABLE refactor (stacked on #3)

Summary

Two new commits stacked on mtp-beats-plain-kernels-v2 (PR #3). Diagnostics and env-knob refactor for the MTP path — no default-behavior changes, no kernels touched.

  1. fd7b9ffmtp: instrument speculative-decode accept rate (DS4_MTP_ACCEPT_REPORT)

    • Adds per-session counters: spec iters, drafts proposed, drafts accepted.
    • Gated by env var DS4_MTP_ACCEPT_REPORT=1; default-off.
    • Emits a single-line report on session teardown: mtp accept: iters=N proposed=M accepted=K rate=X.X% per_iter=Y.YY.
    • Same shape as existing DS4_MTP_PROBE diagnostic.
  2. d3513a6mtp: subsume mtp: make speculation disable skip draft work antirez/ds4#206 — DS4_MTP_SPEC_DISABLE honors no-draft semantics

What's NOT in this PR (deferred)

  • mtp: drop redundant end_commands sync before tensor_read in draft eval (2ed0134 from downstream) was evaluated and dropped. The commit message claims +1.1% MTP gen_tps on its author's hardware (likely Metal). On GB10 the elided cudaDeviceSynchronize is absorbed by the immediately-following synchronous cudaMemcpy D2H — the delta is below the noise floor (5-run mean shift of -0.04% to +0.14% across regimes, vs noise of ±0.3%). Decided not to ship a commit whose perf claim doesn't reproduce on the target hardware.
  • Test-vector fixture refresh (tests/test-vectors/official.vec) — intended as a 3rd commit to fix the long-pending --logprob-vectors short_code_completion failure (pre-existing on upstream/main). Requires running fetch_official_vectors.py against the DeepSeek API. The fetcher needs to be run interactively; left as a follow-up.

Tested against

  • make clean && make cuda-spark — clean, no new warnings
  • ./ds4_test --all — only pre-existing logprob-vectors short_code_completion failure (also on upstream/main). Tensor-equivalence summary: capture_fail=0 logits_fail=0 greedy_fail=0 top1_mismatch=0 across all 5 cases.
  • make cuda-regression — pre-existing build error (also on upstream/main), unchanged
  • make cpu — clean build
  • DEFAULT path byte-identical to PR cuda: revive 2 dropped kernels with FMA-contraction fixes (stacked on #2) #3 for both plain (-p "knight" -n 64 --temp 0) and MTP-strict (DS4_MTP_BATCH_VERIFY=1 DS4_MTP_STRICT=1 --mtp-draft 2)
  • ENV-ON DS4_MTP_ACCEPT_REPORT=1 emits sensible counters (verified: iters=11 proposed=22 accepted=14 rate=63.6%)
  • ENV-ON DS4_MTP_SPEC_DISABLE=1 --mtp-draft 2 output byte-identical to --mtp-draft 1 (canonical no-draft)
  • Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
  • Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf (80.76 GiB)
  • MTP: DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf (3.5 GiB)

Notes

  • Two env knobs added, both diagnostic per AGENT.md: DS4_MTP_ACCEPT_REPORT (counter readout), and the existing DS4_MTP_SPEC_DISABLE semantics aligned to no-draft canonical.
  • Zero kernel changes. Zero default-path drift.
  • Useful for measuring spec-decode quality during MTP tuning work (see also the broader perf investigation in mtp-beats-plain-kernels-v4 if/when it lands).

Out of scope / follow-ups

  • Test-vector fixture refresh (above)
  • 2ed0134 end_commands sync drop (above)
  • Batched-N kernel divergences — separate PR (mtp-beats-plain-kernels-v4)
  • Combined-forward MTP strict mode — needs the batched-N divergence fixes first
  • Captured-graph spec decode — separate subsystem

TrevorS added 2 commits May 24, 2026 10:03
Adds counters and a session-free report for MTP spec-decode acceptance,
gated by the DS4_MTP_ACCEPT_REPORT env var.

Session counters
----------------
ds4_session gains four uint64_t fields:
  mtp_spec_iters           total spec-decode entries with draft_n proposed
  mtp_spec_drafts_proposed sum of draft_n across iters
  mtp_spec_drafts_accepted sum of drafts committed via the verifier
  (the existing mtp_probe_total / mtp_probe_hit pair stays untouched --
  they count pre-spec MTP-draft probes, a different signal)

Instrumentation sites
---------------------
ds4_session_eval_speculative_argmax has one canonical drafting block at
the top followed by ~10 distinct accept paths (margin-skip, N=2 exact
decode + partial, micro-batch verifier full + prefix-1 + sequential,
fallback sequential).  Counters are bumped:
  - iters & proposed: once, right after the batched MTP-draft primitive
    returns a non-empty draft_n;
  - accepted: at every `accepted[n_accept++] = drafts[X];` site via the
    file-local DS4_MTP_RECORD_ACCEPT() macro (10 sites).

The combined-forward helper (ds4_session_eval_speculative_argmax_combined,
opt-in via DS4_MTP_COMBINED_FORWARD) is intentionally NOT instrumented;
its accept accounting belongs to the canonical iters counter only when
fall-through to canonical happens, and the opt-in path is not on by
default.

Report
------
ds4_session_free prints one line on teardown when iters>0 and
DS4_MTP_ACCEPT_REPORT is set:

  ds4: mtp accept: iters=N proposed=P accepted=A rate=R.R% per_iter=X.XX

Where rate = A/P, per_iter = A/N.  Both are useful: rate compares to
spark's bench output (82.8% at n=128); per_iter shows mean drafts
committed per spec call (1.0 = perfectly utilizing K=1, 2.0 = perfect
K=2, etc.).

Initial measurement
-------------------
On ds4flash.gguf + MTP-Q4K + n=128 + --mtp-draft 2 (default --mtp-draft
1 short-circuits the spec path entirely at line `e->mtp_draft_tokens
<= 1`, so accept-rate measurement requires explicit K>=2):

  ds4: mtp accept: iters=44 proposed=88 accepted=55 rate=62.5%
                   per_iter=1.25

Mainline's K=2 accept rate is 62.5% on the same prompt where spark's
bench reports 82.8%.  That confirms a draft-quality gap.

Important side observation
--------------------------
At --mtp-draft 2, mainline's generation rate is 4.23 t/s -- a 3.8x
slowdown vs plain decode (16.1 t/s).  The accept-rate gap (62.5% vs
82.8%) doesn't explain that magnitude on its own; the K=2 verifier
path itself has substantial overhead that needs separate investigation
(scout 3 prior report flagged inherited compression/indexing kernels
in the MTP block as one candidate).  Logged here, not fixed.

Scope
-----
ds4.c only.  +60/-9 LOC (4 session fields + 11 macro/counter inserts +
~15 lines of report code).  No header changes.  Default behavior
unchanged when DS4_MTP_ACCEPT_REPORT is unset.

Tests: ds4_test passes (long-context, tool-call-quality, metal-kernels,
server); pre-existing logprob-vectors failure unchanged.

NO github push.  jj change vzqyyvuu -> lquvymqk.
@TrevorS TrevorS force-pushed the mtp-beats-plain-kernels-v2 branch from af16691 to 65d8182 Compare May 24, 2026 17:13
@TrevorS TrevorS force-pushed the mtp-beats-plain-kernels-v3 branch from d3513a6 to ed98f3e Compare May 24, 2026 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant