Skip to content

Add native MTP ngram hybrid proposer#875

Draft
i386 wants to merge 1 commit into
mainfrom
jd/jianyang-drop-prompt-ngram
Draft

Add native MTP ngram hybrid proposer#875
i386 wants to merge 1 commit into
mainfrom
jd/jianyang-drop-prompt-ngram

Conversation

@i386

@i386 i386 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add opt-in native MTP + n-gram hybrid proposal support in skippy-server
  • widen native MTP batched VerifySpan only when the n-gram span agrees with the MTP anchor token
  • add telemetry for hybrid anchor availability, n-gram span availability, agreement/disagreement, proposal length, accepted length, and accepted tail length
  • remove skippy-prompt standalone n-gram sidecar CLI/config/launcher/stub plumbing

Impact

  • Existing native MTP behavior remains unchanged unless SKIPPY_NATIVE_MTP_NGRAM_HYBRID=1 is enabled.
  • Hybrid mode has no n-gram-only fallback: MTP still supplies the anchor token, and n-gram only extends the span when it agrees.
  • llama.cpp patch surface is unchanged.

Fresh Lab Benchmark

Dataset: nvidia/SPEED-Bench, qualitative, coding,reasoning, LIMIT=4 per category, OSL=512, 8 requests / 9 turns per condition. Golden target is vanilla llama.cpp no-MTP.

Important lab note: studio54 direct reads from /Volumes/External/.../GLM-4.7-Flash-MTP-Q4_K_M.gguf were hanging during this run. I materialized the same GGUF to /Users/jdumay/tmp/skippy-models/GLM-4.7-Flash-MTP-Q4_K_M.gguf for stage0/vanilla. micstudio stage1 used the existing NFS path successfully.

Condition Decode tok/s vs golden Avg latency Completion tokens Drafted/proposed Accepted Accepted tail Avg proposal Trim overhead
vanilla llama.cpp, no MTP 43.44 1.000x 10.401s 3517 0 0 n/a n/a n/a
vanilla llama.cpp, MTP n=1 59.19 1.363x 9.227s 4246 2264 1969 n/a n/a n/a
Skippy 2-stage, no MTP 32.35 0.745x 17.961s 4525 0 0 n/a n/a 0.0ms
Skippy 2-stage, MTP n=1 batched 37.89 0.872x 14.349s 4238 2127 1499 0 1.00 0.0ms
Skippy hybrid ngram_size=4,max=2 41.88 0.964x 14.119s 4608 2685 proposed 1319 701 1.39 0.0ms
Skippy hybrid ngram_size=4,max=4 44.25 1.019x 13.382s 4608 2952 proposed 1211 1048 1.69 0.0ms
Skippy hybrid ngram_size=4,max=8 44.07 1.015x 13.430s 4608 3253 proposed 1197 1126 1.92 0.0ms
Skippy hybrid ngram_size=8,max=4 40.00 0.921x 13.657s 4238 2554 proposed 1111 711 1.47 0.0ms
Skippy hybrid ngram_size=16,max=4 40.46 0.932x 13.466s 4238 2479 proposed 1189 663 1.40 0.0ms

Winning row: SKIPPY_NATIVE_MTP_NGRAM_HYBRID=1 SKIPPY_NATIVE_MTP_NGRAM_SIZE=4 SKIPPY_NATIVE_MTP_NGRAM_MAX_PROPOSAL_TOKENS=4.

Observations:

  • The hybrid row is the first 2-stage Skippy result in this experiment to slightly clear the vanilla no-MTP golden target on decode tok/s: 44.25 vs 43.44 tok/s.
  • The win comes from amortizing stage latency with wider VerifySpan proposals: stage1 saw avg 2.69 tokens per VerifySpan and max 5 for the winning row.
  • Tail acceptance was strong for the winning row: 1048/1209 tail tokens accepted, 86.7% diagnostic tail acceptance.
  • max=8 widened proposals further, but its lower tail acceptance and higher stage1 compute left it slightly behind max=4.
  • Anchor agreement is the limiting signal: 775/1006, 77.0%, so the n-gram extension helps when it agrees but still often falls back to the MTP anchor only.

Artifacts:

  • vanilla: /Users/jdumay/code/lab-experiments/jianyang/benchmarks/speed-bench/results/pr875-vanilla-internal-fitoff-hfcache-20260618T102050Z
  • Skippy split matrix, size sweep: /Users/jdumay/code/lab-experiments/jianyang/benchmarks/speed-bench/results/pr875-skippy-split-hybrid-static-20260618T102746Z
  • Skippy split matrix, max-token sweep: /Users/jdumay/code/lab-experiments/jianyang/benchmarks/speed-bench/results/pr875-skippy-split-hybrid-max-sweep-20260618T104146Z

Validation

  • cargo check -p skippy-server
  • cargo check -p skippy-prompt
  • cargo test -p skippy-server --lib
  • cargo test -p skippy-prompt
  • cargo fmt --all --check
  • cargo clippy -p skippy-server --all-targets -- -D warnings
  • cargo clippy -p skippy-prompt --all-targets -- -D warnings
  • fresh lab benchmark above, using static-metal standalone skippy-server on studio54 + micstudio

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 763ed9ce-dbbd-49a5-b014-cc620fec8589

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jd/jianyang-drop-prompt-ngram

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant