Add Prompt Lookup Decoding (ngram-simple) and Rolling-Hash Speculative Memory (ngram-mod)#1297
Open
mayank2130 wants to merge 3 commits into
Open
Add Prompt Lookup Decoding (ngram-simple) and Rolling-Hash Speculative Memory (ngram-mod)#1297mayank2130 wants to merge 3 commits into
mayank2130 wants to merge 3 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
|
hey @angeloskath can this PLD/n-gram decoding be reviewed. If you're not the one to reachout for mlx-lm PRs could you point me to someone else. Thanks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #851
Summary
Adds Prompt Lookup Decoding (PLD) and rolling-hash speculative decoding to
mlx_lmvia a generalizedDraftStrategyabstraction.Instead of generating speculative drafts with a smaller neural model, the new strategies reuse previously observed token trajectories:
ngram-simpleperforms exact prompt-history lookupngram-modimplements a rolling-hash associative memory ported fromllama.cppPR #19164Both strategies preserve output correctness because speculative tokens are only accepted if verified by the target model under the same sampling configuration.
This PR adds:
DraftStrategyinterface for pluggable speculative draftersModelDraftStrategyfor existing neural draftingNgramSimpleStrategyfor prompt lookup decodingNgramModStrategy+NgramModTablefor rolling-hash speculative memoryUsage
For
ngram-mod, reuse a table across related generations to preserve learned n-gram memory:CLI: multi-turn ngram-simple
CLI: multi-turn ngram-mod
The chat command keeps the conversation history and prompt cache alive across turns, so T2/T3 can reuse the generated structure from T1.
Server
Per-request JSON overrides:
draft_type,ngram_sizedisable_adaptive_gateArchitecture
Speculative drafting is abstracted behind:
NgramSimpleStrategyscans backward for matching n-grams and proposes the following continuation tokens directly from prior history.NgramModStrategyports llama.cpp's rolling-hash speculative memory.Architecture mirrors llama.cpp's split between:
The shared table stores:
hash(ngram) -> next_tokenallowing speculative reuse across requests handled by the same running server process.
Implementation behavior intentionally matches llama.cpp:
Adaptive Gate
An optional adaptive gate computes a 3-gram repetition score over the prompt. If repetition falls below:
NGRAM_GATE_THRESHOLD = 0.02speculation is skipped automatically.
This is particularly important for ngram-mod, whose cold-start behavior can regress below baseline throughput on low-repetition prompts.
Benchmarks
All benchmarks used:
LONG MULTI-TURN EDITING (~280 TOK/TURN) — OVERALL
ngram-mod nd=6 per-turn behavior