From 50ab53e62d1000f61b331ef4150299629fd87868 Mon Sep 17 00:00:00 2001 From: marksverdhei Date: Sun, 7 Jun 2026 20:34:13 +0200 Subject: [PATCH] docs(readme): inventory DFlash + Gemma4 MTP under HT Fork Changes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Backend & quantization table omitted two HT-specific speculative decoding features that have shipped to ht: - DFlash (LLM_ARCH_DFLASH, --spec-type dflash, custom CUDA kernels for partial-accept feature extraction) — landed via PR #62 (b0daec55b), integrates the z-lab DFlash block-diffusion drafter against Gemma4 31B targets. - Gemma4 MTP (gemma4-assistant arch + --spec-type draft-mtp) — vendored via PR #93 (4c0976566) ahead of upstream PR ggml-org/llama.cpp#23398 merge so the gemma-4-12b-qat-mtp preset can ship on titan. Marked with Tracked-upstream=#23398 since it retires when that PR merges and flows through a normal master sync. Found during a §7 documentation freshness sweep — the inventory exists to be authoritative ("consult it before assuming a behaviour is upstream stock" per AGENTS.md), so omissions defeat the purpose. Docs-only, no code touched. --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 3c75fa35f4f0..51739b01b99f 100644 --- a/README.md +++ b/README.md @@ -27,6 +27,8 @@ Unlike upstream, we accept contributions from AI agents and assistants. We judge | LCO-Embedding-Omni GGUF | Conversion script support for the LCO-Embedding-Omni multi-modal embedding family, including audio tensors routed to the base class in the Qwen2.5 Omni mmproj path. | No | | MLA LoRA conversion | `convert_lora_to_gguf.py` understands MLA (`kv_b_proj`) so adapters trained on MLA-style attention convert without manual surgery. | No | | Scheduler split-input cap | `GGML_SCHED_MAX_SPLIT_INPUTS` raised 30 → 256 (CMake cache var) so wide multi-modal graphs no longer trip the scheduler's per-split input limit. | No | +| DFlash speculative decoding | Block-diffusion drafter integration (`LLM_ARCH_DFLASH`, `--spec-type dflash`, `llama_set_dflash`, CUDA kernels for partial-accept feature extraction). Designed against the [z-lab DFlash](https://github.com/z-lab/dflash) reference for Gemma4 31B targets. | No | +| Gemma4 MTP speculative | Vendored upstream PR [#23398](https://github.com/ggml-org/llama.cpp/pull/23398) (`gemma4-assistant` arch + `--spec-type draft-mtp`) ahead of upstream merge so the gemma-4-12b-qat-mtp preset can ship on titan. Retires when #23398 merges upstream and flows through a normal master sync. | [#23398](https://github.com/ggml-org/llama.cpp/pull/23398) | ### WebUI + desktop shell