From 50ab53e62d1000f61b331ef4150299629fd87868 Mon Sep 17 00:00:00 2001
From: marksverdhei <mark.sverdhei@gmail.com>
Date: Sun, 7 Jun 2026 20:34:13 +0200
Subject: [PATCH] docs(readme): inventory DFlash + Gemma4 MTP under HT Fork
 Changes
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The Backend & quantization table omitted two HT-specific speculative
decoding features that have shipped to ht:

- DFlash (LLM_ARCH_DFLASH, --spec-type dflash, custom CUDA kernels for
  partial-accept feature extraction) — landed via PR #62 (b0daec55b),
  integrates the z-lab DFlash block-diffusion drafter against Gemma4
  31B targets.

- Gemma4 MTP (gemma4-assistant arch + --spec-type draft-mtp) — vendored
  via PR #93 (4c0976566) ahead of upstream PR ggml-org/llama.cpp#23398
  merge so the gemma-4-12b-qat-mtp preset can ship on titan. Marked
  with Tracked-upstream=#23398 since it retires when that PR merges and
  flows through a normal master sync.

Found during a §7 documentation freshness sweep — the inventory exists
to be authoritative ("consult it before assuming a behaviour is
upstream stock" per AGENTS.md), so omissions defeat the purpose.

Docs-only, no code touched.
---
 README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/README.md b/README.md
index 3c75fa35f4f0..51739b01b99f 100644
--- a/README.md
+++ b/README.md
@@ -27,6 +27,8 @@ Unlike upstream, we accept contributions from AI agents and assistants. We judge
 | LCO-Embedding-Omni GGUF | Conversion script support for the LCO-Embedding-Omni multi-modal embedding family, including audio tensors routed to the base class in the Qwen2.5 Omni mmproj path. | No |
 | MLA LoRA conversion | `convert_lora_to_gguf.py` understands MLA (`kv_b_proj`) so adapters trained on MLA-style attention convert without manual surgery. | No |
 | Scheduler split-input cap | `GGML_SCHED_MAX_SPLIT_INPUTS` raised 30 → 256 (CMake cache var) so wide multi-modal graphs no longer trip the scheduler's per-split input limit. | No |
+| DFlash speculative decoding | Block-diffusion drafter integration (`LLM_ARCH_DFLASH`, `--spec-type dflash`, `llama_set_dflash`, CUDA kernels for partial-accept feature extraction). Designed against the [z-lab DFlash](https://github.com/z-lab/dflash) reference for Gemma4 31B targets. | No |
+| Gemma4 MTP speculative | Vendored upstream PR [#23398](https://github.com/ggml-org/llama.cpp/pull/23398) (`gemma4-assistant` arch + `--spec-type draft-mtp`) ahead of upstream merge so the gemma-4-12b-qat-mtp preset can ship on titan. Retires when #23398 merges upstream and flows through a normal master sync. | [#23398](https://github.com/ggml-org/llama.cpp/pull/23398) |
 
 ### WebUI + desktop shell