fix(dflash): infer draft dimensions from safetensors#195
Merged
Conversation
89ec481 to
b95674d
Compare
dusterbloom
added a commit
to dusterbloom/lucebox-hub
that referenced
this pull request
May 19, 2026
…ary update
Three new run directories plus a yesterday-vs-today comparison snapshot:
2026-05-17T17-40-56_f031f08/ — preserved yesterday's reference matrix
(n_sample=8, n_runs=8, bootstrap CI 95%). DFlash b22: HE 169.40,
GSM 104.32, Math 119.36 tok/s. MTP d3: 65.62 / 61.00 / 61.89. AR
baseline ~34 tok/s across suites. Was untracked in tree because the
bench_matrix orchestrator landed on a stale branch.
2026-05-19T11-43-13_83e19d9/ — first matrix re-run on HEAD (HE only,
MTP_GGUF env unset → mtp_d3 cell empty). DFlash b22: 173.81 tok/s.
Confirms no DFlash kernel regression vs f031f08.
2026-05-19T11-54-32_83e19d9/ — full apples-to-apples re-run on HEAD
with MTP_GGUF set. Result: all 9 cells (3 suites × {AR, DFlash b22,
MTP d3}) within ±5% of f031f08 mean tok/s. DFlash HE +2.6%, MTP HE
−2.0%, etc. No regression.
2026-05-19_mtp-prefix-warm-ghost/summary.md — updated with the
apples-to-apples table above, an agent bucket-label-vs-actual-token
audit (agent_24k prompts are actually ~2.6K), a known-gaps section
documenting what is NOT yet tested (real CLI agentic loops, NIAH
> 131K, concurrent sessions, sustained throughput, PR Luce-Org#195 merge).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes Qwen3.6 DFlash draft loading by deriving the draft architecture dimensions from the draft
.safetensorstensor shapes instead of inheriting the verifier/target model's attention metadata.The failure reproduced as:
A later mismatch could also appear on projection tensors, for example
o_proj.weight shape[1]=4096 expected 3072.Current understanding
This is not a bug in upstream
safetensors. The safetensors archive already records and returns the correct tensor shape, e.g.layers.0.self_attn.q_norm.weighthas shape[128]. The bug was in lucebox: before validating the draft tensor header,load_draft_safetensorscopied target model attention dimensions intoDraftWeights.That is wrong for Qwen3.6 because the target and draft are different transformers:
head_dimn_headn_head_kvq_dimkv_dimThe old behavior happened to work for earlier drafts where these dimensions matched closely enough, which made the target-metadata inheritance bug latent.
What changed
load_draft_safetensorsnow infers draft dimensions directly from the safetensors header after parsing it:hidden_norm.weight/norm.weight->n_embdfc.weight-> captured target-layer countlayers.0.mlp.gate_proj.weight->n_fflayers.0.self_attn.q_norm.weight/k_norm.weight->head_dimlayers.0.self_attn.q_proj.weight->n_headlayers.0.self_attn.k_proj.weight/v_proj.weight->n_head_kvThe loader still cross-checks target-dependent invariants when target metadata is available:
fc.weight's inferred captured-layer count must matchtarget->n_capture_layersmask_token_idis still copied from the targetWhy this replaces the earlier config.json approach
The previous version of this PR parsed the draft model's adjacent
config.jsonand used that as the authoritative source for draft dimensions. After reviewing the failure more carefully, that was more complicated than necessary for this bug.For the #195 failure, the safetensors header itself is the authoritative source. Parsing
config.jsonmay still be useful later for non-tensor architecture metadata such as SWA layer types or sliding-window policy, but it is not required to fix the tensor-shape mismatch and should be handled as a separate feature if needed.Compatibility
config.jsonare supported; the fix no longer depends on that file.DFLASH27B_TARGET_*constants, even though some of them describe draft defaults. That rename would be mechanical but noisy and is intentionally out of scope.Test plan
safetensorspreserves the relevant tensor shape with a direct local test: a BF16 tensor named likelayers.0.self_attn.q_norm.weightwith shape[128]deserializes as[128].g++ -std=c++17 -fsyntax-only -I dflash/include -I dflash/src -I dflash/src/common -I dflash/src/draft -I dflash/deps/llama.cpp/ggml/include dflash/src/draft/draft_safetensors_loader.cppCUDACXX=/usr/local/cuda/bin/nvcc cmake -S dflash -B /tmp/lbh-pr195-shape/build -DDFLASH27B_GPU_BACKEND=cuda -DDFLASH27B_FA_ALL_QUANTS=OFF -DCMAKE_CUDA_ARCHITECTURES=86cmake --build /tmp/lbh-pr195-shape/build --target smoke_load_draft -j 8LD_LIBRARY_PATH=/tmp/lbh-pr195-shape/build/deps/llama.cpp/ggml/src:/tmp/lbh-pr195-shape/build/deps/llama.cpp/ggml/src/ggml-cuda /tmp/lbh-pr195-shape/build/smoke_load_draft /home/erik/Projects/lucebox-hub/dflash/models/draft/model.safetensorsThe real local draft smoke test loaded 58 tensors / 3.22 GiB and completed with
OK.