fix(dflash): infer draft dimensions from safetensors by easel · Pull Request #195 · Luce-Org/lucebox-hub

easel · 2026-05-14T20:17:17Z

Summary

Fixes Qwen3.6 DFlash draft loading by deriving the draft architecture dimensions from the draft .safetensors tensor shapes instead of inheriting the verifier/target model's attention metadata.

The failure reproduced as:

draft load: safetensors: 'layers.0.self_attn.k_norm.weight' shape[0]=128 expected 256

A later mismatch could also appear on projection tensors, for example o_proj.weight shape[1]=4096 expected 3072.

Current understanding

This is not a bug in upstream safetensors. The safetensors archive already records and returns the correct tensor shape, e.g. layers.0.self_attn.q_norm.weight has shape [128]. The bug was in lucebox: before validating the draft tensor header, load_draft_safetensors copied target model attention dimensions into DraftWeights.

That is wrong for Qwen3.6 because the target and draft are different transformers:

Dimension	Qwen3.6 target	DFlash draft
`head_dim`	256	128
`n_head`	24	32
`n_head_kv`	4	8
`q_dim`	6144	4096
`kv_dim`	1024	1024

The old behavior happened to work for earlier drafts where these dimensions matched closely enough, which made the target-metadata inheritance bug latent.

What changed

load_draft_safetensors now infers draft dimensions directly from the safetensors header after parsing it:

hidden_norm.weight / norm.weight -> n_embd
fc.weight -> captured target-layer count
layers.0.mlp.gate_proj.weight -> n_ff
layers.0.self_attn.q_norm.weight / k_norm.weight -> head_dim
layers.0.self_attn.q_proj.weight -> n_head
layers.0.self_attn.k_proj.weight / v_proj.weight -> n_head_kv

The loader still cross-checks target-dependent invariants when target metadata is available:

draft hidden size must match target hidden size
fc.weight's inferred captured-layer count must match target->n_capture_layers
mask_token_id is still copied from the target

Why this replaces the earlier config.json approach

The previous version of this PR parsed the draft model's adjacent config.json and used that as the authoritative source for draft dimensions. After reviewing the failure more carefully, that was more complicated than necessary for this bug.

For the #195 failure, the safetensors header itself is the authoritative source. Parsing config.json may still be useful later for non-tensor architecture metadata such as SWA layer types or sliding-window policy, but it is not required to fix the tensor-shape mismatch and should be handled as a separate feature if needed.

Compatibility

Drafts without config.json are supported; the fix no longer depends on that file.
No-target callers still work because dimensions are inferred from the archive header itself.
Existing target/draft compatibility checks are stricter where the draft must agree with the target, especially hidden size and captured-layer count.
The patch does not rename the historical DFLASH27B_TARGET_* constants, even though some of them describe draft defaults. That rename would be mechanical but noisy and is intentionally out of scope.

Test plan

Proved upstream safetensors preserves the relevant tensor shape with a direct local test: a BF16 tensor named like layers.0.self_attn.q_norm.weight with shape [128] deserializes as [128].
g++ -std=c++17 -fsyntax-only -I dflash/include -I dflash/src -I dflash/src/common -I dflash/src/draft -I dflash/deps/llama.cpp/ggml/include dflash/src/draft/draft_safetensors_loader.cpp
CUDACXX=/usr/local/cuda/bin/nvcc cmake -S dflash -B /tmp/lbh-pr195-shape/build -DDFLASH27B_GPU_BACKEND=cuda -DDFLASH27B_FA_ALL_QUANTS=OFF -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build /tmp/lbh-pr195-shape/build --target smoke_load_draft -j 8
LD_LIBRARY_PATH=/tmp/lbh-pr195-shape/build/deps/llama.cpp/ggml/src:/tmp/lbh-pr195-shape/build/deps/llama.cpp/ggml/src/ggml-cuda /tmp/lbh-pr195-shape/build/smoke_load_draft /home/erik/Projects/lucebox-hub/dflash/models/draft/model.safetensors

The real local draft smoke test loaded 58 tensors / 3.22 GiB and completed with OK.

cubic-dev-ai

No issues found across 5 files

_{Re-trigger cubic}

…ary update Three new run directories plus a yesterday-vs-today comparison snapshot: 2026-05-17T17-40-56_f031f08/ — preserved yesterday's reference matrix (n_sample=8, n_runs=8, bootstrap CI 95%). DFlash b22: HE 169.40, GSM 104.32, Math 119.36 tok/s. MTP d3: 65.62 / 61.00 / 61.89. AR baseline ~34 tok/s across suites. Was untracked in tree because the bench_matrix orchestrator landed on a stale branch. 2026-05-19T11-43-13_83e19d9/ — first matrix re-run on HEAD (HE only, MTP_GGUF env unset → mtp_d3 cell empty). DFlash b22: 173.81 tok/s. Confirms no DFlash kernel regression vs f031f08. 2026-05-19T11-54-32_83e19d9/ — full apples-to-apples re-run on HEAD with MTP_GGUF set. Result: all 9 cells (3 suites × {AR, DFlash b22, MTP d3}) within ±5% of f031f08 mean tok/s. DFlash HE +2.6%, MTP HE −2.0%, etc. No regression. 2026-05-19_mtp-prefix-warm-ghost/summary.md — updated with the apples-to-apples table above, an agent bucket-label-vs-actual-token audit (agent_24k prompts are actually ~2.6K), a known-gaps section documenting what is NOT yet tested (real CLI agentic loops, NIAH > 131K, concurrent sessions, sustained throughput, PR Luce-Org#195 merge).

easel marked this pull request as ready for review May 19, 2026 13:11

cubic-dev-ai Bot reviewed May 19, 2026

View reviewed changes

fix(dflash): infer draft dims from safetensors

b95674d

easel force-pushed the fix/draft-loader-dims-from-config branch from 89ec481 to b95674d Compare May 19, 2026 14:27

easel changed the title ~~fix(dflash): read draft dims from config.json, don't inherit from target~~ fix(dflash): infer draft dimensions from safetensors May 19, 2026

davide221 merged commit 7476720 into Luce-Org:main May 20, 2026
2 checks passed

davide221 mentioned this pull request May 20, 2026

RuntimeError: dflash daemon exited before weights finished loading. Check the daemon's stderr. #233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dflash): infer draft dimensions from safetensors#195

fix(dflash): infer draft dimensions from safetensors#195
davide221 merged 1 commit into
Luce-Org:mainfrom
easel:fix/draft-loader-dims-from-config

easel commented May 14, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

easel commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Current understanding

What changed

Why this replaces the earlier config.json approach

Compatibility

Test plan

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

easel commented May 14, 2026 •

edited

Loading