Skip to content

fix(dflash): infer draft dimensions from safetensors#195

Merged
davide221 merged 1 commit into
Luce-Org:mainfrom
easel:fix/draft-loader-dims-from-config
May 20, 2026
Merged

fix(dflash): infer draft dimensions from safetensors#195
davide221 merged 1 commit into
Luce-Org:mainfrom
easel:fix/draft-loader-dims-from-config

Conversation

@easel

@easel easel commented May 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes Qwen3.6 DFlash draft loading by deriving the draft architecture dimensions from the draft .safetensors tensor shapes instead of inheriting the verifier/target model's attention metadata.

The failure reproduced as:

draft load: safetensors: 'layers.0.self_attn.k_norm.weight' shape[0]=128 expected 256

A later mismatch could also appear on projection tensors, for example o_proj.weight shape[1]=4096 expected 3072.

Current understanding

This is not a bug in upstream safetensors. The safetensors archive already records and returns the correct tensor shape, e.g. layers.0.self_attn.q_norm.weight has shape [128]. The bug was in lucebox: before validating the draft tensor header, load_draft_safetensors copied target model attention dimensions into DraftWeights.

That is wrong for Qwen3.6 because the target and draft are different transformers:

Dimension Qwen3.6 target DFlash draft
head_dim 256 128
n_head 24 32
n_head_kv 4 8
q_dim 6144 4096
kv_dim 1024 1024

The old behavior happened to work for earlier drafts where these dimensions matched closely enough, which made the target-metadata inheritance bug latent.

What changed

load_draft_safetensors now infers draft dimensions directly from the safetensors header after parsing it:

  • hidden_norm.weight / norm.weight -> n_embd
  • fc.weight -> captured target-layer count
  • layers.0.mlp.gate_proj.weight -> n_ff
  • layers.0.self_attn.q_norm.weight / k_norm.weight -> head_dim
  • layers.0.self_attn.q_proj.weight -> n_head
  • layers.0.self_attn.k_proj.weight / v_proj.weight -> n_head_kv

The loader still cross-checks target-dependent invariants when target metadata is available:

  • draft hidden size must match target hidden size
  • fc.weight's inferred captured-layer count must match target->n_capture_layers
  • mask_token_id is still copied from the target

Why this replaces the earlier config.json approach

The previous version of this PR parsed the draft model's adjacent config.json and used that as the authoritative source for draft dimensions. After reviewing the failure more carefully, that was more complicated than necessary for this bug.

For the #195 failure, the safetensors header itself is the authoritative source. Parsing config.json may still be useful later for non-tensor architecture metadata such as SWA layer types or sliding-window policy, but it is not required to fix the tensor-shape mismatch and should be handled as a separate feature if needed.

Compatibility

  • Drafts without config.json are supported; the fix no longer depends on that file.
  • No-target callers still work because dimensions are inferred from the archive header itself.
  • Existing target/draft compatibility checks are stricter where the draft must agree with the target, especially hidden size and captured-layer count.
  • The patch does not rename the historical DFLASH27B_TARGET_* constants, even though some of them describe draft defaults. That rename would be mechanical but noisy and is intentionally out of scope.

Test plan

  • Proved upstream safetensors preserves the relevant tensor shape with a direct local test: a BF16 tensor named like layers.0.self_attn.q_norm.weight with shape [128] deserializes as [128].
  • g++ -std=c++17 -fsyntax-only -I dflash/include -I dflash/src -I dflash/src/common -I dflash/src/draft -I dflash/deps/llama.cpp/ggml/include dflash/src/draft/draft_safetensors_loader.cpp
  • CUDACXX=/usr/local/cuda/bin/nvcc cmake -S dflash -B /tmp/lbh-pr195-shape/build -DDFLASH27B_GPU_BACKEND=cuda -DDFLASH27B_FA_ALL_QUANTS=OFF -DCMAKE_CUDA_ARCHITECTURES=86
  • cmake --build /tmp/lbh-pr195-shape/build --target smoke_load_draft -j 8
  • LD_LIBRARY_PATH=/tmp/lbh-pr195-shape/build/deps/llama.cpp/ggml/src:/tmp/lbh-pr195-shape/build/deps/llama.cpp/ggml/src/ggml-cuda /tmp/lbh-pr195-shape/build/smoke_load_draft /home/erik/Projects/lucebox-hub/dflash/models/draft/model.safetensors

The real local draft smoke test loaded 58 tensors / 3.22 GiB and completed with OK.

@easel easel marked this pull request as ready for review May 19, 2026 13:11

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 5 files

Re-trigger cubic

@easel easel force-pushed the fix/draft-loader-dims-from-config branch from 89ec481 to b95674d Compare May 19, 2026 14:27
@easel easel changed the title fix(dflash): read draft dims from config.json, don't inherit from target fix(dflash): infer draft dimensions from safetensors May 19, 2026
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 19, 2026
…ary update

Three new run directories plus a yesterday-vs-today comparison snapshot:

  2026-05-17T17-40-56_f031f08/  — preserved yesterday's reference matrix
    (n_sample=8, n_runs=8, bootstrap CI 95%). DFlash b22: HE 169.40,
    GSM 104.32, Math 119.36 tok/s. MTP d3: 65.62 / 61.00 / 61.89. AR
    baseline ~34 tok/s across suites. Was untracked in tree because the
    bench_matrix orchestrator landed on a stale branch.

  2026-05-19T11-43-13_83e19d9/  — first matrix re-run on HEAD (HE only,
    MTP_GGUF env unset → mtp_d3 cell empty). DFlash b22: 173.81 tok/s.
    Confirms no DFlash kernel regression vs f031f08.

  2026-05-19T11-54-32_83e19d9/  — full apples-to-apples re-run on HEAD
    with MTP_GGUF set. Result: all 9 cells (3 suites × {AR, DFlash b22,
    MTP d3}) within ±5% of f031f08 mean tok/s. DFlash HE +2.6%, MTP HE
    −2.0%, etc. No regression.

  2026-05-19_mtp-prefix-warm-ghost/summary.md  — updated with the
    apples-to-apples table above, an agent bucket-label-vs-actual-token
    audit (agent_24k prompts are actually ~2.6K), a known-gaps section
    documenting what is NOT yet tested (real CLI agentic loops, NIAH
    > 131K, concurrent sessions, sustained throughput, PR Luce-Org#195 merge).
@davide221 davide221 merged commit 7476720 into Luce-Org:main May 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants