Skip to content

MoE Prefill Streaming & DDTree Batched Verify#349

Open
howard0su wants to merge 6 commits into
Luce-Org:mainfrom
howard0su:doc
Open

MoE Prefill Streaming & DDTree Batched Verify#349
howard0su wants to merge 6 commits into
Luce-Org:mainfrom
howard0su:doc

Conversation

@howard0su

@howard0su howard0su commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Optimizes MoE (Qwen3.5-35B-A3B) prefill and speculative decode performance through streaming expert loading, GPU-side bitmask fast paths, and batched DDTree verify.

  1. MoE Prefill Streaming Engine (5ba80db)
  • MoeHybridStreamEngine with mmap + madvise for cold expert DMA
  • Batched token-per-expert GPU eval replaces per-token sequential dispatch
  • 3× prefill speedup (94.7 tok/s → from ~30 tok/s baseline)
  1. VRAM Bitmask Fast Path (3a697f5)
  • Per-layer expert_vram_mask[4] + all_routed_are_hot() check
  • Skips CPU entirely when all routed experts are GPU-resident (95% of layers)
  • eval_moe_hot_only_batched() — pure GPU, no cold partition, no merge
  1. Cached Hot-Only Batched Graph (fa7f545)
  • CachedHotBatchedGraph avoids per-call graph rebuild for repeated sub-batch sizes
  • Minimal benefit on sm_75 due to dispatch-count bottleneck (2560 dispatches)
  1. Skip MMQ Sub-Batch on sm_80+ (ea47297)
  • mmq_safe_full_batch flag in MoeHybridConfig, set via query_gpu_compute_sm()
  • On Ampere+ (sm_80+): eliminates ≤4-token sub-batch workaround entirely
  • Reduces dispatches from 2560 → 80 per prefill (32× fewer)
  1. Batched MoE Verify for DDTree (ccb1ba6)
  • hybrid_forward_batch() processes all DDTree tokens layer-by-layer (like prefill)
  • Replaces sequential 22-token × 40-layer loop (880 dispatches → 81 dispatches)
  • Both verify and replay paths are now batched

Performance (RTX 2080 Ti, sm_75)

Metric Before After
Prefill (168 tok) ~30 tok/s 94.7 tok/s
Hot-only layer hit rate N/A 95%
DDTree verify dispatches 2640 81

Architecture Notes

  • sm_75 still uses sub-batch workaround (MMQ bug); sm_80+ gets full benefit
  • HIP/AMD keeps workaround active (gfx1151 has same bug)
  • DDTree AL=2.16 on this draft model — AR decode is still faster on sm_75 with constrained VRAM. On sm_80+ with more VRAM headroom, DDTree should break even at AL≈3+.

howard0su added 5 commits June 7, 2026 13:07
Add a streaming prefill path for MoE hybrid mode that keeps the GGUF
file mmap'd for the server's lifetime and uses madvise(WILLNEED) to
prefetch cold expert data before batched FFN evaluation.

Key changes:
- MoeHybridStreamEngine: pinned host buffer + GPU scratch for DMA
  pipeline, with prefetch_cold_experts() and stream_expert_sync() APIs
- eval_moe_cold_experts_streaming: batches all tokens selecting the same
  cold expert into a single ggml graph (O(unique_experts) graphs instead
  of O(tokens × experts))
- MoeHybridStorage: stores persistent mmap pointer + per-layer expert
  file regions; destructor handles munmap cleanup
- Backend integration (qwen35moe + laguna): when stream engine is ready
  and chunk_len >= 16, prefetch cold experts via madvise before batched
  eval, avoiding the per-token fallback path

Performance (Qwen3.6-35B-A3B, RTX 2080 Ti, 813 cold experts):
- Before: ~10 tok/s prefill (per-token fallback, 14s for 137 tokens)
- After:  ~30 tok/s prefill (batched eval with prefetch, 4.7s for 137 tokens)
- Decode: unchanged at 2.0 tok/s
…ast path)

Add a per-layer VRAM bitmask (expert_vram_mask[4], 256-bit) to
MoeHybridLayerStorage that tracks which experts have GPU-resident
weight tensors. Before entering the hybrid hot+cold eval path, check
if all router-selected experts for the current batch are hot using
all_routed_are_hot() — a simple bitmask lookup with early exit.

When all selected experts are in VRAM:
- Call eval_moe_hot_only_batched() — pure GPU path
- No cold graph build, no CPU compute, no hot+cold merge loop
- Remaps global expert IDs to hot-local IDs and runs single GPU graph

Includes MMQ sub-batch workaround (<=4 tokens) for sm_75/gfx1151
compatibility, matching the existing hybrid batched path.

Adds hit-rate telemetry (hot_only_layers/total_ffn_layers) to
qwen35moe prefill analysis output.
Add CachedHotBatchedGraph struct that pre-builds and caches the ggml
graph + allocator for hot-only prefill sub-batches (n_tokens=4). On
first use per layer, the graph is built once; subsequent sub-batch
dispatches within the same layer skip ggml_init, graph construction,
and gallocr planning — just tensor_set + compute + tensor_get.

Benchmark (Qwen3.6-35B-A3B, 2080 Ti, 168-token prompt):
- Cold cache (first request): 75.7 tok/s prefill
- Warm cache (second request): 90.1 tok/s prefill
- Baseline without caching: 94.7 tok/s prefill

The cached graph saves ~50µs per sub-batch dispatch (ggml_init + graph
build), but on sm_75 the MMQ sub-batch workaround (n_tokens≤4) means
2560 kernel dispatches per 168-token prefill, making dispatch overhead
dominant. The real win requires eliminating the sub-batch constraint
(possible on sm_86+).

Also removes anonymous namespace in moe_hybrid_ffn_eval.cpp (functions
were already static/inline with internal linkage).
…fill)

On sm_80+ (Ampere/Ada/Hopper/Blackwell), the MMQ mul_mat_id kernel works
correctly with reduced hot expert stacks. Skip the <=4-token sub-batch
workaround that was needed only for sm_75 (Turing) and gfx1151 (AMD).

This eliminates 2560 kernel dispatches per prefill (32 sub-batches × 80
layers) down to 80 (1 dispatch per layer), giving ~30× less dispatch
overhead on modern GPUs.

Changes:
- Add mmq_safe_full_batch flag to MoeHybridConfig (default false = safe)
- Add query_gpu_compute_sm() utility (CUDA-only, returns 0 for HIP)
- Set flag to true when sm >= 80 in make_moe_hybrid_config()
- Gate both sub-batch loops (hot-only + hybrid) on the flag
- Allow cached graph build for full-batch sizes when flag is set
- Log at init when full-batch mode is active
Replace sequential hybrid_forward_one_token loop (22 tokens × 40 layers =
880 dispatches) with batched hybrid_forward_batch (40 layers × 2 dispatches
= 80 dispatches per verify pass).

The batched path reuses the same layer-by-layer approach as prefill:
  1. build_layer_prefn_step (DeltaNet + router for all tokens, 1 dispatch)
  2. eval_moe_hot_only_batched / eval_moe_hybrid_ffn_batched (1 dispatch)
  3. CPU combine (residual + FFN output)

Both verify and replay are now batched. Feature capture during replay
processes all tokens per layer instead of per-token dispatch.

Combined with mmq_safe_full_batch (sm_80+), this enables DDTree for MoE
with minimal dispatch overhead on modern GPUs.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 issues found across 18 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/moe_hybrid_stream.h">

<violation number="1" location="server/src/common/moe_hybrid_stream.h:56">
P1: Backend-context mismatch: `stream_expert_sync()` accepts `gpu_backend` but ignores it, while `init()` stores a backend handle in `backend_` that is never used for actual GPU operations. Raw CUDA APIs (`cudaMalloc`/`cudaMemcpy`) bypass the ggml backend, so `gpu_scratch_` is tied to the CUDA context at init time, but `eval_moe_cold_experts_streaming()` may compute on a different `gpu_backend`. This can cause illegal memory access or silent corruption if init and compute use different GPU devices.</violation>
</file>

<file name="server/docs/moe_hybrid.md">

<violation number="1" location="server/docs/moe_hybrid.md:302">
P2: PCIe is incorrectly described as half-duplex for bulk transfers; PCIe links are full-duplex (dual simplex) by design, so the rationale against streaming is technically inaccurate.</violation>
</file>

<file name="server/src/common/gguf_mmap.h">

<violation number="1" location="server/src/common/gguf_mmap.h:205">
P2: Unsigned integer overflow in range validation: `offset + length` on `size_t` can wrap around, causing the clamp to be skipped and an invalid oversized range to be passed to `madvise`/`PrefetchVirtualMemory`. Use `if (length > size_ - offset)` instead.</violation>
</file>

<file name="server/src/qwen35moe/qwen35moe_backend.cpp">

<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.cpp:1409">
P2: Missing non-empty batch guard in `hybrid_forward_batch`: `n_tokens == 0` causes `(size_t)(n_tokens - 1)` to underflow to `SIZE_MAX`, producing UB in the `act_cur.assign(...)` pointer arithmetic.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/CMakeLists.txt Outdated
Comment thread server/scripts/bench_moe_prefill_streaming.py Outdated
Comment thread server/src/common/gguf_mmap.h Outdated
// tensor view over the scratch memory for each weight matrix.
// This is a BLOCKING operation (synchronous DMA). For pipelined usage,
// use the async variants below.
bool stream_expert_sync(const void * mmap_data, size_t mmap_size,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Backend-context mismatch: stream_expert_sync() accepts gpu_backend but ignores it, while init() stores a backend handle in backend_ that is never used for actual GPU operations. Raw CUDA APIs (cudaMalloc/cudaMemcpy) bypass the ggml backend, so gpu_scratch_ is tied to the CUDA context at init time, but eval_moe_cold_experts_streaming() may compute on a different gpu_backend. This can cause illegal memory access or silent corruption if init and compute use different GPU devices.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_stream.h, line 56:

<comment>Backend-context mismatch: `stream_expert_sync()` accepts `gpu_backend` but ignores it, while `init()` stores a backend handle in `backend_` that is never used for actual GPU operations. Raw CUDA APIs (`cudaMalloc`/`cudaMemcpy`) bypass the ggml backend, so `gpu_scratch_` is tied to the CUDA context at init time, but `eval_moe_cold_experts_streaming()` may compute on a different `gpu_backend`. This can cause illegal memory access or silent corruption if init and compute use different GPU devices.</comment>

<file context>
@@ -0,0 +1,112 @@
+    // tensor view over the scratch memory for each weight matrix.
+    // This is a BLOCKING operation (synchronous DMA). For pipelined usage,
+    // use the async variants below.
+    bool stream_expert_sync(const void * mmap_data, size_t mmap_size,
+                            const LayerExpertRegions & regions,
+                            int expert_id,
</file context>

Comment thread server/src/common/moe_hybrid_stream.cpp
Comment thread server/src/common/moe_hybrid_ffn_eval.cpp
Comment thread server/docs/moe_hybrid.md
**The fundamental problem**: PCIe 4.0 x16 bandwidth (15.75 GB/s) is **4× lower** than DDR5 memory bandwidth (≈50 GB/s). Loading a single Q4_K_M Qwen expert (gate + up + down ≈ 6 MB) over PCIe takes ~380 µs just for the transfer. Meanwhile, the CPU can complete the entire matmul chain in ~100–150 µs using the same data that's already in L3/RAM.

Additionally:
- **PCIe is half-duplex for bulk transfers**: You cannot overlap upload of expert weights with download of results efficiently.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: PCIe is incorrectly described as half-duplex for bulk transfers; PCIe links are full-duplex (dual simplex) by design, so the rationale against streaming is technically inaccurate.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/docs/moe_hybrid.md, line 302:

<comment>PCIe is incorrectly described as half-duplex for bulk transfers; PCIe links are full-duplex (dual simplex) by design, so the rationale against streaming is technically inaccurate.</comment>

<file context>
@@ -0,0 +1,396 @@
+**The fundamental problem**: PCIe 4.0 x16 bandwidth (15.75 GB/s) is **4× lower** than DDR5 memory bandwidth (≈50 GB/s). Loading a single Q4_K_M Qwen expert (gate + up + down ≈ 6 MB) over PCIe takes ~380 µs just for the transfer. Meanwhile, the CPU can complete the entire matmul chain in ~100–150 µs using the same data that's already in L3/RAM.
+
+Additionally:
+- **PCIe is half-duplex for bulk transfers**: You cannot overlap upload of expert weights with download of results efficiently.
+- **GPU kernel launch overhead**: Even after DMA completes, the GPU needs to dispatch a kernel for a single-token matmul on just 1–2 cold experts — an inefficient use of GPU SMs.
+- **Bubble injection**: Streaming cold experts to GPU introduces pipeline bubbles. The GPU must wait for DMA → compute → return result, during which the SMs sit idle or context-switch. With CPU-local compute, the GPU is 100% occupied on hot experts.
</file context>
Suggested change
- **PCIe is half-duplex for bulk transfers**: You cannot overlap upload of expert weights with download of results efficiently.
- **PCIe is half-duplex for bulk transfers**: You cannot overlap upload of expert weights with download of results efficiently.
+ **PCIe copy-engine limits overlap**: Most consumer GPUs have only 1–2 DMA copy engines, so upload of expert weights and download of results contend and cannot fully overlap in practice.

Comment thread .github/copilot-instructions.md Outdated

inline void GgufMmap::advise_willneed(size_t offset, size_t length) const {
if (!data_ || offset >= size_) return;
if (offset + length > size_) length = size_ - offset;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Unsigned integer overflow in range validation: offset + length on size_t can wrap around, causing the clamp to be skipped and an invalid oversized range to be passed to madvise/PrefetchVirtualMemory. Use if (length > size_ - offset) instead.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/gguf_mmap.h, line 205:

<comment>Unsigned integer overflow in range validation: `offset + length` on `size_t` can wrap around, causing the clamp to be skipped and an invalid oversized range to be passed to `madvise`/`PrefetchVirtualMemory`. Use `if (length > size_ - offset)` instead.</comment>

<file context>
@@ -196,6 +200,26 @@ inline const void * GgufMmap::data() const { return data_; }
 
+inline void GgufMmap::advise_willneed(size_t offset, size_t length) const {
+    if (!data_ || offset >= size_) return;
+    if (offset + length > size_) length = size_ - offset;
+    if (length == 0) return;
+#if defined(_WIN32)
</file context>
Suggested change
if (offset + length > size_) length = size_ - offset;
if (length > size_ - offset) length = size_ - offset;

if (ffn_hot_alloc) ggml_gallocr_free(ffn_hot_alloc);

// Store last token hidden state in act_cur
act_cur.assign(embed_all.data() + (size_t)(n_tokens - 1) * (size_t)hidden,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Missing non-empty batch guard in hybrid_forward_batch: n_tokens == 0 causes (size_t)(n_tokens - 1) to underflow to SIZE_MAX, producing UB in the act_cur.assign(...) pointer arithmetic.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_backend.cpp, line 1409:

<comment>Missing non-empty batch guard in `hybrid_forward_batch`: `n_tokens == 0` causes `(size_t)(n_tokens - 1)` to underflow to `SIZE_MAX`, producing UB in the `act_cur.assign(...)` pointer arithmetic.</comment>

<file context>
@@ -1153,6 +1222,247 @@ bool Qwen35MoeBackend::hybrid_forward_one_token(int32_t tok, int kv_pos,
+    if (ffn_hot_alloc) ggml_gallocr_free(ffn_hot_alloc);
+
+    // Store last token hidden state in act_cur
+    act_cur.assign(embed_all.data() + (size_t)(n_tokens - 1) * (size_t)hidden,
+                   embed_all.data() + (size_t)n_tokens * (size_t)hidden);
+
</file context>

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 7, 2026
feat(dflash): MoE prefill streaming engine with mmap + batched GPU eval
P1 fixes:
- Guard bench_moe_stream CUDA::cudart link behind GPU backend check
- Replace hardcoded page_size=4096 with sysconf(_SC_PAGESIZE) in madvise
- Add expert_id negativity check in stream_expert_sync
- Fix zip misalignment in bench script by matching on target_tokens key

P2 fix:
- Extract build_shared_expert_subgraph() helper to deduplicate shared expert
  graph construction (was inlined 8 times across eval functions)
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant