MoE Prefill Streaming & DDTree Batched Verify#349
Conversation
Add a streaming prefill path for MoE hybrid mode that keeps the GGUF file mmap'd for the server's lifetime and uses madvise(WILLNEED) to prefetch cold expert data before batched FFN evaluation. Key changes: - MoeHybridStreamEngine: pinned host buffer + GPU scratch for DMA pipeline, with prefetch_cold_experts() and stream_expert_sync() APIs - eval_moe_cold_experts_streaming: batches all tokens selecting the same cold expert into a single ggml graph (O(unique_experts) graphs instead of O(tokens × experts)) - MoeHybridStorage: stores persistent mmap pointer + per-layer expert file regions; destructor handles munmap cleanup - Backend integration (qwen35moe + laguna): when stream engine is ready and chunk_len >= 16, prefetch cold experts via madvise before batched eval, avoiding the per-token fallback path Performance (Qwen3.6-35B-A3B, RTX 2080 Ti, 813 cold experts): - Before: ~10 tok/s prefill (per-token fallback, 14s for 137 tokens) - After: ~30 tok/s prefill (batched eval with prefetch, 4.7s for 137 tokens) - Decode: unchanged at 2.0 tok/s
…ast path) Add a per-layer VRAM bitmask (expert_vram_mask[4], 256-bit) to MoeHybridLayerStorage that tracks which experts have GPU-resident weight tensors. Before entering the hybrid hot+cold eval path, check if all router-selected experts for the current batch are hot using all_routed_are_hot() — a simple bitmask lookup with early exit. When all selected experts are in VRAM: - Call eval_moe_hot_only_batched() — pure GPU path - No cold graph build, no CPU compute, no hot+cold merge loop - Remaps global expert IDs to hot-local IDs and runs single GPU graph Includes MMQ sub-batch workaround (<=4 tokens) for sm_75/gfx1151 compatibility, matching the existing hybrid batched path. Adds hit-rate telemetry (hot_only_layers/total_ffn_layers) to qwen35moe prefill analysis output.
Add CachedHotBatchedGraph struct that pre-builds and caches the ggml graph + allocator for hot-only prefill sub-batches (n_tokens=4). On first use per layer, the graph is built once; subsequent sub-batch dispatches within the same layer skip ggml_init, graph construction, and gallocr planning — just tensor_set + compute + tensor_get. Benchmark (Qwen3.6-35B-A3B, 2080 Ti, 168-token prompt): - Cold cache (first request): 75.7 tok/s prefill - Warm cache (second request): 90.1 tok/s prefill - Baseline without caching: 94.7 tok/s prefill The cached graph saves ~50µs per sub-batch dispatch (ggml_init + graph build), but on sm_75 the MMQ sub-batch workaround (n_tokens≤4) means 2560 kernel dispatches per 168-token prefill, making dispatch overhead dominant. The real win requires eliminating the sub-batch constraint (possible on sm_86+). Also removes anonymous namespace in moe_hybrid_ffn_eval.cpp (functions were already static/inline with internal linkage).
…fill) On sm_80+ (Ampere/Ada/Hopper/Blackwell), the MMQ mul_mat_id kernel works correctly with reduced hot expert stacks. Skip the <=4-token sub-batch workaround that was needed only for sm_75 (Turing) and gfx1151 (AMD). This eliminates 2560 kernel dispatches per prefill (32 sub-batches × 80 layers) down to 80 (1 dispatch per layer), giving ~30× less dispatch overhead on modern GPUs. Changes: - Add mmq_safe_full_batch flag to MoeHybridConfig (default false = safe) - Add query_gpu_compute_sm() utility (CUDA-only, returns 0 for HIP) - Set flag to true when sm >= 80 in make_moe_hybrid_config() - Gate both sub-batch loops (hot-only + hybrid) on the flag - Allow cached graph build for full-batch sizes when flag is set - Log at init when full-batch mode is active
Replace sequential hybrid_forward_one_token loop (22 tokens × 40 layers = 880 dispatches) with batched hybrid_forward_batch (40 layers × 2 dispatches = 80 dispatches per verify pass). The batched path reuses the same layer-by-layer approach as prefill: 1. build_layer_prefn_step (DeltaNet + router for all tokens, 1 dispatch) 2. eval_moe_hot_only_batched / eval_moe_hybrid_ffn_batched (1 dispatch) 3. CPU combine (residual + FFN output) Both verify and replay are now batched. Feature capture during replay processes all tokens per layer instead of per-token dispatch. Combined with mmq_safe_full_batch (sm_80+), this enables DDTree for MoE with minimal dispatch overhead on modern GPUs.
There was a problem hiding this comment.
10 issues found across 18 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/common/moe_hybrid_stream.h">
<violation number="1" location="server/src/common/moe_hybrid_stream.h:56">
P1: Backend-context mismatch: `stream_expert_sync()` accepts `gpu_backend` but ignores it, while `init()` stores a backend handle in `backend_` that is never used for actual GPU operations. Raw CUDA APIs (`cudaMalloc`/`cudaMemcpy`) bypass the ggml backend, so `gpu_scratch_` is tied to the CUDA context at init time, but `eval_moe_cold_experts_streaming()` may compute on a different `gpu_backend`. This can cause illegal memory access or silent corruption if init and compute use different GPU devices.</violation>
</file>
<file name="server/docs/moe_hybrid.md">
<violation number="1" location="server/docs/moe_hybrid.md:302">
P2: PCIe is incorrectly described as half-duplex for bulk transfers; PCIe links are full-duplex (dual simplex) by design, so the rationale against streaming is technically inaccurate.</violation>
</file>
<file name="server/src/common/gguf_mmap.h">
<violation number="1" location="server/src/common/gguf_mmap.h:205">
P2: Unsigned integer overflow in range validation: `offset + length` on `size_t` can wrap around, causing the clamp to be skipped and an invalid oversized range to be passed to `madvise`/`PrefetchVirtualMemory`. Use `if (length > size_ - offset)` instead.</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_backend.cpp">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.cpp:1409">
P2: Missing non-empty batch guard in `hybrid_forward_batch`: `n_tokens == 0` causes `(size_t)(n_tokens - 1)` to underflow to `SIZE_MAX`, producing UB in the `act_cur.assign(...)` pointer arithmetic.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| // tensor view over the scratch memory for each weight matrix. | ||
| // This is a BLOCKING operation (synchronous DMA). For pipelined usage, | ||
| // use the async variants below. | ||
| bool stream_expert_sync(const void * mmap_data, size_t mmap_size, |
There was a problem hiding this comment.
P1: Backend-context mismatch: stream_expert_sync() accepts gpu_backend but ignores it, while init() stores a backend handle in backend_ that is never used for actual GPU operations. Raw CUDA APIs (cudaMalloc/cudaMemcpy) bypass the ggml backend, so gpu_scratch_ is tied to the CUDA context at init time, but eval_moe_cold_experts_streaming() may compute on a different gpu_backend. This can cause illegal memory access or silent corruption if init and compute use different GPU devices.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_stream.h, line 56:
<comment>Backend-context mismatch: `stream_expert_sync()` accepts `gpu_backend` but ignores it, while `init()` stores a backend handle in `backend_` that is never used for actual GPU operations. Raw CUDA APIs (`cudaMalloc`/`cudaMemcpy`) bypass the ggml backend, so `gpu_scratch_` is tied to the CUDA context at init time, but `eval_moe_cold_experts_streaming()` may compute on a different `gpu_backend`. This can cause illegal memory access or silent corruption if init and compute use different GPU devices.</comment>
<file context>
@@ -0,0 +1,112 @@
+ // tensor view over the scratch memory for each weight matrix.
+ // This is a BLOCKING operation (synchronous DMA). For pipelined usage,
+ // use the async variants below.
+ bool stream_expert_sync(const void * mmap_data, size_t mmap_size,
+ const LayerExpertRegions & regions,
+ int expert_id,
</file context>
| **The fundamental problem**: PCIe 4.0 x16 bandwidth (15.75 GB/s) is **4× lower** than DDR5 memory bandwidth (≈50 GB/s). Loading a single Q4_K_M Qwen expert (gate + up + down ≈ 6 MB) over PCIe takes ~380 µs just for the transfer. Meanwhile, the CPU can complete the entire matmul chain in ~100–150 µs using the same data that's already in L3/RAM. | ||
|
|
||
| Additionally: | ||
| - **PCIe is half-duplex for bulk transfers**: You cannot overlap upload of expert weights with download of results efficiently. |
There was a problem hiding this comment.
P2: PCIe is incorrectly described as half-duplex for bulk transfers; PCIe links are full-duplex (dual simplex) by design, so the rationale against streaming is technically inaccurate.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/docs/moe_hybrid.md, line 302:
<comment>PCIe is incorrectly described as half-duplex for bulk transfers; PCIe links are full-duplex (dual simplex) by design, so the rationale against streaming is technically inaccurate.</comment>
<file context>
@@ -0,0 +1,396 @@
+**The fundamental problem**: PCIe 4.0 x16 bandwidth (15.75 GB/s) is **4× lower** than DDR5 memory bandwidth (≈50 GB/s). Loading a single Q4_K_M Qwen expert (gate + up + down ≈ 6 MB) over PCIe takes ~380 µs just for the transfer. Meanwhile, the CPU can complete the entire matmul chain in ~100–150 µs using the same data that's already in L3/RAM.
+
+Additionally:
+- **PCIe is half-duplex for bulk transfers**: You cannot overlap upload of expert weights with download of results efficiently.
+- **GPU kernel launch overhead**: Even after DMA completes, the GPU needs to dispatch a kernel for a single-token matmul on just 1–2 cold experts — an inefficient use of GPU SMs.
+- **Bubble injection**: Streaming cold experts to GPU introduces pipeline bubbles. The GPU must wait for DMA → compute → return result, during which the SMs sit idle or context-switch. With CPU-local compute, the GPU is 100% occupied on hot experts.
</file context>
| - **PCIe is half-duplex for bulk transfers**: You cannot overlap upload of expert weights with download of results efficiently. | |
| - **PCIe is half-duplex for bulk transfers**: You cannot overlap upload of expert weights with download of results efficiently. | |
| + **PCIe copy-engine limits overlap**: Most consumer GPUs have only 1–2 DMA copy engines, so upload of expert weights and download of results contend and cannot fully overlap in practice. |
|
|
||
| inline void GgufMmap::advise_willneed(size_t offset, size_t length) const { | ||
| if (!data_ || offset >= size_) return; | ||
| if (offset + length > size_) length = size_ - offset; |
There was a problem hiding this comment.
P2: Unsigned integer overflow in range validation: offset + length on size_t can wrap around, causing the clamp to be skipped and an invalid oversized range to be passed to madvise/PrefetchVirtualMemory. Use if (length > size_ - offset) instead.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/gguf_mmap.h, line 205:
<comment>Unsigned integer overflow in range validation: `offset + length` on `size_t` can wrap around, causing the clamp to be skipped and an invalid oversized range to be passed to `madvise`/`PrefetchVirtualMemory`. Use `if (length > size_ - offset)` instead.</comment>
<file context>
@@ -196,6 +200,26 @@ inline const void * GgufMmap::data() const { return data_; }
+inline void GgufMmap::advise_willneed(size_t offset, size_t length) const {
+ if (!data_ || offset >= size_) return;
+ if (offset + length > size_) length = size_ - offset;
+ if (length == 0) return;
+#if defined(_WIN32)
</file context>
| if (offset + length > size_) length = size_ - offset; | |
| if (length > size_ - offset) length = size_ - offset; |
| if (ffn_hot_alloc) ggml_gallocr_free(ffn_hot_alloc); | ||
|
|
||
| // Store last token hidden state in act_cur | ||
| act_cur.assign(embed_all.data() + (size_t)(n_tokens - 1) * (size_t)hidden, |
There was a problem hiding this comment.
P2: Missing non-empty batch guard in hybrid_forward_batch: n_tokens == 0 causes (size_t)(n_tokens - 1) to underflow to SIZE_MAX, producing UB in the act_cur.assign(...) pointer arithmetic.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_backend.cpp, line 1409:
<comment>Missing non-empty batch guard in `hybrid_forward_batch`: `n_tokens == 0` causes `(size_t)(n_tokens - 1)` to underflow to `SIZE_MAX`, producing UB in the `act_cur.assign(...)` pointer arithmetic.</comment>
<file context>
@@ -1153,6 +1222,247 @@ bool Qwen35MoeBackend::hybrid_forward_one_token(int32_t tok, int kv_pos,
+ if (ffn_hot_alloc) ggml_gallocr_free(ffn_hot_alloc);
+
+ // Store last token hidden state in act_cur
+ act_cur.assign(embed_all.data() + (size_t)(n_tokens - 1) * (size_t)hidden,
+ embed_all.data() + (size_t)n_tokens * (size_t)hidden);
+
</file context>
feat(dflash): MoE prefill streaming engine with mmap + batched GPU eval
P1 fixes: - Guard bench_moe_stream CUDA::cudart link behind GPU backend check - Replace hardcoded page_size=4096 with sysconf(_SC_PAGESIZE) in madvise - Add expert_id negativity check in stream_expert_sync - Fix zip misalignment in bench script by matching on target_tokens key P2 fix: - Extract build_shared_expert_subgraph() helper to deduplicate shared expert graph construction (was inlined 8 times across eval functions)
Optimizes MoE (Qwen3.5-35B-A3B) prefill and speculative decode performance through streaming expert loading, GPU-side bitmask fast paths, and batched DDTree verify.
Performance (RTX 2080 Ti, sm_75)
Architecture Notes