MoE Prefill Streaming & DDTree Batched Verify by howard0su · Pull Request #349 · Luce-Org/lucebox-hub

howard0su · 2026-06-07T07:25:02Z

Optimizes MoE (Qwen3.5-35B-A3B) prefill and speculative decode performance through streaming expert loading, GPU-side bitmask fast paths, and batched DDTree verify.

MoE Prefill Streaming Engine (5ba80db)

MoeHybridStreamEngine with mmap + madvise for cold expert DMA
Batched token-per-expert GPU eval replaces per-token sequential dispatch
3× prefill speedup (94.7 tok/s → from ~30 tok/s baseline)

VRAM Bitmask Fast Path (3a697f5)

Per-layer expert_vram_mask[4] + all_routed_are_hot() check
Skips CPU entirely when all routed experts are GPU-resident (95% of layers)
eval_moe_hot_only_batched() — pure GPU, no cold partition, no merge

Cached Hot-Only Batched Graph (fa7f545)

CachedHotBatchedGraph avoids per-call graph rebuild for repeated sub-batch sizes
Minimal benefit on sm_75 due to dispatch-count bottleneck (2560 dispatches)

Skip MMQ Sub-Batch on sm_80+ (ea47297)

mmq_safe_full_batch flag in MoeHybridConfig, set via query_gpu_compute_sm()
On Ampere+ (sm_80+): eliminates ≤4-token sub-batch workaround entirely
Reduces dispatches from 2560 → 80 per prefill (32× fewer)

Batched MoE Verify for DDTree (ccb1ba6)

hybrid_forward_batch() processes all DDTree tokens layer-by-layer (like prefill)
Replaces sequential 22-token × 40-layer loop (880 dispatches → 81 dispatches)
Both verify and replay paths are now batched

Performance (RTX 2080 Ti, sm_75)

Metric	Before	After
Prefill (168 tok)	~30 tok/s	94.7 tok/s
Hot-only layer hit rate	N/A	95%
DDTree verify dispatches	2640	81

Architecture Notes

sm_75 still uses sub-batch workaround (MMQ bug); sm_80+ gets full benefit
HIP/AMD keeps workaround active (gfx1151 has same bug)
DDTree AL=2.16 on this draft model — AR decode is still faster on sm_75 with constrained VRAM. On sm_80+ with more VRAM headroom, DDTree should break even at AL≈3+.

Add a streaming prefill path for MoE hybrid mode that keeps the GGUF file mmap'd for the server's lifetime and uses madvise(WILLNEED) to prefetch cold expert data before batched FFN evaluation. Key changes: - MoeHybridStreamEngine: pinned host buffer + GPU scratch for DMA pipeline, with prefetch_cold_experts() and stream_expert_sync() APIs - eval_moe_cold_experts_streaming: batches all tokens selecting the same cold expert into a single ggml graph (O(unique_experts) graphs instead of O(tokens × experts)) - MoeHybridStorage: stores persistent mmap pointer + per-layer expert file regions; destructor handles munmap cleanup - Backend integration (qwen35moe + laguna): when stream engine is ready and chunk_len >= 16, prefetch cold experts via madvise before batched eval, avoiding the per-token fallback path Performance (Qwen3.6-35B-A3B, RTX 2080 Ti, 813 cold experts): - Before: ~10 tok/s prefill (per-token fallback, 14s for 137 tokens) - After: ~30 tok/s prefill (batched eval with prefetch, 4.7s for 137 tokens) - Decode: unchanged at 2.0 tok/s

…ast path) Add a per-layer VRAM bitmask (expert_vram_mask[4], 256-bit) to MoeHybridLayerStorage that tracks which experts have GPU-resident weight tensors. Before entering the hybrid hot+cold eval path, check if all router-selected experts for the current batch are hot using all_routed_are_hot() — a simple bitmask lookup with early exit. When all selected experts are in VRAM: - Call eval_moe_hot_only_batched() — pure GPU path - No cold graph build, no CPU compute, no hot+cold merge loop - Remaps global expert IDs to hot-local IDs and runs single GPU graph Includes MMQ sub-batch workaround (<=4 tokens) for sm_75/gfx1151 compatibility, matching the existing hybrid batched path. Adds hit-rate telemetry (hot_only_layers/total_ffn_layers) to qwen35moe prefill analysis output.

Add CachedHotBatchedGraph struct that pre-builds and caches the ggml graph + allocator for hot-only prefill sub-batches (n_tokens=4). On first use per layer, the graph is built once; subsequent sub-batch dispatches within the same layer skip ggml_init, graph construction, and gallocr planning — just tensor_set + compute + tensor_get. Benchmark (Qwen3.6-35B-A3B, 2080 Ti, 168-token prompt): - Cold cache (first request): 75.7 tok/s prefill - Warm cache (second request): 90.1 tok/s prefill - Baseline without caching: 94.7 tok/s prefill The cached graph saves ~50µs per sub-batch dispatch (ggml_init + graph build), but on sm_75 the MMQ sub-batch workaround (n_tokens≤4) means 2560 kernel dispatches per 168-token prefill, making dispatch overhead dominant. The real win requires eliminating the sub-batch constraint (possible on sm_86+). Also removes anonymous namespace in moe_hybrid_ffn_eval.cpp (functions were already static/inline with internal linkage).

…fill) On sm_80+ (Ampere/Ada/Hopper/Blackwell), the MMQ mul_mat_id kernel works correctly with reduced hot expert stacks. Skip the <=4-token sub-batch workaround that was needed only for sm_75 (Turing) and gfx1151 (AMD). This eliminates 2560 kernel dispatches per prefill (32 sub-batches × 80 layers) down to 80 (1 dispatch per layer), giving ~30× less dispatch overhead on modern GPUs. Changes: - Add mmq_safe_full_batch flag to MoeHybridConfig (default false = safe) - Add query_gpu_compute_sm() utility (CUDA-only, returns 0 for HIP) - Set flag to true when sm >= 80 in make_moe_hybrid_config() - Gate both sub-batch loops (hot-only + hybrid) on the flag - Allow cached graph build for full-batch sizes when flag is set - Log at init when full-batch mode is active

Replace sequential hybrid_forward_one_token loop (22 tokens × 40 layers = 880 dispatches) with batched hybrid_forward_batch (40 layers × 2 dispatches = 80 dispatches per verify pass). The batched path reuses the same layer-by-layer approach as prefill: 1. build_layer_prefn_step (DeltaNet + router for all tokens, 1 dispatch) 2. eval_moe_hot_only_batched / eval_moe_hybrid_ffn_batched (1 dispatch) 3. CPU combine (residual + FFN output) Both verify and replay are now batched. Feature capture during replay processes all tokens per layer instead of per-token dispatch. Combined with mmq_safe_full_batch (sm_80+), this enables DDTree for MoE with minimal dispatch overhead on modern GPUs.

cubic-dev-ai

10 issues found across 18 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/moe_hybrid_stream.h">

<violation number="1" location="server/src/common/moe_hybrid_stream.h:56">
P1: Backend-context mismatch: `stream_expert_sync()` accepts `gpu_backend` but ignores it, while `init()` stores a backend handle in `backend_` that is never used for actual GPU operations. Raw CUDA APIs (`cudaMalloc`/`cudaMemcpy`) bypass the ggml backend, so `gpu_scratch_` is tied to the CUDA context at init time, but `eval_moe_cold_experts_streaming()` may compute on a different `gpu_backend`. This can cause illegal memory access or silent corruption if init and compute use different GPU devices.</violation>
</file>

<file name="server/docs/moe_hybrid.md">

<violation number="1" location="server/docs/moe_hybrid.md:302">
P2: PCIe is incorrectly described as half-duplex for bulk transfers; PCIe links are full-duplex (dual simplex) by design, so the rationale against streaming is technically inaccurate.</violation>
</file>

<file name="server/src/common/gguf_mmap.h">

<violation number="1" location="server/src/common/gguf_mmap.h:205">
P2: Unsigned integer overflow in range validation: `offset + length` on `size_t` can wrap around, causing the clamp to be skipped and an invalid oversized range to be passed to `madvise`/`PrefetchVirtualMemory`. Use `if (length > size_ - offset)` instead.</violation>
</file>

<file name="server/src/qwen35moe/qwen35moe_backend.cpp">

<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.cpp:1409">
P2: Missing non-empty batch guard in `hybrid_forward_batch`: `n_tokens == 0` causes `(size_t)(n_tokens - 1)` to underflow to `SIZE_MAX`, producing UB in the `act_cur.assign(...)` pointer arithmetic.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-07T07:29:01Z

+    // tensor view over the scratch memory for each weight matrix.
+    // This is a BLOCKING operation (synchronous DMA). For pipelined usage,
+    // use the async variants below.
+    bool stream_expert_sync(const void * mmap_data, size_t mmap_size,


P1: Backend-context mismatch: stream_expert_sync() accepts gpu_backend but ignores it, while init() stores a backend handle in backend_ that is never used for actual GPU operations. Raw CUDA APIs (cudaMalloc/cudaMemcpy) bypass the ggml backend, so gpu_scratch_ is tied to the CUDA context at init time, but eval_moe_cold_experts_streaming() may compute on a different gpu_backend. This can cause illegal memory access or silent corruption if init and compute use different GPU devices.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_stream.h, line 56: <comment>Backend-context mismatch: `stream_expert_sync()` accepts `gpu_backend` but ignores it, while `init()` stores a backend handle in `backend_` that is never used for actual GPU operations. Raw CUDA APIs (`cudaMalloc`/`cudaMemcpy`) bypass the ggml backend, so `gpu_scratch_` is tied to the CUDA context at init time, but `eval_moe_cold_experts_streaming()` may compute on a different `gpu_backend`. This can cause illegal memory access or silent corruption if init and compute use different GPU devices.</comment> <file context> @@ -0,0 +1,112 @@ + // tensor view over the scratch memory for each weight matrix. + // This is a BLOCKING operation (synchronous DMA). For pipelined usage, + // use the async variants below. + bool stream_expert_sync(const void * mmap_data, size_t mmap_size, + const LayerExpertRegions & regions, + int expert_id, </file context>

cubic-dev-ai · 2026-06-07T07:29:01Z

+**The fundamental problem**: PCIe 4.0 x16 bandwidth (15.75 GB/s) is **4× lower** than DDR5 memory bandwidth (≈50 GB/s). Loading a single Q4_K_M Qwen expert (gate + up + down ≈ 6 MB) over PCIe takes ~380 µs just for the transfer. Meanwhile, the CPU can complete the entire matmul chain in ~100–150 µs using the same data that's already in L3/RAM.
+
+Additionally:
+- **PCIe is half-duplex for bulk transfers**: You cannot overlap upload of expert weights with download of results efficiently.


P2: PCIe is incorrectly described as half-duplex for bulk transfers; PCIe links are full-duplex (dual simplex) by design, so the rationale against streaming is technically inaccurate.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/docs/moe_hybrid.md, line 302: <comment>PCIe is incorrectly described as half-duplex for bulk transfers; PCIe links are full-duplex (dual simplex) by design, so the rationale against streaming is technically inaccurate.</comment> <file context> @@ -0,0 +1,396 @@ +**The fundamental problem**: PCIe 4.0 x16 bandwidth (15.75 GB/s) is **4× lower** than DDR5 memory bandwidth (≈50 GB/s). Loading a single Q4_K_M Qwen expert (gate + up + down ≈ 6 MB) over PCIe takes ~380 µs just for the transfer. Meanwhile, the CPU can complete the entire matmul chain in ~100–150 µs using the same data that's already in L3/RAM. + +Additionally: +- **PCIe is half-duplex for bulk transfers**: You cannot overlap upload of expert weights with download of results efficiently. +- **GPU kernel launch overhead**: Even after DMA completes, the GPU needs to dispatch a kernel for a single-token matmul on just 1–2 cold experts — an inefficient use of GPU SMs. +- **Bubble injection**: Streaming cold experts to GPU introduces pipeline bubbles. The GPU must wait for DMA → compute → return result, during which the SMs sit idle or context-switch. With CPU-local compute, the GPU is 100% occupied on hot experts. </file context>

Suggested change

- **PCIe is half-duplex for bulk transfers**: You cannot overlap upload of expert weights with download of results efficiently.

- **PCIe is half-duplex for bulk transfers**: You cannot overlap upload of expert weights with download of results efficiently.

+ **PCIe copy-engine limits overlap**: Most consumer GPUs have only 1–2 DMA copy engines, so upload of expert weights and download of results contend and cannot fully overlap in practice.

cubic-dev-ai · 2026-06-07T07:29:02Z


+inline void GgufMmap::advise_willneed(size_t offset, size_t length) const {
+    if (!data_ || offset >= size_) return;
+    if (offset + length > size_) length = size_ - offset;


P2: Unsigned integer overflow in range validation: offset + length on size_t can wrap around, causing the clamp to be skipped and an invalid oversized range to be passed to madvise/PrefetchVirtualMemory. Use if (length > size_ - offset) instead.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/gguf_mmap.h, line 205: <comment>Unsigned integer overflow in range validation: `offset + length` on `size_t` can wrap around, causing the clamp to be skipped and an invalid oversized range to be passed to `madvise`/`PrefetchVirtualMemory`. Use `if (length > size_ - offset)` instead.</comment> <file context> @@ -196,6 +200,26 @@ inline const void * GgufMmap::data() const { return data_; } +inline void GgufMmap::advise_willneed(size_t offset, size_t length) const { + if (!data_ || offset >= size_) return; + if (offset + length > size_) length = size_ - offset; + if (length == 0) return; +#if defined(_WIN32) </file context>

Suggested change

if (offset + length > size_) length = size_ - offset;

if (length > size_ - offset) length = size_ - offset;

cubic-dev-ai · 2026-06-07T07:29:02Z

+    if (ffn_hot_alloc) ggml_gallocr_free(ffn_hot_alloc);
+
+    // Store last token hidden state in act_cur
+    act_cur.assign(embed_all.data() + (size_t)(n_tokens - 1) * (size_t)hidden,


P2: Missing non-empty batch guard in hybrid_forward_batch: n_tokens == 0 causes (size_t)(n_tokens - 1) to underflow to SIZE_MAX, producing UB in the act_cur.assign(...) pointer arithmetic.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_backend.cpp, line 1409: <comment>Missing non-empty batch guard in `hybrid_forward_batch`: `n_tokens == 0` causes `(size_t)(n_tokens - 1)` to underflow to `SIZE_MAX`, producing UB in the `act_cur.assign(...)` pointer arithmetic.</comment> <file context> @@ -1153,6 +1222,247 @@ bool Qwen35MoeBackend::hybrid_forward_one_token(int32_t tok, int kv_pos, + if (ffn_hot_alloc) ggml_gallocr_free(ffn_hot_alloc); + + // Store last token hidden state in act_cur + act_cur.assign(embed_all.data() + (size_t)(n_tokens - 1) * (size_t)hidden, + embed_all.data() + (size_t)n_tokens * (size_t)hidden); + </file context>

feat(dflash): MoE prefill streaming engine with mmap + batched GPU eval

P1 fixes: - Guard bench_moe_stream CUDA::cudart link behind GPU backend check - Replace hardcoded page_size=4096 with sysconf(_SC_PAGESIZE) in madvise - Add expert_id negativity check in stream_expert_sync - Fix zip misalignment in bench script by matching on target_tokens key P2 fix: - Extract build_shared_expert_subgraph() helper to deduplicate shared expert graph construction (was inlined 8 times across eval functions)

howard0su added 5 commits June 7, 2026 13:07

cubic-dev-ai Bot reviewed Jun 7, 2026

View reviewed changes

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 7, 2026

Merge pull request Luce-Org#349 from Luce-Org/doc

74aeb58

feat(dflash): MoE prefill streaming engine with mmap + batched GPU eval

howard0su force-pushed the doc branch from aad4066 to 246c26c Compare June 7, 2026 08:27

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 7, 2026

Merge pull request Luce-Org#349 from Luce-Org/doc

ae36983

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE Prefill Streaming & DDTree Batched Verify#349

MoE Prefill Streaming & DDTree Batched Verify#349
howard0su wants to merge 6 commits into
Luce-Org:mainfrom
howard0su:doc

howard0su commented Jun 7, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 7, 2026

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 7, 2026

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 7, 2026

Uh oh!

cubic-dev-ai Bot Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	- PCIe is half-duplex for bulk transfers: You cannot overlap upload of expert weights with download of results efficiently.
	- PCIe is half-duplex for bulk transfers: You cannot overlap upload of expert weights with download of results efficiently.
	+ PCIe copy-engine limits overlap: Most consumer GPUs have only 1–2 DMA copy engines, so upload of expert weights and download of results contend and cannot fully overlap in practice.

	if (offset + length > size_) length = size_ - offset;
	if (length > size_ - offset) length = size_ - offset;

Conversation

howard0su commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance (RTX 2080 Ti, sm_75)

Architecture Notes

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

howard0su commented Jun 7, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading