[V100/SM70] Rollup: Volta serving stack — W4A16 + FP8 (e5m2/MTP) + reasoning fixes#55
Open
rivetphilbot wants to merge 54 commits into
Open
[V100/SM70] Rollup: Volta serving stack — W4A16 + FP8 (e5m2/MTP) + reasoning fixes#55rivetphilbot wants to merge 54 commits into
rivetphilbot wants to merge 54 commits into
Conversation
Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit f8e4c58adad5561ab4cd006fdab6c9b1903eec1c)
Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit 2fc562b8cfae2bb255baf097e0c71b498860c327)
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Updated the WeChat group QR code image in the README.
修复了错误的名字
Add FLASH_ATTN_V100 runtime path, Qwen3.5/Qwen3.6 launch profiles, SM70 AWQ updates, vendored build dependencies, and public regression charts.
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
… loading Squashed PR 1CatAI#45: SM70 dense WNA16 TurboMind linear kernel + selector, admit V100 (CC 7.0) in CompressedTensorsWNA16, compressed-tensors MoE decode buffers for V100, keep router gate / split GDN projections unquantized under CT, skip tuple-shard split for non-output-dim CT params, sm70_884_4.cu kernel registry sync. Enables W4A16 compressed-tensors models on Volta.
…2 KV on W4A16 compressed-tensors, V100/SM70) Squashed PR 1CatAI#49.
vLLM eagerly raises "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" for any FP8-weight checkpoint, blocking the only FP8 KV format that compiles on V100/SM70 (Triton on sm_70 supports fp8e5, not fp8e4nv/e4m3). fp8_e5m2 is scaleless: its 5 exponent bits give enough range that it does not need the k_scale/v_scale machinery the e4m3 path loads from the checkpoint. So for e5m2 we skip the scale-quant setup and use plain e5m2 KV with default 1.0 scales; the e4m3/scaled path is unchanged. Sibling of 1CatAI#49 (same fix for W4A16 compressed-tensors checkpoints). Validated live on 2x V100-PCIE-32GB serving dense FP8 + MTP Qwen3.5-40B at 177K context.
…kens extract_content_ids() returns [] when the <think> block was never closed (generation truncated by max_tokens mid-reasoning). Decoding an empty id list blanked out the content the reasoning parser already produced, so the client got an empty/cut-off response on truncation. Only override content when there are content ids. Mirrors 1CatAI#47.
|
Up vote |
…ed backend)
FlashAttnV100Backend defines get_supported_head_sizes() -> [64,128,256], but
validate_configuration() checks supports_head_size(), not that method. Since
FlashAttnV100Backend does not override supports_head_size, it inherited
TritonAttentionBackend.supports_head_size (head_size >= 32) and wrongly
validated head sizes the Volta kernel cannot run (e.g. 512), then hard-crashed
in the CUDA dispatch (TORCH_CHECK D <= 256, "D must be even, <=256, multiple
of 8").
Add a supports_head_size override returning {64,128,256} (the sizes the SM70
dense/paged kernels actually dispatch). This:
- fixes the crash for any model with head_dim > 256 on auto-select / FA-V100,
- enables vLLM v1's per-layer backend selection to keep the supported heads
on FLASH_ATTN_V100 and fall through to TRITON_ATTN for the rest.
Concretely this unlocks the fast mixed backend for models with dual head dims
(e.g. Gemma-4: 256-dim sliding layers run on FA-V100, 512-dim global layers on
Triton) instead of being forced onto TRITON_ATTN for every layer — ~2.5x decode
on V100 for such models. No effect on standard head_dim<=256 models (already
all FA-V100).
Co-Authored-By: RivetOS Claude <noreply@anthropic.com>
|
@yangzhuxinyzx is this PR being look into? It would be helpful to see a community develop around your wonderful V100 initiative |
|
I am also interested, seems like some positive changes here |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Integration branch bundling all our V100/SM70 serving patches so one branch builds a
vLLM that runs both compressed-tensors W4A16 and dense FP8 (+MTP) Qwen3.5/Deckard
checkpoints on Volta (2× V100-32GB). Each commit is the clean net diff of an existing focused
PR; this is the all-in-one alternative to merging them piecemeal.
Bundled (one squashed commit each)
KVCacheMethodwhen the checkpoint has no KV scales → fp8_e5m2 KV on W4A16 CTfp8_e5m2KV with FP8 checkpoints → fp8_e5m2 KV on dense FP8max_tokens(else empty/cut-off responses).gitattributesValidation
at 177K context.
CompressedTensorsWNA16scheme admits CC 7.0 andW4A16 (asym and sym) binds
SM70TurboMindLinearKernel— stock 1.1.0 admits neither.qwen3_next.py, resolved.If you'd rather review/merge atomically, the focused PRs #45 / #47 / #49 / #54 are open;
this rollup is the convenience superset.
✅ Verified working configuration (FP8 + MTP, 2× V100)
Validated from a clean install (2026-06-01) on 2× V100-PCIE-32GB: 1Cat's prebuilt 1.1.0
wheels + the three pure-Python FP8 files from this branch overlaid onto the installed
vllm/→ the dense 40B FP8 + MTP checkpoint serves at 177K withfp8_e5m2KV and cleanreasoning separation. No source build for the FP8 path.
Environment
vllm1.1.0,flash_attn_v1001.1.0 (fromreleases/latest)qwen3_5FP8 (+MTP) checkpoint (~21.8 GiB/GPU)Install (FP8 — no build)
The FP8 commits — #54 (
attention.py, required), #47 (serving.py), #52(parser) — are pure-Python, hence the overlay. Only #45 (W4A16) needs a source build
(it modifies
lmdeploy/src/turbomind/kernels/gemm/kernel/sm70_884_4.cu).Serve (run from OUTSIDE the source checkout)
Observed:
Application startup complete; GPU KV cache 63,360 tokens (1.35× concurrency at177K); cold-start init ≈ 270 s (torch.compile + cudagraph), warm ≈ 75–135 s; chat smoke test
finish_reason=stopwith reasoning routed toreasoning_content,contentclean; ~30–32 tok/s single-stream (MTP ~0.8–0.9 draft acceptance ≈ 1.9 tokens/step — without MTP it falls to the base rate ~16–17 tok/s).Common trip-ups
--index-strategy unsafe-best-match(flashinfer is PyPI-only); plainpipis fine.1Cat-vLLMcheckout, or Python imports the source tree instead of the wheel's CUDA extensions.attention.pyoverlay →ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints.at load.-MTPcheckpoint carries the draft head but it does nothing without--speculative-config(use"method": "mtp";qwen3_5_mtpis a deprecated alias). Verify it's accepting via/metrics(spec_decode_num_accepted_tokens_totalvs…_num_draft_tokens_total). Note: low tok/s with MTP accepting is not MTP-off — on a no-NVLink V100 pair MTP only pays off when the verify+draft step is captured in one FULL CUDA graph (cudagraph_mode=FULL_AND_PIECEWISE). Piecewise-only graphs (or--enforce-eager) leave the draft's all-reduce un-amortized → MTP nets ~0% and you sit at base rate. FULL capture can silently fall back to piecewise when the DeltaNetcausal_conv1d_updateautotune runs during capture — setcudagraph_num_of_warmups=3to push it into warmup. Field-proven: 17→30 tok/s (1.76×) on a box whose interconnect was better than ours, purely by capturing FULL graphs. So the interconnect is not the ceiling — graph capture is.--served-model-name deckard-fp8mtp <actual-model-name>or existing clients get 404s.Field validation (2026-06-06, pve3, 2× V100-PCIE-32GB)
Follow-up bench session on the same hardware, comparing stock 1.1.0 wheels vs rollup overlay
(
/opt/1cat-vllm/.venv-v110) across Qwen3.6-27B-FP8 (community recipe) and Deckard-40B-FP8-MTP.All throughput numbers are c1 single-stream from
bench_deckard_toks.py(256 tokens,ignore_eos).Do not use
vllm bench servefor community comparisons — it understated Qwen27B at ~27 tok/svs the real ~51 tok/s.
Full write-up also committed on this branch:
V100_FIELD_VALIDATION_20260606.md.Qwen27B-FP8 — rollup is not a throughput tax
attention.py+ serving + parserfp8_e5m2KV @ 177kfp8_e5m2on FP8 weightsattention.py)Verdict: The rollup patches are ~0 tok/s regression on the Qwen27B-FP8 community recipe.
The earlier concern that rollup costs ~⅓ throughput was a bench-tool artifact (
vllm bench serve),not a real serving penalty. For max Qwen27B speed on 2×V100, stock wheels + community flags suffice;
rollup is still required for Deckard prod features (e5m2 @ 177k, reasoning split) but not for Qwen perf.
KV cache modes on SM70 — do not conflate
--kv-cache-dtypefp8(e4m3)fp8e4nv not supportedon SM70 — stock, v110, and rollup all fail identicallyfp8_e5m2fp8_e5m2on dense FP8 checkpoints — without theattention.pyoverlay,stock 1.1.0 hard-fails at load.
--kv-cache-dtype fp8(e4m3) is not a V100 knob today. Community "+20 tok/s with fp8 KV" claimslikely mean FP8 weights + default fp16 KV, not e4m3 KV.
fp8_e5m2on Qwen27B is an optional memory saver with no measured speed penalty; on Deckard 40Bit is required to fit 177k context.
Deckard-40B-FP8-MTP reference numbers (same bench script)
n=1n=2When to use rollup vs stock (practical guide)
fp8_e5m2KV (~32.5 c1) — not a rollup penalty on 27B--kv-cache-dtype fp8(e4m3)Deckard-40B-FP8-MTP @ 177k — MTP
n=2unlocks ~40 tok/s (2026-06-07)Follow-up on the prod Deckard recipe (rollup v110 +
fp8_e5m2@ 177k). The old PRserve example used
num_speculative_tokens: 1; field testing on pve3 showsn=2isthe throughput win, not CUDA fullgraph or batched-token tuning.
n=2, batched 8192 (current prod)n=2, batched 8192, fullgraphn=2, batched 4096n=1Why:
n=1has a higher acceptance rate, butn=2proposes more draft tokens perstep → more accepted tokens/step → higher net throughput. Do not pick
n=1based on accept % alone.Prod flags that matter (beyond the FP8 overlay):
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'← the headline knob--max-num-batched-tokens 8192(4096 also fine; not the differentiator)--kv-cache-dtype fp8_e5m2(required for 177k KV on 40B)NCCL_P2P_DISABLE=1(cross-socket SYS topology;NCCL_P2P_DISABLE=0→ ~4× slowdown)Canonical write-up:
BestDeckardConfig.mdon the homelab share; serve script:/opt/1cat-vllm/serve_fp8mtp_prod.sh→1cat-qwen.serviceon pve3.Open follow-ups (not blocking merge)
n=2concurrency sweep (c4/c8/c16) — only c1 A/B run so far