Skip to content

[V100/SM70] Rollup: Volta serving stack — W4A16 + FP8 (e5m2/MTP) + reasoning fixes#55

Open
rivetphilbot wants to merge 54 commits into
1CatAI:mainfrom
rivetphilbot:rivetphilbot-rollup
Open

[V100/SM70] Rollup: Volta serving stack — W4A16 + FP8 (e5m2/MTP) + reasoning fixes#55
rivetphilbot wants to merge 54 commits into
1CatAI:mainfrom
rivetphilbot:rivetphilbot-rollup

Conversation

@rivetphilbot

@rivetphilbot rivetphilbot commented Jun 1, 2026

Copy link
Copy Markdown

Integration branch bundling all our V100/SM70 serving patches so one branch builds a
vLLM that runs both compressed-tensors W4A16 and dense FP8 (+MTP) Qwen3.5/Deckard
checkpoints on Volta (2× V100-32GB). Each commit is the clean net diff of an existing focused
PR; this is the all-in-one alternative to merging them piecemeal.

Bundled (one squashed commit each)

Validation

If you'd rather review/merge atomically, the focused PRs #45 / #47 / #49 / #54 are open;
this rollup is the convenience superset.


✅ Verified working configuration (FP8 + MTP, 2× V100)

Validated from a clean install (2026-06-01) on 2× V100-PCIE-32GB: 1Cat's prebuilt 1.1.0
wheels + the three pure-Python FP8 files from this branch overlaid onto the installed
vllm/ → the dense 40B FP8 + MTP checkpoint serves at 177K with fp8_e5m2 KV and clean
reasoning separation. No source build for the FP8 path.

Environment

  • 2× NVIDIA V100-PCIE-32GB (SM70), CUDA 12.8 driver
  • Python 3.12, torch 2.9.1+cu128, vllm 1.1.0, flash_attn_v100 1.1.0 (from releases/latest)
  • Model: a dense qwen3_5 FP8 (+MTP) checkpoint (~21.8 GiB/GPU)

Install (FP8 — no build)

# 1) Python 3.12 env (wheels are cp312); 2) install BOTH wheels together:
base=https://github.com/1CatAI/1Cat-vLLM/releases/download/v1.1.0
curl -fLO $base/flash_attn_v100-1.1.0-cp312-cp312-linux_x86_64.whl
curl -fLO $base/vllm-1.1.0-cp312-cp312-linux_x86_64.whl
pip install --prefer-binary --extra-index-url https://download.pytorch.org/whl/cu128 ./*.whl
# 3) overlay the 3 FP8 files from this branch onto the installed package:
git clone --depth 1 -b rivetphilbot-rollup https://github.com/rivetphilbot/1Cat-vLLM.git rollup-src
SP=$(python -c "import vllm,os;print(os.path.dirname(os.path.dirname(vllm.__file__)))")
for f in vllm/model_executor/layers/attention/attention.py \
         vllm/entrypoints/openai/chat_completion/serving.py \
         vllm/reasoning/qwen3_reasoning_parser.py; do cp "rollup-src/$f" "$SP/$f"; done

The FP8 commits — #54 (attention.py, required), #47 (serving.py), #52
(parser) — are pure-Python, hence the overlay. Only #45 (W4A16) needs a source build
(it modifies lmdeploy/src/turbomind/kernels/gemm/kernel/sm70_884_4.cu).

Serve (run from OUTSIDE the source checkout)

export NCCL_P2P_DISABLE=1 NCCL_SHM_DISABLE=0 VLLM_WORKER_MULTIPROC_METHOD=spawn
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export VLLM_SM70_FP8_DEQUANT_FALLBACK=1 VLLM_SM70_FP8_TURBOMIND=1
python -m vllm.entrypoints.openai.api_server \
  --model /models/deckard-40b-fp8-mtp --served-model-name deckard-fp8mtp \
  --host 0.0.0.0 --port 8003 --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.90 --max-model-len 177408 \
  --max-num-seqs 4 --max-num-batched-tokens 4096 \
  --kv-cache-dtype fp8_e5m2 --attention-backend FLASH_ATTN_V100 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}' \
  --skip-mm-profiling --enable-prefix-caching \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 --default-chat-template-kwargs '{"enable_thinking": true}'

Observed: Application startup complete; GPU KV cache 63,360 tokens (1.35× concurrency at
177K); cold-start init ≈ 270 s (torch.compile + cudagraph), warm ≈ 75–135 s; chat smoke test
finish_reason=stop with reasoning routed to reasoning_content, content clean; ~30–32 tok/s single-stream (MTP ~0.8–0.9 draft acceptance ≈ 1.9 tokens/step — without MTP it falls to the base rate ~16–17 tok/s).

Common trip-ups

  • Wheels are cp312 → use Python 3.12 (a newer system Python won't match).
  • Install both wheels together; with uv add --index-strategy unsafe-best-match (flashinfer is PyPI-only); plain pip is fine.
  • Run from outside any 1Cat-vLLM checkout, or Python imports the source tree instead of the wheel's CUDA extensions.
  • Missing the attention.py overlay → ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints. at load.
  • MTP worth ~2×, fails silently. The -MTP checkpoint carries the draft head but it does nothing without --speculative-config (use "method": "mtp"; qwen3_5_mtp is a deprecated alias). Verify it's accepting via /metrics (spec_decode_num_accepted_tokens_total vs …_num_draft_tokens_total). Note: low tok/s with MTP accepting is not MTP-off — on a no-NVLink V100 pair MTP only pays off when the verify+draft step is captured in one FULL CUDA graph (cudagraph_mode=FULL_AND_PIECEWISE). Piecewise-only graphs (or --enforce-eager) leave the draft's all-reduce un-amortized → MTP nets ~0% and you sit at base rate. FULL capture can silently fall back to piecewise when the DeltaNet causal_conv1d_update autotune runs during capture — set cudagraph_num_of_warmups=3 to push it into warmup. Field-proven: 17→30 tok/s (1.76×) on a box whose interconnect was better than ours, purely by capturing FULL graphs. So the interconnect is not the ceiling — graph capture is.
  • Clients need the served-name alias. When swapping models behind a fixed endpoint, add --served-model-name deckard-fp8mtp <actual-model-name> or existing clients get 404s.

Field validation (2026-06-06, pve3, 2× V100-PCIE-32GB)

Follow-up bench session on the same hardware, comparing stock 1.1.0 wheels vs rollup overlay
(/opt/1cat-vllm/.venv-v110) across Qwen3.6-27B-FP8 (community recipe) and Deckard-40B-FP8-MTP.
All throughput numbers are c1 single-stream from bench_deckard_toks.py (256 tokens, ignore_eos).
Do not use vllm bench serve for community comparisons — it understated Qwen27B at ~27 tok/s
vs the real ~51 tok/s.

Full write-up also committed on this branch: V100_FIELD_VALIDATION_20260606.md.

Qwen27B-FP8 — rollup is not a throughput tax

Variant Mean c1 Notes
Stock 1.1.0 wheels, community flags, auto fp16 KV 51.6 No rollup overlay
Rollup v110 overlay, same flags, auto fp16 KV 51.5 attention.py + serving + parser
Rollup v110, auto fp16 KV @ 177k 49.6 Run-to-run variance, not KV mode
Rollup v110, fp8_e5m2 KV @ 177k 49.7 MTP accept 70% vs 68% — within noise
Stock + fp8_e5m2 on FP8 weights BOOT_FAIL Needs #54 (attention.py)

Verdict: The rollup patches are ~0 tok/s regression on the Qwen27B-FP8 community recipe.
The earlier concern that rollup costs ~⅓ throughput was a bench-tool artifact (vllm bench serve),
not a real serving penalty. For max Qwen27B speed on 2×V100, stock wheels + community flags suffice;
rollup is still required for Deckard prod features (e5m2 @ 177k, reasoning split) but not for Qwen perf.

KV cache modes on SM70 — do not conflate

--kv-cache-dtype Qwen27B @ 177k Deckard40B @ 177k Notes
(omit — auto fp16) ✓ boots, ~50 tok/s BOOT_FAIL (KV OOM) Community recipe for 27B
fp8 (e4m3) BOOT_FAIL BOOT_FAIL fp8e4nv not supported on SM70 — stock, v110, and rollup all fail identically
fp8_e5m2 ✓ boots (rollup only) ✓ boots (rollup only) Memory lever for 40B @ 177k; neutral on Qwen27B speed
  • [Bugfix] Allow fp8_e5m2 KV cache with FP8 checkpoints (V100/SM70) #54 is load-bearing for fp8_e5m2 on dense FP8 checkpoints — without the attention.py overlay,
    stock 1.1.0 hard-fails at load.
  • --kv-cache-dtype fp8 (e4m3) is not a V100 knob today. Community "+20 tok/s with fp8 KV" claims
    likely mean FP8 weights + default fp16 KV, not e4m3 KV.
  • fp8_e5m2 on Qwen27B is an optional memory saver with no measured speed penalty; on Deckard 40B
    it is required to fit 177k context.

Deckard-40B-FP8-MTP reference numbers (same bench script)

Config Mean c1 Notes
Rollup + e5m2 @ 177k, MTP n=1 ~32.3 old prod-ref
Auto fp16 KV @ 99k (no e5m2) 35.6 Max ctx without e5m2 on 40B
Rollup + e5m2 @ 177k, MTP n=2 ~40.0 current prod (2026-06-07)
Auto fp16 KV @ 177k BOOT_FAIL Needs 8.93 GiB KV, ~5.2 GiB available

When to use rollup vs stock (practical guide)

Goal Config
Max tok/s on 2×V100 Qwen27B-FP8 stock wheels, community flags, auto fp16 KV (~51 c1)
Deckard tune + 177k ctx Rollup overlay + fp8_e5m2 KV (~32.5 c1) — not a rollup penalty on 27B
Deckard speed without e5m2 Deckard @ 99k auto fp16 KV (~35.6 c1)
Do not use on V100 --kv-cache-dtype fp8 (e4m3)

Deckard-40B-FP8-MTP @ 177k — MTP n=2 unlocks ~40 tok/s (2026-06-07)

Follow-up on the prod Deckard recipe (rollup v110 + fp8_e5m2 @ 177k). The old PR
serve example used num_speculative_tokens: 1; field testing on pve3 shows n=2 is
the throughput win
, not CUDA fullgraph or batched-token tuning.

Variant Mean c1 MTP accept Notes
MTP n=2, batched 8192 (current prod) ~40.0 71.4% (450/630) ~25% over n=1
MTP n=2, batched 8192, fullgraph 40.0 71.4% fullgraph adds nothing here
MTP n=2, batched 4096 40.0 71.4% batched size irrelevant at c1
Prod-ref MTP n=1 ~32.3 82.9% (348/420) higher accept %, fewer tokens/step

Why: n=1 has a higher acceptance rate, but n=2 proposes more draft tokens per
step → more accepted tokens/step → higher net throughput. Do not pick n=1 based on accept % alone.

Prod flags that matter (beyond the FP8 overlay):

  • --speculative-config '{"method":"mtp","num_speculative_tokens":2}' ← the headline knob
  • --max-num-batched-tokens 8192 (4096 also fine; not the differentiator)
  • --kv-cache-dtype fp8_e5m2 (required for 177k KV on 40B)
  • NCCL_P2P_DISABLE=1 (cross-socket SYS topology; NCCL_P2P_DISABLE=0 → ~4× slowdown)
  • Default cudagraph is fine; forcing fullgraph did not help on this box

Canonical write-up: BestDeckardConfig.md on the homelab share; serve script:
/opt/1cat-vllm/serve_fp8mtp_prod.sh1cat-qwen.service on pve3.

Open follow-ups (not blocking merge)

  • Deckard MTP n=2 concurrency sweep (c4/c8/c16) — only c1 A/B run so far
  • CUDA graph capture mode not systematically A/B'd on our no-NVLink V100 pair
  • All numbers are c1; prod runs c16 — Qwen27B ceiling at higher concurrency untested

yangzhuxinyzx and others added 30 commits March 21, 2026 12:23
Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com>
(cherry picked from commit f8e4c58adad5561ab4cd006fdab6c9b1903eec1c)
Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com>
(cherry picked from commit 2fc562b8cfae2bb255baf097e0c71b498860c327)
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Updated the WeChat group QR code image in the README.
修复了错误的名字
Add FLASH_ATTN_V100 runtime path, Qwen3.5/Qwen3.6 launch profiles, SM70 AWQ updates, vendored build dependencies, and public regression charts.
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
yangzhuxinyzx and others added 19 commits May 13, 2026 19:00
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
… loading

Squashed PR 1CatAI#45: SM70 dense WNA16 TurboMind linear kernel + selector,
admit V100 (CC 7.0) in CompressedTensorsWNA16, compressed-tensors MoE
decode buffers for V100, keep router gate / split GDN projections
unquantized under CT, skip tuple-shard split for non-output-dim CT params,
sm70_884_4.cu kernel registry sync. Enables W4A16 compressed-tensors
models on Volta.
…2 KV on W4A16 compressed-tensors, V100/SM70)

Squashed PR 1CatAI#49.
vLLM eagerly raises "fp8_e5m2 kv-cache is not supported with fp8
checkpoints" for any FP8-weight checkpoint, blocking the only FP8 KV
format that compiles on V100/SM70 (Triton on sm_70 supports fp8e5, not
fp8e4nv/e4m3).

fp8_e5m2 is scaleless: its 5 exponent bits give enough range that it does
not need the k_scale/v_scale machinery the e4m3 path loads from the
checkpoint. So for e5m2 we skip the scale-quant setup and use plain e5m2
KV with default 1.0 scales; the e4m3/scaled path is unchanged.

Sibling of 1CatAI#49 (same fix for W4A16 compressed-tensors checkpoints).
Validated live on 2x V100-PCIE-32GB serving dense FP8 + MTP Qwen3.5-40B
at 177K context.
…kens

extract_content_ids() returns [] when the <think> block was never closed
(generation truncated by max_tokens mid-reasoning). Decoding an empty id
list blanked out the content the reasoning parser already produced, so the
client got an empty/cut-off response on truncation. Only override content
when there are content ids. Mirrors 1CatAI#47.
@philbert440

philbert440 commented Jun 1, 2026

Copy link
Copy Markdown

Validated with: https://huggingface.co/tcclaviger/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-FP8-MTP

@valentijnvenus

Copy link
Copy Markdown

Up vote

claude and others added 5 commits June 6, 2026 02:42
…ed backend)

FlashAttnV100Backend defines get_supported_head_sizes() -> [64,128,256], but
validate_configuration() checks supports_head_size(), not that method. Since
FlashAttnV100Backend does not override supports_head_size, it inherited
TritonAttentionBackend.supports_head_size (head_size >= 32) and wrongly
validated head sizes the Volta kernel cannot run (e.g. 512), then hard-crashed
in the CUDA dispatch (TORCH_CHECK D <= 256, "D must be even, <=256, multiple
of 8").

Add a supports_head_size override returning {64,128,256} (the sizes the SM70
dense/paged kernels actually dispatch). This:
  - fixes the crash for any model with head_dim > 256 on auto-select / FA-V100,
  - enables vLLM v1's per-layer backend selection to keep the supported heads
    on FLASH_ATTN_V100 and fall through to TRITON_ATTN for the rest.

Concretely this unlocks the fast mixed backend for models with dual head dims
(e.g. Gemma-4: 256-dim sliding layers run on FA-V100, 512-dim global layers on
Triton) instead of being forced onto TRITON_ATTN for every layer — ~2.5x decode
on V100 for such models. No effect on standard head_dim<=256 models (already
all FA-V100).

Co-Authored-By: RivetOS Claude <noreply@anthropic.com>
@valentijnvenus

Copy link
Copy Markdown

@yangzhuxinyzx is this PR being look into? It would be helpful to see a community develop around your wonderful V100 initiative

@GabeChurch

Copy link
Copy Markdown

I am also interested, seems like some positive changes here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants