[V100/SM70] Rollup: Volta serving stack — W4A16 + FP8 (e5m2/MTP) + reasoning fixes by rivetphilbot · Pull Request #55 · 1CatAI/1Cat-vLLM

rivetphilbot · 2026-06-01T03:44:50Z

Integration branch bundling all our V100/SM70 serving patches so one branch builds a
vLLM that runs both compressed-tensors W4A16 and dense FP8 (+MTP) Qwen3.5/Deckard
checkpoints on Volta (2× V100-32GB). Each commit is the clean net diff of an existing focused
PR; this is the all-in-one alternative to merging them piecemeal.

Bundled (one squashed commit each)

[V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading #45 — SM70 dense WNA16 TurboMind kernel + V100 (CC 7.0) admittance + DeltaNet loading → W4A16 on Volta
[Bugfix] Allow fp8_e5m2 KV cache on W4A16 compressed-tensors models (V100/SM70) #49 — skip CT KVCacheMethod when the checkpoint has no KV scales → fp8_e5m2 KV on W4A16 CT
[Bugfix] Allow fp8_e5m2 KV cache with FP8 checkpoints (V100/SM70) #54 — allow fp8_e5m2 KV with FP8 checkpoints → fp8_e5m2 KV on dense FP8
fix: keep partial content when reasoning block is truncated by max_tokens #47 — keep partial content when the reasoning block is truncated by max_tokens (else empty/cut-off responses)
[Bugfix] Default Qwen3 reasoning parser to prompt-has-open-think #52 — default the Qwen3 reasoning parser to prompt-has-open-think
ci: fix CRLF line endings in shell scripts #46 — CRLF→LF for shell scripts + .gitattributes

Validation

FP8 path (e5m2 + fix: keep partial content when reasoning block is truncated by max_tokens #47) runs live on 2× V100-PCIE serving a dense FP8 + MTP Qwen3.5-40B
at 177K context.
W4A16 selection verified: with [V100/SM70] Add compressed-tensors dense WNA16 path + DeltaNet weight loading #45 the CompressedTensorsWNA16 scheme admits CC 7.0 and
W4A16 (asym and sym) binds SM70TurboMindLinearKernel — stock 1.1.0 admits neither.
All Python compile-checked; only conflict during assembly was a one-line comment in
qwen3_next.py, resolved.

If you'd rather review/merge atomically, the focused PRs #45 / #47 / #49 / #54 are open;
this rollup is the convenience superset.

✅ Verified working configuration (FP8 + MTP, 2× V100)

Validated from a clean install (2026-06-01) on 2× V100-PCIE-32GB: 1Cat's prebuilt 1.1.0
wheels + the three pure-Python FP8 files from this branch overlaid onto the installed
vllm/ → the dense 40B FP8 + MTP checkpoint serves at 177K with fp8_e5m2 KV and clean
reasoning separation. No source build for the FP8 path.

Environment

2× NVIDIA V100-PCIE-32GB (SM70), CUDA 12.8 driver
Python 3.12, torch 2.9.1+cu128, vllm 1.1.0, flash_attn_v100 1.1.0 (from releases/latest)
Model: a dense qwen3_5 FP8 (+MTP) checkpoint (~21.8 GiB/GPU)

Install (FP8 — no build)

# 1) Python 3.12 env (wheels are cp312); 2) install BOTH wheels together:
base=https://github.com/1CatAI/1Cat-vLLM/releases/download/v1.1.0
curl -fLO $base/flash_attn_v100-1.1.0-cp312-cp312-linux_x86_64.whl
curl -fLO $base/vllm-1.1.0-cp312-cp312-linux_x86_64.whl
pip install --prefer-binary --extra-index-url https://download.pytorch.org/whl/cu128 ./*.whl
# 3) overlay the 3 FP8 files from this branch onto the installed package:
git clone --depth 1 -b rivetphilbot-rollup https://github.com/rivetphilbot/1Cat-vLLM.git rollup-src
SP=$(python -c "import vllm,os;print(os.path.dirname(os.path.dirname(vllm.__file__)))")
for f in vllm/model_executor/layers/attention/attention.py \
         vllm/entrypoints/openai/chat_completion/serving.py \
         vllm/reasoning/qwen3_reasoning_parser.py; do cp "rollup-src/$f" "$SP/$f"; done

The FP8 commits — #54 (attention.py, required), #47 (serving.py), #52
(parser) — are pure-Python, hence the overlay. Only #45 (W4A16) needs a source build
(it modifies lmdeploy/src/turbomind/kernels/gemm/kernel/sm70_884_4.cu).

Serve (run from OUTSIDE the source checkout)

export NCCL_P2P_DISABLE=1 NCCL_SHM_DISABLE=0 VLLM_WORKER_MULTIPROC_METHOD=spawn
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export VLLM_SM70_FP8_DEQUANT_FALLBACK=1 VLLM_SM70_FP8_TURBOMIND=1
python -m vllm.entrypoints.openai.api_server \
  --model /models/deckard-40b-fp8-mtp --served-model-name deckard-fp8mtp \
  --host 0.0.0.0 --port 8003 --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.90 --max-model-len 177408 \
  --max-num-seqs 4 --max-num-batched-tokens 4096 \
  --kv-cache-dtype fp8_e5m2 --attention-backend FLASH_ATTN_V100 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}' \
  --skip-mm-profiling --enable-prefix-caching \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 --default-chat-template-kwargs '{"enable_thinking": true}'

Observed: Application startup complete; GPU KV cache 63,360 tokens (1.35× concurrency at
177K); cold-start init ≈ 270 s (torch.compile + cudagraph), warm ≈ 75–135 s; chat smoke test
finish_reason=stop with reasoning routed to reasoning_content, content clean; ~30–32 tok/s single-stream (MTP ~0.8–0.9 draft acceptance ≈ 1.9 tokens/step — without MTP it falls to the base rate ~16–17 tok/s).

Common trip-ups

Wheels are cp312 → use Python 3.12 (a newer system Python won't match).
Install both wheels together; with uv add --index-strategy unsafe-best-match (flashinfer is PyPI-only); plain pip is fine.
Run from outside any 1Cat-vLLM checkout, or Python imports the source tree instead of the wheel's CUDA extensions.
Missing the attention.py overlay → ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints. at load.
MTP worth ~2×, fails silently. The -MTP checkpoint carries the draft head but it does nothing without --speculative-config (use "method": "mtp"; qwen3_5_mtp is a deprecated alias). Verify it's accepting via /metrics (spec_decode_num_accepted_tokens_total vs …_num_draft_tokens_total). Note: low tok/s with MTP accepting is not MTP-off — on a no-NVLink V100 pair MTP only pays off when the verify+draft step is captured in one FULL CUDA graph (cudagraph_mode=FULL_AND_PIECEWISE). Piecewise-only graphs (or --enforce-eager) leave the draft's all-reduce un-amortized → MTP nets ~0% and you sit at base rate. FULL capture can silently fall back to piecewise when the DeltaNet causal_conv1d_update autotune runs during capture — set cudagraph_num_of_warmups=3 to push it into warmup. Field-proven: 17→30 tok/s (1.76×) on a box whose interconnect was better than ours, purely by capturing FULL graphs. So the interconnect is not the ceiling — graph capture is.
Clients need the served-name alias. When swapping models behind a fixed endpoint, add --served-model-name deckard-fp8mtp <actual-model-name> or existing clients get 404s.

Field validation (2026-06-06, pve3, 2× V100-PCIE-32GB)

Follow-up bench session on the same hardware, comparing stock 1.1.0 wheels vs rollup overlay
(/opt/1cat-vllm/.venv-v110) across Qwen3.6-27B-FP8 (community recipe) and Deckard-40B-FP8-MTP.
All throughput numbers are c1 single-stream from bench_deckard_toks.py (256 tokens, ignore_eos).
Do not use vllm bench serve for community comparisons — it understated Qwen27B at ~27 tok/s
vs the real ~51 tok/s.

Full write-up also committed on this branch: V100_FIELD_VALIDATION_20260606.md.

Qwen27B-FP8 — rollup is not a throughput tax

Variant	Mean c1	Notes
Stock 1.1.0 wheels, community flags, auto fp16 KV	51.6	No rollup overlay
Rollup v110 overlay, same flags, auto fp16 KV	51.5	`attention.py` + serving + parser
Rollup v110, auto fp16 KV @ 177k	49.6	Run-to-run variance, not KV mode
Rollup v110, `fp8_e5m2` KV @ 177k	49.7	MTP accept 70% vs 68% — within noise
Stock + `fp8_e5m2` on FP8 weights	BOOT_FAIL	Needs #54 (`attention.py`)

Verdict: The rollup patches are ~0 tok/s regression on the Qwen27B-FP8 community recipe.
The earlier concern that rollup costs ~⅓ throughput was a bench-tool artifact (vllm bench serve),
not a real serving penalty. For max Qwen27B speed on 2×V100, stock wheels + community flags suffice;
rollup is still required for Deckard prod features (e5m2 @ 177k, reasoning split) but not for Qwen perf.

KV cache modes on SM70 — do not conflate

`--kv-cache-dtype`	Qwen27B @ 177k	Deckard40B @ 177k	Notes
(omit — auto fp16)	✓ boots, ~50 tok/s	BOOT_FAIL (KV OOM)	Community recipe for 27B
`fp8` (e4m3)	BOOT_FAIL	BOOT_FAIL	`fp8e4nv not supported` on SM70 — stock, v110, and rollup all fail identically
`fp8_e5m2`	✓ boots (rollup only)	✓ boots (rollup only)	Memory lever for 40B @ 177k; neutral on Qwen27B speed

[Bugfix] Allow fp8_e5m2 KV cache with FP8 checkpoints (V100/SM70) #54 is load-bearing for fp8_e5m2 on dense FP8 checkpoints — without the attention.py overlay,
stock 1.1.0 hard-fails at load.
--kv-cache-dtype fp8 (e4m3) is not a V100 knob today. Community "+20 tok/s with fp8 KV" claims
likely mean FP8 weights + default fp16 KV, not e4m3 KV.
fp8_e5m2 on Qwen27B is an optional memory saver with no measured speed penalty; on Deckard 40B
it is required to fit 177k context.

Deckard-40B-FP8-MTP reference numbers (same bench script)

Config	Mean c1	Notes
Rollup + e5m2 @ 177k, MTP `n=1`	~32.3	old prod-ref
Auto fp16 KV @ 99k (no e5m2)	35.6	Max ctx without e5m2 on 40B
Rollup + e5m2 @ 177k, MTP `n=2`	~40.0	current prod (2026-06-07)
Auto fp16 KV @ 177k	BOOT_FAIL	Needs 8.93 GiB KV, ~5.2 GiB available

When to use rollup vs stock (practical guide)

Goal	Config
Max tok/s on 2×V100	Qwen27B-FP8 stock wheels, community flags, auto fp16 KV (~51 c1)
Deckard tune + 177k ctx	Rollup overlay + `fp8_e5m2` KV (~32.5 c1) — not a rollup penalty on 27B
Deckard speed without e5m2	Deckard @ 99k auto fp16 KV (~35.6 c1)
Do not use on V100	`--kv-cache-dtype fp8` (e4m3)

Deckard-40B-FP8-MTP @ 177k — MTP `n=2` unlocks ~40 tok/s (2026-06-07)

Follow-up on the prod Deckard recipe (rollup v110 + fp8_e5m2 @ 177k). The old PR
serve example used num_speculative_tokens: 1; field testing on pve3 shows n=2 is
the throughput win, not CUDA fullgraph or batched-token tuning.

Variant	Mean c1	MTP accept	Notes
MTP `n=2`, batched 8192 (current prod)	~40.0	71.4% (450/630)	~25% over n=1
MTP `n=2`, batched 8192, fullgraph	40.0	71.4%	fullgraph adds nothing here
MTP `n=2`, batched 4096	40.0	71.4%	batched size irrelevant at c1
Prod-ref MTP `n=1`	~32.3	82.9% (348/420)	higher accept %, fewer tokens/step

Why: n=1 has a higher acceptance rate, but n=2 proposes more draft tokens per
step → more accepted tokens/step → higher net throughput. Do not pick n=1 based on accept % alone.

Prod flags that matter (beyond the FP8 overlay):

--speculative-config '{"method":"mtp","num_speculative_tokens":2}' ← the headline knob
--max-num-batched-tokens 8192 (4096 also fine; not the differentiator)
--kv-cache-dtype fp8_e5m2 (required for 177k KV on 40B)
NCCL_P2P_DISABLE=1 (cross-socket SYS topology; NCCL_P2P_DISABLE=0 → ~4× slowdown)
Default cudagraph is fine; forcing fullgraph did not help on this box

Canonical write-up: BestDeckardConfig.md on the homelab share; serve script:
/opt/1cat-vllm/serve_fp8mtp_prod.sh → 1cat-qwen.service on pve3.

Open follow-ups (not blocking merge)

Deckard MTP n=2 concurrency sweep (c4/c8/c16) — only c1 A/B run so far
CUDA graph capture mode not systematically A/B'd on our no-NVLink V100 pair
All numbers are c1; prod runs c16 — Qwen27B ceiling at higher concurrency untested

Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit f8e4c58adad5561ab4cd006fdab6c9b1903eec1c)

Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit 2fc562b8cfae2bb255baf097e0c71b498860c327)

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

Updated the WeChat group QR code image in the README.

修复了错误的名字

Add FLASH_ATTN_V100 runtime path, Qwen3.5/Qwen3.6 launch profiles, SM70 AWQ updates, vendored build dependencies, and public regression charts.

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

… loading Squashed PR 1CatAI#45: SM70 dense WNA16 TurboMind linear kernel + selector, admit V100 (CC 7.0) in CompressedTensorsWNA16, compressed-tensors MoE decode buffers for V100, keep router gate / split GDN projections unquantized under CT, skip tuple-shard split for non-output-dim CT params, sm70_884_4.cu kernel registry sync. Enables W4A16 compressed-tensors models on Volta.

…2 KV on W4A16 compressed-tensors, V100/SM70) Squashed PR 1CatAI#49.

vLLM eagerly raises "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" for any FP8-weight checkpoint, blocking the only FP8 KV format that compiles on V100/SM70 (Triton on sm_70 supports fp8e5, not fp8e4nv/e4m3). fp8_e5m2 is scaleless: its 5 exponent bits give enough range that it does not need the k_scale/v_scale machinery the e4m3 path loads from the checkpoint. So for e5m2 we skip the scale-quant setup and use plain e5m2 KV with default 1.0 scales; the e4m3/scaled path is unchanged. Sibling of 1CatAI#49 (same fix for W4A16 compressed-tensors checkpoints). Validated live on 2x V100-PCIE-32GB serving dense FP8 + MTP Qwen3.5-40B at 177K context.

…kens extract_content_ids() returns [] when the <think> block was never closed (generation truncated by max_tokens mid-reasoning). Decoding an empty id list blanked out the content the reasoning parser already produced, so the client got an empty/cut-off response on truncation. Only override content when there are content ids. Mirrors 1CatAI#47.

Squashed PR 1CatAI#52.

Squashed PR 1CatAI#46.

philbert440 · 2026-06-01T14:31:25Z

Validated with: https://huggingface.co/tcclaviger/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-FP8-MTP

valentijnvenus · 2026-06-04T05:13:02Z

Up vote

…ed backend) FlashAttnV100Backend defines get_supported_head_sizes() -> [64,128,256], but validate_configuration() checks supports_head_size(), not that method. Since FlashAttnV100Backend does not override supports_head_size, it inherited TritonAttentionBackend.supports_head_size (head_size >= 32) and wrongly validated head sizes the Volta kernel cannot run (e.g. 512), then hard-crashed in the CUDA dispatch (TORCH_CHECK D <= 256, "D must be even, <=256, multiple of 8"). Add a supports_head_size override returning {64,128,256} (the sizes the SM70 dense/paged kernels actually dispatch). This: - fixes the crash for any model with head_dim > 256 on auto-select / FA-V100, - enables vLLM v1's per-layer backend selection to keep the supported heads on FLASH_ATTN_V100 and fall through to TRITON_ATTN for the rest. Concretely this unlocks the fast mixed backend for models with dual head dims (e.g. Gemma-4: 256-dim sliding layers run on FA-V100, 512-dim global layers on Triton) instead of being forced onto TRITON_ATTN for every layer — ~2.5x decode on V100 for such models. No effect on standard head_dim<=256 models (already all FA-V100). Co-Authored-By: RivetOS Claude <noreply@anthropic.com>

valentijnvenus · 2026-06-18T20:24:52Z

@yangzhuxinyzx is this PR being look into? It would be helpful to see a community develop around your wonderful V100 initiative

GabeChurch · 2026-06-18T20:50:26Z

I am also interested, seems like some positive changes here

yangzhuxinyzx and others added 30 commits March 21, 2026 12:23

[Core] Import 1Cat-vLLM-0.0.2 runtime and build system

4683901

[CI/Build] Vendor lmdeploy source for standalone builds

92c6efb

[Kernel] Add validation, examples, and benchmark assets

5262499

[Doc] Publish 1Cat-vLLM-0.0.2 release snapshot

b3b1abd

[Doc] Update rebuilt wheel download links

6fd0f8d

Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit f8e4c58adad5561ab4cd006fdab6c9b1903eec1c)

[Bugfix] Vendor runtime Python packages for source builds

a8783b0

Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit 2fc562b8cfae2bb255baf097e0c71b498860c327)

[CI/Build][Doc] Add verified SM70 Docker runtime path

1e6c257

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

Add files via upload

f29bd45

Change WeChat group QR code image

d6c28dc

Updated the WeChat group QR code image in the README.

Update README.md

18e5223

Add files via upload

3c7a8a3

Update Dockerfile.sm70-wheel

f5d2e15

修复了错误的名字

Add files via upload

feb8402

docs: update wechat group qr code

c1dce83

docs: update WeChat group QR code

82f59c8

Release 1Cat-vLLM 0.0.3

92a785c

Add FLASH_ATTN_V100 runtime path, Qwen3.5/Qwen3.6 launch profiles, SM70 AWQ updates, vendored build dependencies, and public regression charts.

Merge 1CatAI main history for 0.0.3

eea9d81

Update README.md

04bb4b7

Update README.md

7a7549c

Update README.md

6276450

Update README.md

a1bf487

docs: clarify wheel runtime directory

197f1cc

[Kernel] Add V100 FA2 fp8 KV cache audits

58ebaa6

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Core] Trim V100 startup memory defaults

3b539f9

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

QRcode-update

437b358

[Core] Prepare 1.0.0 V100 release

a4daad6

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Update 1.0.0 wheel install and MTP launch

761ae33

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Simplify public launch commands

0741a30

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Restore validated MTP launch profile

36536e5

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Add MTP throughput note

29b73ec

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

yangzhuxinyzx and others added 19 commits May 13, 2026 19:00

[Bugfix] Restore spec proposer compatibility

0ac0632

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Add TP2 MTP launch profile

05ac1a4

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Core] Archive FP8 MTP investigation state

8b536c1

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

docs: update WeChat group QR code

bf37452

[Kernel] Add SM70 FP8 MoE fast path

69749dd

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Credit flash-attention-v100

d18b16c

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Bugfix] Stabilize MTP state handling

acd2a31

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

docs: update WeChat group QR code

06f7a38

docs: update WeChat group QR code to Group 3

f1a64a7

[Build] Prepare 1Cat-vLLM 1.0.1 release

42f23f6

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Build] Prepare 1Cat-vLLM 1.1.0 beta release

a645fcb

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Refocus README on project overview

530ac4d

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

docs: update WeChat group QR code

432f197

P7: skip CT KVCacheMethod when kv_cache_scheme is None (allow fp8_e5m…

3355222

…2 KV on W4A16 compressed-tensors, V100/SM70) Squashed PR 1CatAI#49.

[Bugfix] Default Qwen3 reasoning parser to prompt-has-open-think

7950cdd

Squashed PR 1CatAI#52.

ci: fix CRLF line endings in shell scripts + .gitattributes

385d932

Squashed PR 1CatAI#46.

claude and others added 5 commits June 6, 2026 02:42

revert: roll back fced94d head-size validation override

f5c8e47

docs: add 2026-06-06 pve3 field validation results

ea45f3a

docs: V100 field validation 2026-06-06 (Qwen27B rollup A/B + KV matrix)

308d4c1

docs: add Deckard MTP n=2 ~40 tok/s finding to field validation

6fbf600

rivetphilbot mentioned this pull request Jun 8, 2026

[V100/SM70] gemma-4 (31B + MTP) support: fully-FA hybrid attention (sliding-window + head_dim-512) #59

Closed

yangzhuxinyzx force-pushed the main branch from 63b05fc to 00323f2 Compare June 15, 2026 02:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V100/SM70] Rollup: Volta serving stack — W4A16 + FP8 (e5m2/MTP) + reasoning fixes#55

[V100/SM70] Rollup: Volta serving stack — W4A16 + FP8 (e5m2/MTP) + reasoning fixes#55
rivetphilbot wants to merge 54 commits into
1CatAI:mainfrom
rivetphilbot:rivetphilbot-rollup

rivetphilbot commented Jun 1, 2026 •

edited

Loading

Uh oh!

philbert440 commented Jun 1, 2026 •

edited

Loading

Uh oh!

valentijnvenus commented Jun 4, 2026

Uh oh!

valentijnvenus commented Jun 18, 2026

Uh oh!

GabeChurch commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

rivetphilbot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bundled (one squashed commit each)

Validation

✅ Verified working configuration (FP8 + MTP, 2× V100)

Field validation (2026-06-06, pve3, 2× V100-PCIE-32GB)

Qwen27B-FP8 — rollup is not a throughput tax

KV cache modes on SM70 — do not conflate

Deckard-40B-FP8-MTP reference numbers (same bench script)

When to use rollup vs stock (practical guide)

Deckard-40B-FP8-MTP @ 177k — MTP n=2 unlocks ~40 tok/s (2026-06-07)

Open follow-ups (not blocking merge)

Uh oh!

philbert440 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valentijnvenus commented Jun 4, 2026

Uh oh!

valentijnvenus commented Jun 18, 2026

Uh oh!

GabeChurch commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

rivetphilbot commented Jun 1, 2026 •

edited

Loading

Deckard-40B-FP8-MTP @ 177k — MTP `n=2` unlocks ~40 tok/s (2026-06-07)

philbert440 commented Jun 1, 2026 •

edited

Loading