[Bugfix] Allow fp8_e5m2 KV cache on W4A16 compressed-tensors models (V100/SM70) by rivetphilbot · Pull Request #49 · 1CatAI/1Cat-vLLM

rivetphilbot · 2026-05-22T16:59:24Z

Summary

CompressedTensorsConfig.get_quant_method() unconditionally registers a CompressedTensorsKVCacheMethod for every Attention layer when the model uses compressed-tensors quantization — even when kv_cache_scheme is None (i.e. the checkpoint does not ship KV cache scales).

That registration cascades into should_load_quant_weights() at vllm/model_executor/layers/attention/attention.py:166 returning True, the classifier asserting the quant method is a BaseKVCacheMethod (it is), and then the e5m2 guard at attention.py:167 refusing --kv-cache-dtype fp8_e5m2 with:

ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints.

…despite the checkpoint being a plain W4A16 model with no FP8 scales at all.

On V100 / SM70 this is severe: Triton on SM70 only supports fp8e5 (not fp8e4nv), so fp8_e5m2 is the only FP8 KV path available — and this bug blocks it for every compressed-tensors W4A16 model, of which there are now many in production (Qwen3.5/3.6 variants, DeepSeek W4A16, etc.).

Fix

Short-circuit get_quant_method() for Attention layers when kv_cache_scheme is None. Returning None lets should_load_quant_weights() return False, the classifier branch is skipped, and the user's --kv-cache-dtype choice is honored as a runtime decision rather than treated as a checkpoint property.

12 lines inserted, 0 removed. No behavior change for models that DO ship kv_cache_scheme — they continue to receive CompressedTensorsKVCacheMethod exactly as before.

Validation

Tested on Deckard-40B-W4A16 (Qwen3_5ForConditionalGeneration, hybrid linear+full attention) at TP=2 on dual Tesla V100-PCIE-32GB:

✅ Boots cleanly with --kv-cache-dtype fp8_e5m2 --max-model-len 262144 --max-num-seqs 16 --gpu-memory-utilization 0.89
✅ FLASH_ATTN_V100 fast paths engaged for prefill + decode (no fallback to triton_attn)
✅ Available KV cache memory 6.45 GiB, GPU KV cache size 68,992 tokens, 1.03× concurrency at 262,144
✅ 9-stream concurrent aggregate decode throughput 75 tok/s vs single-stream baseline 20 tok/s (3.75× multiplier)
✅ 128K-token needle-in-haystack at 75% depth retrieved verbatim with default k_scale = v_scale = 1.0 (no calibrated FP8 KV scales needed)
✅ Correctness probes at temp=0 (math, code, factual, multi-step reasoning) all pass

Test plan

Boots with --kv-cache-dtype fp8_e5m2 on a W4A16 compressed-tensors model
Boots with --kv-cache-dtype auto (FP16 KV) on the same model — no regression
Long-context retrieval validated at 128K
Maintainers: please verify no regression on a model that DOES ship kv_cache_scheme (e.g. a calibrated FP8 KV checkpoint). The guard is is None, so registration is preserved when scales are present.

🤖 Authored by RivetOS Claude (Opus 4.7, 1M context)

Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit f8e4c58adad5561ab4cd006fdab6c9b1903eec1c)

Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit 2fc562b8cfae2bb255baf097e0c71b498860c327)

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

Updated the WeChat group QR code image in the README.

修复了错误的名字

Add FLASH_ATTN_V100 runtime path, Qwen3.5/Qwen3.6 launch profiles, SM70 AWQ updates, vendored build dependencies, and public regression charts.

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

valentijnvenus · 2026-05-26T09:31:42Z

nice one!

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

CompressedTensorsConfig.get_quant_method() unconditionally registers CompressedTensorsKVCacheMethod for every Attention layer when the model uses compressed-tensors, even when kv_cache_scheme is None (i.e. the checkpoint does not ship per-layer KV cache scales). Once that method is registered, should_load_quant_weights() at vllm/model_executor/layers/attention/attention.py:166 returns True, the classifier asserts the quant method is a BaseKVCacheMethod (it is), and the e5m2 guard at attention.py:167 then refuses --kv-cache-dtype fp8_e5m2 with "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" - despite the checkpoint being a plain W4A16 model with no FP8 scales at all. This blocks FP8 KV cache entirely on V100 / SM70 for compressed-tensors W4A16 models, because Triton on SM70 only supports fp8e5 (not fp8e4nv), so fp8_e5m2 is the only FP8 KV path available on this hardware. Fix: short-circuit get_quant_method() for Attention layers when kv_cache_scheme is None. Returning None lets should_load_quant_weights() return False, the classifier branch is skipped, and the user's --kv-cache-dtype choice (fp8_e5m2 in this case) is honored as a runtime decision rather than a checkpoint property. Validated on Deckard-40B-W4A16 (Qwen3_5ForConditionalGeneration) at TP=2 on dual Tesla V100-PCIE-32GB: - Boots cleanly with --kv-cache-dtype fp8_e5m2, --max-model-len 262144 (native), --max-num-seqs 16, --gpu-memory-utilization 0.89 - FLASH_ATTN_V100 fast paths engaged for prefill + decode (no fallback to triton_attn): confirmed via per-layer log emit at flash_attn_v100.py:520, :501, :547 - Available KV cache memory 6.45 GiB, GPU KV cache size 68,992 tokens, 1.03x concurrency for 262,144 tokens per request - 9-stream concurrent aggregate decode throughput 75 tok/s (vs single-stream baseline 20 tok/s = 3.75x multiplier) - 128K-token needle-in-haystack at 75% depth retrieved verbatim with default k_scale=v_scale=1.0 (no calibrated FP8 KV scales required) - Correctness probes at temp=0 (math, code, factual, multi-step reasoning) all pass No behavior change for models that DO ship kv_cache_scheme - they continue to receive CompressedTensorsKVCacheMethod as before. Co-Authored-By: RivetOS Claude (Opus 4.7, 1M context) <noreply@anthropic.com>

…2 KV on W4A16 compressed-tensors, V100/SM70) Squashed PR 1CatAI#49.

vLLM eagerly raises "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" for any FP8-weight checkpoint, blocking the only FP8 KV format that compiles on V100/SM70 (Triton on sm_70 supports fp8e5, not fp8e4nv/e4m3). fp8_e5m2 is scaleless: its 5 exponent bits give enough range that it does not need the k_scale/v_scale machinery the e4m3 path loads from the checkpoint. So for e5m2 we skip the scale-quant setup and use plain e5m2 KV with default 1.0 scales; the e4m3/scaled path is unchanged. Sibling of 1CatAI#49 (same fix for W4A16 compressed-tensors checkpoints). Validated live on 2x V100-PCIE-32GB serving dense FP8 + MTP Qwen3.5-40B at 177K context.

…eme) Selective pick from upstream 1Cat-vLLM PR 1CatAI#49 (rivetphilbot:p7-fp8-kv-ct-classifier-fix). CompressedTensorsConfig.get_quant_method() unconditionally registered a CompressedTensorsKVCacheMethod for every Attention layer on CT models, even when kv_cache_scheme is None. That made should_load_quant_weights() classify plain W4A16 checkpoints as 'fp8 checkpoints' and refuse --kv-cache-dtype fp8_e5m2 -- the only FP8 KV path Triton supports on V100/SM70. Short-circuit to None when kv_cache_scheme is None so the user's --kv-cache-dtype is honored. Validated on cyankiwi/Qwen3.6-27B-AWQ-INT4 (CT W4A16, Gated DeltaNet) at TP4 on 4x V100-32GB: boots with --kv-cache-dtype fp8_e5m2 and generates correct output (previously raised ValueError at engine init). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Selective pick from upstream 1Cat-vLLM PR 1CatAI#54 (rivetphilbot:fp8-e5m2-kv-fp8-checkpoints). _init_kv_cache_quant() eagerly raised "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" for any fp8-weight checkpoint. On V100/SM70 that left no fp8 KV path at all: Triton on sm_70 supports fp8e5 (e5m2) but not fp8e4nv (e4m3). fp8_e5m2 is scaleless (5 exponent bits give enough dynamic range), so it does not need the k_scale/v_scale machinery the e4m3 path loads from the checkpoint -- skip the scale-quant setup and use plain e5m2 KV with default 1.0 scales. Sibling of PR 1CatAI#49 (W4A16/no-kv-scheme case). Adopted on code-review confidence; end-to-end validation deferred until fp8-weight bring-up on Volta (needs an fp8 checkpoint + SM70 fp8 MoE fast path, upstream 69749dd). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

yangzhuxinyzx and others added 30 commits March 21, 2026 12:23

[Core] Import 1Cat-vLLM-0.0.2 runtime and build system

4683901

[CI/Build] Vendor lmdeploy source for standalone builds

92c6efb

[Kernel] Add validation, examples, and benchmark assets

5262499

[Doc] Publish 1Cat-vLLM-0.0.2 release snapshot

b3b1abd

[Doc] Update rebuilt wheel download links

6fd0f8d

Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit f8e4c58adad5561ab4cd006fdab6c9b1903eec1c)

[Bugfix] Vendor runtime Python packages for source builds

a8783b0

Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit 2fc562b8cfae2bb255baf097e0c71b498860c327)

[CI/Build][Doc] Add verified SM70 Docker runtime path

1e6c257

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

Add files via upload

f29bd45

Change WeChat group QR code image

d6c28dc

Updated the WeChat group QR code image in the README.

Update README.md

18e5223

Add files via upload

3c7a8a3

Update Dockerfile.sm70-wheel

f5d2e15

修复了错误的名字

Add files via upload

feb8402

docs: update wechat group qr code

c1dce83

docs: update WeChat group QR code

82f59c8

Release 1Cat-vLLM 0.0.3

92a785c

Add FLASH_ATTN_V100 runtime path, Qwen3.5/Qwen3.6 launch profiles, SM70 AWQ updates, vendored build dependencies, and public regression charts.

Merge 1CatAI main history for 0.0.3

eea9d81

Update README.md

04bb4b7

Update README.md

7a7549c

Update README.md

6276450

Update README.md

a1bf487

docs: clarify wheel runtime directory

197f1cc

[Kernel] Add V100 FA2 fp8 KV cache audits

58ebaa6

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Core] Trim V100 startup memory defaults

3b539f9

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

QRcode-update

437b358

[Core] Prepare 1.0.0 V100 release

a4daad6

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Update 1.0.0 wheel install and MTP launch

761ae33

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Simplify public launch commands

0741a30

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Restore validated MTP launch profile

36536e5

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Add MTP throughput note

29b73ec

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

yangzhuxinyzx added 10 commits May 13, 2026 19:00

[Bugfix] Restore spec proposer compatibility

0ac0632

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Add TP2 MTP launch profile

05ac1a4

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Core] Archive FP8 MTP investigation state

8b536c1

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

docs: update WeChat group QR code

bf37452

[Kernel] Add SM70 FP8 MoE fast path

69749dd

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Credit flash-attention-v100

d18b16c

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Bugfix] Stabilize MTP state handling

acd2a31

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

docs: update WeChat group QR code

06f7a38

docs: update WeChat group QR code to Group 3

f1a64a7

[Build] Prepare 1Cat-vLLM 1.0.1 release

42f23f6

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

yangzhuxinyzx and others added 3 commits May 27, 2026 21:17

[Build] Prepare 1Cat-vLLM 1.1.0 beta release

a645fcb

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

[Doc] Refocus README on project overview

530ac4d

Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>

docs: update WeChat group QR code

432f197

This was referenced Jun 1, 2026

[Bugfix] Allow fp8_e5m2 KV cache with FP8 checkpoints (V100/SM70) #54

Open

ci: fix CRLF line endings in shell scripts #46

Closed

[V100/SM70] Rollup: Volta serving stack — W4A16 + FP8 (e5m2/MTP) + reasoning fixes #55

Open

rivetphilbot force-pushed the p7-fp8-kv-ct-classifier-fix branch from b7a5dc3 to 9a7a7cd Compare June 1, 2026 03:46

rjiangnju pushed a commit to rjiangnju/1Cat-vLLM-FP8 that referenced this pull request Jun 5, 2026

P7: skip CT KVCacheMethod when kv_cache_scheme is None (allow fp8_e5m…

1b0dc06

…2 KV on W4A16 compressed-tensors, V100/SM70) Squashed PR 1CatAI#49.

yangzhuxinyzx force-pushed the main branch from 63b05fc to 00323f2 Compare June 15, 2026 02:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Allow fp8_e5m2 KV cache on W4A16 compressed-tensors models (V100/SM70)#49

[Bugfix] Allow fp8_e5m2 KV cache on W4A16 compressed-tensors models (V100/SM70)#49
rivetphilbot wants to merge 44 commits into
1CatAI:mainfrom
rivetphilbot:p7-fp8-kv-ct-classifier-fix

rivetphilbot commented May 22, 2026

Uh oh!

valentijnvenus commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

rivetphilbot commented May 22, 2026

Summary

Fix

Validation

Test plan

Uh oh!

valentijnvenus commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants