[Bugfix] Allow fp8_e5m2 KV cache on W4A16 compressed-tensors models (V100/SM70)#49
Open
rivetphilbot wants to merge 44 commits into
Open
[Bugfix] Allow fp8_e5m2 KV cache on W4A16 compressed-tensors models (V100/SM70)#49rivetphilbot wants to merge 44 commits into
rivetphilbot wants to merge 44 commits into
Conversation
Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit f8e4c58adad5561ab4cd006fdab6c9b1903eec1c)
Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit 2fc562b8cfae2bb255baf097e0c71b498860c327)
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Updated the WeChat group QR code image in the README.
修复了错误的名字
Add FLASH_ATTN_V100 runtime path, Qwen3.5/Qwen3.6 launch profiles, SM70 AWQ updates, vendored build dependencies, and public regression charts.
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
|
nice one! |
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
This was referenced Jun 1, 2026
CompressedTensorsConfig.get_quant_method() unconditionally registers CompressedTensorsKVCacheMethod for every Attention layer when the model uses compressed-tensors, even when kv_cache_scheme is None (i.e. the checkpoint does not ship per-layer KV cache scales). Once that method is registered, should_load_quant_weights() at vllm/model_executor/layers/attention/attention.py:166 returns True, the classifier asserts the quant method is a BaseKVCacheMethod (it is), and the e5m2 guard at attention.py:167 then refuses --kv-cache-dtype fp8_e5m2 with "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" - despite the checkpoint being a plain W4A16 model with no FP8 scales at all. This blocks FP8 KV cache entirely on V100 / SM70 for compressed-tensors W4A16 models, because Triton on SM70 only supports fp8e5 (not fp8e4nv), so fp8_e5m2 is the only FP8 KV path available on this hardware. Fix: short-circuit get_quant_method() for Attention layers when kv_cache_scheme is None. Returning None lets should_load_quant_weights() return False, the classifier branch is skipped, and the user's --kv-cache-dtype choice (fp8_e5m2 in this case) is honored as a runtime decision rather than a checkpoint property. Validated on Deckard-40B-W4A16 (Qwen3_5ForConditionalGeneration) at TP=2 on dual Tesla V100-PCIE-32GB: - Boots cleanly with --kv-cache-dtype fp8_e5m2, --max-model-len 262144 (native), --max-num-seqs 16, --gpu-memory-utilization 0.89 - FLASH_ATTN_V100 fast paths engaged for prefill + decode (no fallback to triton_attn): confirmed via per-layer log emit at flash_attn_v100.py:520, :501, :547 - Available KV cache memory 6.45 GiB, GPU KV cache size 68,992 tokens, 1.03x concurrency for 262,144 tokens per request - 9-stream concurrent aggregate decode throughput 75 tok/s (vs single-stream baseline 20 tok/s = 3.75x multiplier) - 128K-token needle-in-haystack at 75% depth retrieved verbatim with default k_scale=v_scale=1.0 (no calibrated FP8 KV scales required) - Correctness probes at temp=0 (math, code, factual, multi-step reasoning) all pass No behavior change for models that DO ship kv_cache_scheme - they continue to receive CompressedTensorsKVCacheMethod as before. Co-Authored-By: RivetOS Claude (Opus 4.7, 1M context) <noreply@anthropic.com>
b7a5dc3 to
9a7a7cd
Compare
rjiangnju
pushed a commit
to rjiangnju/1Cat-vLLM-FP8
that referenced
this pull request
Jun 5, 2026
…2 KV on W4A16 compressed-tensors, V100/SM70) Squashed PR 1CatAI#49.
rjiangnju
pushed a commit
to rjiangnju/1Cat-vLLM-FP8
that referenced
this pull request
Jun 5, 2026
vLLM eagerly raises "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" for any FP8-weight checkpoint, blocking the only FP8 KV format that compiles on V100/SM70 (Triton on sm_70 supports fp8e5, not fp8e4nv/e4m3). fp8_e5m2 is scaleless: its 5 exponent bits give enough range that it does not need the k_scale/v_scale machinery the e4m3 path loads from the checkpoint. So for e5m2 we skip the scale-quant setup and use plain e5m2 KV with default 1.0 scales; the e4m3/scaled path is unchanged. Sibling of 1CatAI#49 (same fix for W4A16 compressed-tensors checkpoints). Validated live on 2x V100-PCIE-32GB serving dense FP8 + MTP Qwen3.5-40B at 177K context.
humanjesse
added a commit
to humanjesse/vllm-v100
that referenced
this pull request
Jun 8, 2026
…eme) Selective pick from upstream 1Cat-vLLM PR 1CatAI#49 (rivetphilbot:p7-fp8-kv-ct-classifier-fix). CompressedTensorsConfig.get_quant_method() unconditionally registered a CompressedTensorsKVCacheMethod for every Attention layer on CT models, even when kv_cache_scheme is None. That made should_load_quant_weights() classify plain W4A16 checkpoints as 'fp8 checkpoints' and refuse --kv-cache-dtype fp8_e5m2 -- the only FP8 KV path Triton supports on V100/SM70. Short-circuit to None when kv_cache_scheme is None so the user's --kv-cache-dtype is honored. Validated on cyankiwi/Qwen3.6-27B-AWQ-INT4 (CT W4A16, Gated DeltaNet) at TP4 on 4x V100-32GB: boots with --kv-cache-dtype fp8_e5m2 and generates correct output (previously raised ValueError at engine init). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
humanjesse
added a commit
to humanjesse/vllm-v100
that referenced
this pull request
Jun 8, 2026
Selective pick from upstream 1Cat-vLLM PR 1CatAI#54 (rivetphilbot:fp8-e5m2-kv-fp8-checkpoints). _init_kv_cache_quant() eagerly raised "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" for any fp8-weight checkpoint. On V100/SM70 that left no fp8 KV path at all: Triton on sm_70 supports fp8e5 (e5m2) but not fp8e4nv (e4m3). fp8_e5m2 is scaleless (5 exponent bits give enough dynamic range), so it does not need the k_scale/v_scale machinery the e4m3 path loads from the checkpoint -- skip the scale-quant setup and use plain e5m2 KV with default 1.0 scales. Sibling of PR 1CatAI#49 (W4A16/no-kv-scheme case). Adopted on code-review confidence; end-to-end validation deferred until fp8-weight bring-up on Volta (needs an fp8 checkpoint + SM70 fp8 MoE fast path, upstream 69749dd). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CompressedTensorsConfig.get_quant_method()unconditionally registers aCompressedTensorsKVCacheMethodfor everyAttentionlayer when the model uses compressed-tensors quantization — even whenkv_cache_scheme is None(i.e. the checkpoint does not ship KV cache scales).That registration cascades into
should_load_quant_weights()atvllm/model_executor/layers/attention/attention.py:166returningTrue, the classifier asserting the quant method is aBaseKVCacheMethod(it is), and then the e5m2 guard atattention.py:167refusing--kv-cache-dtype fp8_e5m2with:…despite the checkpoint being a plain W4A16 model with no FP8 scales at all.
On V100 / SM70 this is severe: Triton on SM70 only supports
fp8e5(notfp8e4nv), sofp8_e5m2is the only FP8 KV path available — and this bug blocks it for every compressed-tensors W4A16 model, of which there are now many in production (Qwen3.5/3.6 variants, DeepSeek W4A16, etc.).Fix
Short-circuit
get_quant_method()forAttentionlayers whenkv_cache_scheme is None. ReturningNoneletsshould_load_quant_weights()returnFalse, the classifier branch is skipped, and the user's--kv-cache-dtypechoice is honored as a runtime decision rather than treated as a checkpoint property.12 lines inserted, 0 removed. No behavior change for models that DO ship
kv_cache_scheme— they continue to receiveCompressedTensorsKVCacheMethodexactly as before.Validation
Tested on Deckard-40B-W4A16 (
Qwen3_5ForConditionalGeneration, hybrid linear+full attention) at TP=2 on dual Tesla V100-PCIE-32GB:--kv-cache-dtype fp8_e5m2 --max-model-len 262144 --max-num-seqs 16 --gpu-memory-utilization 0.89FLASH_ATTN_V100fast paths engaged for prefill + decode (no fallback totriton_attn)k_scale = v_scale = 1.0(no calibrated FP8 KV scales needed)temp=0(math, code, factual, multi-step reasoning) all passTest plan
--kv-cache-dtype fp8_e5m2on a W4A16 compressed-tensors model--kv-cache-dtype auto(FP16 KV) on the same model — no regressionkv_cache_scheme(e.g. a calibrated FP8 KV checkpoint). The guard isis None, so registration is preserved when scales are present.🤖 Authored by RivetOS Claude (Opus 4.7, 1M context)