[Bugfix] Allow fp8_e5m2 KV cache with FP8 checkpoints (V100/SM70)#54
Open
rivetphilbot wants to merge 44 commits into
Open
[Bugfix] Allow fp8_e5m2 KV cache with FP8 checkpoints (V100/SM70)#54rivetphilbot wants to merge 44 commits into
rivetphilbot wants to merge 44 commits into
Conversation
Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit f8e4c58adad5561ab4cd006fdab6c9b1903eec1c)
Signed-off-by: Pan-Shuhan-YMZX <263558224+Pan-Shuhan-YMZX@users.noreply.github.com> (cherry picked from commit 2fc562b8cfae2bb255baf097e0c71b498860c327)
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Updated the WeChat group QR code image in the README.
修复了错误的名字
Add FLASH_ATTN_V100 runtime path, Qwen3.5/Qwen3.6 launch profiles, SM70 AWQ updates, vendored build dependencies, and public regression charts.
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
Signed-off-by: yangzhuxinyzx <153831768+yangzhuxinyzx@users.noreply.github.com>
vLLM eagerly raises "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" for any FP8-weight checkpoint, blocking the only FP8 KV format that compiles on V100/SM70 (Triton on sm_70 supports fp8e5, not fp8e4nv/e4m3). fp8_e5m2 is scaleless: its 5 exponent bits give enough range that it does not need the k_scale/v_scale machinery the e4m3 path loads from the checkpoint. So for e5m2 we skip the scale-quant setup and use plain e5m2 KV with default 1.0 scales; the e4m3/scaled path is unchanged. Sibling of 1CatAI#49 (same fix for W4A16 compressed-tensors checkpoints). Validated live on 2x V100-PCIE-32GB serving dense FP8 + MTP Qwen3.5-40B at 177K context.
This was referenced Jun 1, 2026
humanjesse
added a commit
to humanjesse/vllm-v100
that referenced
this pull request
Jun 8, 2026
Selective pick from upstream 1Cat-vLLM PR 1CatAI#54 (rivetphilbot:fp8-e5m2-kv-fp8-checkpoints). _init_kv_cache_quant() eagerly raised "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" for any fp8-weight checkpoint. On V100/SM70 that left no fp8 KV path at all: Triton on sm_70 supports fp8e5 (e5m2) but not fp8e4nv (e4m3). fp8_e5m2 is scaleless (5 exponent bits give enough dynamic range), so it does not need the k_scale/v_scale machinery the e4m3 path loads from the checkpoint -- skip the scale-quant setup and use plain e5m2 KV with default 1.0 scales. Sibling of PR 1CatAI#49 (W4A16/no-kv-scheme case). Adopted on code-review confidence; end-to-end validation deferred until fp8-weight bring-up on Volta (needs an fp8 checkpoint + SM70 fp8 MoE fast path, upstream 69749dd). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Attentioneagerly raisesfp8_e5m2 kv-cache is not supported with fp8 checkpointsfor any FP8-weight checkpoint. On V100/SM70 that blocks the only FP8 KV format
that compiles: Triton on
sm_70supportsfp8e5(e5m2) but notfp8e4nv(e4m3),so e4m3 KV is unavailable and this guard removes e5m2 too — leaving no FP8 KV path
at all for FP8 checkpoints on Volta.
Why the guard is too broad
fp8_e5m2is scaleless: its 5 exponent bits give enough dynamic range that itdoes not need the
k_scale/v_scalemachinery the e4m3 path loads from thecheckpoint. For e5m2 we can skip the scale-quant setup and use plain e5m2 KV with
the default 1.0 scales.
Change
Only the e5m2 branch is special-cased; the e4m3/scaled path is unchanged (moved
under
else). One-function change inAttention._init_kv_cache_quantequivalent.Relation to #49
This is the FP8-checkpoint sibling of #49, which lifted the same restriction for
W4A16 compressed-tensors checkpoints. Same root cause (an over-broad eager
guard), different checkpoint class.
Validation
Running live on 2× V100-PCIE-32GB (
FLASH_ATTN_V100, TP=2) serving a denseFP8 + MTP Qwen3.5-40B model at 177K context: FP8 weights +
fp8_e5m2KVcoexist and generate coherently. fp8 KV roughly halves the cache footprint, which
is what makes the long context fit on 32 GB cards.