Skip to content

build(deps): update vllm requirement from >=0.21.0 to >=0.22.0 in /engines/stt-transcribe/vllm-asr#268

Open
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/engines/stt-transcribe/vllm-asr/vllm-gte-0.22.0
Open

build(deps): update vllm requirement from >=0.21.0 to >=0.22.0 in /engines/stt-transcribe/vllm-asr#268
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/engines/stt-transcribe/vllm-asr/vllm-gte-0.22.0

Conversation

@dependabot

@dependabot dependabot Bot commented on behalf of github Jun 4, 2026

Copy link
Copy Markdown
Contributor

Updates the requirements on vllm to permit the latest version.

Release notes

Sourced from vllm's releases.

v0.22.0

Highlights

This release features 459 commits from 230 contributors (63 new)!

  • DeepSeek V4 maturity: DeepSeek V4 received a major hardening pass this cycle — the model was reorganized into a dedicated vllm/models/deepseek_v4/ package (#43004, #43039, #43073, #43077, #43149), gained NVFP4 fused MoE support (#42209), full + piecewise CUDA graph (#42604), and MTP speculative decoding (#43385). A large set of fused kernels (MegaMoE, mhc, Q-norm, indexer, sparse MLA) and ROCm parity fixes landed alongside accuracy fixes (#42810, #43710).
  • Model Runner V2 advances toward default: MRv2 is now default for Qwen3 dense models. vLLM will fall back to MRv1 for features that aren't yet supported in MRv2 (#39337). sleep-mode weight reload (#42673), update_config (#42783), and shared KV-cache layers (#35045), plus many correctness fixes.
  • Experimental Rust frontend: A new Rust front-end integration landed (#40848), with the implementation moved into the tree (#43283) and a DP Supervisor for data-parallel serving (#40841).
  • Batch invariance, faster: Batch-invariant inference gained Cutlass FP8 support for a 28.9% end-to-end latency improvement (#40408), compile-mode support on SM80 (#42456), and an NVFP4 Cutlass linear path (#39912).
  • Multi-tier KV cache offloading: A new multi-tier KV cache offloading framework (#40020) with a Python filesystem secondary tier (#41735), DSv4 support (#43142), and Mooncake disk offloading (#42689) extends offloading beyond CPU memory.

Model Support

  • New architectures: MiniCPM-V 4.6 (#41254), InternS2 Preview (#42705), OpenVLA (#42654), MolmoWeb hf_overrides docs (#42163); EXAONE-4.5 aligned with Transformers update (#42246).
  • Speculative decoding: custom callable proposer backend (#39487), post-norm EAGLE-3 speculators (#42764), peagle speculators (#41826), hybrid-attention models in extract_hidden_states (#39949), non-MTP speculation for NemotronH (#43130), shared MTP weights in MRv2 (#42538).
  • DeepSeek V4: NVFP4 MoE (#42209), CUDA graph full/piecewise (#42604), MTP (#43385), model package refactor (#43004, #43039, #43073, #43077), sparse MLA + compressor refactor (#43149, #43710), MegaMoE input-prep kernel move (#43632).
  • Qwen3.5/3.6: GDN output-projection flatten (#42311), GatedDeltaNet Marlin TP≥2 fix (#36329), ViT full CUDA graph (#42151), runai-streamer weight loading for Qwen3.5/MTP/Qwen3-VL (#42521, #42716), KDA chunk-prefill exp2 semantics (#43195).
  • Gemma3/Gemma4: mixed-resolution image co-batching crash fix (#42217), MoE routing closure fix (#42250), tool-parser float-corruption fix (#42128), batched vision encoder for image/video (#43169), multi-GPU fix (#42630).
  • Kimi-K2.5: skip vision-tower dtype conversion under quantization (#42869), mm_projector dtype fix (#42081).
  • Cohere: enable Cohere MoE (#43143), pipeline parallelism for Cohere vision (#42819).
  • Tool calling: Apertus tool parser (#41154), Qwen3Coder anyOf/oneOf/$ref resolution re-land (#37831), shared coerce_to_schema_type across MiniMax-M2 / DeepSeek-V3.2 / Seed-OSS parsers (#43006, #43019, #43140).
  • ViT CUDA graph: Qwen2-VL (#41736), Step3-VL encoder (#42224), Qwen3.5 (#42151), FlashInfer metadata for Qwen2.5-VL vision attention (#42787).

Engine Core

  • Model Runner V2: Qwen3-dense-by-default oracle (#39337), sleep-mode reload weights (#42673), update_config (#42783), shared KV-cache layers (#35045), FP32 gumbel sampling (#41775), auto-fallback to MRv1 with connectors (#42955), logprob_token_ids correctness (#43125, #41761), prompt-logprobs size fix (#42778).
  • KV offloading: multi-tier framework (#40020), Python filesystem secondary tier (#41735), DSv4 support (#43142), tier-offload follow-up (#42529), prefer HND layout (#41928), reset_cache() (#41956), per-request tracking (#42507), store-deferral fix (#41945).
  • MoE refactor: ExpertMapManager (#41046), experts moved to experts/ (#42334), RoutedExperts alias for FusedMoE (#40735), EPLB refactoring for FusedMoE (#41055).
  • Mamba: attention module refactor (#41126), Mamba2 SSD kernel warmup (#39822), bf16 SSM cache (#41680), GPU-side state postprocessing fused kernel (#40172), run single-token extends as decodes (#42430).
  • KV events: emit KV cache metadata (#40984).
  • Allocator: manual cumem allocator enable (#33648), stream-aware free callback (#43020).
  • elastic-EP: stage/commit MoE quant method on reconfigure (#40881).

Hardware & Performance

  • NVIDIA Blackwell / SM12x: FlashInfer b12x MoE + FP4 GEMM for SM120/121 (#40082), per-tensor FP8 CUTLASS on SM12.1 (#41215), head_dim=512 for FlashInfer TRTLLM attention (#38822), FlashInfer Blackwell GDN prefill (#40717), GDN prefill kernel for SM100 (#43273).
  • Performance: batch-invariant Cutlass FP8 (+28.9% E2E) (#40408), CutlassFP8 padding pre-processing (+13.5% TTFT) (#42651), padded NVFP4 quant kernel (+2.4–5.7% E2E) (#42774), GPU<->CPU sync elimination 1/n (#41429) and 4/n (#42347), fused RoPE+KVCache+q_concat for MLA (#40392), MLA compute_prefill_context / _v_up_proj optimizations (#42460, #42561), penalties Triton kernel (#40657), do_not_specialize in fused FP8 RoPE (#42849), FULL CUDA graph capture for TRITON_MLA decode (#42885).
  • AMD ROCm: DSV4 functionality + accuracy fixes (#42810, #43679 Tilelang MHC), flash sparse MLA Triton kernels (#41812), gluon paged MQA logits on gfx950/MI355X (#42062), RMSNorm+Quant fusion for gfx950 (#41825), AITER FA backend cleanup (#41942), XGMI backend for MoRI connector (#41753), QuickReduce min-size override (#41675), DSV4 MTP (#43385).
  • CPU / RISC-V: RVV-optimized attention kernels for RISC-V Vector Extension (#40119) with VLEN=256 (#42943), fused GDN for AMX CPU (#42707), MXFP4 W4A16 MoE (#41922), experimental Triton + MRv2 on CPU (#43225), improved CPU thread utilization (#42666), --cpu-distributed-timeout-seconds (#42968).
  • Intel XPU: GPTQ int4 support (#37844), mxfp8 MoE (#41918), FP8 block-scaled quantization (#42952), custom-op collective behavior (#41354), multiple sparse-attention kernels (#37888), MoE topk routing + MXFP4 fallback (#42951), CT W4A4 MXFP4 path (#38896), reduced XPU MoE host overhead (#42915).
  • Kernel ABI: continued migration to libtorch stable ABI — 5/n (#42339), 6/n (#42663), 7/n (#43209).
  • Experimental: breakable CUDA graph (#42304).

Large Scale Serving

  • Disaggregated serving (NIXL): lease-renewal TTL for KV blocks on P (#41383), handshake-failure policy honoring (#40364), GDN support for PD with NIXL (#41869), multi-node TP>8 fix (#39907), side-channel host-selection fix (#41806).
  • Mooncake: disk offloading in MooncakeStoreConnector (#42689), HMA support for DSV4 (#42828), operation metrics (#43392), load-failure propagation (#42788), block-aligned full hits (#43494), finish-after-preemption handling (#43281).
  • Data parallel: DP Supervisor (#40841), publish request counts at engine-step start (#41626), forward X-data-parallel-rank header (#42330).
  • EPLB: change default EPLB communicator (#43110), VLM-wrapper init fix (#39805), remove dead torch.accelerator.synchronize() (#40733).
  • LoRA: one-shot Triton kernel for MoE LoRA (#42290), simultaneous 2D & 3D MoE LoRA adapters (#42242), reduced 2D-weight memory under EP (#42737), MoE LoRA align-kernel grid fix (#40131).

Quantization

  • MXFP4: linear layers + compressed-tensors integration (#41664), CPU W4A16 MoE (#41922), XPU mxfp8 MoE (#41918).
  • NVFP4: DeepSeek V4 fused MoE (#42209), ModelOpt W4A16 NVFP4 fused MoE + mixed-precision dispatch (#42566), batch-invariant NVFP4 Cutlass linear (#39912), FlashInfer TRTLLM NvFP4 monolithic MoE routing fix (#43223), TRTLLM NVFP4 MoE chunking fix (#43599).

... (truncated)

Commits
  • 0b3ba88 Revert "[CPU] Experimentally enable Triton and MRV2 (#43225)"
  • 799c3af [BugFix] Fix hard-coded timeout for multi-API-server startup (#43768)
  • 64e2523 [Bugfix] Pass routed_scaling_factor to FlashInfer TRTLLM BF16 MoE (#43769)
  • a147dd0 [ROCm][DSV4] Enable Tilelang MHC replacing torch/triton mhc (#43679)
  • 0759293 [Bugfix][Kernel] TRTLLM NVFP4 MoE chunking (#43599)
  • a930f5a Fix RunAI streamer tensor buffer reuse during weight loading (#43464)
  • 40cf020 Fix early CUDA init (#43791)
  • 8c40613 [misc] Bump cutedsl version to 4.5.2 (#43745)
  • 5ebdf47 [Bugfix] Map reasoning_effort to enable_thinking in chat template kwargs (#43...
  • a94cd6d [MRV2][BugFix] Fix KV connector handling in spec decode case (#43719)
  • Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Updates the requirements on [vllm](https://github.com/vllm-project/vllm) to permit the latest version.
- [Release notes](https://github.com/vllm-project/vllm/releases)
- [Changelog](https://github.com/vllm-project/vllm/blob/main/RELEASE.md)
- [Commits](vllm-project/vllm@v0.21.0...v0.22.0)

---
updated-dependencies:
- dependency-name: vllm
  dependency-version: 0.22.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot @github

dependabot Bot commented on behalf of github Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Labels

The following labels could not be found: dependencies, engines, python. Please create them before Dependabot can add them to a pull request.

Please fix the above issues or remove invalid values from dependabot.yml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants