chore(deps): update dependency vllm to v0.19.0 [security] by renovate[bot] · Pull Request #791 · IBM/ado

renovate · 2026-04-06T06:14:21Z

This PR contains the following updates:

Package	Change	Age	Confidence
vllm	`0.18.0` → `0.19.0`

Warning

Some dependencies could not be looked up. Check the Dependency Dashboard for more information.

GitHub Vulnerability Alerts

CVE-2026-34756

Summary

A Denial of Service vulnerability exists in the vLLM OpenAI-compatible API server. Due to the lack of an upper bound validation on the n parameter in the ChatCompletionRequest and CompletionRequest Pydantic models, an unauthenticated attacker can send a single HTTP request with an astronomically large n value. This completely blocks the Python asyncio event loop and causes immediate Out-Of-Memory crashes by allocating millions of request object copies in the heap before the request even reaches the scheduling queue.

Details

The root cause of this vulnerability lies in the missing upper bound checks across the request parsing and asynchronous scheduling layers:

Protocol Layer:
In vllm/entrypoints/openai/chat_completion/protocol.py, the n parameter is defined simply as an integer without any pydantic.Field constraints for an upper bound.

class ChatCompletionRequest(OpenAIBaseModel):
    # Ordered by official OpenAI API documentation
    # https://platform.openai.com/docs/api/reference/chat/create
    messages: list[ChatCompletionMessageParam]
    model: str | None = None
    frequency_penalty: float | None = 0.0
    logit_bias: dict[str, float] | None = None
    logprobs: bool | None = False
    top_logprobs: int | None = 0
    max_tokens: int | None = Field(
        default=None,
        deprecated="max_tokens is deprecated in favor of "
        "the max_completion_tokens field",
    )
    max_completion_tokens: int | None = None
    n: int | None = 1
    presence_penalty: float | None = 0.0

SamplingParams Layer (Incomplete Validation):
When the API request is converted to internal SamplingParams in vllm/sampling_params.py, the _verify_args method only checks the lower bound (self.n < 1), entirely omitting an upper bounds check.

    def _verify_args(self) -> None:
        if not isinstance(self.n, int):
            raise ValueError(f"n must be an int, but is of type {type(self.n)}")
        if self.n < 1:
            raise ValueError(f"n must be at least 1, got {self.n}.")

Engine Layer (The OOM Trigger):
When the malicious request reaches the core engine (vllm/v1/engine/async_llm.py), the engine attempts to fan out the request n times to generate identical independent sequences within a synchronous loop.

        # Fan out child requests (for n>1).
        parent_request = ParentRequest(request)
        for idx in range(parent_params.n):
            request_id, child_params = parent_request.get_child_info(idx)
            child_request = request if idx == parent_params.n - 1 else copy(request)
            child_request.request_id = request_id
            child_request.sampling_params = child_params
            await self._add_request(
                child_request, prompt_text, parent_request, idx, queue
            )
        return queue

Because Python's asyncio runs on a single thread and event loop, this monolithic for-loop monopolizes the CPU thread. The server stops responding to all other connections (including liveness probes). Simultaneously, the memory allocator is overwhelmed by cloning millions of request object instances via copy(request), driving the host's Resident Set Size (RSS) up by gigabytes per second until the OS OOM-killer terminates the vLLM process.

Impact

Vulnerability Type: Resource Exhaustion / Denial of Service

Impacted Parties:

Any individual or organization hosting a public-facing vLLM API server (vllm.entrypoints.openai.api_server), which happens to be the primary entrypoint for OpenAI-compatible setups.
SaaS / AI-as-a-Service platforms acting as reverse proxies sitting in front of vLLM without strict HTTP body payload validation or rate limitations.

Because this vulnerability exploits the control plane rather than the data plane, an unauthenticated remote attacker can achieve a high success rate in taking down production inference hosts with a single HTTP request. This effectively circumvents any hardware-level capacity planning and conventional bandwidth stress limitations.

CVE-2026-34753

Summary

A Server Side Request Forgery (SSRF) vulnerability in download_bytes_from_url allows any actor who can control batch input JSON to make the vLLM batch runner issue arbitrary HTTP/HTTPS requests from the server, without any URL validation or domain restrictions.

This can be used to target internal services (e.g. cloud metadata endpoints or internal HTTP APIs) reachable from the vLLM host.

Details

Vulnerable component

The vulnerable logic is in the batch runner entrypoint vllm/entrypoints/openai/run_batch.py, function download_bytes_from_url:


# run_batch.py Lines 442-482
async def download_bytes_from_url(url: str) -> bytes:
    """
    Download data from a URL or decode from a data URL.

    Args:
        url: Either an HTTP/HTTPS URL or a data URL (data:...;base64,...)

    Returns:
        Data as bytes
    """
    parsed = urlparse(url)

    # Handle data URLs (base64 encoded)
    if parsed.scheme == "data":
        # Format: data:...;base64,<base64_data>
        if "," in url:
            header, data = url.split(",", 1)
            if "base64" in header:
                return base64.b64decode(data)
            else:
                raise ValueError(f"Unsupported data URL encoding: {header}")
        else:
            raise ValueError(f"Invalid data URL format: {url}")

    # Handle HTTP/HTTPS URLs
    elif parsed.scheme in ("http", "https"):
        async with (
            aiohttp.ClientSession() as session,
            session.get(url) as resp,
        ):
            if resp.status != 200:
                raise Exception(
                    f"Failed to download data from URL: {url}. Status: {resp.status}"
                )
            return await resp.read()

    else:
        raise ValueError(
            f"Unsupported URL scheme: {parsed.scheme}. "
            "Supported schemes: http, https, data"
        )

Key properties:

The function only parses the URL to dispatch on the scheme (data, http, https).
For http / https, it directly calls session.get(url) on the provided string.
There is no validation of:
- hostname or IP address,
- whether the target is internal or external,
- port number,
- path, query, or redirect target.
This is in contrast to the multimodal media path (MediaConnector), which implements an explicit domain allowlist. download_bytes_from_url does not reuse that protection.

URL controllability

The url argument is fully controlled by batch input JSON via the file_url field of BatchTranscriptionRequest / BatchTranslationRequest.

Batch request body type:


# run_batch.py Line 67-80
class BatchTranscriptionRequest(TranscriptionRequest):
    """
    Batch transcription request that uses file_url instead of file.

    This class extends TranscriptionRequest but replaces the file field
    with file_url to support batch processing from audio files written in JSON format.
    """

    file_url: str = Field(
        ...,
        description=(
            "Either a URL of the audio or a data URL with base64 encoded audio data. "
        ),
    )


# run_batch.py Line 98-111
class BatchTranslationRequest(TranslationRequest):
    """
    Batch translation request that uses file_url instead of file.

    This class extends TranslationRequest but replaces the file field
    with file_url to support batch processing from audio files written in JSON format.
    """

    file_url: str = Field(
        ...,
        description=(
            "Either a URL of the audio or a data URL with base64 encoded audio data. "
        ),
    )

There is no restriction on the domain, IP, or port of file_url in these models.

Batch input is parsed directly from the batch file:


# run_batch.py Line 139-179
class BatchRequestInput(OpenAIBaseModel):
    ...
    url: str
    body: BatchRequestInputBody
    @&#8203;field_validator("body", mode="plain")
    @&#8203;classmethod
    def check_type_for_url(cls, value: Any, info: ValidationInfo):
        url: str = info.data["url"]
        ...
        if url == "/v1/audio/transcriptions":
            return BatchTranscriptionRequest.model_validate(value)
        if url == "/v1/audio/translations":
            return BatchTranslationRequest.model_validate(value)


# run_batch.py Line 770-781
   logger.info("Reading batch from %s...", args.input_file)

    # Submit all requests in the file to the engine "concurrently".
    response_futures: list[Awaitable[BatchRequestOutput]] = []
    for request_json in (await read_file(args.input_file)).strip().split("\n"):
        # Skip empty lines.
        request_json = request_json.strip()
        if not request_json:
            continue

        request = BatchRequestInput.model_validate_json(request_json)

The batch runner reads each line of the input file (args.input_file), parses it as JSON, and constructs a BatchTranscriptionRequest / BatchTranslationRequest. Whatever file_url appears in that JSON line becomes batch_request_body.file_url.

file_url is passed directly into download_bytes_from_url:


# run_batch.py Line 610-623
def wrapper(handler_fn: Callable):
        async def transcription_wrapper(
            batch_request_body: (BatchTranscriptionRequest | BatchTranslationRequest),
        ) -> (
            TranscriptionResponse
            | TranscriptionResponseVerbose
            | TranslationResponse
            | TranslationResponseVerbose
            | ErrorResponse
        ):
            try:
                # Download data from URL
                audio_data = await download_bytes_from_url(batch_request_body.file_url)

So the data flow is:

Attacker supplies JSON line in the batch input file with arbitrary body.file_url.
BatchRequestInput / BatchTranscriptionRequest / BatchTranslationRequest parse that JSON and store file_url verbatim.
make_transcription_wrapper calls download_bytes_from_url(batch_request_body.file_url).
download_bytes_from_url’s HTTP/HTTPS branch issues aiohttp.ClientSession().get(url) to that attacker-controlled URL with no further validation.

This is a classic SSRF pattern: a server-side component makes arbitrary HTTP requests to a URL string taken from untrusted input.

Comparison with safer code

The project already contains a safer URL-handling path for multimodal media in vllm/multimodal/media/connector.py, which demonstrates the intent to mitigate SSRF via domain allowlists and URL normalization:


# connector.py Lines 169-189
 def load_from_url(
        self,
        url: str,
        media_io: MediaIO[_M],
        *,
        fetch_timeout: int | None = None,
    ) -> _M:  # type: ignore[type-var]
        url_spec = parse_url(url)

        if url_spec.scheme and url_spec.scheme.startswith("http"):
            self._assert_url_in_allowed_media_domains(url_spec)

            connection = self.connection
            data = connection.get_bytes(
                url_spec.url,
                timeout=fetch_timeout,
                allow_redirects=envs.VLLM_MEDIA_URL_ALLOW_REDIRECTS,
            )

            return media_io.load_bytes(data)

and:


# connector.py Lines 158-167
  def _assert_url_in_allowed_media_domains(self, url_spec: Url) -> None:
        if (
            self.allowed_media_domains
            and url_spec.hostname not in self.allowed_media_domains
        ):
            raise ValueError(
                f"The URL must be from one of the allowed domains: "
                f"{self.allowed_media_domains}. Input URL domain: "
                f"{url_spec.hostname}"
            )

download_bytes_from_url does not reuse this allowlist or any equivalent validation, even though it also fetches user-provided URLs.

CVE-2026-34755

Summary

The VideoMediaIO.load_base64() method at vllm/multimodal/media/video.py:51-62 splits video/jpeg data URLs by comma to extract individual JPEG frames, but does not enforce a frame count limit. The num_frames parameter (default: 32), which is enforced by the load_bytes() code path at line 47-48, is completely bypassed in the video/jpeg base64 path. An attacker can send a single API request containing thousands of comma-separated base64-encoded JPEG frames, causing the server to decode all frames into memory and crash with OOM.

Details

Vulnerable code

# video.py:51-62
def load_base64(self, media_type: str, data: str) -> tuple[npt.NDArray, dict[str, Any]]:
    if media_type.lower() == "video/jpeg":
        load_frame = partial(self.image_io.load_base64, "image/jpeg")
        return np.stack(
            [np.asarray(load_frame(frame_data)) for frame_data in data.split(",")]
            #                                                       ^^^^^^^^^^
            # Unbounded split — no frame count limit
        ), {}
    return self.load_bytes(base64.b64decode(data))

The load_bytes() path (line 47-48) properly delegates to a video loader that respects self.num_frames (default 32). The load_base64("video/jpeg", ...) path bypasses this limit entirely — data.split(",") produces an unbounded list and every frame is decoded into a numpy array.

video/jpeg is part of vLLM's public API

video/jpeg is a vLLM-specific MIME type, not IANA-registered. However it is part of the public API surface:

encode_video_url() at vllm/multimodal/utils.py:96-108 generates data:video/jpeg;base64,... URLs
Official test suites at tests/entrypoints/openai/test_video.py:62 and tests/entrypoints/test_chat_utils.py:153 both use this format

Memory amplification

Each JPEG frame decodes to a full numpy array. For 640x480 RGB images, each frame is ~921 KB decoded. 5000 frames = ~4.6 GB. np.stack() then creates an additional copy. The compressed JPEG payload is small (~100 KB for 5000 frames) but decompresses to gigabytes.

Data flow

POST /v1/chat/completions
  → chat_utils.py:1434   video_url type → mm_parser.parse_video()
  → chat_utils.py:872    parse_video() → self._connector.fetch_video()
  → connector.py:295     fetch_video() → load_from_url(url, self.video_io)
  → connector.py:91      _load_data_url(): url_spec.path.split(",", 1)
                          → media_type = "video/jpeg"
                          → data = "<frame1>,<frame2>,...,<frame10000>"
  → connector.py:100     media_io.load_base64("video/jpeg", data)
  → video.py:54          data.split(",")  ← UNBOUNDED
  → video.py:55-57       all frames decoded into numpy arrays
  → video.py:56          np.stack([...])  ← massive combined array → OOM

connector.py:91 uses split(",", 1) which splits on only the first comma. All remaining commas stay in data and are later split by video.py:54.

Comparison with existing protections

Code Path	Frame Limit	File
`load_bytes()` (binary video)	Yes — `num_frames` (default 32)	video.py:46-49
`load_base64("video/jpeg", ...)`	No — unlimited `data.split(",")`	video.py:51-62

Release Notes

vllm-project/vllm (vllm)

`v0.19.0`

Compare Source

vLLM v0.19.0

Highlights

This release features 448 commits from 197 contributors (54 new)!

Gemma 4 support: Full Google Gemma 4 architecture support including MoE, multimodal, reasoning, and tool-use capabilities (#38826, #38847). Requires transformers>=5.5.0. We recommend using pre-built docker image vllm/vllm-openai:gemma4 for out of box usage.
Zero-bubble async scheduling + speculative decoding: Async scheduling now supports speculative decoding with zero-bubble overlap, significantly improving throughput (#32951).
Model Runner V2 maturation: MRV2 gains piecewise CUDA graphs for pipeline parallelism (#35162), spec decode rejection sampler with greedy/logprobs support (#37238, #37237), multi-modal embeddings for spec decode (#36097), streaming inputs (#37028), and EPLB support (#37488).
ViT Full CUDA Graphs: Vision encoders (ViT) now support full CUDA graph capture for reduced overhead (#35963).
General CPU KV cache offloading: A simple yet general CPU KV cache offloading mechanism for V1, with pluggable cache policy and block-level preemption handling (#37160, #37874, #34805, #36642, #37853).
DBO (Dual-Batch Overlap) generalization: The microbatch optimization (DBO) now works with general models, not just specific architectures (#37926).
NVIDIA B300/GB300 (SM 10.3) support: Allreduce fusion enabled by default with tuned all-reduce communicator (#37755, #37756).
Transformers v5 compatibility: Broad compatibility fixes across many models for HuggingFace Transformers v5 (#37681, #38127, #38090, #38247, #38410).

Model Support

New architectures: Gemma 4 (#38826), Cohere ASR (#35809), Cohere Transcribe (#38120), ColQwen3.5 4.5B (#36887), LFM2-ColBERT-350M (#37528), Granite 4.0 1B Speech (#38019), Qwen3-ForcedAligner (#35367).
Speculative decoding: Eagle3 for Pixtral (#37182), EagleMistralLarge3 fix (#37232).
LoRA expansion: H2OVL tower/connector LoRA (#31696), --lora-target-modules to restrict LoRA to specific modules (#34984), language_model_only respected (#37375), Mistral3 fix (#36928), Qwen3.5 fix (#36976), out-of-tree ops replacement (#37181).
Model fixes: NemotronH MTP + Chunked Prefill (#35447), Qwen3-VL video timestamps (#37439), Qwen3.5 GDN quantized models (#37448), Qwen3Next A_log FP32 (#37810), JAIS ALiBi (#37820), RoBERTa CUDA graph position IDs (#37873), AudioFlamingo3/MusicFlamingo (#37643), Music Flamingo loading (#35535), bge-m3 task selection (#37632), Nemotron Parse loading (#37407), GLM OCR patch merger (#37962), PaddleOCR checkpoint compat (#38232), DeepSeek v3.2 params (#33703), MiniMax NVFP4 weight loading (#37214), gated model HF token (#37920), Parakeet OOM on long audio (#36671).
Features: Temporal compression for Nemotron-3-VL videos (#36808), NemotronH Puzzle + MTP (#37803), torch.compile for InternVL vision encoder (#38049), multiple embedding types in single call (#35829).
Performance: GLM-4.xv ViT optimization (#37779).

Engine Core

Zero-bubble async scheduling + speculative decoding (#32951).
Model Runner V2: PP CUDA graphs (#35162), spec decode rejection sampler greedy (#37238) + logprobs (#37237), multimodal embeddings for spec decode (#36097), streaming inputs (#37028), configurable acceptance rate (#38045), FP32 draft logits (#37526), FP64 Gumbel noise (#37798), warmup with spec decode (#37812).
ViT Full CUDA Graph capture (#35963).
General CPU KV cache offloading with pluggable CachePolicy (#37160, #37874), block-level preemption (#34805), multiple KV groups (#36642), hybrid model support (#37853).
DBO for general models: Microbatch optimization generalized beyond specific architectures (#37926).
Compilation: Mega AOT artifact for torch 2.12+ (#37198), lazy graph module to defer recompile (#37609), remove model tag requirement for compile cache (#37345), Triton autotuning disk cache enabled by default (#37188), inductor runtime asserts disabled by default (#37485).
FlexAttention: Custom mask modification support (#37692).
Attention: Distinguish short extends vs decodes (#37303), allow qk_nope_head_dim=192 in FlashInfer MLA (#37475), skip sliding window attention layers with FP8 KV cache (#33695).
Scheduling: Schedule requests based on full input sequence length (#37307).
Spec decode: Per-draft-model MoE backend via --speculative-config (#37880), Eagle3 drafter quant_config propagation (#37280), Eagle3 norm_before_fc propagation (#38111).
Extensibility: PluggableLayer for CustomQwen2Decoder (#37293), tensor IPC transfer for multimodal data (#32104).
Performance: Optimize top-k in Triton sampler (#37225), optimize token_embed for pooling models with 1% improvement (#37347), fix slow hasattr in CUDAGraphWrapper (#37425), NFS prefetch auto-enabled with RAM guard (#37673), pybase64 replacement (#37290), optimize swap_states for hybrid models (#34733).
Bugfixes: Fix gibberish from FP8 MLA KV scale inconsistency (#37054), Mamba state corruption (#37728), deadlock with pause/resume (#37024), FlashInfer MNNVL socket collisions (#36674), multimodal prefix cache key collisions (#36708), DP coordinator ZMQ TOCTOU (#37452), CUDA graph memory double-counting (#37426), pooling non-determinism (#37775), AllReduce Fusion shutdown crash (#36955), FlashInfer allreduce workspace (#37461), async spec decoding with hybrid models (#38556), MLA sparse indexer prefill chunking (#36178), KV offloading + MLA (#37536), async scheduling extra CUDA context (#37449), DP MTP dummy run (#35243), offloading+prefetch for GLM-4.7-FP8 (#37178), max memory for multiple KV-cache groups (#36030).

Hardware & Performance

NVIDIA:
- B300/GB300 (SM 10.3): Allreduce fusion enabled by default (#37755), tuned all-reduce communicator (#37756).
- Blackwell: Optimized SM120 CUTLASS blockwise FP8 GEMM (#37970), fix NVFP4 NaN on desktop Blackwell (#37725), fix DeepGEMM E8M0 accuracy for Qwen3.5 FP8 (#38083), restore FP8 FlashMLA CUDA graph persistent buffers (#35175), DGX Spark fix (#38126).
- FlashInfer sparse MLA as default for FP8 KV cache (#37252).
- Tuned prefill configs for FP8 FA3 (#36265), tuned Triton MoE config for Qwen3.5 on H200 with 9.9% E2E improvement (#37340), H800 MoE configs (#31201).
- GPT-OSS: Router GEMM kernel (#37205), eliminate padding with FlashInfer MXFP4/MXFP8 MoE (#30647), reduce redundant SparseMatrix creation (#37683).
- NVFP4 CUTLASS MoE non-gated support (#37320), fuse pack topk in TRTLLM MoE via torch.compile (#37695).
- Non-contiguous KV cache in TRTLLM FP8 dequant kernel (#36867), Qwen3 dual stream input projection (#36795).
AMD ROCm:
- ROCm 7.2.1, torch 2.10, triton 3.6 (#38252).
- DeepEP as all2all backend (#34692).
- Persistent MLA kernel from AITER (#36574), FP8xFP8 attention in AITER (#36927).
- AWQ Marlin support (#36505), wvSplitK skinny GEMM for RDNA4/gfx1x (#34709).
- Nightly Docker image and wheel releases (#37283).
- Bugfixes: Sleep mode memory leak (#37533), hybrid model stride (#37228), qwen3_next crash (#36795).
Intel XPU: MLA model support (#37143), CompressedTensor W4A8 (#37207), auto-detect XPU build platform (#37634).
TPU: Async scheduling interface (#36924), Qwen3.5 FP8 weight loading fix (#37348).
CPU: Enable tcmalloc by default (#37607), graceful degradation without tcmalloc/libiomp (#37561), 48.9% throughput improvement for pooling models (#38139), OpenMP thread fix for torch.compile (#37538), structured output crash fix (#37706), KV cache block zeroing crash fix (#37550), slot mapping kernel (#37987), W4A16 compressed tensors (#38219).
Performance fixes: FP8 DeepGEMM batch invariance (#37718), Triton autotuning for Qwen3.5 (#37338), TRTLLM NVFP4 routing precision (#36725).

Large Scale Serving

Disaggregated serving: PD kv_transfer_params for Anthropic Messages (#37535) and Responses API (#37424), Mooncake heterogeneous TP (#36869), Mamba N-1 prefill for P/D (#37310).
EPLB: MRV2 support (#37488), improved responsiveness (#36271), EP weight filter fix (#37322).
Elastic EP: Fix repeated scale up/down cycles (#37131), fix stateless group port races (#36330).
DBO: Generalized to work with all models (#37926).
Multi-node: Fix allreduce fusion (#38136).
KV connector: Plugin-overridable metadata build (#37336).
Constraints: Cap API servers to 1 with Elastic EP (#37466).

Quantization

Online MXFP8 quantization for MoE and dense models (#35448).
FP8: WoQ kernel abstraction (#32929), Marlin FP8 for compressed tensors fix (#38092).
NVFP4: Rescale weight scales to fix BF16 dequant underflow (#34577), fix Marlin NaN/Inf with float16 (#33972).
QeRL: Online quantization composed with quantized reloading for RLHF (#38032).
CPU: W4A16 compressed tensors (#38219).
XPU: CompressedTensor W4A8 (#37207).
ROCm: AWQ Marlin support (#36505).
MXFP8 + DeepGEMM: Fix crash when both are active (#37358).
Removals: Per-tensor-per-channel FP8 removed (#32700), Sparse24 integration and kernels removed (#36799).

API & Frontend

New endpoints: /v1/chat/completions/batch for batched chat completions (#38011).
Features: Limit thinking tokens (hard limit) (#20859), multiple embedding types in single call (#35829), numpy array embeddings for multimodal (#38119), --lora-target-modules (#34984), -sc shorthand for --speculative-config (#38380).
Tool parsing: GigaChat 3.1 parser (#36664), Kimi-K2.5 reasoning/tool parser (#37438), Gemma 4 tool parser (#38847), tools passed to parser constructor (#38029), fix Mistral parser (#37209), fix DeepSeek v3.2 streaming (#36056), fix GLM-4.7 parsing (#37386), fix Hermes streaming (#38168), fix OpenAI tool parser IndexError (#37958), fix Anthropic streaming (#37510).
Responses API: Fix crash with tool_choice=required exceeding max_output_tokens (#37258), fix TTFT recording (#37498), fix Anthropic serving template kwargs (#37899).
Performance: Offload blocking tokenizer ops to thread pool (#34789).
Deprecations: --calculate-kv-scales (#37201), score task (#37537), pooling multi-task support (#37956), reasoning_content message field removed (#37480).
Bugfixes: Embed/classify task routing (#37573), Cohere embed task instruction (#38362), renderer workers restricted to 1 with MM cache (#38418).
UX: Log once per node by default (#37568), torch profiler with stack enabled (#37571).

Security

Add VLLM_MAX_N_SEQUENCES environment variable to enforce sequence limits (#37952).
Enforce frame limit in VideoMediaIO to prevent resource exhaustion (#38636).

Dependencies

Transformers v5 compatibility across many models (#37681, #38127, #38247, #38410, #38090).
ROCm 7.2.1, torch 2.10, triton 3.6 for ROCm builds (#38252).
compressed-tensors bumped to 0.14.0.1 (#36988).
Python OpenAI package bumped (#32316).
flashinfer-cubin added as default CUDA dependency (#37233).
librosa removed from audio dependencies (#37058).

V0 Deprecation

Deprecate virtual engine (#37195).
Deprecate --disable-frontend-multiprocessing (#37612).
Refactor KV cache from list to element (#37487).

New Contributors

@aaab8b made their first contribution in #37533
@aasgaonkar made their first contribution in #35386
@allgather made their first contribution in #38410
@avinashsingh77 made their first contribution in #37100
@b-mu made their first contribution in #35963
@bongwoobak made their first contribution in #37424
@brandonpelfrey made their first contribution in #32104
@ccrhx4 made their first contribution in #37634
@cdpath made their first contribution in #37510
@cemigo114 made their first contribution in #37064
@cnyvfang made their first contribution in #37439
@DanBlanaru made their first contribution in #37307
@DorBernsohn made their first contribution in #37438
@dsingal0 made their first contribution in #37923
@fxdawnn made their first contribution in #36038
@grYe99 made their first contribution in #38074
@guillaumeguy made their first contribution in #38119
@gxd3 made their first contribution in #36924
@he-yufeng made their first contribution in #37301
@javierdejesusda made their first contribution in #37920
@jetxa made their first contribution in #37899
@jhsmith409 made their first contribution in #37448
@jrplatin made their first contribution in #37348
@kjiang249 made their first contribution in #37475
@laudney made their first contribution in #34709
@lcskrishna made their first contribution in #34692
@li-liwen made their first contribution in #38108
@Liangyx2 made their first contribution in #37523
@MatejRojec made their first contribution in #38011
@Nekofish-L made their first contribution in #37970
@pjo256 made their first contribution in #34733
@r266-tech made their first contribution in #37820
@RobTand made their first contribution in #37725
@scyyh11 made their first contribution in #34789
@SherryC41 made their first contribution in #37519
@shwetha-s-poojary made their first contribution in #31696
@siewcapital made their first contribution in #36955
@SKPsanjeevi made their first contribution in #36574
@thillai-c made their first contribution in #37231
@tianrengao made their first contribution in #34389
@tmm77 made their first contribution in #37694
@utsumi-fj made their first contribution in #38328
@vineetatiwari27 made their first contribution in #37998
@Wangbei25 made their first contribution in #37293
@WindChimeRan made their first contribution in #35007
@wjhrdy made their first contribution in #37706
@XLiu-2000 made their first contribution in #37371
@xueliangyang-oeuler made their first contribution in #37536
@yanghui1-arch made their first contribution in #37873
@yassha made their first contribution in #37369
@yeahdongcn made their first contribution in #37840
@Young-Leo made their first contribution in #37565
@ZeldaHuang made their first contribution in #37425
@zhejiangxiaomai made their first contribution in #37259

`v0.18.1`

Compare Source

This is a patch release on top of v0.18.0 to address a few issues:

Change default SM100 MLA prefill backend back to TRT-LLM (#38562)
Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 (#37158)
Disable monolithic TRTLLM MoE for Renormalize routing #37605
Pre-download missing FlashInfer headers in Docker build #38391
Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell (#38083)

Configuration

📅 Schedule: Branch creation - "" (UTC), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

chore(deps): update dependency vllm to v0.19.0 [security]

e035740

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps): update dependency vllm to v0.19.0 [security]#791

chore(deps): update dependency vllm to v0.19.0 [security]#791
renovate[bot] wants to merge 1 commit intomainfrom
renovate/pypi-vllm-vulnerability

renovate bot commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

renovate bot commented Apr 6, 2026

GitHub Vulnerability Alerts

CVE-2026-34756

Summary

Details

Impact

CVE-2026-34753

Summary

Details

Vulnerable component

URL controllability

Comparison with safer code

CVE-2026-34755

Summary

Details

Vulnerable code

video/jpeg is part of vLLM's public API

Memory amplification

Data flow

Comparison with existing protections

Release Notes

v0.19.0

vLLM v0.19.0

Highlights

Model Support

Engine Core

Hardware & Performance

Large Scale Serving

Quantization

API & Frontend

Security

Dependencies

V0 Deprecation

New Contributors

v0.18.1

Configuration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

`v0.19.0`

`v0.18.1`