Add MiniMax-M2.5 B200 FP4/NVFP4 serving recipe by faradawn · Pull Request #320 · vllm-project/recipes

faradawn · 2026-04-08T03:36:15Z

Summary

Add ### B200 (FP4 / NVFP4) section to MiniMax/MiniMax-M2.5.md
Base config: TP=2, FP8 KV cache, --max-cudagraph-capture-size 2048, --stream-interval 20, --no-enable-prefix-caching
Expert parallelism variant: TP=4 + --enable-expert-parallel for higher concurrency (EP degree matches TP per benchmark sweep)

Test plan

Verify nvidia/MiniMax-M2.5-NVFP4 model loads with TP=2 on B200
Verify expert-parallel variant with TP=4 + --enable-expert-parallel

Add B200 FP4 serving configurations to MiniMax-M2.5.md based on SemiAnalysisAI/InferenceX#996. Includes base TP=2 config and TP=4 with expert parallelism for higher concurrency, both using FP8 KV cache. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request adds a new section to the MiniMax-M2.5 documentation for B200 GPUs, providing command examples for the FP4 quantized model. The review feedback correctly identifies that the vllm serve commands are missing essential tool-calling and reasoning parser flags required for the model's specialized features. Additionally, the non-standard --no-enable-prefix-caching flag should be removed to ensure the commands are valid and prevent startup errors.

gemini-code-assist · 2026-04-08T03:37:47Z

+vllm serve nvidia/MiniMax-M2.5-NVFP4 \
+  --tensor-parallel-size 2 \
+  --gpu-memory-utilization 0.90 \
+  --kv-cache-dtype fp8 \
+  --max-cudagraph-capture-size 2048 \
+  --stream-interval 20 \
+  --no-enable-prefix-caching \
+  --trust-remote-code


The vllm serve command is missing the tool-calling and reasoning parser flags (--tool-call-parser, --reasoning-parser, and --enable-auto-tool-choice) which are included in other MiniMax-M2.5 examples in this file. Without these, the model's specialized capabilities will not function correctly. Additionally, --no-enable-prefix-caching is not a standard vLLM flag; prefix caching is disabled by default, so this flag should be removed to avoid potential startup errors.

Suggested change

vllm serve nvidia/MiniMax-M2.5-NVFP4 \

--tensor-parallel-size 2 \

--gpu-memory-utilization 0.90 \

--kv-cache-dtype fp8 \

--max-cudagraph-capture-size 2048 \

--stream-interval 20 \

--no-enable-prefix-caching \

--trust-remote-code

vllm serve nvidia/MiniMax-M2.5-NVFP4 \

--tensor-parallel-size 2 \

--gpu-memory-utilization 0.90 \

--kv-cache-dtype fp8 \

--max-cudagraph-capture-size 2048 \

--stream-interval 20 \

--tool-call-parser minimax_m2 \

--reasoning-parser minimax_m2_append_think \

--enable-auto-tool-choice \

--trust-remote-code

gemini-code-assist · 2026-04-08T03:37:48Z

+vllm serve nvidia/MiniMax-M2.5-NVFP4 \
+  --tensor-parallel-size 4 \
+  --enable-expert-parallel \
+  --gpu-memory-utilization 0.90 \
+  --kv-cache-dtype fp8 \
+  --max-cudagraph-capture-size 2048 \
+  --stream-interval 20 \
+  --no-enable-prefix-caching \
+  --trust-remote-code


This example is also missing the required parsers for tool-calling and reasoning, and includes the non-standard --no-enable-prefix-caching flag. Including the parsers ensures the model behaves consistently with the other configurations provided in this guide.

Suggested change

vllm serve nvidia/MiniMax-M2.5-NVFP4 \

--tensor-parallel-size 4 \

--enable-expert-parallel \

--gpu-memory-utilization 0.90 \

--kv-cache-dtype fp8 \

--max-cudagraph-capture-size 2048 \

--stream-interval 20 \

--no-enable-prefix-caching \

--trust-remote-code

vllm serve nvidia/MiniMax-M2.5-NVFP4 \

--tensor-parallel-size 4 \

--enable-expert-parallel \

--gpu-memory-utilization 0.90 \

--kv-cache-dtype fp8 \

--max-cudagraph-capture-size 2048 \

--stream-interval 20 \

--tool-call-parser minimax_m2 \

--reasoning-parser minimax_m2_append_think \

--enable-auto-tool-choice \

--trust-remote-code

gemini-code-assist bot reviewed Apr 8, 2026

View reviewed changes

hshrivastava-droid mentioned this pull request Apr 8, 2026

[NVIDIA] Minimaxm2.5 nvfp4 b200 SemiAnalysisAI/InferenceX#996

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MiniMax-M2.5 B200 FP4/NVFP4 serving recipe#320

Add MiniMax-M2.5 B200 FP4/NVFP4 serving recipe#320
faradawn wants to merge 1 commit intovllm-project:mainfrom
faradawn:minimaxm2.5-b200-fp4

faradawn commented Apr 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

faradawn commented Apr 8, 2026

Summary

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant