Skip to content

Add MiniMax-M2.5 B200 FP4/NVFP4 serving recipe#320

Open
faradawn wants to merge 1 commit intovllm-project:mainfrom
faradawn:minimaxm2.5-b200-fp4
Open

Add MiniMax-M2.5 B200 FP4/NVFP4 serving recipe#320
faradawn wants to merge 1 commit intovllm-project:mainfrom
faradawn:minimaxm2.5-b200-fp4

Conversation

@faradawn
Copy link
Copy Markdown
Collaborator

@faradawn faradawn commented Apr 8, 2026

Summary

  • Add ### B200 (FP4 / NVFP4) section to MiniMax/MiniMax-M2.5.md
  • Base config: TP=2, FP8 KV cache, --max-cudagraph-capture-size 2048, --stream-interval 20, --no-enable-prefix-caching
  • Expert parallelism variant: TP=4 + --enable-expert-parallel for higher concurrency (EP degree matches TP per benchmark sweep)

Based on SemiAnalysisAI/InferenceX#996

Test plan

  • Verify nvidia/MiniMax-M2.5-NVFP4 model loads with TP=2 on B200
  • Verify expert-parallel variant with TP=4 + --enable-expert-parallel

Add B200 FP4 serving configurations to MiniMax-M2.5.md based on
SemiAnalysisAI/InferenceX#996. Includes base TP=2 config and TP=4
with expert parallelism for higher concurrency, both using FP8 KV cache.

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new section to the MiniMax-M2.5 documentation for B200 GPUs, providing command examples for the FP4 quantized model. The review feedback correctly identifies that the vllm serve commands are missing essential tool-calling and reasoning parser flags required for the model's specialized features. Additionally, the non-standard --no-enable-prefix-caching flag should be removed to ensure the commands are valid and prevent startup errors.

Comment thread MiniMax/MiniMax-M2.5.md
Comment on lines +55 to +62
vllm serve nvidia/MiniMax-M2.5-NVFP4 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--max-cudagraph-capture-size 2048 \
--stream-interval 20 \
--no-enable-prefix-caching \
--trust-remote-code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The vllm serve command is missing the tool-calling and reasoning parser flags (--tool-call-parser, --reasoning-parser, and --enable-auto-tool-choice) which are included in other MiniMax-M2.5 examples in this file. Without these, the model's specialized capabilities will not function correctly. Additionally, --no-enable-prefix-caching is not a standard vLLM flag; prefix caching is disabled by default, so this flag should be removed to avoid potential startup errors.

Suggested change
vllm serve nvidia/MiniMax-M2.5-NVFP4 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--max-cudagraph-capture-size 2048 \
--stream-interval 20 \
--no-enable-prefix-caching \
--trust-remote-code
vllm serve nvidia/MiniMax-M2.5-NVFP4 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--max-cudagraph-capture-size 2048 \
--stream-interval 20 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-auto-tool-choice \
--trust-remote-code

Comment thread MiniMax/MiniMax-M2.5.md
Comment on lines +68 to +76
vllm serve nvidia/MiniMax-M2.5-NVFP4 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--max-cudagraph-capture-size 2048 \
--stream-interval 20 \
--no-enable-prefix-caching \
--trust-remote-code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This example is also missing the required parsers for tool-calling and reasoning, and includes the non-standard --no-enable-prefix-caching flag. Including the parsers ensures the model behaves consistently with the other configurations provided in this guide.

Suggested change
vllm serve nvidia/MiniMax-M2.5-NVFP4 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--max-cudagraph-capture-size 2048 \
--stream-interval 20 \
--no-enable-prefix-caching \
--trust-remote-code
vllm serve nvidia/MiniMax-M2.5-NVFP4 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--max-cudagraph-capture-size 2048 \
--stream-interval 20 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-auto-tool-choice \
--trust-remote-code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant