Add MiniMax-M2.5 B200 FP4/NVFP4 serving recipe#320
Add MiniMax-M2.5 B200 FP4/NVFP4 serving recipe#320faradawn wants to merge 1 commit intovllm-project:mainfrom
Conversation
Add B200 FP4 serving configurations to MiniMax-M2.5.md based on SemiAnalysisAI/InferenceX#996. Includes base TP=2 config and TP=4 with expert parallelism for higher concurrency, both using FP8 KV cache. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request adds a new section to the MiniMax-M2.5 documentation for B200 GPUs, providing command examples for the FP4 quantized model. The review feedback correctly identifies that the vllm serve commands are missing essential tool-calling and reasoning parser flags required for the model's specialized features. Additionally, the non-standard --no-enable-prefix-caching flag should be removed to ensure the commands are valid and prevent startup errors.
| vllm serve nvidia/MiniMax-M2.5-NVFP4 \ | ||
| --tensor-parallel-size 2 \ | ||
| --gpu-memory-utilization 0.90 \ | ||
| --kv-cache-dtype fp8 \ | ||
| --max-cudagraph-capture-size 2048 \ | ||
| --stream-interval 20 \ | ||
| --no-enable-prefix-caching \ | ||
| --trust-remote-code |
There was a problem hiding this comment.
The vllm serve command is missing the tool-calling and reasoning parser flags (--tool-call-parser, --reasoning-parser, and --enable-auto-tool-choice) which are included in other MiniMax-M2.5 examples in this file. Without these, the model's specialized capabilities will not function correctly. Additionally, --no-enable-prefix-caching is not a standard vLLM flag; prefix caching is disabled by default, so this flag should be removed to avoid potential startup errors.
| vllm serve nvidia/MiniMax-M2.5-NVFP4 \ | |
| --tensor-parallel-size 2 \ | |
| --gpu-memory-utilization 0.90 \ | |
| --kv-cache-dtype fp8 \ | |
| --max-cudagraph-capture-size 2048 \ | |
| --stream-interval 20 \ | |
| --no-enable-prefix-caching \ | |
| --trust-remote-code | |
| vllm serve nvidia/MiniMax-M2.5-NVFP4 \ | |
| --tensor-parallel-size 2 \ | |
| --gpu-memory-utilization 0.90 \ | |
| --kv-cache-dtype fp8 \ | |
| --max-cudagraph-capture-size 2048 \ | |
| --stream-interval 20 \ | |
| --tool-call-parser minimax_m2 \ | |
| --reasoning-parser minimax_m2_append_think \ | |
| --enable-auto-tool-choice \ | |
| --trust-remote-code |
| vllm serve nvidia/MiniMax-M2.5-NVFP4 \ | ||
| --tensor-parallel-size 4 \ | ||
| --enable-expert-parallel \ | ||
| --gpu-memory-utilization 0.90 \ | ||
| --kv-cache-dtype fp8 \ | ||
| --max-cudagraph-capture-size 2048 \ | ||
| --stream-interval 20 \ | ||
| --no-enable-prefix-caching \ | ||
| --trust-remote-code |
There was a problem hiding this comment.
This example is also missing the required parsers for tool-calling and reasoning, and includes the non-standard --no-enable-prefix-caching flag. Including the parsers ensures the model behaves consistently with the other configurations provided in this guide.
| vllm serve nvidia/MiniMax-M2.5-NVFP4 \ | |
| --tensor-parallel-size 4 \ | |
| --enable-expert-parallel \ | |
| --gpu-memory-utilization 0.90 \ | |
| --kv-cache-dtype fp8 \ | |
| --max-cudagraph-capture-size 2048 \ | |
| --stream-interval 20 \ | |
| --no-enable-prefix-caching \ | |
| --trust-remote-code | |
| vllm serve nvidia/MiniMax-M2.5-NVFP4 \ | |
| --tensor-parallel-size 4 \ | |
| --enable-expert-parallel \ | |
| --gpu-memory-utilization 0.90 \ | |
| --kv-cache-dtype fp8 \ | |
| --max-cudagraph-capture-size 2048 \ | |
| --stream-interval 20 \ | |
| --tool-call-parser minimax_m2 \ | |
| --reasoning-parser minimax_m2_append_think \ | |
| --enable-auto-tool-choice \ | |
| --trust-remote-code |
Summary
### B200 (FP4 / NVFP4)section toMiniMax/MiniMax-M2.5.md--max-cudagraph-capture-size 2048,--stream-interval 20,--no-enable-prefix-caching--enable-expert-parallelfor higher concurrency (EP degree matches TP per benchmark sweep)Based on SemiAnalysisAI/InferenceX#996
Test plan
nvidia/MiniMax-M2.5-NVFP4model loads with TP=2 on B200--enable-expert-parallel