Your current environment
8x V100 32GB
🐛 Describe the bug
#!/bin/bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
vllm serve /home/user/models/QuantTrio--MiniMax-M2.7-AWQ --served-model-name minimax-m2.7 \
--dtype float16 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 8 \
--max-num-seqs 4 \
--max-num-batched-tokens 4096 \
--attention-backend FLASH_ATTN_V100 \
--reasoning-parser minimax_m2 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--trust-remote-code
There have clearly been some major changes in the Flash Attention codebase, because now decode fails with RuntimeError: Shared memory exceeds 96KB: 146944 bytes in MiniMax-M2.7. If I vaguely recall correctly, this happens when FA tries to allocate shared memory that exceeds the available size per SM, which occurs when head dimensions exceed some size.
Before submitting a new issue...
Your current environment
8x V100 32GB
🐛 Describe the bug
#!/bin/bash CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve /home/user/models/QuantTrio--MiniMax-M2.7-AWQ --served-model-name minimax-m2.7 \ --dtype float16 \ --gpu-memory-utilization 0.9 \ --tensor-parallel-size 8 \ --max-num-seqs 4 \ --max-num-batched-tokens 4096 \ --attention-backend FLASH_ATTN_V100 \ --reasoning-parser minimax_m2 \ --enable-auto-tool-choice \ --tool-call-parser minimax_m2 \ --trust-remote-codeThere have clearly been some major changes in the Flash Attention codebase, because now decode fails with
RuntimeError: Shared memory exceeds 96KB: 146944 bytesin MiniMax-M2.7. If I vaguely recall correctly, this happens when FA tries to allocate shared memory that exceeds the available size per SM, which occurs when head dimensions exceed some size.Before submitting a new issue...