[Bug]: RuntimeError: Shared memory exceeds 96KB: 146944 bytes in MiniMax-M2.7

### Your current environment

8x V100 32GB

### 🐛 Describe the bug

```bash
#!/bin/bash

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

vllm serve /home/user/models/QuantTrio--MiniMax-M2.7-AWQ --served-model-name minimax-m2.7 \
  --dtype float16 \
  --gpu-memory-utilization 0.9 \
  --tensor-parallel-size 8 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 4096 \
  --attention-backend FLASH_ATTN_V100 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --trust-remote-code
```
There have clearly been some major changes in the Flash Attention codebase, because now decode fails with `RuntimeError: Shared memory exceeds 96KB: 146944 bytes` in MiniMax-M2.7. If I vaguely recall correctly, this happens when FA tries to allocate shared memory that exceeds the available size per SM, which occurs when head dimensions exceed some size.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: RuntimeError: Shared memory exceeds 96KB: 146944 bytes in MiniMax-M2.7 #40

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: RuntimeError: Shared memory exceeds 96KB: 146944 bytes in MiniMax-M2.7 #40

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions