🚀 The feature, motivation and pitch
GitHub Issue版本(更详细):
Environment:
GPU: Tesla V100 SXM2 16GB × 4
1Cat-vLLM: 1.1.0
CUDA: 12.0
Python: 3.12
Model: Qwen3-30B-A3B-AWQ
Problem:
Single request works perfectly. But when 5 concurrent requests are sent simultaneously, the service crashes with:
RuntimeError: Worker failed with error 'Shared memory exceeds 96KB: 99840 bytes'
Launch command:
--tensor-parallel-size 4 --enforce-eager --dtype float16 --max-num-seqs 5
Question: Is there a fix or workaround for this?
Alternatives
你好,请问1Cat-vLLM 1.1.0在4×V100 SXM2 16GB上,TP=4跑Qwen3-30B-A3B-AWQ,单请求完全正常,但多个请求并发时报错:RuntimeError: Worker failed with error 'Shared memory exceeds 96KB: 99840 bytes',服务直接崩溃重启。请问这个问题有解决方案吗?
Additional context
No response
Before submitting a new issue...
🚀 The feature, motivation and pitch
GitHub Issue版本(更详细):
Environment:
GPU: Tesla V100 SXM2 16GB × 4
1Cat-vLLM: 1.1.0
CUDA: 12.0
Python: 3.12
Model: Qwen3-30B-A3B-AWQ
Problem:
Single request works perfectly. But when 5 concurrent requests are sent simultaneously, the service crashes with:
RuntimeError: Worker failed with error 'Shared memory exceeds 96KB: 99840 bytes'
Launch command:
--tensor-parallel-size 4 --enforce-eager --dtype float16 --max-num-seqs 5
Question: Is there a fix or workaround for this?
Alternatives
你好,请问1Cat-vLLM 1.1.0在4×V100 SXM2 16GB上,TP=4跑Qwen3-30B-A3B-AWQ,单请求完全正常,但多个请求并发时报错:RuntimeError: Worker failed with error 'Shared memory exceeds 96KB: 99840 bytes',服务直接崩溃重启。请问这个问题有解决方案吗?
Additional context
No response
Before submitting a new issue...