Skip to content

1Cat-vLLM 1.1.0在4×V100 SXM2 16GB上,TP=4跑Qwen3-30B-A3B-AWQ,单请求完全正常,但多个请求并发时报错 #60

@zhaoqingxiaozhao

Description

@zhaoqingxiaozhao

🚀 The feature, motivation and pitch

GitHub Issue版本(更详细):

Environment:

GPU: Tesla V100 SXM2 16GB × 4
1Cat-vLLM: 1.1.0
CUDA: 12.0
Python: 3.12
Model: Qwen3-30B-A3B-AWQ

Problem:
Single request works perfectly. But when 5 concurrent requests are sent simultaneously, the service crashes with:
RuntimeError: Worker failed with error 'Shared memory exceeds 96KB: 99840 bytes'
Launch command:
--tensor-parallel-size 4 --enforce-eager --dtype float16 --max-num-seqs 5
Question: Is there a fix or workaround for this?

Alternatives

你好,请问1Cat-vLLM 1.1.0在4×V100 SXM2 16GB上,TP=4跑Qwen3-30B-A3B-AWQ,单请求完全正常,但多个请求并发时报错:RuntimeError: Worker failed with error 'Shared memory exceeds 96KB: 99840 bytes',服务直接崩溃重启。请问这个问题有解决方案吗?

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions