Skip to content

[multi_agent] rollout generation stalls for a long time before any /generate fires #979

@GuanxingLu

Description

@GuanxingLu

Description

Running the qwen3-30B-A3B multi-agent example, after resume_memory_occupation returns, the rollout sits at Rollout generation: 0/256 with GPUs at 0% util. The sglang engine only sees periodic GET /health, where no POST /generate shows up for tens of seconds (sometimes indefinitely).

Looks like the event loop is blocked on the client side: examples/multi_agent/rollout_with_multi_agents.py calls load_tokenizer(args.hf_checkpoint, ...) on every sample, and examples/multi_agent/agent_system.py then does deepcopy(args) with the tokenizer attached. Under the GIL, deepcopying a PreTrainedTokenizer once per sample basically freezes everything.

Reproduction

bash examples/multi_agent/run-qwen3-30B-A3B-multi-agent.sh

Model Qwen3-30B-A3B-Instruct-2507, rollout_batch_size=32, n_samples_per_prompt=8, 8×L20X, hf checkpoint on a shared FS.

Traceback

No exception. During the stall:

INFO: .../resume_memory_occupation HTTP/1.1 200 OK
Rollout generation:   0%|          | 0/256 [00:00<?, ?it/s]
INFO: .../health HTTP/1.1 200 OK
INFO: .../health HTTP/1.1 200 OK
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions