[multi_agent] rollout generation stalls for a long time before any /generate fires

## Description

Running the qwen3-30B-A3B multi-agent example, after `resume_memory_occupation` returns, the rollout sits at `Rollout generation: 0/256` with GPUs at 0% util. The sglang engine only sees periodic `GET /health`, where no `POST /generate` shows up for tens of seconds (sometimes indefinitely).

Looks like the event loop is blocked on the client side: `examples/multi_agent/rollout_with_multi_agents.py` calls `load_tokenizer(args.hf_checkpoint, ...)` on every sample, and `examples/multi_agent/agent_system.py` then does `deepcopy(args)` with the tokenizer attached. Under the GIL, deepcopying a `PreTrainedTokenizer` once per sample basically freezes everything.

## Reproduction

```bash
bash examples/multi_agent/run-qwen3-30B-A3B-multi-agent.sh
```

Model `Qwen3-30B-A3B-Instruct-2507`, rollout_batch_size=32, n_samples_per_prompt=8, 8×L20X, hf checkpoint on a shared FS.

## Traceback

No exception. During the stall:

```
INFO: .../resume_memory_occupation HTTP/1.1 200 OK
Rollout generation:   0%|          | 0/256 [00:00<?, ?it/s]
INFO: .../health HTTP/1.1 200 OK
INFO: .../health HTTP/1.1 200 OK
...
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multi_agent] rollout generation stalls for a long time before any /generate fires #979

Description

Reproduction

Traceback

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[multi_agent] rollout generation stalls for a long time before any /generate fires #979

Description

Description

Reproduction

Traceback

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions