SGLang custom all-reduce v2 CUDA graph capture fails under Miles colocate rollout

## Summary

While testing the SGLang v0.5.12 bump for Miles, a pure SGLang server launch for Qwen3-30B-A3B succeeds with the same core rollout server args, but the Miles CI test fails when the same server is launched through Miles/Ray colocate rollout.

This suggests the failure is not a generic standalone SGLang launch issue. It is triggered by the Miles/Ray colocate rollout context, or by a subtle environment/launch difference introduced by that path.

## Environment

- Miles PR: #1164
- Failing CI test: `tests/e2e/megatron/test_qwen3_30B_A3B/test_r3_baseline.py`
- Failing CI job: https://github.com/radixark/miles/actions/runs/26258685623/job/77287934127?pr=1164
- Image under test: `radixark/miles:sglang-miles-v0.5.12`
- SGLang branch under test: `sglang-miles-v0.5.12`
- SGLang commit observed in container: `e687c74`
- Model: `Qwen/Qwen3-30B-A3B`
- GPUs used for local repro: `CUDA_VISIBLE_DEVICES=1,2,3,4`

## Working standalone SGLang launch

This pure SGLang server command starts successfully and reaches health check:

```bash
export CUDA_VISIBLE_DEVICES=1,2,3,4
export LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/nvidia/cu13/lib:/usr/local/lib/python3.12/dist-packages/tvm_ffi/lib:$LD_LIBRARY_PATH

python -m sglang.launch_server \
  --model-path /cluster_public/models/Qwen3-30B-A3B \
  --tokenizer-path /cluster_public/models/Qwen3-30B-A3B \
  --host 127.0.0.1 \
  --port 15300 \
  --nccl-port 15301 \
  --dist-init-addr 127.0.0.1:15306 \
  --tp-size 4 \
  --dp-size 1 \
  --pp-size 1 \
  --ep-size 1 \
  --mem-fraction-static 0.7 \
  --max-running-requests 512 \
  --cuda-graph-max-bs 512 \
  --chunked-prefill-size 8192 \
  --max-prefill-tokens 16384 \
  --attention-backend fa3 \
  --sampling-backend flashinfer \
  --grammar-backend xgrammar \
  --enable-metrics \
  --enable-memory-saver \
  --enable-draft-weights-cpu-backup \
  --enable-return-routed-experts \
  --skip-server-warmup \
  --disable-piecewise-cuda-graph
```

Key successful log lines from the standalone launch:

```text
Custom allreduce v2 initialized successfully
Capture cuda graph bs [1, 2, 4, ..., 512]
Registering 72 cuda graph addresses
Capture cuda graph end
```

## Failing Miles CI repro

The failing Miles test can be reproduced in the same container with:

```bash
export CUDA_VISIBLE_DEVICES=1,2,3,4
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_NVLS_ENABLE=1
export LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/nvidia/cu13/lib:/usr/local/lib/python3.12/dist-packages/tvm_ffi/lib:$LD_LIBRARY_PATH
export PYTHONPATH=/root/miles:/root/Megatron-LM
export MILES_TEST_ENABLE_INFINITE_RUN=false

cd /root/miles
ray stop --force || true
python -m tests.e2e.megatron.test_qwen3_30B_A3B.test_r3_baseline
```

The test expands to a Miles training command whose relevant rollout args are:

```bash
ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"env_vars":{"PYTHONPATH":"/root/Megatron-LM","CUDA_DEVICE_MAX_CONNECTIONS":"1","NCCL_NVLS_ENABLE":"1","no_proxy":"127.0.0.1,127.0.0.1","MASTER_ADDR":"127.0.0.1","MILES_EXPERIMENTAL_ROLLOUT_REFACTOR":"1"}}' \
  -- python3 /root/miles/train.py \
  --hf-checkpoint /root/models/Qwen3-30B-A3B \
  --ref-load /root/Qwen3-30B-A3B_torch_dist \
  --rollout-num-gpus-per-engine 4 \
  --sglang-mem-fraction-static 0.7 \
  --sglang-max-running-requests 512 \
  --sglang-enable-metrics \
  --tensor-model-parallel-size 2 \
  --context-parallel-size 2 \
  --expert-model-parallel-size 4 \
  --actor-num-nodes 1 \
  --actor-num-gpus-per-node 4 \
  --colocate \
  --ci-test \
  --moe-token-dispatcher-type alltoall \
  ...
```

The SGLang server launched by Miles uses the same important server-side settings:

```text
tp_size=4
mem_fraction_static=0.7
max_running_requests=512
cuda_graph_max_bs=512
disable_piecewise_cuda_graph=True
enable_memory_saver=True
enable_return_routed_experts=True
attention_backend='fa3'
sampling_backend='flashinfer'
grammar_backend='xgrammar'
disable_custom_all_reduce=False
```

## Failure

Miles fails while creating the rollout server. The training step itself is not reached.

Failure path:

```text
train.py
  create_rollout_manager(...)
    RolloutManager.__init__
      start_rollout_servers(...)
        SGLangEngine.init()
          launch_server_process(ServerArgs(...))
            _wait_server_healthy(...)
              Server process terminated unexpectedly
```

Underlying SGLang crash:

```text
Capture cuda graph failed: Runtime check failed at
/sgl-workspace/sglang/python/sglang/jit_kernel/include/sgl_kernel/distributed/custom_all_reduce.cuh:37:
CUDA error: invalid argument
```

Full relevant SGLang stack:

```text
sglang/srt/model_executor/cuda_graph_runner.py:739 in __init__
  self.capture()
sglang/srt/model_executor/cuda_graph_runner.py:926 in capture
  with (...)
sglang/srt/distributed/parallel_state.py:1564 in graph_capture
sglang/srt/distributed/parallel_state.py:519 in graph_capture
sglang/srt/distributed/device_communicators/custom_all_reduce_v2.py:108 in capture
  pairs = self.obj.share_graph_inputs()
sglang/jit_kernel/include/sgl_kernel/distributed/custom_all_reduce.cuh:37
  CUDA error: invalid argument
```

Outer Miles/Ray error:

```text
ray.exceptions.RayTaskError: ray::SGLangEngine.init()
Exception: Server process terminated unexpectedly.
```

## Current interpretation

- This is reproducible outside GitHub Actions in the `bump-sglang` rcli job.
- It is not a generic pure SGLang Qwen3-30B-A3B launch failure, because the standalone SGLang command succeeds with the same core server args.
- The failure is triggered by the Miles/Ray colocate rollout launch context.
- The immediate failing component is SGLang CustomAllReduceV2 graph capture, specifically `share_graph_inputs()` exporting CUDA graph input addresses through CUDA IPC.
- This is likely related to the v0.5.12 change where JIT CustomAllReduceV2 became the default path. v0.5.10 did not use this path by default.

## Next experiments

1. Re-run the Miles test with CustomAllReduceV2 disabled, for example by forcing the old custom all-reduce path or disabling custom all-reduce, to confirm the regression boundary.
2. Compare process/env differences between standalone launch and Miles Ray actor launch, especially Ray runtime env, inherited CUDA/NCCL env vars, and visible-device mapping.
3. Reduce `--sglang-cuda-graph-max-bs` or disable CUDA graph only as diagnostic experiments, not as the preferred final fix.
4. Isolate whether the trigger is Ray actor launch itself, Miles colocate resource state, or training/rollout colocate initialization before the SGLang server process is created.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGLang custom all-reduce v2 CUDA graph capture fails under Miles colocate rollout #1176

Summary

Environment

Working standalone SGLang launch

Failing Miles CI repro

Failure

Current interpretation

Next experiments

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SGLang custom all-reduce v2 CUDA graph capture fails under Miles colocate rollout #1176

Description

Summary

Environment

Working standalone SGLang launch

Failing Miles CI repro

Failure

Current interpretation

Next experiments

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions