Skip to content

SGLang custom all-reduce v2 CUDA graph capture fails under Miles colocate rollout #1176

@yueming-yuan

Description

@yueming-yuan

Summary

While testing the SGLang v0.5.12 bump for Miles, a pure SGLang server launch for Qwen3-30B-A3B succeeds with the same core rollout server args, but the Miles CI test fails when the same server is launched through Miles/Ray colocate rollout.

This suggests the failure is not a generic standalone SGLang launch issue. It is triggered by the Miles/Ray colocate rollout context, or by a subtle environment/launch difference introduced by that path.

Environment

Working standalone SGLang launch

This pure SGLang server command starts successfully and reaches health check:

export CUDA_VISIBLE_DEVICES=1,2,3,4
export LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/nvidia/cu13/lib:/usr/local/lib/python3.12/dist-packages/tvm_ffi/lib:$LD_LIBRARY_PATH

python -m sglang.launch_server \
  --model-path /cluster_public/models/Qwen3-30B-A3B \
  --tokenizer-path /cluster_public/models/Qwen3-30B-A3B \
  --host 127.0.0.1 \
  --port 15300 \
  --nccl-port 15301 \
  --dist-init-addr 127.0.0.1:15306 \
  --tp-size 4 \
  --dp-size 1 \
  --pp-size 1 \
  --ep-size 1 \
  --mem-fraction-static 0.7 \
  --max-running-requests 512 \
  --cuda-graph-max-bs 512 \
  --chunked-prefill-size 8192 \
  --max-prefill-tokens 16384 \
  --attention-backend fa3 \
  --sampling-backend flashinfer \
  --grammar-backend xgrammar \
  --enable-metrics \
  --enable-memory-saver \
  --enable-draft-weights-cpu-backup \
  --enable-return-routed-experts \
  --skip-server-warmup \
  --disable-piecewise-cuda-graph

Key successful log lines from the standalone launch:

Custom allreduce v2 initialized successfully
Capture cuda graph bs [1, 2, 4, ..., 512]
Registering 72 cuda graph addresses
Capture cuda graph end

Failing Miles CI repro

The failing Miles test can be reproduced in the same container with:

export CUDA_VISIBLE_DEVICES=1,2,3,4
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_NVLS_ENABLE=1
export LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/nvidia/cu13/lib:/usr/local/lib/python3.12/dist-packages/tvm_ffi/lib:$LD_LIBRARY_PATH
export PYTHONPATH=/root/miles:/root/Megatron-LM
export MILES_TEST_ENABLE_INFINITE_RUN=false

cd /root/miles
ray stop --force || true
python -m tests.e2e.megatron.test_qwen3_30B_A3B.test_r3_baseline

The test expands to a Miles training command whose relevant rollout args are:

ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"env_vars":{"PYTHONPATH":"/root/Megatron-LM","CUDA_DEVICE_MAX_CONNECTIONS":"1","NCCL_NVLS_ENABLE":"1","no_proxy":"127.0.0.1,127.0.0.1","MASTER_ADDR":"127.0.0.1","MILES_EXPERIMENTAL_ROLLOUT_REFACTOR":"1"}}' \
  -- python3 /root/miles/train.py \
  --hf-checkpoint /root/models/Qwen3-30B-A3B \
  --ref-load /root/Qwen3-30B-A3B_torch_dist \
  --rollout-num-gpus-per-engine 4 \
  --sglang-mem-fraction-static 0.7 \
  --sglang-max-running-requests 512 \
  --sglang-enable-metrics \
  --tensor-model-parallel-size 2 \
  --context-parallel-size 2 \
  --expert-model-parallel-size 4 \
  --actor-num-nodes 1 \
  --actor-num-gpus-per-node 4 \
  --colocate \
  --ci-test \
  --moe-token-dispatcher-type alltoall \
  ...

The SGLang server launched by Miles uses the same important server-side settings:

tp_size=4
mem_fraction_static=0.7
max_running_requests=512
cuda_graph_max_bs=512
disable_piecewise_cuda_graph=True
enable_memory_saver=True
enable_return_routed_experts=True
attention_backend='fa3'
sampling_backend='flashinfer'
grammar_backend='xgrammar'
disable_custom_all_reduce=False

Failure

Miles fails while creating the rollout server. The training step itself is not reached.

Failure path:

train.py
  create_rollout_manager(...)
    RolloutManager.__init__
      start_rollout_servers(...)
        SGLangEngine.init()
          launch_server_process(ServerArgs(...))
            _wait_server_healthy(...)
              Server process terminated unexpectedly

Underlying SGLang crash:

Capture cuda graph failed: Runtime check failed at
/sgl-workspace/sglang/python/sglang/jit_kernel/include/sgl_kernel/distributed/custom_all_reduce.cuh:37:
CUDA error: invalid argument

Full relevant SGLang stack:

sglang/srt/model_executor/cuda_graph_runner.py:739 in __init__
  self.capture()
sglang/srt/model_executor/cuda_graph_runner.py:926 in capture
  with (...)
sglang/srt/distributed/parallel_state.py:1564 in graph_capture
sglang/srt/distributed/parallel_state.py:519 in graph_capture
sglang/srt/distributed/device_communicators/custom_all_reduce_v2.py:108 in capture
  pairs = self.obj.share_graph_inputs()
sglang/jit_kernel/include/sgl_kernel/distributed/custom_all_reduce.cuh:37
  CUDA error: invalid argument

Outer Miles/Ray error:

ray.exceptions.RayTaskError: ray::SGLangEngine.init()
Exception: Server process terminated unexpectedly.

Current interpretation

  • This is reproducible outside GitHub Actions in the bump-sglang rcli job.
  • It is not a generic pure SGLang Qwen3-30B-A3B launch failure, because the standalone SGLang command succeeds with the same core server args.
  • The failure is triggered by the Miles/Ray colocate rollout launch context.
  • The immediate failing component is SGLang CustomAllReduceV2 graph capture, specifically share_graph_inputs() exporting CUDA graph input addresses through CUDA IPC.
  • This is likely related to the v0.5.12 change where JIT CustomAllReduceV2 became the default path. v0.5.10 did not use this path by default.

Next experiments

  1. Re-run the Miles test with CustomAllReduceV2 disabled, for example by forcing the old custom all-reduce path or disabling custom all-reduce, to confirm the regression boundary.
  2. Compare process/env differences between standalone launch and Miles Ray actor launch, especially Ray runtime env, inherited CUDA/NCCL env vars, and visible-device mapping.
  3. Reduce --sglang-cuda-graph-max-bs or disable CUDA graph only as diagnostic experiments, not as the preferred final fix.
  4. Isolate whether the trigger is Ray actor launch itself, Miles colocate resource state, or training/rollout colocate initialization before the SGLang server process is created.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions