Summary
While testing the SGLang v0.5.12 bump for Miles, a pure SGLang server launch for Qwen3-30B-A3B succeeds with the same core rollout server args, but the Miles CI test fails when the same server is launched through Miles/Ray colocate rollout.
This suggests the failure is not a generic standalone SGLang launch issue. It is triggered by the Miles/Ray colocate rollout context, or by a subtle environment/launch difference introduced by that path.
Environment
Working standalone SGLang launch
This pure SGLang server command starts successfully and reaches health check:
export CUDA_VISIBLE_DEVICES=1,2,3,4
export LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/nvidia/cu13/lib:/usr/local/lib/python3.12/dist-packages/tvm_ffi/lib:$LD_LIBRARY_PATH
python -m sglang.launch_server \
--model-path /cluster_public/models/Qwen3-30B-A3B \
--tokenizer-path /cluster_public/models/Qwen3-30B-A3B \
--host 127.0.0.1 \
--port 15300 \
--nccl-port 15301 \
--dist-init-addr 127.0.0.1:15306 \
--tp-size 4 \
--dp-size 1 \
--pp-size 1 \
--ep-size 1 \
--mem-fraction-static 0.7 \
--max-running-requests 512 \
--cuda-graph-max-bs 512 \
--chunked-prefill-size 8192 \
--max-prefill-tokens 16384 \
--attention-backend fa3 \
--sampling-backend flashinfer \
--grammar-backend xgrammar \
--enable-metrics \
--enable-memory-saver \
--enable-draft-weights-cpu-backup \
--enable-return-routed-experts \
--skip-server-warmup \
--disable-piecewise-cuda-graph
Key successful log lines from the standalone launch:
Custom allreduce v2 initialized successfully
Capture cuda graph bs [1, 2, 4, ..., 512]
Registering 72 cuda graph addresses
Capture cuda graph end
Failing Miles CI repro
The failing Miles test can be reproduced in the same container with:
export CUDA_VISIBLE_DEVICES=1,2,3,4
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_NVLS_ENABLE=1
export LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/nvidia/cu13/lib:/usr/local/lib/python3.12/dist-packages/tvm_ffi/lib:$LD_LIBRARY_PATH
export PYTHONPATH=/root/miles:/root/Megatron-LM
export MILES_TEST_ENABLE_INFINITE_RUN=false
cd /root/miles
ray stop --force || true
python -m tests.e2e.megatron.test_qwen3_30B_A3B.test_r3_baseline
The test expands to a Miles training command whose relevant rollout args are:
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{"env_vars":{"PYTHONPATH":"/root/Megatron-LM","CUDA_DEVICE_MAX_CONNECTIONS":"1","NCCL_NVLS_ENABLE":"1","no_proxy":"127.0.0.1,127.0.0.1","MASTER_ADDR":"127.0.0.1","MILES_EXPERIMENTAL_ROLLOUT_REFACTOR":"1"}}' \
-- python3 /root/miles/train.py \
--hf-checkpoint /root/models/Qwen3-30B-A3B \
--ref-load /root/Qwen3-30B-A3B_torch_dist \
--rollout-num-gpus-per-engine 4 \
--sglang-mem-fraction-static 0.7 \
--sglang-max-running-requests 512 \
--sglang-enable-metrics \
--tensor-model-parallel-size 2 \
--context-parallel-size 2 \
--expert-model-parallel-size 4 \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 4 \
--colocate \
--ci-test \
--moe-token-dispatcher-type alltoall \
...
The SGLang server launched by Miles uses the same important server-side settings:
tp_size=4
mem_fraction_static=0.7
max_running_requests=512
cuda_graph_max_bs=512
disable_piecewise_cuda_graph=True
enable_memory_saver=True
enable_return_routed_experts=True
attention_backend='fa3'
sampling_backend='flashinfer'
grammar_backend='xgrammar'
disable_custom_all_reduce=False
Failure
Miles fails while creating the rollout server. The training step itself is not reached.
Failure path:
train.py
create_rollout_manager(...)
RolloutManager.__init__
start_rollout_servers(...)
SGLangEngine.init()
launch_server_process(ServerArgs(...))
_wait_server_healthy(...)
Server process terminated unexpectedly
Underlying SGLang crash:
Capture cuda graph failed: Runtime check failed at
/sgl-workspace/sglang/python/sglang/jit_kernel/include/sgl_kernel/distributed/custom_all_reduce.cuh:37:
CUDA error: invalid argument
Full relevant SGLang stack:
sglang/srt/model_executor/cuda_graph_runner.py:739 in __init__
self.capture()
sglang/srt/model_executor/cuda_graph_runner.py:926 in capture
with (...)
sglang/srt/distributed/parallel_state.py:1564 in graph_capture
sglang/srt/distributed/parallel_state.py:519 in graph_capture
sglang/srt/distributed/device_communicators/custom_all_reduce_v2.py:108 in capture
pairs = self.obj.share_graph_inputs()
sglang/jit_kernel/include/sgl_kernel/distributed/custom_all_reduce.cuh:37
CUDA error: invalid argument
Outer Miles/Ray error:
ray.exceptions.RayTaskError: ray::SGLangEngine.init()
Exception: Server process terminated unexpectedly.
Current interpretation
- This is reproducible outside GitHub Actions in the
bump-sglang rcli job.
- It is not a generic pure SGLang Qwen3-30B-A3B launch failure, because the standalone SGLang command succeeds with the same core server args.
- The failure is triggered by the Miles/Ray colocate rollout launch context.
- The immediate failing component is SGLang CustomAllReduceV2 graph capture, specifically
share_graph_inputs() exporting CUDA graph input addresses through CUDA IPC.
- This is likely related to the v0.5.12 change where JIT CustomAllReduceV2 became the default path. v0.5.10 did not use this path by default.
Next experiments
- Re-run the Miles test with CustomAllReduceV2 disabled, for example by forcing the old custom all-reduce path or disabling custom all-reduce, to confirm the regression boundary.
- Compare process/env differences between standalone launch and Miles Ray actor launch, especially Ray runtime env, inherited CUDA/NCCL env vars, and visible-device mapping.
- Reduce
--sglang-cuda-graph-max-bs or disable CUDA graph only as diagnostic experiments, not as the preferred final fix.
- Isolate whether the trigger is Ray actor launch itself, Miles colocate resource state, or training/rollout colocate initialization before the SGLang server process is created.
Summary
While testing the SGLang v0.5.12 bump for Miles, a pure SGLang server launch for Qwen3-30B-A3B succeeds with the same core rollout server args, but the Miles CI test fails when the same server is launched through Miles/Ray colocate rollout.
This suggests the failure is not a generic standalone SGLang launch issue. It is triggered by the Miles/Ray colocate rollout context, or by a subtle environment/launch difference introduced by that path.
Environment
tests/e2e/megatron/test_qwen3_30B_A3B/test_r3_baseline.pyradixark/miles:sglang-miles-v0.5.12sglang-miles-v0.5.12e687c74Qwen/Qwen3-30B-A3BCUDA_VISIBLE_DEVICES=1,2,3,4Working standalone SGLang launch
This pure SGLang server command starts successfully and reaches health check:
Key successful log lines from the standalone launch:
Failing Miles CI repro
The failing Miles test can be reproduced in the same container with:
The test expands to a Miles training command whose relevant rollout args are:
The SGLang server launched by Miles uses the same important server-side settings:
Failure
Miles fails while creating the rollout server. The training step itself is not reached.
Failure path:
Underlying SGLang crash:
Full relevant SGLang stack:
Outer Miles/Ray error:
Current interpretation
bump-sglangrcli job.share_graph_inputs()exporting CUDA graph input addresses through CUDA IPC.Next experiments
--sglang-cuda-graph-max-bsor disable CUDA graph only as diagnostic experiments, not as the preferred final fix.