[WIP] [fix] Simplify GPU ID remapping for SGLang colocate mode by da-niao-dan · Pull Request #1178 · radixark/miles

da-niao-dan · 2026-05-22T14:15:44Z

Summary

Fix SGLang CUDA graph capture failure in colocate mode by using local loop indices directly instead of physical GPU IDs from Ray placement groups. Removes the workaround that disabled custom all-reduce v2.

Symptom & Reproduction

Symptom: SGLang v0.5.12 server fails to start in colocate rollout mode with error: Capture cuda graph failed: Runtime check failed at custom_all_reduce.cuh:37: CUDA error: invalid argument

Reproduction:

export CUDA_VISIBLE_DEVICES=1,2,3,4
python train.py \
  --hf-checkpoint /root/models/Qwen3-30B-A3B \
  --rollout-num-gpus-per-engine 4 \
  --tensor-model-parallel-size 2 \
  --actor-num-gpus-per-node 4 \
  --colocate \
  --num-rollout 1

Linked issue: SGLang custom all-reduce v2 CUDA graph capture fails under Miles colocate rollout #1176

Root Cause

The original code passed physical GPU IDs from Ray placement groups (e.g., 1,2,3,4) to SGLang engines and then converted them to local IDs via _to_local_gpu_id(). The conversion was based on CUDA_VISIBLE_DEVICES in the parent Ray actor's context, which might differ from what the spawned SGLang server process sees, causing device index mismatch.

History: Why the Remapping Was Added

Before Jan 5, 2026:

base_gpu_id was computed by get_base_gpu_id(args, rank) returning logical indices (0, 1, 2, ...)
No CUDA_VISIBLE_DEVICES handling

Jan 5, 2026 (commit c77bdcf):

Added _to_local_gpu_id() to handle CUDA_VISIBLE_DEVICES=4,5,6,7 style GPU selection
Purpose: convert physical GPU 4 → local GPU 0, etc.

Later (around Jan 3, 2026, commit 9dfd339):

Added reordered_gpu_ids from placement groups
Started passing physical GPU IDs: base_gpu_id = int(reordered_gpu_ids[gpu_index])
Then _to_local_gpu_id() would convert to local ID

Why it worked before v0.5.12:

v0.5.10: CustomAllReduceV2 disabled by default → device mapping lenient
v0.5.12: CustomAllReduceV2 enabled by default → strict device mapping, bug exposed

The Fix

With RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1, the parent Ray actor and spawned SGLang server see the same CUDA_VISIBLE_DEVICES. Therefore, we don't need _to_local_gpu_id() conversion - we can use local loop indices directly.

Code flow:

User: CUDA_VISIBLE_DEVICES=4,5,6,7
  ↓
Ray: Creates placement group with 4 GPUs
  ↓
Engine 0 (i=0): base_gpu_id = 0*4 = 0 → Uses local GPUs 0,1,2,3 ✓
Engine 1 (i=1): base_gpu_id = 1*4 = 4 → Uses local GPUs 4,5,6,7 ✓

Changes:

rollout.py: Use base_gpu_id = i * num_gpu_per_engine (local loop index)
sglang_engine.py: Remove _to_local_gpu_id() function, keep get_base_gpu_id() as fallback
Remove the workaround that disabled SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2

How Users Specify Which GPUs to Use

GPU selection happens at the Ray level via CUDA_VISIBLE_DEVICES:

# Use specific GPUs (4,5,6,7 instead of 0,1,2,3)
export CUDA_VISIBLE_DEVICES=4,5,6,7
python train.py --colocate ...

Status

Status: WIP - awaiting CI test results
Local testing: No access to multi-GPU hardware for reproduction
Relying on: CI tests for verification

Verification

CI tests should verify:

✅ SGLang server starts successfully
✅ "Custom allreduce v2 initialized" appears in logs
✅ "Capture cuda graph end" appears in logs
✅ No "CUDA error: invalid argument"
✅ Training step completes

In colocate mode, SGLang engines should use local GPU 0 as their base device ID, not physical GPU IDs from Ray placement groups. Each engine uses TP GPUs starting from its local perspective of GPU 0, while the placement_group_bundle_index selects which physical GPU bundle to use. This ensures consistent device mapping between the Ray actor and the spawned SGLang server process, preventing CUDA graph capture failures in SGLang v0.5.12+. Changes: - rollout.py: Use base_gpu_id=0 for all engines - sglang_engine.py: Remove _to_local_gpu_id() conversion and base_gpu_id_is_local flag - CUDA_VISIBLE_DEVICES handles the mapping to physical GPUs Fixes radixark#1176

gemini-code-assist

Code Review

This pull request simplifies GPU device mapping for SGLang engines by leveraging Ray's environment isolation, specifically by hardcoding the base GPU ID to 0 and removing the manual local ID conversion logic. Feedback indicates that the fallback mechanism in sglang_engine.py may still return global GPU indices when base_gpu_id is missing, which could lead to out-of-bounds errors now that the conversion utility has been removed.

When base_gpu_id is None, default to 0 instead of calling get_base_gpu_id(). The get_base_gpu_id() function returns a global GPU offset (e.g., 4, 5) which was previously converted by _to_local_gpu_id(). Without that conversion, passing it directly to SGLang would cause incorrect device mapping. Ray's placement group ensures each engine sees its assigned GPU as local GPU 0, so defaulting to 0 is correct and removes the need for get_base_gpu_id().

Use local loop index (i * num_gpu_per_engine) instead of always 0. This correctly handles multiple engines: - Engine 0: base_gpu_id = 0 (uses GPUs 0,1,2,3 for TP=4) - Engine 1: base_gpu_id = 4 (uses GPUs 4,5,6,7 for TP=4) Also restore get_base_gpu_id() as fallback for other code paths. Key insight: With RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1, parent and child processes see the same CUDA_VISIBLE_DEVICES, so we don't need _to_local_gpu_id() conversion. We just use local loop indices directly.

da-niao-dan requested review from Zhichenzzz, fzyzcjy, maocheng23, yueming-yuan and yushengsu-thu as code owners May 22, 2026 14:15

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

Comment thread miles/backends/sglang_utils/sglang_engine.py Outdated

da-niao-dan added 2 commits May 22, 2026 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [fix] Simplify GPU ID remapping for SGLang colocate mode#1178

[WIP] [fix] Simplify GPU ID remapping for SGLang colocate mode#1178
da-niao-dan wants to merge 3 commits into
radixark:mainfrom
da-niao-dan:fix/sglang-cuda-remapping-issue

da-niao-dan commented May 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

da-niao-dan commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Symptom & Reproduction

Root Cause

History: Why the Remapping Was Added

The Fix

How Users Specify Which GPUs to Use

Status

Verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

da-niao-dan commented May 22, 2026 •

edited

Loading