[WIP] [fix] Simplify GPU ID remapping for SGLang colocate mode#1178
Open
da-niao-dan wants to merge 3 commits into
Open
[WIP] [fix] Simplify GPU ID remapping for SGLang colocate mode#1178da-niao-dan wants to merge 3 commits into
da-niao-dan wants to merge 3 commits into
Conversation
In colocate mode, SGLang engines should use local GPU 0 as their base device ID, not physical GPU IDs from Ray placement groups. Each engine uses TP GPUs starting from its local perspective of GPU 0, while the placement_group_bundle_index selects which physical GPU bundle to use. This ensures consistent device mapping between the Ray actor and the spawned SGLang server process, preventing CUDA graph capture failures in SGLang v0.5.12+. Changes: - rollout.py: Use base_gpu_id=0 for all engines - sglang_engine.py: Remove _to_local_gpu_id() conversion and base_gpu_id_is_local flag - CUDA_VISIBLE_DEVICES handles the mapping to physical GPUs Fixes radixark#1176
Contributor
There was a problem hiding this comment.
Code Review
This pull request simplifies GPU device mapping for SGLang engines by leveraging Ray's environment isolation, specifically by hardcoding the base GPU ID to 0 and removing the manual local ID conversion logic. Feedback indicates that the fallback mechanism in sglang_engine.py may still return global GPU indices when base_gpu_id is missing, which could lead to out-of-bounds errors now that the conversion utility has been removed.
When base_gpu_id is None, default to 0 instead of calling get_base_gpu_id(). The get_base_gpu_id() function returns a global GPU offset (e.g., 4, 5) which was previously converted by _to_local_gpu_id(). Without that conversion, passing it directly to SGLang would cause incorrect device mapping. Ray's placement group ensures each engine sees its assigned GPU as local GPU 0, so defaulting to 0 is correct and removes the need for get_base_gpu_id().
Use local loop index (i * num_gpu_per_engine) instead of always 0. This correctly handles multiple engines: - Engine 0: base_gpu_id = 0 (uses GPUs 0,1,2,3 for TP=4) - Engine 1: base_gpu_id = 4 (uses GPUs 4,5,6,7 for TP=4) Also restore get_base_gpu_id() as fallback for other code paths. Key insight: With RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1, parent and child processes see the same CUDA_VISIBLE_DEVICES, so we don't need _to_local_gpu_id() conversion. We just use local loop indices directly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix SGLang CUDA graph capture failure in colocate mode by using local loop indices directly instead of physical GPU IDs from Ray placement groups. Removes the workaround that disabled custom all-reduce v2.
Symptom & Reproduction
Capture cuda graph failed: Runtime check failed at custom_all_reduce.cuh:37: CUDA error: invalid argumentexport CUDA_VISIBLE_DEVICES=1,2,3,4 python train.py \ --hf-checkpoint /root/models/Qwen3-30B-A3B \ --rollout-num-gpus-per-engine 4 \ --tensor-model-parallel-size 2 \ --actor-num-gpus-per-node 4 \ --colocate \ --num-rollout 1Root Cause
The original code passed physical GPU IDs from Ray placement groups (e.g., 1,2,3,4) to SGLang engines and then converted them to local IDs via
_to_local_gpu_id(). The conversion was based onCUDA_VISIBLE_DEVICESin the parent Ray actor's context, which might differ from what the spawned SGLang server process sees, causing device index mismatch.History: Why the Remapping Was Added
Before Jan 5, 2026:
base_gpu_idwas computed byget_base_gpu_id(args, rank)returning logical indices (0, 1, 2, ...)Jan 5, 2026 (commit c77bdcf):
_to_local_gpu_id()to handleCUDA_VISIBLE_DEVICES=4,5,6,7style GPU selectionLater (around Jan 3, 2026, commit 9dfd339):
reordered_gpu_idsfrom placement groupsbase_gpu_id = int(reordered_gpu_ids[gpu_index])_to_local_gpu_id()would convert to local IDWhy it worked before v0.5.12:
The Fix
With
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1, the parent Ray actor and spawned SGLang server see the sameCUDA_VISIBLE_DEVICES. Therefore, we don't need_to_local_gpu_id()conversion - we can use local loop indices directly.Code flow:
Changes:
rollout.py: Usebase_gpu_id = i * num_gpu_per_engine(local loop index)sglang_engine.py: Remove_to_local_gpu_id()function, keepget_base_gpu_id()as fallbackSGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2How Users Specify Which GPUs to Use
GPU selection happens at the Ray level via
CUDA_VISIBLE_DEVICES:Status
Verification
CI tests should verify: