Skip to content

[WIP] [fix] Simplify GPU ID remapping for SGLang colocate mode#1178

Open
da-niao-dan wants to merge 3 commits into
radixark:mainfrom
da-niao-dan:fix/sglang-cuda-remapping-issue
Open

[WIP] [fix] Simplify GPU ID remapping for SGLang colocate mode#1178
da-niao-dan wants to merge 3 commits into
radixark:mainfrom
da-niao-dan:fix/sglang-cuda-remapping-issue

Conversation

@da-niao-dan
Copy link
Copy Markdown

@da-niao-dan da-niao-dan commented May 22, 2026

Summary

Fix SGLang CUDA graph capture failure in colocate mode by using local loop indices directly instead of physical GPU IDs from Ray placement groups. Removes the workaround that disabled custom all-reduce v2.

Symptom & Reproduction

  • Symptom: SGLang v0.5.12 server fails to start in colocate rollout mode with error: Capture cuda graph failed: Runtime check failed at custom_all_reduce.cuh:37: CUDA error: invalid argument
  • Reproduction:
    export CUDA_VISIBLE_DEVICES=1,2,3,4
    python train.py \
      --hf-checkpoint /root/models/Qwen3-30B-A3B \
      --rollout-num-gpus-per-engine 4 \
      --tensor-model-parallel-size 2 \
      --actor-num-gpus-per-node 4 \
      --colocate \
      --num-rollout 1
  • Linked issue: SGLang custom all-reduce v2 CUDA graph capture fails under Miles colocate rollout #1176

Root Cause

The original code passed physical GPU IDs from Ray placement groups (e.g., 1,2,3,4) to SGLang engines and then converted them to local IDs via _to_local_gpu_id(). The conversion was based on CUDA_VISIBLE_DEVICES in the parent Ray actor's context, which might differ from what the spawned SGLang server process sees, causing device index mismatch.

History: Why the Remapping Was Added

Before Jan 5, 2026:

  • base_gpu_id was computed by get_base_gpu_id(args, rank) returning logical indices (0, 1, 2, ...)
  • No CUDA_VISIBLE_DEVICES handling

Jan 5, 2026 (commit c77bdcf):

  • Added _to_local_gpu_id() to handle CUDA_VISIBLE_DEVICES=4,5,6,7 style GPU selection
  • Purpose: convert physical GPU 4 → local GPU 0, etc.

Later (around Jan 3, 2026, commit 9dfd339):

  • Added reordered_gpu_ids from placement groups
  • Started passing physical GPU IDs: base_gpu_id = int(reordered_gpu_ids[gpu_index])
  • Then _to_local_gpu_id() would convert to local ID

Why it worked before v0.5.12:

  • v0.5.10: CustomAllReduceV2 disabled by default → device mapping lenient
  • v0.5.12: CustomAllReduceV2 enabled by default → strict device mapping, bug exposed

The Fix

With RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1, the parent Ray actor and spawned SGLang server see the same CUDA_VISIBLE_DEVICES. Therefore, we don't need _to_local_gpu_id() conversion - we can use local loop indices directly.

Code flow:

User: CUDA_VISIBLE_DEVICES=4,5,6,7
  ↓
Ray: Creates placement group with 4 GPUs
  ↓
Engine 0 (i=0): base_gpu_id = 0*4 = 0 → Uses local GPUs 0,1,2,3 ✓
Engine 1 (i=1): base_gpu_id = 1*4 = 4 → Uses local GPUs 4,5,6,7 ✓

Changes:

  • rollout.py: Use base_gpu_id = i * num_gpu_per_engine (local loop index)
  • sglang_engine.py: Remove _to_local_gpu_id() function, keep get_base_gpu_id() as fallback
  • Remove the workaround that disabled SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2

How Users Specify Which GPUs to Use

GPU selection happens at the Ray level via CUDA_VISIBLE_DEVICES:

# Use specific GPUs (4,5,6,7 instead of 0,1,2,3)
export CUDA_VISIBLE_DEVICES=4,5,6,7
python train.py --colocate ...

Status

  • Status: WIP - awaiting CI test results
  • Local testing: No access to multi-GPU hardware for reproduction
  • Relying on: CI tests for verification

Verification

CI tests should verify:

  • ✅ SGLang server starts successfully
  • ✅ "Custom allreduce v2 initialized" appears in logs
  • ✅ "Capture cuda graph end" appears in logs
  • ✅ No "CUDA error: invalid argument"
  • ✅ Training step completes

In colocate mode, SGLang engines should use local GPU 0 as their base device
ID, not physical GPU IDs from Ray placement groups. Each engine uses TP GPUs
starting from its local perspective of GPU 0, while the placement_group_bundle_index
selects which physical GPU bundle to use. This ensures consistent device mapping
between the Ray actor and the spawned SGLang server process, preventing CUDA
graph capture failures in SGLang v0.5.12+.

Changes:
- rollout.py: Use base_gpu_id=0 for all engines
- sglang_engine.py: Remove _to_local_gpu_id() conversion and base_gpu_id_is_local flag
- CUDA_VISIBLE_DEVICES handles the mapping to physical GPUs

Fixes radixark#1176
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request simplifies GPU device mapping for SGLang engines by leveraging Ray's environment isolation, specifically by hardcoding the base GPU ID to 0 and removing the manual local ID conversion logic. Feedback indicates that the fallback mechanism in sglang_engine.py may still return global GPU indices when base_gpu_id is missing, which could lead to out-of-bounds errors now that the conversion utility has been removed.

Comment thread miles/backends/sglang_utils/sglang_engine.py Outdated
When base_gpu_id is None, default to 0 instead of calling get_base_gpu_id().
The get_base_gpu_id() function returns a global GPU offset (e.g., 4, 5) which
was previously converted by _to_local_gpu_id(). Without that conversion,
passing it directly to SGLang would cause incorrect device mapping.

Ray's placement group ensures each engine sees its assigned GPU as local GPU 0,
so defaulting to 0 is correct and removes the need for get_base_gpu_id().
Use local loop index (i * num_gpu_per_engine) instead of always 0.
This correctly handles multiple engines:
- Engine 0: base_gpu_id = 0 (uses GPUs 0,1,2,3 for TP=4)
- Engine 1: base_gpu_id = 4 (uses GPUs 4,5,6,7 for TP=4)

Also restore get_base_gpu_id() as fallback for other code paths.

Key insight: With RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1,
parent and child processes see the same CUDA_VISIBLE_DEVICES, so
we don't need _to_local_gpu_id() conversion. We just use local
loop indices directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant