[bug] CUDA Graphs Hang in MoE Token Dispatcher

### **Description**
Benchmarking jobs for **GPT-OSS (BF16)** using CUDA graphs are hanging in the **26.02** release. This appears to be a regression, as the same configurations (64 and 128 GPUs) were functional in version **25.11**. 

The hang occurs during the CUDA graph capture/replay phase

### **Environment**
* **Release:** 26.02 (Regression from 25.11)
* **Model:** GPT-OSS BF16
* **Hardware:** 64, 128, and 256 GPU GB300 clusters
* **Config:** * `moe_flex_dispatcher_backend = None`
    * `moe_token_dispatcher_type = alltoall`
    * `cuda_graph_scope = ["attn", "moe_router", "moe_preprocess"]`

### **Observed Behavior**
When CUDA graphs are enabled with the scopes above, the process hangs indefinitely. 
* **File Location:** `megatron/core/transformer/moe/token_dispatcher.py`
* **GPU Utilization:** 0% (Idle)
* **CPU Utilization:** ~200% for the processes (indicates a busy-wait or host-side deadlock).
* **Scaling Impact:** * **64/128 GPUs:** Immediate hang in 26.02 (Functional in 25.11).
    * **256 GPUs:** Fails in both 25.11 and 26.02.

### **Comparison Data (NPI NV Roofline)**

| GPU Count | Scopes | 25.11 (TFLOPS) | 26.02 (TFLOPS) | Roofline Target |
| :--- | :--- | :--- | :--- |
| **64** | attn, moe_router, moe_preprocess | PASS | **FAIL (Hang)** |
| **128** | attn, moe_router, moe_preprocess | PASS | **FAIL (Hang)** |
| **256** | attn, moe_router, moe_preprocess | FAIL | **FAIL (Hang)** |

### **Additional Context**
26.04 has the same hang issue and same behavior.

### Minimal repro

```shell
We on above workload on GKE cluster. and the launch script we use

  python scripts/performance/custom_setup_experiment.py \
    --model_family_name gpt_oss \
    --model_recipe_name gpt_oss_120b \
    --config_variant v1 \
    --gpu gb300 \
    --num_gpus 64 \
    --gpus_per_node 4 \
    --compute_dtype bf16 \
    --seq_length 4096 \
    --global_batch_size 512 \
    --micro_batch_size 4 \
    --tensor_model_parallel_size 1 \
    --pipeline_model_parallel_size 1 \
    --context_parallel_size 1 \
    --expert_model_parallel_size 64 \
    --expert_tensor_parallel_size 1 \
    --cuda_graph_impl transformer_engine \
    --cuda_graph_scope attn,moe_router,moe_preprocess

We use custom_setup_experiment.py for GKE environment. can try setup_experiment.py on Slurm cluster.
```

### Expected behavior

training stuck at completed CUDA graphs capture.

### Affected area

area:model

### Regression?

Yes

### Environment

_No response_

### Logs

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] CUDA Graphs Hang in MoE Token Dispatcher #3517

Description

Environment

Observed Behavior

Comparison Data (NPI NV Roofline)

Additional Context

Minimal repro

Expected behavior

Affected area

Regression?

Environment

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[bug] CUDA Graphs Hang in MoE Token Dispatcher #3517

Description

Description

Environment

Observed Behavior

Comparison Data (NPI NV Roofline)

Additional Context

Minimal repro

Expected behavior

Affected area

Regression?

Environment

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions