Skip to content

Conversation

@YutongJau
Copy link

Description

Fixes a critical race condition in NcclTreeFlowModel::iteratable causing "double free" crashes (SIGABRT) under high-concurrency scenarios, particularly when running fine-grained workloads or low-latency network configurations.

Crash Log

info: reduce-scatter forward pass collective issued for layer: mlp_row, involved dimensions: 1, 0, 0...
***** info: fwd pass comm collective for layer: mlp_row is finished************
double free or corruption (out)
Aborted (core dumped)

The Issue

A TOCTOU (Time-of-Check to Time-of-Use) vulnerability exists because the exit condition is checked after releasing the lock:

cs.ExitSection(); // Lock released
if (all_channel_finished == true && all_packets_freed == true) {
    exit(); // <--- Race condition: multiple threads can enter here
    return false;
}

The Fix

Utilized the existing atomic judge_exit_flag.exchange(true) to ensure exit() is called exactly once, regardless of thread interleaving.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the termination logic in NcclTreeFlowModel::iteratable to eliminate a race that could lead to multiple threads calling NcclTreeFlowModel::exit() and triggering double-free crashes under high concurrency.

Changes:

  • Compute the all_channel_finished && all_packets_freed termination condition while holding the FlowCriticalSection lock and introduce a local should_exit flag.
  • Use the existing atomic judge_exit_flag.exchange(true) within the critical section to ensure that only a single thread observes should_exit == true and proceeds to call exit().
  • Keep the call to exit() outside the critical section while now being guarded by the atomic flag to preserve previous behavior without the race.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant