Skip to content

Fix: spin-wait ack barrier replaces check-and-return in sync_start drain protocol#501

Merged
poursoul merged 1 commit intohw-native-sys:mainfrom
yanghaoran29:fix_sync
Apr 13, 2026
Merged

Fix: spin-wait ack barrier replaces check-and-return in sync_start drain protocol#501
poursoul merged 1 commit intohw-native-sys:mainfrom
yanghaoran29:fix_sync

Conversation

@yanghaoran29
Copy link
Copy Markdown
Contributor

@yanghaoran29 yanghaoran29 commented Apr 10, 2026

Replace the non-blocking ack check (load and return if not all acked) with a spin-wait loop that blocks until all scheduler threads have set their bit in drain_ack_mask. This eliminates the window where a non-elected thread returns to the scheduler loop and resumes tracker writes while the drain worker already has exclusive tracker access.

Remove drain_barrier_mask (the second atomic introduced as an intermediate step) — the single spin-wait on drain_ack_mask is sufficient for the full-stop guarantee. Reset detection uses drain_ack_mask bit-clear (release store on insufficient resources), not drain_worker_elected which remains zero until after the barrier completes.

Also fix drain_ack_mask reset ordering: use memory_order_release instead of relaxed so the clearing store is visible to threads spinning on their own bit.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a four-phase drain protocol in aicpu_executor.cpp by adding a full-stop barrier to ensure all threads have finished tracker writes before the election phase. A critical logic error was identified in the implementation of the new barrier: the spin-wait loop checks drain_worker_elected to detect a reset, but since this variable is only set after the loop, the condition is always true, leading to premature returns and potential livelocks. It is recommended to use drain_ack_mask to monitor for resets instead.

…ain protocol

Replace the non-blocking ack check (load and return if not all acked) with a
spin-wait loop that blocks until all scheduler threads have set their bit in
drain_ack_mask. This eliminates the window where a non-elected thread returns
to the scheduler loop and resumes tracker writes while the drain worker already
has exclusive tracker access.

Remove drain_barrier_mask (the second atomic introduced as an intermediate step)
— the single spin-wait on drain_ack_mask is sufficient for the full-stop
guarantee. Reset detection uses drain_ack_mask bit-clear (release store on
insufficient resources), not drain_worker_elected which remains zero until after
the barrier completes.

Also fix drain_ack_mask reset ordering: use memory_order_release instead of
relaxed so the clearing store is visible to threads spinning on their own bit.
@yanghaoran29
Copy link
Copy Markdown
Contributor Author

Data Race Analysis and Fix for the Drain Protocol

1. Problem Symptom

In drain mode, core_trackers_[t].core_states_ is a regular uint64_t (non-atomic) with no lock protection, posing a data race risk.

Under default conditions, thread t exclusively writes its own core_trackers_[t].core_states_, so no race occurs. However, in drain mode, the drain thread iterates and writes to the entire core_trackers_ array. There is no guarantee that other threads have stopped writing to core_trackers_ — non-drain threads may be writing their tracker state at:

tracker.change_core_state(bit_pos);

in src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp line 450, leading to a race condition.

2. Reproduction Steps

Insert the following code before line 450 in src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp:

tracker.change_core_state(bit_pos);

Insert:

if ((drain_state_.drain_ack_mask.load(std::memory_order_relaxed)) != 0) { usleep(1000); }
assert((drain_state_.drain_worker_elected.load(std::memory_order_relaxed)) == 0);

Then increase the parameters in the sample examples/a5/tensormap_and_ringbuffer/spmd_sync_start_stress/kernels/orchestration/spmd_sync_start_stress_orch.cpp and its corresponding golden file, and run the test. An assert failure will be triggered with low probability, confirming concurrent write conflicts on the tracker between threads.

Example timeline:

Thread 0: entering drain mode (sync_start_pending=12)
Thread 0: drain acked, ack_mask=0x1 ≠ 0x7 → return to main loop

Thread 1: entering drain mode
Thread 1: drain acked, ack_mask=0x3 ≠ 0x7 → return to main loop

Thread 1: [back to main loop] → enter completion check
Thread 1:     change_core_state(bit_pos)    ← writes core_trackers_[1].core_states_
                                             ← Thread 1's ack bit 0x2 remains in the mask

Thread 2: acks → ack_mask=0x7 == all_acked
Thread 2: elected → drain_worker_dispatch
Thread 2:     core_trackers_[1].get_valid_cluster_offset_states()  ← reads core_states_
Thread 2:     dispatch_block_to_cluster(t=1, ...)
Thread 2:     change_core_state(...)                               ← writes core_states_

                    ↑ Thread 1 and Thread 2 concurrently read/write the same core_states_ (data race)

3. Root Cause

3.1 Insufficient Semantics of the Ack Barrier

The fetch_or operation on drain_ack_mask means:

"I have stopped issuing new dispatches (no longer actively scheduling new tasks)."

It does NOT guarantee:

  • The thread has stopped completion polling (change_core_state)
  • The thread is in a quiescent state (no longer modifying core_trackers_)

3.2 Legacy Implementation: Immediate Return to Main Loop After Ack

In the old handle_drain_mode:

drain_ack_mask.fetch_or(1u << thread_idx, release);

// Return immediately if not all acks are received
if ((drain_ack_mask.load(acquire) & all_acked) != all_acked) return;

When ack_mask is not fully set, the thread returns directly to the top of the main scheduling loop and immediately enters the completion check (check_running_cores_for_completion), which includes writes to core_trackers_[thread_idx].core_states_ (change_core_state).

check_running_cores_for_completion does not check any drain state and has no path to prevent execution after ack.

3.3 all_acked Does Not Mean All Threads Are Quiescent

When Thread 2 observes ack_mask == all_acked, Thread 0 and Thread 1 may have already returned and re-entered the completion check in the main loop.

all_acked only proves all three threads once passed the fetch_or line — it does not prove they are stopped at this moment.

3.4 Race Target

core_trackers_[t].core_states_ is a regular uint64_t (non-atomic) with no lock protection.

3.5 Limitation of the Legacy Handshake (Check-and-Return After Ack)

The legacy scheme only guarantees "each thread once passed the ack point", but cannot ensure:

  • The thread is stopped at the current moment (may have returned to the main loop);
  • The thread has stopped tracker writes in the completion path (change_core_state);
  • No concurrent read/write on the same tracker when the elected thread starts traversing the global core_trackers_.

Thus, "check-and-return after ack" only stops new dispatches — it fails to achieve the handshake goal of "full quiescence before drain".


4. Fix: Replace Ack Barrier with Spin-Wait

Change the ack barrier from "set bit → check once → return if not satisfied" to set bit → spin until all threads set their bits. Once spinning, threads never return to the main loop to perform tracker writes, guaranteeing exclusive access to trackers for the elected thread.

4.1 No New Fields Needed for SyncStartDrainState

struct alignas(64) SyncStartDrainState {
    std::atomic<int32_t>  sync_start_pending{0};    // 0=normal; -1=initializing; >0=active
    std::atomic<int32_t>  drain_worker_elected{0};  // 0=none; >0: elected thread's (thread_idx+1)
    std::atomic<uint32_t> drain_ack_mask{0};        // bit per thread; all-set = all threads reached ack barrier
    PTO2TaskSlotState    *pending_task{nullptr};
    int32_t _pad[10];
};

The semantics of drain_ack_mask are upgraded from "once passed the ack point" to "currently spinning and waiting" — the spin itself acts as a full-stop barrier.

4.2 Revised Control Flow of handle_drain_mode

[Ack Barrier] Set bit and spin-wait:
    fetch_or(drain_ack_mask, 1 << thread_idx, release)
    spin:
        ack = drain_ack_mask.load(acquire)
        if (ack & all_acked) == all_acked → break (all ready, proceed)
        if (ack & (1 << thread_idx)) == 0 → return (elected reset mask, abort current round)
        SPIN_WAIT_HINT()
    // Reaching here: all threads are spinning, no tracker writes executed

[Election]
    CAS(drain_worker_elected, 0 → thread_idx+1)
    CAS failed (non-elected):
        spin: while (sync_start_pending != 0):
            if (drain_worker_elected.load(acquire) == 0) → return (resource insufficient reset, retry)
            SPIN_WAIT_HINT()
        return
    CAS succeeded (elected):
        Check if global resources are sufficient
        Insufficient → drain_ack_mask.store(0, release)        ← notify spinning threads of reset, exit current round
                   drain_worker_elected.store(0, release)   ← notify threads past ack barrier of reset
                   return
        Sufficient → drain_worker_dispatch()
               fence(release) + clear fields + sync_start_pending.store(0, release)
               → other threads observe sync_start_pending == 0 and exit spin

4.3 Two-Stage Reset Detection

When resources are insufficient, threads in different phases detect reset via separate signals:

Thread Phase Reset Signal Detection Method
Spinning at ack barrier drain_ack_mask.store(0, release) Detect (ack & my_bit) == 0 in spin → return
Past ack barrier, spinning after election drain_worker_elected.store(0, release) Detect drain_worker_elected.load(acquire) == 0 in spin → return

drain_worker_elected == 0 is the initial value before election, so it cannot be used as a reset criterion in the ack barrier spin. The two signals (drain_ack_mask and drain_worker_elected) serve their respective phases.

4.4 Field Clearing Timing

Field Normal Completion (after drain_worker_dispatch) Reset (insufficient resources)
drain_ack_mask Cleared (relaxed, covered by fence) store(0, release) (reset signal for ack spin)
drain_worker_elected Cleared (relaxed, covered by fence) store(0, release) (reset signal for post-election spin)
sync_start_pending store(0, release) (final step) Not cleared (drain continues waiting for resources)

In the normal completion path, atomic_thread_fence(release) ensures visibility of all tracker writes.
sync_start_pending.store(0, release) acts as the final publish point; other threads safely exit after acquiring 0.

4.5 Key Guarantees

Guarantee Mechanism
All threads stop issuing new dispatches drain_ack_mask fetch_or (unchanged)
All threads stop completion checks (change_core_state) Ack barrier replaced with spin: threads block in spin after setting bit, no tracker writes
Elected thread has exclusive tracker access Election and dispatch start only after ack barrier (all threads spinning)
All threads exit and retry safely on resource shortage Two-stage reset signals cover ack spin and post-election spin

5. Conclusion

  • Root cause: The legacy ack barrier used "check-and-return", allowing threads to return to the main loop and write trackers immediately after ack;
  • Fix: Replace the ack barrier with a spin-wait (spin until all bits set). The spin acts as a full-stop barrier without requiring an extra drain_barrier_mask;
  • Relation to two-phase scheme: drain_barrier_mask was an intermediate solution, finally merged into the spin semantics of the ack barrier. Equivalent guarantees are achieved with a single atomic variable for a simpler design;
  • Two-stage reset detection: drain_ack_mask clear for the ack spin phase, drain_worker_elected clear for the post-election spin phase.

@yanghaoran29 yanghaoran29 changed the title Fix: add drain_barrier_mask for full-stop barrier in sync_start drain protocol Fix: spin-wait ack barrier replaces check-and-return in sync_start drain protocol Apr 13, 2026
@poursoul poursoul merged commit d82c4d8 into hw-native-sys:main Apr 13, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants