Fix: spin-wait ack barrier replaces check-and-return in sync_start drain protocol by yanghaoran29 · Pull Request #501 · hw-native-sys/simpler

yanghaoran29 · 2026-04-10T02:06:11Z

Replace the non-blocking ack check (load and return if not all acked) with a spin-wait loop that blocks until all scheduler threads have set their bit in drain_ack_mask. This eliminates the window where a non-elected thread returns to the scheduler loop and resumes tracker writes while the drain worker already has exclusive tracker access.

Remove drain_barrier_mask (the second atomic introduced as an intermediate step) — the single spin-wait on drain_ack_mask is sufficient for the full-stop guarantee. Reset detection uses drain_ack_mask bit-clear (release store on insufficient resources), not drain_worker_elected which remains zero until after the barrier completes.

Also fix drain_ack_mask reset ordering: use memory_order_release instead of relaxed so the clearing store is visible to threads spinning on their own bit.

gemini-code-assist

Code Review

This pull request introduces a four-phase drain protocol in aicpu_executor.cpp by adding a full-stop barrier to ensure all threads have finished tracker writes before the election phase. A critical logic error was identified in the implementation of the new barrier: the spin-wait loop checks drain_worker_elected to detect a reset, but since this variable is only set after the loop, the condition is always true, leading to premature returns and potential livelocks. It is recommended to use drain_ack_mask to monitor for resets instead.

src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp

src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp

…ain protocol Replace the non-blocking ack check (load and return if not all acked) with a spin-wait loop that blocks until all scheduler threads have set their bit in drain_ack_mask. This eliminates the window where a non-elected thread returns to the scheduler loop and resumes tracker writes while the drain worker already has exclusive tracker access. Remove drain_barrier_mask (the second atomic introduced as an intermediate step) — the single spin-wait on drain_ack_mask is sufficient for the full-stop guarantee. Reset detection uses drain_ack_mask bit-clear (release store on insufficient resources), not drain_worker_elected which remains zero until after the barrier completes. Also fix drain_ack_mask reset ordering: use memory_order_release instead of relaxed so the clearing store is visible to threads spinning on their own bit.

yanghaoran29 · 2026-04-13T07:57:51Z

Data Race Analysis and Fix for the Drain Protocol

1. Problem Symptom

In drain mode, core_trackers_[t].core_states_ is a regular uint64_t (non-atomic) with no lock protection, posing a data race risk.

Under default conditions, thread t exclusively writes its own core_trackers_[t].core_states_, so no race occurs. However, in drain mode, the drain thread iterates and writes to the entire core_trackers_ array. There is no guarantee that other threads have stopped writing to core_trackers_ — non-drain threads may be writing their tracker state at:

tracker.change_core_state(bit_pos);

in src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp line 450, leading to a race condition.

2. Reproduction Steps

Insert the following code before line 450 in src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp:

tracker.change_core_state(bit_pos);

Insert:

if ((drain_state_.drain_ack_mask.load(std::memory_order_relaxed)) != 0) { usleep(1000); }
assert((drain_state_.drain_worker_elected.load(std::memory_order_relaxed)) == 0);

Then increase the parameters in the sample examples/a5/tensormap_and_ringbuffer/spmd_sync_start_stress/kernels/orchestration/spmd_sync_start_stress_orch.cpp and its corresponding golden file, and run the test. An assert failure will be triggered with low probability, confirming concurrent write conflicts on the tracker between threads.

Example timeline:

Thread 0: entering drain mode (sync_start_pending=12)
Thread 0: drain acked, ack_mask=0x1 ≠ 0x7 → return to main loop

Thread 1: entering drain mode
Thread 1: drain acked, ack_mask=0x3 ≠ 0x7 → return to main loop

Thread 1: [back to main loop] → enter completion check
Thread 1:     change_core_state(bit_pos)    ← writes core_trackers_[1].core_states_
                                             ← Thread 1's ack bit 0x2 remains in the mask

Thread 2: acks → ack_mask=0x7 == all_acked
Thread 2: elected → drain_worker_dispatch
Thread 2:     core_trackers_[1].get_valid_cluster_offset_states()  ← reads core_states_
Thread 2:     dispatch_block_to_cluster(t=1, ...)
Thread 2:     change_core_state(...)                               ← writes core_states_

                    ↑ Thread 1 and Thread 2 concurrently read/write the same core_states_ (data race)

3. Root Cause

3.1 Insufficient Semantics of the Ack Barrier

The fetch_or operation on drain_ack_mask means:

"I have stopped issuing new dispatches (no longer actively scheduling new tasks)."

It does NOT guarantee:

The thread has stopped completion polling (change_core_state)
The thread is in a quiescent state (no longer modifying core_trackers_)

3.2 Legacy Implementation: Immediate Return to Main Loop After Ack

In the old handle_drain_mode:

drain_ack_mask.fetch_or(1u << thread_idx, release);

// Return immediately if not all acks are received
if ((drain_ack_mask.load(acquire) & all_acked) != all_acked) return;

When ack_mask is not fully set, the thread returns directly to the top of the main scheduling loop and immediately enters the completion check (check_running_cores_for_completion), which includes writes to core_trackers_[thread_idx].core_states_ (change_core_state).

check_running_cores_for_completion does not check any drain state and has no path to prevent execution after ack.

3.3 `all_acked` Does Not Mean All Threads Are Quiescent

When Thread 2 observes ack_mask == all_acked, Thread 0 and Thread 1 may have already returned and re-entered the completion check in the main loop.

all_acked only proves all three threads once passed the fetch_or line — it does not prove they are stopped at this moment.

3.4 Race Target

core_trackers_[t].core_states_ is a regular uint64_t (non-atomic) with no lock protection.

3.5 Limitation of the Legacy Handshake (Check-and-Return After Ack)

The legacy scheme only guarantees "each thread once passed the ack point", but cannot ensure:

The thread is stopped at the current moment (may have returned to the main loop);
The thread has stopped tracker writes in the completion path (change_core_state);
No concurrent read/write on the same tracker when the elected thread starts traversing the global core_trackers_.

Thus, "check-and-return after ack" only stops new dispatches — it fails to achieve the handshake goal of "full quiescence before drain".

4. Fix: Replace Ack Barrier with Spin-Wait

Change the ack barrier from "set bit → check once → return if not satisfied" to set bit → spin until all threads set their bits. Once spinning, threads never return to the main loop to perform tracker writes, guaranteeing exclusive access to trackers for the elected thread.

4.1 No New Fields Needed for `SyncStartDrainState`

struct alignas(64) SyncStartDrainState {
    std::atomic<int32_t>  sync_start_pending{0};    // 0=normal; -1=initializing; >0=active
    std::atomic<int32_t>  drain_worker_elected{0};  // 0=none; >0: elected thread's (thread_idx+1)
    std::atomic<uint32_t> drain_ack_mask{0};        // bit per thread; all-set = all threads reached ack barrier
    PTO2TaskSlotState    *pending_task{nullptr};
    int32_t _pad[10];
};

The semantics of drain_ack_mask are upgraded from "once passed the ack point" to "currently spinning and waiting" — the spin itself acts as a full-stop barrier.

4.2 Revised Control Flow of `handle_drain_mode`

[Ack Barrier] Set bit and spin-wait:
    fetch_or(drain_ack_mask, 1 << thread_idx, release)
    spin:
        ack = drain_ack_mask.load(acquire)
        if (ack & all_acked) == all_acked → break (all ready, proceed)
        if (ack & (1 << thread_idx)) == 0 → return (elected reset mask, abort current round)
        SPIN_WAIT_HINT()
    // Reaching here: all threads are spinning, no tracker writes executed

[Election]
    CAS(drain_worker_elected, 0 → thread_idx+1)
    CAS failed (non-elected):
        spin: while (sync_start_pending != 0):
            if (drain_worker_elected.load(acquire) == 0) → return (resource insufficient reset, retry)
            SPIN_WAIT_HINT()
        return
    CAS succeeded (elected):
        Check if global resources are sufficient
        Insufficient → drain_ack_mask.store(0, release)        ← notify spinning threads of reset, exit current round
                   drain_worker_elected.store(0, release)   ← notify threads past ack barrier of reset
                   return
        Sufficient → drain_worker_dispatch()
               fence(release) + clear fields + sync_start_pending.store(0, release)
               → other threads observe sync_start_pending == 0 and exit spin

4.3 Two-Stage Reset Detection

When resources are insufficient, threads in different phases detect reset via separate signals:

Thread Phase	Reset Signal	Detection Method
Spinning at ack barrier	`drain_ack_mask.store(0, release)`	Detect `(ack & my_bit) == 0` in spin → return
Past ack barrier, spinning after election	`drain_worker_elected.store(0, release)`	Detect `drain_worker_elected.load(acquire) == 0` in spin → return

drain_worker_elected == 0 is the initial value before election, so it cannot be used as a reset criterion in the ack barrier spin. The two signals (drain_ack_mask and drain_worker_elected) serve their respective phases.

4.4 Field Clearing Timing

Field	Normal Completion (after drain_worker_dispatch)	Reset (insufficient resources)
`drain_ack_mask`	Cleared (relaxed, covered by fence)	store(0, release) (reset signal for ack spin)
`drain_worker_elected`	Cleared (relaxed, covered by fence)	store(0, release) (reset signal for post-election spin)
`sync_start_pending`	store(0, release) (final step)	Not cleared (drain continues waiting for resources)

In the normal completion path, atomic_thread_fence(release) ensures visibility of all tracker writes.
sync_start_pending.store(0, release) acts as the final publish point; other threads safely exit after acquiring 0.

4.5 Key Guarantees

Guarantee	Mechanism
All threads stop issuing new dispatches	`drain_ack_mask` fetch_or (unchanged)
All threads stop completion checks (`change_core_state`)	Ack barrier replaced with spin: threads block in spin after setting bit, no tracker writes
Elected thread has exclusive tracker access	Election and dispatch start only after ack barrier (all threads spinning)
All threads exit and retry safely on resource shortage	Two-stage reset signals cover ack spin and post-election spin

5. Conclusion

Root cause: The legacy ack barrier used "check-and-return", allowing threads to return to the main loop and write trackers immediately after ack;
Fix: Replace the ack barrier with a spin-wait (spin until all bits set). The spin acts as a full-stop barrier without requiring an extra drain_barrier_mask;
Relation to two-phase scheme: drain_barrier_mask was an intermediate solution, finally merged into the spin semantics of the ack barrier. Equivalent guarantees are achieved with a single atomic variable for a simpler design;
Two-stage reset detection: drain_ack_mask clear for the ack spin phase, drain_worker_elected clear for the post-election spin phase.

gemini-code-assist bot reviewed Apr 10, 2026

View reviewed changes

src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated Show resolved Hide resolved

src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated Show resolved Hide resolved

yanghaoran29 force-pushed the fix_sync branch from 29b803c to 2ba8586 Compare April 10, 2026 02:39

yanghaoran29 force-pushed the fix_sync branch from 509c62b to 77647a2 Compare April 13, 2026 07:40

yanghaoran29 changed the title ~~Fix: add drain_barrier_mask for full-stop barrier in sync_start drain protocol~~ Fix: spin-wait ack barrier replaces check-and-return in sync_start drain protocol Apr 13, 2026

poursoul approved these changes Apr 13, 2026

View reviewed changes

poursoul merged commit d82c4d8 into hw-native-sys:main Apr 13, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: spin-wait ack barrier replaces check-and-return in sync_start drain protocol#501

Fix: spin-wait ack barrier replaces check-and-return in sync_start drain protocol#501
poursoul merged 1 commit intohw-native-sys:mainfrom
yanghaoran29:fix_sync

yanghaoran29 commented Apr 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

yanghaoran29 commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yanghaoran29 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yanghaoran29 commented Apr 13, 2026

Data Race Analysis and Fix for the Drain Protocol

1. Problem Symptom

2. Reproduction Steps

3. Root Cause

3.1 Insufficient Semantics of the Ack Barrier

3.2 Legacy Implementation: Immediate Return to Main Loop After Ack

3.3 all_acked Does Not Mean All Threads Are Quiescent

3.4 Race Target

3.5 Limitation of the Legacy Handshake (Check-and-Return After Ack)

4. Fix: Replace Ack Barrier with Spin-Wait

4.1 No New Fields Needed for SyncStartDrainState

4.2 Revised Control Flow of handle_drain_mode

4.3 Two-Stage Reset Detection

4.4 Field Clearing Timing

4.5 Key Guarantees

5. Conclusion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yanghaoran29 commented Apr 10, 2026 •

edited

Loading

3.3 `all_acked` Does Not Mean All Threads Are Quiescent

4.1 No New Fields Needed for `SyncStartDrainState`

4.2 Revised Control Flow of `handle_drain_mode`