Fix: spin-wait ack barrier replaces check-and-return in sync_start drain protocol#501
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a four-phase drain protocol in aicpu_executor.cpp by adding a full-stop barrier to ensure all threads have finished tracker writes before the election phase. A critical logic error was identified in the implementation of the new barrier: the spin-wait loop checks drain_worker_elected to detect a reset, but since this variable is only set after the loop, the condition is always true, leading to premature returns and potential livelocks. It is recommended to use drain_ack_mask to monitor for resets instead.
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Outdated
Show resolved
Hide resolved
src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Outdated
Show resolved
Hide resolved
…ain protocol Replace the non-blocking ack check (load and return if not all acked) with a spin-wait loop that blocks until all scheduler threads have set their bit in drain_ack_mask. This eliminates the window where a non-elected thread returns to the scheduler loop and resumes tracker writes while the drain worker already has exclusive tracker access. Remove drain_barrier_mask (the second atomic introduced as an intermediate step) — the single spin-wait on drain_ack_mask is sufficient for the full-stop guarantee. Reset detection uses drain_ack_mask bit-clear (release store on insufficient resources), not drain_worker_elected which remains zero until after the barrier completes. Also fix drain_ack_mask reset ordering: use memory_order_release instead of relaxed so the clearing store is visible to threads spinning on their own bit.
Data Race Analysis and Fix for the Drain Protocol1. Problem SymptomIn drain mode, Under default conditions, thread t exclusively writes its own tracker.change_core_state(bit_pos);in src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp line 450, leading to a race condition. 2. Reproduction StepsInsert the following code before line 450 in tracker.change_core_state(bit_pos);Insert: if ((drain_state_.drain_ack_mask.load(std::memory_order_relaxed)) != 0) { usleep(1000); }
assert((drain_state_.drain_worker_elected.load(std::memory_order_relaxed)) == 0);Then increase the parameters in the sample Example timeline: 3. Root Cause3.1 Insufficient Semantics of the Ack BarrierThe
It does NOT guarantee:
3.2 Legacy Implementation: Immediate Return to Main Loop After AckIn the old drain_ack_mask.fetch_or(1u << thread_idx, release);
// Return immediately if not all acks are received
if ((drain_ack_mask.load(acquire) & all_acked) != all_acked) return;When
3.3
|
| Thread Phase | Reset Signal | Detection Method |
|---|---|---|
| Spinning at ack barrier | drain_ack_mask.store(0, release) |
Detect (ack & my_bit) == 0 in spin → return |
| Past ack barrier, spinning after election | drain_worker_elected.store(0, release) |
Detect drain_worker_elected.load(acquire) == 0 in spin → return |
drain_worker_elected == 0 is the initial value before election, so it cannot be used as a reset criterion in the ack barrier spin. The two signals (drain_ack_mask and drain_worker_elected) serve their respective phases.
4.4 Field Clearing Timing
| Field | Normal Completion (after drain_worker_dispatch) | Reset (insufficient resources) |
|---|---|---|
drain_ack_mask |
Cleared (relaxed, covered by fence) | store(0, release) (reset signal for ack spin) |
drain_worker_elected |
Cleared (relaxed, covered by fence) | store(0, release) (reset signal for post-election spin) |
sync_start_pending |
store(0, release) (final step) | Not cleared (drain continues waiting for resources) |
In the normal completion path, atomic_thread_fence(release) ensures visibility of all tracker writes.
sync_start_pending.store(0, release) acts as the final publish point; other threads safely exit after acquiring 0.
4.5 Key Guarantees
| Guarantee | Mechanism |
|---|---|
| All threads stop issuing new dispatches | drain_ack_mask fetch_or (unchanged) |
All threads stop completion checks (change_core_state) |
Ack barrier replaced with spin: threads block in spin after setting bit, no tracker writes |
| Elected thread has exclusive tracker access | Election and dispatch start only after ack barrier (all threads spinning) |
| All threads exit and retry safely on resource shortage | Two-stage reset signals cover ack spin and post-election spin |
5. Conclusion
- Root cause: The legacy ack barrier used "check-and-return", allowing threads to return to the main loop and write trackers immediately after ack;
- Fix: Replace the ack barrier with a spin-wait (spin until all bits set). The spin acts as a full-stop barrier without requiring an extra
drain_barrier_mask; - Relation to two-phase scheme:
drain_barrier_maskwas an intermediate solution, finally merged into the spin semantics of the ack barrier. Equivalent guarantees are achieved with a single atomic variable for a simpler design; - Two-stage reset detection:
drain_ack_maskclear for the ack spin phase,drain_worker_electedclear for the post-election spin phase.
Replace the non-blocking ack check (load and return if not all acked) with a spin-wait loop that blocks until all scheduler threads have set their bit in drain_ack_mask. This eliminates the window where a non-elected thread returns to the scheduler loop and resumes tracker writes while the drain worker already has exclusive tracker access.
Remove drain_barrier_mask (the second atomic introduced as an intermediate step) — the single spin-wait on drain_ack_mask is sufficient for the full-stop guarantee. Reset detection uses drain_ack_mask bit-clear (release store on insufficient resources), not drain_worker_elected which remains zero until after the barrier completes.
Also fix drain_ack_mask reset ordering: use memory_order_release instead of relaxed so the clearing store is visible to threads spinning on their own bit.