Skip to content

[Bug] Data race on core_states_ in drain protocol — ack barrier has insufficient semantics #509

@yanghaoran29

Description

@yanghaoran29

Platform

a5sim (Ascend 950 simulation)

Runtime Variant

tensormap_and_ringbuffer

Description

During drain mode, core_trackers_[t].core_states_ (a plain uint64_t, non-atomic, unprotected) is subject to two classes of data race: read-write conflict and write-write conflict.

The drain_ack_mask barrier only guarantees that a thread has stopped issuing new dispatches. It does not guarantee that the thread has stopped completion polling (change_core_state). As a result, a thread that acks but returns early (because all_acked has not yet been reached) immediately re-enters the main scheduling loop and can call check_running_cores_for_completion, which writes core_states_. Concurrently, another thread may have already been elected and entered drain_worker_dispatch, reading and writing the same core_states_.

Race types

Race type Concurrent party A Concurrent party B
Read-write conflict Thread 1 writes core_states_[1] (change_core_state, L451) Thread 2 (elected) reads core_states_[1] (get_valid_cluster_offset_states)
Write-write conflict Thread 1 writes core_states_[1] (change_core_state, L451) Thread 2 (elected) writes core_states_[1] (change_core_state, L631)

Interleaving that triggers the race

Thread 0: ack → ack_mask=0x1 ≠ 0x7 → return to main loop

Thread 1: ack → ack_mask=0x3 ≠ 0x7 → return to main loop
Thread 1: [back in main loop] → change_core_state(bit_pos)  ← writes core_trackers_[1].core_states_
                                                              ← Thread 1's ack bit is still set in mask

Thread 2: ack → ack_mask=0x7 == all_acked → elected → drain_worker_dispatch
Thread 2:     get_valid_cluster_offset_states()              ← reads  core_states_[1]
Thread 2:     change_core_state(...)                         ← writes core_states_[1]

↑ Thread 1 and Thread 2 concurrently access the same core_states_ (data race, UB)

Steps to Reproduce

Insert the following before `tracker.change_core_state(bit_pos)` at L451 of `aicpu_executor.cpp`:


if ((drain_state_.drain_ack_mask.load(std::memory_order_relaxed)) != 0) { usleep(1000); }
assert((drain_state_.drain_worker_elected.load(std::memory_order_relaxed)) == 0);


Then increase the parameters of the `examples/a5/tensormap_and_ringbuffer/spmd_sync_start_stress` example (and its corresponding golden file) and run. The assert will fail with low probability. An assert failure confirms that the race-prone interleaving was reached; actual memory corruption is undefined behavior and may not produce a visible wrong result on every run.

Expected Behavior

正常运行,assert不失败

Actual Behavior

assert失败

Git Commit ID

8d5f25b

CANN Version

No response

Driver Version

No response

Host Platform

Linux (aarch64)

Additional Context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions