[Bug] Data race on `core_states_` in drain protocol — ack barrier has insufficient semantics

### Platform

a5sim (Ascend 950 simulation)

### Runtime Variant

tensormap_and_ringbuffer

### Description

During drain mode, `core_trackers_[t].core_states_` (a plain `uint64_t`, non-atomic, unprotected) is subject to two classes of data race: read-write conflict and write-write conflict.

The `drain_ack_mask` barrier only guarantees that a thread has stopped issuing new dispatches. It does **not** guarantee that the thread has stopped completion polling (`change_core_state`). As a result, a thread that acks but returns early (because `all_acked` has not yet been reached) immediately re-enters the main scheduling loop and can call `check_running_cores_for_completion`, which writes `core_states_`. Concurrently, another thread may have already been elected and entered `drain_worker_dispatch`, reading and writing the same `core_states_`.

### Race types

| Race type | Concurrent party A | Concurrent party B |
|-----------|--------------------|--------------------|
| Read-write conflict | Thread 1 writes `core_states_[1]` (`change_core_state`, L451) | Thread 2 (elected) reads `core_states_[1]` (`get_valid_cluster_offset_states`) |
| Write-write conflict | Thread 1 writes `core_states_[1]` (`change_core_state`, L451) | Thread 2 (elected) writes `core_states_[1]` (`change_core_state`, L631) |

### Interleaving that triggers the race

```
Thread 0: ack → ack_mask=0x1 ≠ 0x7 → return to main loop

Thread 1: ack → ack_mask=0x3 ≠ 0x7 → return to main loop
Thread 1: [back in main loop] → change_core_state(bit_pos)  ← writes core_trackers_[1].core_states_
                                                              ← Thread 1's ack bit is still set in mask

Thread 2: ack → ack_mask=0x7 == all_acked → elected → drain_worker_dispatch
Thread 2:     get_valid_cluster_offset_states()              ← reads  core_states_[1]
Thread 2:     change_core_state(...)                         ← writes core_states_[1]

↑ Thread 1 and Thread 2 concurrently access the same core_states_ (data race, UB)
```

### Steps to Reproduce

```markdown
Insert the following before `tracker.change_core_state(bit_pos)` at L451 of `aicpu_executor.cpp`:


if ((drain_state_.drain_ack_mask.load(std::memory_order_relaxed)) != 0) { usleep(1000); }
assert((drain_state_.drain_worker_elected.load(std::memory_order_relaxed)) == 0);


Then increase the parameters of the `examples/a5/tensormap_and_ringbuffer/spmd_sync_start_stress` example (and its corresponding golden file) and run. The assert will fail with low probability. An assert failure confirms that the race-prone interleaving was reached; actual memory corruption is undefined behavior and may not produce a visible wrong result on every run.
```

### Expected Behavior

正常运行，assert不失败

### Actual Behavior

assert失败

### Git Commit ID

8d5f25b38d6a09da9feecc847702ebaa58ffd883

### CANN Version

_No response_

### Driver Version

_No response_

### Host Platform

Linux (aarch64)

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Data race on `core_states_` in drain protocol — ack barrier has insufficient semantics #509

Platform

Runtime Variant

Description

Race types

Interleaving that triggers the race

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race type	Concurrent party A	Concurrent party B
Read-write conflict	Thread 1 writes `core_states_[1]` (`change_core_state`, L451)	Thread 2 (elected) reads `core_states_[1]` (`get_valid_cluster_offset_states`)
Write-write conflict	Thread 1 writes `core_states_[1]` (`change_core_state`, L451)	Thread 2 (elected) writes `core_states_[1]` (`change_core_state`, L631)

[Bug] Data race on core_states_ in drain protocol — ack barrier has insufficient semantics #509

Description

Platform

Runtime Variant

Description

Race types

Interleaving that triggers the race

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] Data race on `core_states_` in drain protocol — ack barrier has insufficient semantics #509