[Bug] Rebased partial-manual manual scope deadlocks on paged_attention while AUTO mode remains healthy

### Platform

a2a3sim (Ascend 910B/C simulation)

### Runtime Variant

tensormap_and_ringbuffer

### Description

Draft PR [#482](https://github.com/hw-native-sys/simpler/pull/482) adds a hybrid manual-scope mode to `tensormap_and_ringbuffer`:

- `PTO2_SCOPE()` stays in default `AUTO` mode.
- `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` enables scoped explicit same-scope dependency wiring.
- Same-manual-scope producer/consumer edges are expressed with `pto2_rt_add_dependency(...)`.
- Manual-local tensors skip TensorMap replay/discovery.
- Cross-scope boundary tensors still use `owner_task_id` retention and TensorMap frontier/discovery for correctness.

The feature works on the pre-rebase branch, and the `AUTO` path still works after rebasing to current `main`. The failure is specific to the rebased `partial_manual` path on paged-attention.

Current understanding is that the rebase exposed a scheduler/allocator bug in the deferred manual publish path:

- `AUTO` publishes/discovers tasks incrementally during submit.
- `partial_manual` accumulates unpublished tasks inside the manual scope and batch-publishes them at `scope_end()`.
- On rebased `main`, this bursty publish path can report deadlock even after downstream progress is already visible.

Rebase debugging found several concrete issues already:

1. Hidden `alloc_tensors()` tasks with `active_mask == 0` were being published from the manual path.
2. Some non-profiling ready paths could enqueue work without consistently transitioning `PENDING -> READY`.
3. After fixing both of the above locally for diagnosis, the remaining failure is still an allocator/scheduler false-deadlock: a task becomes `READY`, but the allocator aborts based on a narrow `last_alive` heuristic before that progress is fully retired.

This means the problem is not that manual scope is conceptually invalid; it is that the current rebased `partial_manual` integration path is still buggy.

Related: #409

### Steps to Reproduce

```markdown
git checkout 316bfb1c97c4141167e55481cb915d3a26c3c71e
python examples/scripts/run_example.py --build --silent \
  -k tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels \
  -g tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py \
  -p a2a3sim --clone-protocol https -c 8830244b
```

### Expected Behavior

The rebased `partial_manual` paged-attention example should complete successfully, like the rebased `AUTO` runtime path.

More specifically:

- Manual scope should preserve the intended hybrid semantics from PR #482.
- Same-scope explicit dependencies should execute without TensorMap replay overhead.
- Cross-scope boundary tensors should still remain correct through owner/TensorMap handling.
- The deferred `scope_end()` publish should not deadlock or mis-detect lack of progress.

### Actual Behavior

The rebased `partial_manual` run deadlocks/fails in the runtime while `AUTO` remains healthy.

Observed pattern:

- tasks inside the manual scope are deferred until `scope_end()`
- batch publish starts and early tasks do make progress
- a downstream task becomes `READY`
- allocator still aborts on deadlock / task-ring-full before that ready progress is retired cleanly
- orchestration later fails because expected outputs are not produced

Representative trace excerpts:

```text
dispatch: thread=0 shape=0 task=0 block=0/1
dispatch: thread=2 shape=0 task=4 block=0/1
ready(local): task=5 shape=1 fanin=2/2
dispatch: thread=2 shape=1 task=5 block=0/1
ready(local): task=6 shape=0 fanin=2/2
ready(local): task=1 shape=1 fanin=2/2
dispatch: thread=1 shape=1 task=1 block=0/1
ready(local): task=2 shape=0 fanin=2/2
```

Deadlock snapshot at abort:

```text
task=1 state=3 fanin=2/2 fanout=1/2 active_mask=2 done=1/1 block_num=1 next_block=1
task=2 state=1 fanin=2/2 fanout=1/2 active_mask=1 done=0/1 block_num=1 next_block=0
task=3 state=0 fanin=2/3 fanout=1/2 active_mask=2 done=0/1 block_num=1 next_block=0
task=4 state=4 fanin=1/1 fanout=2/2 active_mask=1 done=1/1 block_num=1 next_block=1
```

State meanings:

- `0 = PENDING`
- `1 = READY`
- `2 = RUNNING`
- `3 = COMPLETED`
- `4 = CONSUMED`

The critical detail is that `task=2` is already `READY` with `fanin=2/2`, so useful progress exists, but the allocator still concludes deadlock.

### Git Commit ID

316bfb1c97c4141167e55481cb915d3a26c3c71e

### CANN Version

N/A

### Driver Version

N/A

### Host Platform

Linux (aarch64)

### Additional Context

Draft implementation reference: [#482](https://github.com/hw-native-sys/simpler/pull/482)

Visual timeline of where the rebased failure happens:

```text
AUTO on rebased main
--------------------
submit qk0 -> publish -> sf0 ready -> dispatch -> pv0 ready -> dispatch -> up0 ...
progress is discovered and drained incrementally during submit

PARTIAL_MANUAL on rebased main
------------------------------
manual scope open
submit qk0 sf0 pv0 up0 qk1 sf1 pv1 up1 ...
(all tasks stay unpublished behind the manual-scope barrier)
                    |
                    v
                scope_end()
                    |
                    +-> batch publish
                    +-> qk0 dispatches
                    +-> sf0 completes
                    +-> pv0 becomes READY
                    +-> allocator still aborts before that READY progress is retired

Observed head snapshot near abort
---------------------------------
task 1 = COMPLETED
task 2 = READY      <--- progress exists here
task 3 = PENDING    <--- still waiting on task 2
task 4 = CONSUMED

Result
------
The runtime treats the burst-published manual path as deadlocked even though
scheduler-visible progress has already advanced past the old head.
```

This report is intentionally focused on the remaining rebased failure, not on the earlier design iterations that have already been corrected. The main question now is how to make the allocator/scheduler treat manual-scope burst publish as first-class progress, so rebased `partial_manual` behaves as safely as `AUTO`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Rebased partial-manual manual scope deadlocks on paged_attention while AUTO mode remains healthy #495

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Rebased partial-manual manual scope deadlocks on paged_attention while AUTO mode remains healthy #495

Description

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions