Skip to content

[Bug] Rebased partial-manual manual scope deadlocks on paged_attention while AUTO mode remains healthy #495

@uv-xiao

Description

@uv-xiao

Platform

a2a3sim (Ascend 910B/C simulation)

Runtime Variant

tensormap_and_ringbuffer

Description

Draft PR #482 adds a hybrid manual-scope mode to tensormap_and_ringbuffer:

  • PTO2_SCOPE() stays in default AUTO mode.
  • PTO2_SCOPE(PTO2ScopeMode::MANUAL) enables scoped explicit same-scope dependency wiring.
  • Same-manual-scope producer/consumer edges are expressed with pto2_rt_add_dependency(...).
  • Manual-local tensors skip TensorMap replay/discovery.
  • Cross-scope boundary tensors still use owner_task_id retention and TensorMap frontier/discovery for correctness.

The feature works on the pre-rebase branch, and the AUTO path still works after rebasing to current main. The failure is specific to the rebased partial_manual path on paged-attention.

Current understanding is that the rebase exposed a scheduler/allocator bug in the deferred manual publish path:

  • AUTO publishes/discovers tasks incrementally during submit.
  • partial_manual accumulates unpublished tasks inside the manual scope and batch-publishes them at scope_end().
  • On rebased main, this bursty publish path can report deadlock even after downstream progress is already visible.

Rebase debugging found several concrete issues already:

  1. Hidden alloc_tensors() tasks with active_mask == 0 were being published from the manual path.
  2. Some non-profiling ready paths could enqueue work without consistently transitioning PENDING -> READY.
  3. After fixing both of the above locally for diagnosis, the remaining failure is still an allocator/scheduler false-deadlock: a task becomes READY, but the allocator aborts based on a narrow last_alive heuristic before that progress is fully retired.

This means the problem is not that manual scope is conceptually invalid; it is that the current rebased partial_manual integration path is still buggy.

Related: #409

Steps to Reproduce

git checkout 316bfb1c97c4141167e55481cb915d3a26c3c71e
python examples/scripts/run_example.py --build --silent \
  -k tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels \
  -g tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py \
  -p a2a3sim --clone-protocol https -c 8830244b

Expected Behavior

The rebased partial_manual paged-attention example should complete successfully, like the rebased AUTO runtime path.

More specifically:

  • Manual scope should preserve the intended hybrid semantics from PR Add manual-scope dependency mode to tensormap runtime #482.
  • Same-scope explicit dependencies should execute without TensorMap replay overhead.
  • Cross-scope boundary tensors should still remain correct through owner/TensorMap handling.
  • The deferred scope_end() publish should not deadlock or mis-detect lack of progress.

Actual Behavior

The rebased partial_manual run deadlocks/fails in the runtime while AUTO remains healthy.

Observed pattern:

  • tasks inside the manual scope are deferred until scope_end()
  • batch publish starts and early tasks do make progress
  • a downstream task becomes READY
  • allocator still aborts on deadlock / task-ring-full before that ready progress is retired cleanly
  • orchestration later fails because expected outputs are not produced

Representative trace excerpts:

dispatch: thread=0 shape=0 task=0 block=0/1
dispatch: thread=2 shape=0 task=4 block=0/1
ready(local): task=5 shape=1 fanin=2/2
dispatch: thread=2 shape=1 task=5 block=0/1
ready(local): task=6 shape=0 fanin=2/2
ready(local): task=1 shape=1 fanin=2/2
dispatch: thread=1 shape=1 task=1 block=0/1
ready(local): task=2 shape=0 fanin=2/2

Deadlock snapshot at abort:

task=1 state=3 fanin=2/2 fanout=1/2 active_mask=2 done=1/1 block_num=1 next_block=1
task=2 state=1 fanin=2/2 fanout=1/2 active_mask=1 done=0/1 block_num=1 next_block=0
task=3 state=0 fanin=2/3 fanout=1/2 active_mask=2 done=0/1 block_num=1 next_block=0
task=4 state=4 fanin=1/1 fanout=2/2 active_mask=1 done=1/1 block_num=1 next_block=1

State meanings:

  • 0 = PENDING
  • 1 = READY
  • 2 = RUNNING
  • 3 = COMPLETED
  • 4 = CONSUMED

The critical detail is that task=2 is already READY with fanin=2/2, so useful progress exists, but the allocator still concludes deadlock.

Git Commit ID

316bfb1

CANN Version

N/A

Driver Version

N/A

Host Platform

Linux (aarch64)

Additional Context

Draft implementation reference: #482

Visual timeline of where the rebased failure happens:

AUTO on rebased main
--------------------
submit qk0 -> publish -> sf0 ready -> dispatch -> pv0 ready -> dispatch -> up0 ...
progress is discovered and drained incrementally during submit

PARTIAL_MANUAL on rebased main
------------------------------
manual scope open
submit qk0 sf0 pv0 up0 qk1 sf1 pv1 up1 ...
(all tasks stay unpublished behind the manual-scope barrier)
                    |
                    v
                scope_end()
                    |
                    +-> batch publish
                    +-> qk0 dispatches
                    +-> sf0 completes
                    +-> pv0 becomes READY
                    +-> allocator still aborts before that READY progress is retired

Observed head snapshot near abort
---------------------------------
task 1 = COMPLETED
task 2 = READY      <--- progress exists here
task 3 = PENDING    <--- still waiting on task 2
task 4 = CONSUMED

Result
------
The runtime treats the burst-published manual path as deadlocked even though
scheduler-visible progress has already advanced past the old head.

This report is intentionally focused on the remaining rebased failure, not on the earlier design iterations that have already been corrected. The main question now is how to make the allocator/scheduler treat manual-scope burst publish as first-class progress, so rebased partial_manual behaves as safely as AUTO.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions