You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cross-scope boundary tensors still use owner_task_id retention and TensorMap frontier/discovery for correctness.
The feature works on the pre-rebase branch, and the AUTO path still works after rebasing to current main. The failure is specific to the rebased partial_manual path on paged-attention.
Current understanding is that the rebase exposed a scheduler/allocator bug in the deferred manual publish path:
AUTO publishes/discovers tasks incrementally during submit.
partial_manual accumulates unpublished tasks inside the manual scope and batch-publishes them at scope_end().
On rebased main, this bursty publish path can report deadlock even after downstream progress is already visible.
Rebase debugging found several concrete issues already:
Hidden alloc_tensors() tasks with active_mask == 0 were being published from the manual path.
Some non-profiling ready paths could enqueue work without consistently transitioning PENDING -> READY.
After fixing both of the above locally for diagnosis, the remaining failure is still an allocator/scheduler false-deadlock: a task becomes READY, but the allocator aborts based on a narrow last_alive heuristic before that progress is fully retired.
This means the problem is not that manual scope is conceptually invalid; it is that the current rebased partial_manual integration path is still buggy.
Visual timeline of where the rebased failure happens:
AUTO on rebased main
--------------------
submit qk0 -> publish -> sf0 ready -> dispatch -> pv0 ready -> dispatch -> up0 ...
progress is discovered and drained incrementally during submit
PARTIAL_MANUAL on rebased main
------------------------------
manual scope open
submit qk0 sf0 pv0 up0 qk1 sf1 pv1 up1 ...
(all tasks stay unpublished behind the manual-scope barrier)
|
v
scope_end()
|
+-> batch publish
+-> qk0 dispatches
+-> sf0 completes
+-> pv0 becomes READY
+-> allocator still aborts before that READY progress is retired
Observed head snapshot near abort
---------------------------------
task 1 = COMPLETED
task 2 = READY <--- progress exists here
task 3 = PENDING <--- still waiting on task 2
task 4 = CONSUMED
Result
------
The runtime treats the burst-published manual path as deadlocked even though
scheduler-visible progress has already advanced past the old head.
This report is intentionally focused on the remaining rebased failure, not on the earlier design iterations that have already been corrected. The main question now is how to make the allocator/scheduler treat manual-scope burst publish as first-class progress, so rebased partial_manual behaves as safely as AUTO.
Platform
a2a3sim (Ascend 910B/C simulation)
Runtime Variant
tensormap_and_ringbuffer
Description
Draft PR #482 adds a hybrid manual-scope mode to
tensormap_and_ringbuffer:PTO2_SCOPE()stays in defaultAUTOmode.PTO2_SCOPE(PTO2ScopeMode::MANUAL)enables scoped explicit same-scope dependency wiring.pto2_rt_add_dependency(...).owner_task_idretention and TensorMap frontier/discovery for correctness.The feature works on the pre-rebase branch, and the
AUTOpath still works after rebasing to currentmain. The failure is specific to the rebasedpartial_manualpath on paged-attention.Current understanding is that the rebase exposed a scheduler/allocator bug in the deferred manual publish path:
AUTOpublishes/discovers tasks incrementally during submit.partial_manualaccumulates unpublished tasks inside the manual scope and batch-publishes them atscope_end().main, this bursty publish path can report deadlock even after downstream progress is already visible.Rebase debugging found several concrete issues already:
alloc_tensors()tasks withactive_mask == 0were being published from the manual path.PENDING -> READY.READY, but the allocator aborts based on a narrowlast_aliveheuristic before that progress is fully retired.This means the problem is not that manual scope is conceptually invalid; it is that the current rebased
partial_manualintegration path is still buggy.Related: #409
Steps to Reproduce
Expected Behavior
The rebased
partial_manualpaged-attention example should complete successfully, like the rebasedAUTOruntime path.More specifically:
scope_end()publish should not deadlock or mis-detect lack of progress.Actual Behavior
The rebased
partial_manualrun deadlocks/fails in the runtime whileAUTOremains healthy.Observed pattern:
scope_end()READYRepresentative trace excerpts:
Deadlock snapshot at abort:
State meanings:
0 = PENDING1 = READY2 = RUNNING3 = COMPLETED4 = CONSUMEDThe critical detail is that
task=2is alreadyREADYwithfanin=2/2, so useful progress exists, but the allocator still concludes deadlock.Git Commit ID
316bfb1
CANN Version
N/A
Driver Version
N/A
Host Platform
Linux (aarch64)
Additional Context
Draft implementation reference: #482
Visual timeline of where the rebased failure happens:
This report is intentionally focused on the remaining rebased failure, not on the earlier design iterations that have already been corrected. The main question now is how to make the allocator/scheduler treat manual-scope burst publish as first-class progress, so rebased
partial_manualbehaves as safely asAUTO.