hw-native-sys
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/distributed_level_runtime.md‎
Lines changed: 6 additions & 0 deletions b/‎docs/distributed_level_runtime.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎docs/orchestrator.md‎
Lines changed: 118 additions & 13 deletions b/‎docs/orchestrator.md‎
Lines changed: 118 additions & 13 deletions
diff --git a/‎docs/roadmap.md‎
Lines changed: 127 additions & 0 deletions b/‎docs/roadmap.md‎
Lines changed: 127 additions & 0 deletions
diff --git a/‎docs/scheduler.md‎
Lines changed: 6 additions & 0 deletions b/‎docs/scheduler.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎docs/task-flow.md‎
Lines changed: 6 additions & 0 deletions b/‎docs/task-flow.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎docs/worker-manager.md‎
Lines changed: 6 additions & 0 deletions b/‎docs/worker-manager.md‎
Lines changed: 6 additions & 0 deletions
@@ -74,6 +74,7 @@ export ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
 | [Orchestrator](docs/orchestrator.md) | DAG submission internals: submit flow, TensorMap, Scope, Ring, task state machine |
 | [Scheduler](docs/scheduler.md) | DAG dispatch internals: wiring/ready/completion queues, dispatch loop |
 | [Worker Manager](docs/worker-manager.md) | Worker pool, WorkerThread, THREAD/PROCESS modes, fork + mailbox mechanics |
+| [Roadmap](docs/roadmap.md) | Hierarchical-runtime refactor — what has landed and what is still in flight |
 | [Getting Started](docs/getting-started.md) | Setup, prerequisites, build process, configuration |
 | [Developer Guide](docs/developer-guide.md) | Directory structure, role ownership, conventions |
 | [Testing Guide](docs/testing.md) | CI pipeline, test types, writing new tests |
 
@@ -1,5 +1,11 @@
 # Distributed Level Runtime — Level Model and Component Composition
 
+> **Status**: level model + high-level component split are accurate for
+> current code. Low-level details (e.g. `IWorker::run` signature,
+> `WorkerThread` unified mode) describe the target; see the
+> per-component docs for current vs target, or
+> [roadmap.md](roadmap.md) for the full landed-vs-planned breakdown.
+
 This document covers:
 
 - The **L0–L6 level model** (what each level represents)
 
@@ -1,5 +1,10 @@
 # Orchestrator — DAG Submission Internals
 
+> **Status**: describes the **target** design. Current code matches the
+> user-facing submit API and `alloc` surface; inline "Status:" notes flag
+> the few remaining divergences. See [roadmap.md](roadmap.md) for the
+> full landed-vs-planned breakdown.
+
 The Orchestrator is the **DAG builder**. It runs single-threaded on the user's
 thread (inside `Worker::run` between `scope_begin` and `drain`) and owns the
 three data structures that turn a sequence of `submit_*` calls into a scheduled
@@ -18,26 +23,40 @@ The user's orch fn receives an `Orchestrator*` as its first argument:
 ```cpp
 class Orchestrator {
 public:
-    SubmitResult submit_next_level(Callable cb, TaskArgs args, const CallConfig &config);
-    SubmitResult submit_next_level_group(Callable cb,
-                                          std::vector<TaskArgs> args_list,
-                                          const CallConfig &config);
-    SubmitResult submit_sub(Callable cb, TaskArgs args, const CallConfig &config);
-
-private:
-    friend class Worker;
+    // --- User-facing submit API (tags inside TaskArgs drive deps) ---
+    SubmitResult submit_next_level(uint64_t callable,
+                                    const TaskArgs &args,
+                                    const ChipCallConfig &config);
+    SubmitResult submit_next_level_group(uint64_t callable,
+                                          const std::vector<TaskArgs> &args_list,
+                                          const ChipCallConfig &config);
+    SubmitResult submit_sub(int32_t callable_id, const TaskArgs &args);
+    SubmitResult submit_sub_group(int32_t callable_id,
+                                   const std::vector<TaskArgs> &args_list);
+
+    // --- Intermediate-buffer allocation (runtime-owned lifetime) ---
+    ContinuousTensor alloc(const std::vector<uint32_t> &shape, DataType dtype);
+
+    // --- Internal lifecycle (invoked by Worker::run only, bound as _scope_begin
+    //     / _scope_end / _drain in the Python facade) ---
     void scope_begin();
     void scope_end();
     void drain();
-    // ... components: Ring, TensorMap, Scope, slot pool
+
+private:
+    // ... components: Ring, TensorMap, Scope, slot pool, active_tasks_ counter
 };
 
-struct SubmitResult { TaskSlot slot_id; };
+struct SubmitResult { TaskSlot task_slot; };  // field is `task_slot` in current code
 ```
 
-`scope_begin` / `scope_end` / `drain` are not user-visible — they are invoked
-by `Worker::run` around the orch fn. See
-[task-flow.md](task-flow.md) §5 for the Worker::run wrapper.
+**Status**: `submit_sub` takes only `(callable_id, args)` — no `config`, SUB
+has no per-call config. Target design (plan §"Why L2 has no submit") allows
+callable IDs that may later unify with ChipCallable pointers; see PR-E.
+
+`scope_begin` / `scope_end` / `drain` are invoked from Python `Worker.run` via
+`_scope_begin` / `_scope_end` / `_drain` bindings. They are not part of the
+user-facing orch-fn API.
 
 ---
 
@@ -389,6 +408,92 @@ State transitions are driven by atomic CAS operations:
 - Orch: FREE → PENDING/READY at submit time
 - Scheduler: READY → RUNNING → COMPLETED → CONSUMED during dispatch/completion
 
+### Fanout-release threshold
+
+Both paths that can trigger COMPLETED → CONSUMED (the scheduler's
+`try_consume` and the scope-end `release_ref`) use the same threshold:
+
+```cpp
+if (fanout_released >= fanout_total + 1 && state == COMPLETED) on_consumed(slot);
+```
+
+The `+1` accounts for the slot's own self-release contribution, which normal
+tasks emit from `on_task_complete` (`try_consume(slot)` self-call). Alloc
+slots (§8b) bypass the scheduler and pre-bump `fanout_released` to `1` at
+`alloc()` time to stand in for the self-release. Both paths use `on_consumed`,
+which uses a CAS on `state` from `COMPLETED` to `CONSUMED` to remain idempotent
+when both fire concurrently at threshold.
+
+---
+
+## 8b. `alloc(shape, dtype)` — runtime-owned intermediate buffers
+
+Mirrors L2's "task slot owns its output buffer" model: `alloc` creates a
+synthetic task slot in `COMPLETED` state that owns an mmap'd buffer. The
+buffer is freed when the slot reaches `CONSUMED` — i.e. after all downstream
+consumers have completed and the scope ref has been released.
+
+```cpp
+ContinuousTensor Orchestrator::alloc(const std::vector<uint32_t> &shape, DataType dtype) {
+    // 1. mmap(MAP_SHARED|MAP_ANONYMOUS) a page-aligned region — visible to
+    //    forked child workers at the same virtual address.
+    void *buf = mmap(...);
+    // 2. Claim a task slot.
+    TaskSlot sid = ring_.alloc();
+    TaskSlotState &s = slots_[sid];
+    // 3. Record buffer for on_consumed munmap.
+    s.alloc_bufs.push_back(buf);
+    s.alloc_sizes.push_back(mmap_bytes);
+    // 4. Register as this slot's output so downstream `INPUT`-tagged tensors
+    //    with the same data ptr look up this slot as producer.
+    tensormap_.insert(reinterpret_cast<uint64_t>(buf), sid);
+    s.output_keys.push_back(reinterpret_cast<uint64_t>(buf));
+    // 5. No fanin — alloc has no work to wait on.
+    s.fanin_count = 0;
+    // 6. Initial fanout = scope_ref. Consumers that wire on this slot in
+    //    infer_deps bump fanout_total; this slot's CONSUMED transition waits
+    //    for all of them + scope_end.
+    s.fanout_total = (scope_.depth() > 0) ? 1 : 0;
+    if (s.fanout_total > 0) scope_.register_task(sid);
+    // 7. Sim self-consume so the fanout-release threshold math aligns with
+    //    normal slots (see §8 Fanout-release threshold).
+    s.fanout_released = 1;
+    // 8. Straight to COMPLETED — no dispatch needed.
+    s.state = TaskState::COMPLETED;
+    active_tasks_++;
+    return ContinuousTensor{buf, shape, dtype};
+}
+```
+
+On `on_consumed`, in addition to the usual `tensormap.erase_task_outputs` and
+`ring.release(sid)`, the slot's `alloc_bufs` are `munmap`'d.
+
+### Consumer interaction
+
+`infer_deps` treats `COMPLETED` producers specially: it still wires the
+fanout edge (so the producer waits for the consumer before being consumed and
+freeing its buffer) but does not bump `live_fanins` (the consumer is
+immediately ready because the producer is already done).
+
+```cpp
+if (ps_state == TaskState::CONSUMED) continue;  // already gone
+ps.fanout_consumers.push_back(slot);
+ps.fanout_total++;
+s.fanin_producers.push_back(prod);
+if (ps_state != TaskState::COMPLETED) live_fanins++;   // wait only if not yet done
+```
+
+### Status — placeholder vs target (PR-H)
+
+The current implementation uses **per-alloc `mmap`** (one syscall per
+`alloc()` invocation). This is a placeholder. The target design (PR-H,
+"HeapRing") pre-allocates a single MAP_SHARED region at `Worker::init()`
+before any fork, bump-allocates from it, and reclaims via FIFO
+`last_alive` tracking — mirroring L2's `PTO2TaskAllocator`. Under the
+target design, `OUTPUT`-tagged tensors will be auto-allocated by the
+Orchestrator (no explicit `alloc` call), and `OUTPUT_EXISTING` will
+preserve the current "user-provided buffer" path.
+
 ---
 
 ## 9. Invariants
 
@@ -0,0 +1,127 @@
+# Hierarchical Runtime — Roadmap
+
+The six per-component docs (`orchestrator.md`, `scheduler.md`,
+`worker-manager.md`, `task-flow.md`, `chip-level-arch.md`,
+`distributed_level_runtime.md`) describe the **target** design of the
+hierarchical runtime. This page tracks what has already landed vs. what is
+still in flight, so readers can tell which bits of the design are running
+today and which are planned.
+
+If you only read one file to understand "what will this look like when
+it's done", read the per-component doc. If you want to know "what do I
+get if I pip install `main` today", this page.
+
+---
+
+## Landed
+
+### Schedule engine shape
+
+- **Component split** — `Orchestrator` (DAG builder) / `Scheduler` (DAG
+  executor) / `WorkerManager` + `WorkerThread` (execution layer) — lives
+  in `src/common/distributed/`.
+- **Level model** — L0–L6 as described in
+  [distributed_level_runtime.md](distributed_level_runtime.md) §1. L2
+  (single-chip) and L3 (composite over ChipWorker + SubWorker) are
+  implemented; L4+ recursion is not (see below).
+
+### User-facing API
+
+- **Unified `TaskArgs`** — vector-backed builder with per-tensor
+  `TensorArgType` tags (`INPUT` / `OUTPUT` / `INOUT` / `OUTPUT_EXISTING`
+  / `NO_DEP`). Replaces separate `TaggedTaskArgs` / `DynamicTaskArgs`.
+- **Tag-driven `submit_*` on `Orchestrator`** —
+  `submit_next_level` / `submit_next_level_group` / `submit_sub` /
+  `submit_sub_group`. No `inputs=`/`outputs=` kwargs; tags inside the
+  `TaskArgs` drive `tensormap.lookup`/`insert` automatically.
+- **`SubmitResult = {slot_id}`** — downstream consumers reference output
+  tensors by their own data pointers.
+- **`Worker` has no `submit`/`scope`/`drain`** — those concepts belong
+  to `Orchestrator` (accessed via `worker.get_orchestrator()`).
+  `Orchestrator._scope_begin` / `_scope_end` / `_drain` are invoked by
+  the Python `Worker.run` facade only.
+- **`orch.alloc(shape, dtype)`** — runtime-owned intermediate buffer
+  backed by `mmap(MAP_SHARED | MAP_ANONYMOUS)`. Lifetime follows a
+  synthetic task slot so the buffer is freed once all downstream
+  consumers have completed (see
+  [orchestrator.md](orchestrator.md) §8b).
+
+### Dispatch internals
+
+- `Scheduler` dispatches via a single ready queue into `WorkerManager`
+  pools (next-level + sub). Slot stores `chip_storage_list` (one
+  `ChipStorageTaskArgs` per group worker) that dispatch passes through
+  a `WorkerPayload` handed to `IWorker::run`.
+- `DistChipProcess` / `DistSubWorker` are separate classes today;
+  unified `WorkerThread` with `THREAD | PROCESS` modes is not yet
+  implemented.
+
+---
+
+## In flight / not yet landed
+
+### PR-H: HeapRing + `OUTPUT` auto-alloc
+
+- Replace the current per-call `mmap` in `orch.alloc` with a single
+  pre-allocated `MAP_SHARED` region at `Worker.init()` (default 1 GB),
+  bump-allocated with FIFO reclamation (mirrors L2's
+  `PTO2TaskAllocator`).
+- `OUTPUT` tag will auto-allocate from the ring;
+  `OUTPUT_EXISTING` keeps the "user-provided buffer" path.
+- Merge slot ring + heap ring into one allocator
+  (matches L2-consistency audit Strict-2).
+- Fork-safety hygiene at `Worker.init()` (`setenv
+  OMP_NUM_THREADS=1` / `pthread_atfork` on runtime-owned locks).
+
+### PR-C: drop `WorkerPayload`, new `IWorker::run` signature
+
+- `IWorker::run(callable, TaskArgsView, config)` — no `WorkerPayload`
+  wrapper; mailbox encodes a length-prefixed blob of `callable +
+  config + args` at dispatch.
+- Slot drops `chip_storage_list` and stores the `TaskArgs` itself.
+  Child assembles `ChipStorageTaskArgs` from the view at the L2 ABI
+  edge only.
+- Strict-1 (per-scope rings, 4 depth) lands here.
+
+### PR-D: WorkerThread unification + per-shape ready queues
+
+- Fold `DistChipProcess` / `DistSubWorker` into `WorkerThread` with
+  `Mode = THREAD | PROCESS`.
+- Strict-4: 3 ready queues (AIC / AIV / MIX) instead of a single queue.
+
+### PR-E: uniform `Worker.run` + callable registry unification
+
+- Python `Worker.run` drops the `if level==2` branch.
+- Callable registry moves fully into C++
+  (`unordered_map<uint64_t, nb::object>` owned by `Worker`) so
+  `ChipCallable` and Python `sub` callables share one lookup path.
+  This unblocks L4+ recursion.
+
+### PR-F: C++ `Worker::run(Task)` for L4+ recursion
+
+- C++ `Task { OrchFn orch; TaskArgs task_args; CallConfig config; }`
+  so a higher-level `Worker` can register a lower-level `Worker` as a
+  next-level child and dispatch via `IWorker::run`.
+
+### PR-G: drop the `Dist` prefix
+
+- Final rename sweep: `DistOrchestrator` → `Orchestrator`, files
+  `dist_*.{h,cpp}` → `*.{h,cpp}`.
+
+---
+
+## Behavioural notes on the current implementation
+
+- **`DistOrchestrator::release_ref` threshold is `>= total + 1`** (not
+  `>= total`). This matches `DistScheduler::try_consume` — the
+  `+1` accounts for the slot's own self-release contribution. Alloc
+  slots (synthetic, never dispatched) pre-bump `fanout_released` to
+  `1` in `alloc()` so this threshold math works for them too.
+  `on_consumed` uses a CAS on state to remain idempotent across the two
+  call paths (`release_ref` and `try_consume`).
+- **scene_test has two helper functions** —
+  `_build_chip_task_args` returns `ChipStorageTaskArgs` (POD, for the
+  current L2 path: `ChipWorker.run(callable, POD, config)`) and
+  `_build_l3_task_args` returns a tagged `TaskArgs` (for
+  `orch.submit_next_level`). PR-C will collapse these into one helper
+  when `ChipWorker::run` takes a `TaskArgsView`.
@@ -1,5 +1,11 @@
 # Scheduler — DAG Dispatch Internals
 
+> **Status**: target design. Current code dispatches via
+> `IWorker::run(const WorkerPayload&)` rather than `run(callable, view,
+> config)`; per-worker-type ready queue split (Strict-4) is not yet
+> implemented. See [roadmap.md](roadmap.md) for the full
+> landed-vs-planned breakdown.
+
 The Scheduler is the **DAG executor**. A dedicated C++ thread that consumes
 submitted slots, wires fanout edges, dispatches ready tasks to worker threads,
 and handles completion callbacks. It is the bridge between the Orchestrator
 
@@ -1,5 +1,11 @@
 # Task Flow — Callable / TaskArgs / CallConfig Pass-Through
 
+> **Status**: describes the **target** design. The unified `TaskArgs` +
+> tag-driven submit + Orchestrator-owned drain are landed; the
+> `IWorker::run(callable, view, config)` signature and length-prefixed
+> mailbox blob are not yet (target landing: PR-C). See
+> [roadmap.md](roadmap.md) for the full landed-vs-planned breakdown.
+
 This document specifies **what data flows through the hierarchical runtime and
 what shapes it takes at each stage**. It covers:
 
 
@@ -1,5 +1,11 @@
 # Worker Manager — Pool, Threading, and Dispatch Modes
 
+> **Status**: describes the **target** design. Current code still has
+> separate `DistChipProcess` / `DistSubWorker` classes (target: merged
+> into `WorkerThread` in PR-D) and passes `const WorkerPayload&` to
+> `IWorker::run` (target: replaced in PR-C). See
+> [roadmap.md](roadmap.md) for the full landed-vs-planned breakdown.
+
 `WorkerManager` and `WorkerThread` together implement the **execution layer**
 of a `Worker` engine. `WorkerManager` owns two pools of `WorkerThread`s (one
 for next-level workers, one for sub workers); each `WorkerThread` owns an