You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unify TaskArgs:
- Rename storage template `TaskArgs<...>` -> `TaskArgsTpl<...>` so the
unqualified name `TaskArgs` is free for the unified user-facing builder.
- Add `using TaskArgs = TaskArgsTpl<ContinuousTensor, uint64_t, 0, 0,
TensorArgType>` — vector-backed + per-tensor TensorArgType tags.
- Add TaskArgsView, make_view(), task_args_blob_size, write_blob,
read_blob, view_to_chip_storage for the dispatch / wire / L2 ABI edge.
- nanobind: drop DynamicTaskArgs / TaggedTaskArgs bindings, expose
unified TaskArgs; extend TensorArgType with OUTPUT_EXISTING / NO_DEP.
Tag-driven submit:
- DistOrchestrator exposes submit_next_level / submit_next_level_group /
submit_sub / submit_sub_group, each taking a TaskArgs (with tags).
- Tags drive dependency inference: INPUT/INOUT -> tensormap.lookup
producer; OUTPUT/INOUT/OUTPUT_EXISTING -> tensormap.insert; NO_DEP skip.
- Drop `inputs=` / `outputs=` from the submit API; downstream consumers
reference output tensors by their own data pointers.
- Shrink DistSubmitResult to {slot_id} only. Delete DistInputSpec /
DistOutputSpec / DistSubmitOutput from both C++ and Python surfaces.
Slot storage and dispatch:
- DistTaskSlotState drops `payload` / `args_list<const void*>`; gains
worker_type / callable_ptr / callable_id / config (ChipCallConfig) /
chip_storage_list<ChipStorageTaskArgs> built by Orchestrator at submit.
- DistScheduler::dispatch_ready assembles a per-worker WorkerPayload
from slot fields + chip_storage_list[i] and hands it to IWorker::run.
- WorkerPayload kept as an internal dispatch carrier (mailbox layout
unchanged); not exposed to Python.
Worker / Orchestrator separation:
- Delete DistWorker::submit / submit_group / scope_begin / scope_end
entirely — those concepts belong on Orchestrator.
- Add DistWorker::get_orchestrator() accessor; nanobind exposes the C++
DistOrchestrator directly with submit_* (public) and _scope_begin /
_scope_end (invoked only by the Python facade).
- Python Orchestrator becomes a thin wrapper over the bound C++
DistOrchestrator (no more WorkerPayload construction, no inputs/outputs
kwargs).
- Python Worker.run() fetches the orchestrator handle once at init and
runs scope_begin -> orch_fn -> scope_end -> drain inside one DAG.
Orchestrator.alloc for runtime-managed intermediates:
- DistOrchestrator::alloc(shape, dtype) -> ContinuousTensor. Mirrors
L2's "task slot owns its output buffer" model: alloc creates a
synthetic task slot in COMPLETED state that owns an mmap'd buffer;
the buffer is munmap'd when the slot reaches CONSUMED (all downstream
consumers done + scope ref released). Users tag the returned tensor
as OUTPUT / INPUT in TaskArgs to wire deps naturally via the
TensorMap — no separate alloc-lifecycle API needed.
- mmap(MAP_SHARED|MAP_ANONYMOUS) so forked child workers see the same
virtual address.
- DistTaskSlotState gains alloc_bufs / alloc_sizes (empty for non-alloc
slots). on_consumed munmap's them.
Orchestrator consume-lifecycle fixes (required by alloc):
- infer_deps now wires fanout on COMPLETED producers (previously
skipped): consumer doesn't wait on the producer (live_fanins not
bumped) but is added to fanin_producers so its deferred try_consume
keeps the producer alive until the consumer finishes. CONSUMED
producers are still skipped (resources already freed).
- release_ref threshold changed from `>= total` to `>= total + 1` to
match try_consume — prevents scope_end from prematurely consuming
slots whose downstream consumers haven't finished. Total contributors
= 1 (self try_consume or alloc's sim) + N (consumer deferreds) + 1
(scope_end) = total + 1.
- on_consumed is idempotent (CAS on state); both release paths can now
hit the threshold concurrently without double-freeing alloc buffers.
Returns bool (true iff this call performed the transition).
- active_tasks_ fetch_sub lives inside orchestrator.on_consumed (gated
on CAS win). A notify_consumed callback wired from DistWorker at init
signals drain from both scheduler-driven and scope_end-driven paths.
Plumbing:
- Extract ChipCallConfig to its own header (chip_call_config.h) to
break the circular include between dist_types.h and chip_worker.h.
- Rename runtime `Arg` base from TaskArgs<...> to TaskArgsTpl<...> in
a2a3/aicpu_build_graph, a2a3/tensormap_and_ringbuffer, and
a5/tensormap_and_ringbuffer.
Tests:
- C++ tests/ut/cpp/test_dist_orchestrator.cpp + test_dist_scheduler.cpp
rewritten against the new TaskArgs-tag API.
- Python ut tests migrated: test_host_worker (TestSubmitResult replaces
TestOutputAllocation; new TestOrchAlloc class), test_group_task
(synthetic-tensor dep wiring).
- L3 ST tests (test_l3_dependency, test_l3_group) build TaskArgs with
tags directly; scene_test._build_chip_task_args returns TaskArgs.
- test_task_interface.py — TestTaskArgs covers the merged surface.
- 105/105 Python ut pass on macOS.
0 commit comments