Summary
Add a tensor dump capability that captures intermediate tensor data (inputs before dispatch and outputs after completion) during runtime execution. This enables offline debugging, golden-value validation, and kernel correctness verification without modifying user kernels.
The feature spans three layers:
- Platform layer: common tensor dump interface, AICPU-side dump logic, and host-side collector for gathering dumped data
- Runtime layer: integration into
host_build_graph runtime (with future support for aicpu_build_graph and tensormap_and_ringbuffer)
- User interface:
--dump-tensor CLI flag in run_example.py and a dedicated example (dump_tensor_example) demonstrating usage
Motivation / Use Case
When debugging kernel correctness issues or validating new orchestration flows, developers currently have no built-in way to inspect intermediate tensor values at each execution step. They must manually instrument kernel code or add ad-hoc print statements, which is error-prone and non-reproducible.
A first-class tensor dump feature allows:
- Capturing before-dispatch inputs and after-completion outputs per task, saved to disk as binary files
- Comparing dumped tensors against golden computations to pinpoint which kernel or step produces incorrect results
- Debugging without modifying kernel source — the dump is controlled entirely from the runtime/platform layer
Proposed API / Behavior
- Enable via
--dump-tensor flag on run_example.py
- Runtime sets
enable_dump in kernel args; AICPU reads this flag and writes tensor data to a host-visible region
- Host-side
TensorDumpCollector gathers and writes binary dump files organized by task ID and tensor index
- Output directory:
outputs/tensor_dump_<timestamp>/
Additional Context
Work in progress — currently implemented for host_build_graph runtime on the a2a3 architecture.
Summary
Add a tensor dump capability that captures intermediate tensor data (inputs before dispatch and outputs after completion) during runtime execution. This enables offline debugging, golden-value validation, and kernel correctness verification without modifying user kernels.
The feature spans three layers:
host_build_graphruntime (with future support foraicpu_build_graphandtensormap_and_ringbuffer)--dump-tensorCLI flag inrun_example.pyand a dedicated example (dump_tensor_example) demonstrating usageMotivation / Use Case
When debugging kernel correctness issues or validating new orchestration flows, developers currently have no built-in way to inspect intermediate tensor values at each execution step. They must manually instrument kernel code or add ad-hoc print statements, which is error-prone and non-reproducible.
A first-class tensor dump feature allows:
Proposed API / Behavior
--dump-tensorflag onrun_example.pyenable_dumpin kernel args; AICPU reads this flag and writes tensor data to a host-visible regionTensorDumpCollectorgathers and writes binary dump files organized by task ID and tensor indexoutputs/tensor_dump_<timestamp>/Additional Context
Work in progress — currently implemented for
host_build_graphruntime on thea2a3architecture.