Support: per-task tensor dump across all runtimes (#dump-tensor) by ChaoZheng109 · Pull Request #547 · hw-native-sys/simpler

ChaoZheng109 · 2026-04-14T06:24:50Z

Add opt-in runtime observability that captures per-task input/output tensor bytes (before dispatch and after completion) and exports them to outputs/tensor_dump_/ for offline inspection.

Architecture:

a2a3 (primary): shared memory via halHostRegister + DumpMemoryManager background thread for concurrent collection during execution
a5 (temporary fallback): memcpy-based batch collect-after-sync, pending a5 halHostRegister support
Device-side structures (DumpSetupHeader, DumpBuffer, TensorDumpRecord, circular arena) are binary-identical across both platforms
dump_data_base flows through KernelArgs (not Runtime), matching the profiling pattern

Runtime integration:

host_build_graph: orchestration registers tensor metadata via set_tensor_info_to_task() / add_task_with_tensor_info()
aicpu_build_graph / tensormap_and_ringbuffer: metadata derived from PTO2TaskPayload automatically, no orchestration change needed
Gated by ChipCallConfig::enable_dump_tensor (--dump-tensor CLI); zero allocation and zero code path cost when disabled

Artifacts:

dump_tensor_example scene tests (a2a3 + a5)
tools/dump_viewer.py for filtering and exporting to human-readable txt
docs/tensor-dump.md design doc

closes #506

gemini-code-assist · 2026-04-14T06:24:54Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Add opt-in runtime observability that captures per-task input/output tensor bytes (before dispatch and after completion) and exports them to outputs/tensor_dump_<timestamp>/ for offline inspection. Architecture: - a2a3 (primary): shared memory via halHostRegister + DumpMemoryManager background thread for concurrent collection during execution - a5 (temporary fallback): memcpy-based batch collect-after-sync, pending a5 halHostRegister support - Device-side structures (DumpSetupHeader, DumpBuffer, TensorDumpRecord, circular arena) are binary-identical across both platforms - dump_data_base flows through KernelArgs (not Runtime), matching the profiling pattern Runtime integration: - host_build_graph: orchestration registers tensor metadata via set_tensor_info_to_task() / add_task_with_tensor_info() - aicpu_build_graph / tensormap_and_ringbuffer: metadata derived from PTO2TaskPayload automatically, no orchestration change needed - Gated by ChipCallConfig::enable_dump_tensor (--dump-tensor CLI); zero allocation and zero code path cost when disabled Artifacts: - dump_tensor_example scene tests (a2a3 + a5) - tools/dump_viewer.py for filtering and exporting to human-readable txt - docs/tensor-dump.md design doc

ChaoZheng109 force-pushed the dump-tensor branch 23 times, most recently from 1e2bb81 to 7934011 Compare April 15, 2026 01:17

ChaoZheng109 marked this pull request as ready for review April 15, 2026 01:18

ChaoZheng109 force-pushed the dump-tensor branch from 7934011 to 43f2a75 Compare April 15, 2026 01:56

ChaoWao approved these changes Apr 15, 2026

View reviewed changes

ChaoWao merged commit e3e4bd5 into hw-native-sys:main Apr 15, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support: per-task tensor dump across all runtimes (#dump-tensor)#547

Support: per-task tensor dump across all runtimes (#dump-tensor)#547
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoZheng109:dump-tensor

ChaoZheng109 commented Apr 14, 2026

Uh oh!

gemini-code-assist bot commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ChaoZheng109 commented Apr 14, 2026

Uh oh!

gemini-code-assist bot commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants