Skip to content

Support: per-task tensor dump across all runtimes (#dump-tensor)#547

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoZheng109:dump-tensor
Apr 15, 2026
Merged

Support: per-task tensor dump across all runtimes (#dump-tensor)#547
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoZheng109:dump-tensor

Conversation

@ChaoZheng109
Copy link
Copy Markdown
Collaborator

Add opt-in runtime observability that captures per-task input/output tensor bytes (before dispatch and after completion) and exports them to outputs/tensor_dump_/ for offline inspection.

Architecture:

  • a2a3 (primary): shared memory via halHostRegister + DumpMemoryManager background thread for concurrent collection during execution
  • a5 (temporary fallback): memcpy-based batch collect-after-sync, pending a5 halHostRegister support
  • Device-side structures (DumpSetupHeader, DumpBuffer, TensorDumpRecord, circular arena) are binary-identical across both platforms
  • dump_data_base flows through KernelArgs (not Runtime), matching the profiling pattern

Runtime integration:

  • host_build_graph: orchestration registers tensor metadata via set_tensor_info_to_task() / add_task_with_tensor_info()
  • aicpu_build_graph / tensormap_and_ringbuffer: metadata derived from PTO2TaskPayload automatically, no orchestration change needed
  • Gated by ChipCallConfig::enable_dump_tensor (--dump-tensor CLI); zero allocation and zero code path cost when disabled

Artifacts:

  • dump_tensor_example scene tests (a2a3 + a5)
  • tools/dump_viewer.py for filtering and exporting to human-readable txt
  • docs/tensor-dump.md design doc

closes #506

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ChaoZheng109 ChaoZheng109 force-pushed the dump-tensor branch 23 times, most recently from 1e2bb81 to 7934011 Compare April 15, 2026 01:17
@ChaoZheng109 ChaoZheng109 marked this pull request as ready for review April 15, 2026 01:18
Add opt-in runtime observability that captures per-task input/output
tensor bytes (before dispatch and after completion) and exports them
to outputs/tensor_dump_<timestamp>/ for offline inspection.

Architecture:
  - a2a3 (primary): shared memory via halHostRegister + DumpMemoryManager
    background thread for concurrent collection during execution
  - a5 (temporary fallback): memcpy-based batch collect-after-sync,
    pending a5 halHostRegister support
  - Device-side structures (DumpSetupHeader, DumpBuffer, TensorDumpRecord,
    circular arena) are binary-identical across both platforms
  - dump_data_base flows through KernelArgs (not Runtime), matching
    the profiling pattern

Runtime integration:
  - host_build_graph: orchestration registers tensor metadata via
    set_tensor_info_to_task() / add_task_with_tensor_info()
  - aicpu_build_graph / tensormap_and_ringbuffer: metadata derived
    from PTO2TaskPayload automatically, no orchestration change needed
  - Gated by ChipCallConfig::enable_dump_tensor (--dump-tensor CLI);
    zero allocation and zero code path cost when disabled

Artifacts:
  - dump_tensor_example scene tests (a2a3 + a5)
  - tools/dump_viewer.py for filtering and exporting to human-readable txt
  - docs/tensor-dump.md design doc
@ChaoWao ChaoWao merged commit e3e4bd5 into hw-native-sys:main Apr 15, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Tensor dump for runtime debugging and validation

2 participants