Skip to content

Hardware test#1

Closed
yanghaoran29 wants to merge 14 commits intomainfrom
hardware_test
Closed

Hardware test#1
yanghaoran29 wants to merge 14 commits intomainfrom
hardware_test

Conversation

@yanghaoran29
Copy link
Copy Markdown
Owner

No description provided.

…iling

Add a standalone simulation unit test framework (tests/aicpu_ut/) that
runs the PTO2 orchestrator and scheduler logic on a standard Linux CPU
without Ascend hardware, with integrated perf profiling support.

- Extend register read/write stubs to support simulation mode (zero-reg
  address treated as no-op dispatch, enables zero-core perf testing)
- Add per-core register address mapping used by the executor in sim mode

- Add AICPU-side profiling buffer management: per-core dispatch timestamp
  arrays, double-buffer switch, and perf_aicpu_switch_buffer()
- Expose PLATFORM_PROF_BUFFER_SIZE for compile-time sizing

- Add PTO2_ORCH_PROFILING and PTO2_SCHED_PROFILING as independent sub-
  switches under the existing PTO2_PROFILING master flag

- Add PTO2_SIM_AICORE_UT branch: when cores_total_num_ == 0 (sim mode),
  skip hardware register polling and run a drain loop instead
- Integrate PTO2_SCHED_PROFILING instrumentation: track per-phase cycle
  counts (get_ready, resolve_deps, dispatch_setup) and accumulate into
  scheduler phase breakdown output
- Add local_dispatch_count / local_overflow_count profiling counters

- Simulate AICore execution in-process: aicpu_sim_run_pto2() launches
  scheduler threads, accumulates dispatch counts per worker type, and
  provides aicpu_sim_get_actual_sched_cpu() for affinity reporting

- Add print_sched_profiling(rt): print per-thread phase breakdown
  (get_ready / resolve_deps / dispatch_setup / idle) in table form
- Add ut_dispatch_without_fanin_satisfied flag: in sim mode, treats any
  task with fanin_refcount >= 1 as immediately READY (bypasses fanin
  wait), allowing all tasks to be dispatched for scheduler-only perf
  measurement without actual AICore execution completing tasks

- Set init_task_on_submit = true when scheduler is attached so that
  init_task() is called at submit time, pre-populating fanin_refcount

- Add pto2_runtime_create_custom() for tests: takes explicit
  task_window_size and gm_heap_size parameters
- Add get_sim_aicore_mode() accessor

- Fully static link (no .so dependencies): discovers libstdc++.a,
  libm.a, libc.a, libpthread.a, libdl.a, libgcc.a at configure time
- One binary per PERF_CASE_IDX via target_compile_definitions
- PTO2_PROFILING / PTO2_SCHED_PROFILING / PTO2_ORCH_PROFILING toggles
- PTO2_SIM_AICORE_UT option (default ON) for zero-core sim paths

- cpu_affinity.cpp / cpu_affinity.h: bind_to_cpu(), current_cpu() via
  sched_setaffinity / sched_getcpu; ORCH_CPU / SCHED_CPU{0..7} from
  compile-time defines
- test_common.cpp / test_common.h: make_runtime() (calls
  pto2_runtime_create_custom with task_window=16384, heap=4GB),
  sim_run_with_resolve_and_dispatch() (runs scheduler threads and idle-
  loops until MAX_IDLE_ITERATIONS quiet cycles), print_orch_profiling(),
  print_sched_profiling() wrappers
- json_cases.h: PerfTestCase struct for compile-time test case selection
- test_log_stubs.cpp: stub out DEV_DEBUG / DEV_INFO / DEV_ERROR etc. for
  host-side compilation

- test_cpu_affinity.cpp: verify bind_to_cpu() and current_cpu() return
  the expected core
- test_platform_config.cpp: verify PLATFORM_MAX_BLOCKDIM,
  PLATFORM_AIC_CORES_PER_BLOCKDIM, PLATFORM_AIV_CORES_PER_BLOCKDIM,
  PLATFORM_MAX_AICPU_THREADS compile-time values

- test_paged_attention.cpp: single-head paged attention orch+sched perf
- test_batch_paged_attention.cpp: batch paged attention full pipeline
  (orchestrator and scheduler run concurrently on separate threads);
  3 cases: batch=64/ctx=8193, batch=2/varseq, batch=4/varseq
- test_batch_paged_attention_orch_only.cpp: orchestration only, no
  scheduler threads; used to profile build_batch_paged_attention_graph
  in isolation
- test_batch_paged_attention_sched_prof_only.cpp: run orchestration
  first (single-threaded, completes fully), then launch scheduler threads
  separately; PERF_WAIT_AFTER_INIT / SIGSTOP mechanism pauses after orch
  so perf record window covers only the scheduler phase

- CMake configure → parallel build → test execution → pass/fail summary
- Test registry (TEST_TYPE / TEST_INDICES associative arrays) for
  --test / --idx filtering
- --sched-threads N: pass AICPU_UT_NUM_SCHED_THREADS to test binaries
- --no-profiling / --no-sched-profiling / --no-orch-profiling toggles
- Writes sim output to outputs/aicpu_ut_sim_run.log; phase breakdown to
  outputs/aicpu_ut_phase_breakdown.log

- Wrapper around perf record for a single named binary (--bin required)
- test_batch_paged_attention* binaries: SIGSTOP/SIGCONT protocol —
  detect process state T via /proc/<pid>/stat, attach perf record -p,
  send SIGCONT; sampling window covers only the work phase
- Other binaries: full-program perf record -- <bin>
- --build triggers run_tests.sh --build-only before sampling
- Default: --no-build; --call-graph dwarf (default) / fp / lbr

- Document run_tests.sh and perf_sched.sh usage, parameters, available
  tests, environment variables, and execution flow

- Parse aicpu_ut_sim_run.log and print per-task-type dispatch statistics

- Add Part 2 JSON phase data source: try parse_scheduler_from_json_phases
  first (perf JSON version >= 2), fall back to device log parsing
- Extend Phase Breakdown table with get_ready / dispatch_setup columns

- Development notes and simulation architecture overview

Update: remove ut_dispatch_without_fanin_satisfied; add build/profiling opts

Remove ut_dispatch_without_fanin_satisfied from PTO2Scheduler:
- Field bypassed fanin dependency check (fanin_rc>=1 instead of
  fanin_rc==fanin_count) in sim tests; no longer needed as the
  dependency chain is evaluated correctly without the escape hatch
- Simplify release_fanin_and_check_ready() to unconditional
  bool ready = (new_refcount == task->fanin_count)
- Remove initialization in pto_scheduler.cpp and the
  #if PTO2_SIM_AICORE_UT blocks in test_batch_paged_attention.cpp
  and test_batch_paged_attention_sched_prof_only.cpp

run_tests.sh:
- Default profiling to OFF (silent run); add --profiling flag to
  enable all profiling output; add --profiling --no-sched/orch-profiling
  for selective control
- Suppress SIM_LOG/AICPU_UT_PHASE_LOG writes and summary output
  when profiling is off
- Add --opt-level <N> parameter (default 3); passed to CMake as
  OPT_LEVEL; also settable via OPT_LEVEL env variable

CMakeLists.txt:
- Add OPT_LEVEL cache variable (default 3); compile options now use
  -O${OPT_LEVEL} so optimization level is configurable at build time

HARDWARE_SIMULATION.md:
- Remove outdated ut_dispatch_without_fanin_satisfied section

Made-with: Cursor
@yanghaoran29 yanghaoran29 deleted the hardware_test branch March 17, 2026 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant