Hardware test by yanghaoran29 · Pull Request #1 · yanghaoran29/simpler

yanghaoran29 · 2026-03-16T13:18:58Z

No description provided.

…iling Add a standalone simulation unit test framework (tests/aicpu_ut/) that runs the PTO2 orchestrator and scheduler logic on a standard Linux CPU without Ascend hardware, with integrated perf profiling support. - Extend register read/write stubs to support simulation mode (zero-reg address treated as no-op dispatch, enables zero-core perf testing) - Add per-core register address mapping used by the executor in sim mode - Add AICPU-side profiling buffer management: per-core dispatch timestamp arrays, double-buffer switch, and perf_aicpu_switch_buffer() - Expose PLATFORM_PROF_BUFFER_SIZE for compile-time sizing - Add PTO2_ORCH_PROFILING and PTO2_SCHED_PROFILING as independent sub- switches under the existing PTO2_PROFILING master flag - Add PTO2_SIM_AICORE_UT branch: when cores_total_num_ == 0 (sim mode), skip hardware register polling and run a drain loop instead - Integrate PTO2_SCHED_PROFILING instrumentation: track per-phase cycle counts (get_ready, resolve_deps, dispatch_setup) and accumulate into scheduler phase breakdown output - Add local_dispatch_count / local_overflow_count profiling counters - Simulate AICore execution in-process: aicpu_sim_run_pto2() launches scheduler threads, accumulates dispatch counts per worker type, and provides aicpu_sim_get_actual_sched_cpu() for affinity reporting - Add print_sched_profiling(rt): print per-thread phase breakdown (get_ready / resolve_deps / dispatch_setup / idle) in table form - Add ut_dispatch_without_fanin_satisfied flag: in sim mode, treats any task with fanin_refcount >= 1 as immediately READY (bypasses fanin wait), allowing all tasks to be dispatched for scheduler-only perf measurement without actual AICore execution completing tasks - Set init_task_on_submit = true when scheduler is attached so that init_task() is called at submit time, pre-populating fanin_refcount - Add pto2_runtime_create_custom() for tests: takes explicit task_window_size and gm_heap_size parameters - Add get_sim_aicore_mode() accessor - Fully static link (no .so dependencies): discovers libstdc++.a, libm.a, libc.a, libpthread.a, libdl.a, libgcc.a at configure time - One binary per PERF_CASE_IDX via target_compile_definitions - PTO2_PROFILING / PTO2_SCHED_PROFILING / PTO2_ORCH_PROFILING toggles - PTO2_SIM_AICORE_UT option (default ON) for zero-core sim paths - cpu_affinity.cpp / cpu_affinity.h: bind_to_cpu(), current_cpu() via sched_setaffinity / sched_getcpu; ORCH_CPU / SCHED_CPU{0..7} from compile-time defines - test_common.cpp / test_common.h: make_runtime() (calls pto2_runtime_create_custom with task_window=16384, heap=4GB), sim_run_with_resolve_and_dispatch() (runs scheduler threads and idle- loops until MAX_IDLE_ITERATIONS quiet cycles), print_orch_profiling(), print_sched_profiling() wrappers - json_cases.h: PerfTestCase struct for compile-time test case selection - test_log_stubs.cpp: stub out DEV_DEBUG / DEV_INFO / DEV_ERROR etc. for host-side compilation - test_cpu_affinity.cpp: verify bind_to_cpu() and current_cpu() return the expected core - test_platform_config.cpp: verify PLATFORM_MAX_BLOCKDIM, PLATFORM_AIC_CORES_PER_BLOCKDIM, PLATFORM_AIV_CORES_PER_BLOCKDIM, PLATFORM_MAX_AICPU_THREADS compile-time values - test_paged_attention.cpp: single-head paged attention orch+sched perf - test_batch_paged_attention.cpp: batch paged attention full pipeline (orchestrator and scheduler run concurrently on separate threads); 3 cases: batch=64/ctx=8193, batch=2/varseq, batch=4/varseq - test_batch_paged_attention_orch_only.cpp: orchestration only, no scheduler threads; used to profile build_batch_paged_attention_graph in isolation - test_batch_paged_attention_sched_prof_only.cpp: run orchestration first (single-threaded, completes fully), then launch scheduler threads separately; PERF_WAIT_AFTER_INIT / SIGSTOP mechanism pauses after orch so perf record window covers only the scheduler phase - CMake configure → parallel build → test execution → pass/fail summary - Test registry (TEST_TYPE / TEST_INDICES associative arrays) for --test / --idx filtering - --sched-threads N: pass AICPU_UT_NUM_SCHED_THREADS to test binaries - --no-profiling / --no-sched-profiling / --no-orch-profiling toggles - Writes sim output to outputs/aicpu_ut_sim_run.log; phase breakdown to outputs/aicpu_ut_phase_breakdown.log - Wrapper around perf record for a single named binary (--bin required) - test_batch_paged_attention* binaries: SIGSTOP/SIGCONT protocol — detect process state T via /proc/<pid>/stat, attach perf record -p, send SIGCONT; sampling window covers only the work phase - Other binaries: full-program perf record -- <bin> - --build triggers run_tests.sh --build-only before sampling - Default: --no-build; --call-graph dwarf (default) / fp / lbr - Document run_tests.sh and perf_sched.sh usage, parameters, available tests, environment variables, and execution flow - Parse aicpu_ut_sim_run.log and print per-task-type dispatch statistics - Add Part 2 JSON phase data source: try parse_scheduler_from_json_phases first (perf JSON version >= 2), fall back to device log parsing - Extend Phase Breakdown table with get_ready / dispatch_setup columns - Development notes and simulation architecture overview Update: remove ut_dispatch_without_fanin_satisfied; add build/profiling opts Remove ut_dispatch_without_fanin_satisfied from PTO2Scheduler: - Field bypassed fanin dependency check (fanin_rc>=1 instead of fanin_rc==fanin_count) in sim tests; no longer needed as the dependency chain is evaluated correctly without the escape hatch - Simplify release_fanin_and_check_ready() to unconditional bool ready = (new_refcount == task->fanin_count) - Remove initialization in pto_scheduler.cpp and the #if PTO2_SIM_AICORE_UT blocks in test_batch_paged_attention.cpp and test_batch_paged_attention_sched_prof_only.cpp run_tests.sh: - Default profiling to OFF (silent run); add --profiling flag to enable all profiling output; add --profiling --no-sched/orch-profiling for selective control - Suppress SIM_LOG/AICPU_UT_PHASE_LOG writes and summary output when profiling is off - Add --opt-level <N> parameter (default 3); passed to CMake as OPT_LEVEL; also settable via OPT_LEVEL env variable CMakeLists.txt: - Add OPT_LEVEL cache variable (default 3); compile options now use -O${OPT_LEVEL} so optimization level is configurable at build time HARDWARE_SIMULATION.md: - Remove outdated ut_dispatch_without_fanin_satisfied section Made-with: Cursor

… into hardware_test Made-with: Cursor

yanghaoran29 added 14 commits March 16, 2026 20:41

Remove no early return

58b0cda

Remove

552c551

Unitify the interface

6e9b66e

update profiling

7d8feb0

update profiling

740aecc

recover comment

cf98f08

update

4ca41a8

Update

ff8958d

Update

215d900

update

12637b2

remove

d9b522d

fix

64fb166

Merge branch 'hardware_test' of https://github.com/yanghaoran29/simpler…

8e7e152

… into hardware_test Made-with: Cursor

yanghaoran29 closed this Mar 17, 2026

yanghaoran29 deleted the hardware_test branch March 17, 2026 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardware test#1

Hardware test#1
yanghaoran29 wants to merge 14 commits intomainfrom
hardware_test

yanghaoran29 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yanghaoran29 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant