git clone <repo-url>
cd simplerThe pto-isa dependency will be automatically cloned when you first run an example that needs it.
The pto-isa repository provides header files needed for kernel compilation on the a2a3 (hardware) platform.
The test framework automatically handles PTO_ISA_ROOT setup:
- Checks if
PTO_ISA_ROOTis already set - If not, clones pto-isa to
examples/scripts/_deps/pto-isaon first run - Passes the resolved path to the kernel compiler
Automatic Setup (Recommended): Just run your example - pto-isa will be cloned automatically on first run:
python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3simBy default, the auto-clone uses SSH (git@github.com:...). In CI or environments without SSH keys, use --clone-protocol https:
python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3sim --clone-protocol httpsManual Setup (if auto-setup fails or you prefer manual control):
mkdir -p examples/scripts/_deps
git clone --branch main git@github.com:PTO-ISA/pto-isa.git examples/scripts/_deps/pto-isa
# Or use HTTPS
git clone --branch main https://github.com/PTO-ISA/pto-isa.git examples/scripts/_deps/pto-isa
# Set environment variable (optional - auto-detected if in standard location)
export PTO_ISA_ROOT=$(pwd)/examples/scripts/_deps/pto-isaUsing a Different Location:
export PTO_ISA_ROOT=/path/to/your/pto-isaTroubleshooting:
- If git is not available: Clone pto-isa manually and set
PTO_ISA_ROOT - If clone fails due to network: Try again or clone manually
- If SSH clone fails (e.g., in CI): Use
--clone-protocol httpsor clone manually with HTTPS
Note: For the simulation platform (a2a3sim), PTO ISA headers are optional and only needed if your kernels use PTO ISA intrinsics.
- CMake 3.15+
- CANN toolkit with:
cceccompiler (AICore Bisheng CCE)- Cross-compiler for AICPU (aarch64-target-linux-gnu-gcc/g++)
- Standard C/C++ compiler (gcc/g++) for host
- Python 3 with development headers
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
export ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latestAll workflows assume an activated project-local venv (see .claude/rules/venv-isolation.md for why --no-build-isolation is required).
Recommended daily-dev setup:
python3 -m venv --system-site-packages .venv
source .venv/bin/activate
pip install --no-build-isolation scikit-build-core nanobind cmake pytest numpy ml_dtypes torch
pip install --no-build-isolation -e .Editing Python is instant (editable install). Editing C++ requires re-running pip install --no-build-isolation -e . (scikit-build-core's auto-rebuild is disabled because it interacts badly with pip's ephemeral build env — see docs/python-packaging.md).
Other supported paths: pip install . (non-editable), pip install --no-build-isolation ., pip install -e ., and cmake + PYTHONPATH (no pip). Full comparison of all 5 paths — what lands where, which entry points work under each, trade-offs — lives in docs/python-packaging.md.
Verifying an install: the single source of truth is tools/verify_packaging.sh, which exercises all 5 install paths × 4 entry points from a fully clean state. CI runs the same script on macOS + Ubuntu (see the packaging-matrix job in .github/workflows/ci.yml).
The RuntimeCompiler class handles compilation of all three components separately:
from simpler_setup.runtime_compiler import RuntimeCompiler
# For real Ascend hardware (requires CANN toolkit)
compiler = RuntimeCompiler(platform="a2a3")
# For simulation (no Ascend SDK needed)
compiler = RuntimeCompiler(platform="a2a3sim")
# Compile each component to independent binaries
aicore_binary = compiler.compile("aicore", include_dirs, source_dirs) # → .o file
aicpu_binary = compiler.compile("aicpu", include_dirs, source_dirs) # → .so file
host_binary = compiler.compile("host", include_dirs, source_dirs) # → .so fileToolchains used:
- AICore: Bisheng CCE (
cceccompiler) →.oobject file (a2a3 only) - AICPU: aarch64 cross-compiler →
.soshared object (a2a3 only) - Host: Standard gcc/g++ →
.soshared library - HostSim: Standard gcc/g++ for all targets (a2a3sim)
# Simulation platform (no hardware required)
python examples/scripts/run_example.py \
-k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3sim
# Hardware platform (requires Ascend device)
python examples/scripts/run_example.py \
-k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3Expected output:
=== Building Runtime: host_build_graph (platform: a2a3sim) ===
...
=== Comparing Results ===
Comparing f: shape=(16384,), dtype=float32
f: PASS (16384/16384 elements matched)
============================================================
TEST PASSED
============================================================
from simpler.task_interface import ChipWorker
from simpler_setup.runtime_builder import RuntimeBuilder
# Build or locate pre-built runtime binaries
builder = RuntimeBuilder(platform="a2a3sim")
binaries = builder.get_binaries("tensormap_and_ringbuffer")
# Create worker and initialize with platform binaries
worker = ChipWorker()
worker.init(host_path=str(binaries.host_path),
aicpu_path=str(binaries.aicpu_path),
aicore_path=str(binaries.aicore_path))
worker.set_device(device_id=0)
# Execute callable on device
worker.run(chip_callable, orch_args, block_dim=24)
# Cleanup
worker.reset_device()
worker.finalize()In src/{arch}/runtime/host_build_graph/runtime/runtime.h:
#define RUNTIME_MAX_TASKS 131072 // Maximum number of tasks
#define RUNTIME_MAX_ARGS 16 // Maximum arguments per task
#define RUNTIME_MAX_FANOUT 512 // Maximum successors per taskRuntime behavior is configured via kernel_config.py in each example:
RUNTIME_CONFIG = {
"runtime": "host_build_graph", # Runtime to use
"aicpu_thread_num": 3, # Number of AICPU scheduler threads
"block_dim": 3, # Number of AICore blocks (1 block = 1 AIC + 2 AIV)
}Device selection is done via CLI flag:
python examples/scripts/run_example.py -k <kernels> -g <golden.py> -p a2a3 --device 0- Device IDs: 0-15 (typically device 9 used for examples)
- Handshake cores: Usually 3 (1c2v configuration: 1 core, 2 vector units)
- Kernel compilation: Requires
ASCEND_HOME_PATHenvironment variable - Memory management: MemoryAllocator automatically tracks allocations
- Python requirement: NumPy for efficient array operations
Device logs written to ~/ascend/log/debug/device-<id>/
Kernel uses macros:
DEV_INFO: Informational messagesDEV_DEBUG: Debug messagesDEV_WARN: WarningsDEV_ERROR: Error messages