git clone <repo-url>
cd simplerThe pto-isa dependency will be automatically cloned when you first run an example that needs it.
The pto-isa repository provides header files needed for kernel compilation on the a2a3 (hardware) platform.
The test framework automatically handles PTO_ISA_ROOT setup:
- Checks if
PTO_ISA_ROOTis already set - If not, clones pto-isa to
examples/scripts/_deps/pto-isaon first run - Passes the resolved path to the kernel compiler
Automatic Setup (Recommended): Just run your example - pto-isa will be cloned automatically on first run:
python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3simBy default, the auto-clone uses SSH (git@github.com:...). In CI or environments without SSH keys, use --clone-protocol https:
python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3sim --clone-protocol httpsManual Setup (if auto-setup fails or you prefer manual control):
mkdir -p examples/scripts/_deps
git clone --branch main git@github.com:PTO-ISA/pto-isa.git examples/scripts/_deps/pto-isa
# Or use HTTPS
git clone --branch main https://github.com/PTO-ISA/pto-isa.git examples/scripts/_deps/pto-isa
# Set environment variable (optional - auto-detected if in standard location)
export PTO_ISA_ROOT=$(pwd)/examples/scripts/_deps/pto-isaUsing a Different Location:
export PTO_ISA_ROOT=/path/to/your/pto-isaTroubleshooting:
- If git is not available: Clone pto-isa manually and set
PTO_ISA_ROOT - If clone fails due to network: Try again or clone manually
- If SSH clone fails (e.g., in CI): Use
--clone-protocol httpsor clone manually with HTTPS
Note: For the simulation platform (a2a3sim), PTO ISA headers are optional and only needed if your kernels use PTO ISA intrinsics.
- CMake 3.15+
- CANN toolkit with:
cceccompiler (AICore Bisheng CCE)- Cross-compiler for AICPU (aarch64-target-linux-gnu-gcc/g++)
- Standard C/C++ compiler (gcc/g++) for host
- Python 3 with development headers
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
export ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latestThree ways to get the project running, depending on your role. All three assume an activated project-local venv (see .claude/rules/venv-isolation.md).
| Concern | pip install . |
pip install -e . |
cmake + PYTHONPATH |
|---|---|---|---|
| Who it's for | Users / CI | Python + C++ developers | C++-only developers |
simpler_setup resolves to |
site-packages | source tree (via .pth) |
source tree (via PYTHONPATH) |
simpler resolves to |
site-packages (4 files) | source tree python/simpler/ (all 8 files) |
source tree (all 8) |
_task_interface.*.so lives at |
site-packages root | build/{wheel_tag}/ (finder-dispatched) |
python/_task_interface.*.so |
PROJECT_ROOT |
<site-packages>/simpler_setup/_assets/ |
repo root | repo root |
src/ found under |
_assets/src/ |
<repo>/src/ |
<repo>/src/ |
build/lib/ found under |
_assets/build/lib/ |
<repo>/build/lib/ |
<repo>/build/lib/ |
Edit .py → effect |
reinstall required | immediate | immediate |
Edit nanobind .cpp → rebuild |
reinstall required | auto on next import (editable.rebuild) |
manual cmake --build build/ |
Edit runtime src/ → rebuild |
reinstall or manual | manual (--build flag or explicit script) |
manual |
from simpler.kernel_compiler import |
fails (excluded from wheel) | works (transitional copy — use simpler_setup instead) |
works (transitional copy — use simpler_setup instead) |
--build path writable |
no (site-packages read-only) | yes | yes |
pip install --no-build-isolation .--no-build-isolation is required: scikit-build-core consumes the venv's already-installed scikit-build-core, nanobind, cmake directly; isolation would hide them.
What lands in site-packages:
site-packages/
├── _task_interface.cpython-*.so # nanobind extension
├── simpler/ # stable 4 files only
│ ├── __init__.py
│ ├── env_manager.py
│ ├── task_interface.py
│ └── worker.py
└── simpler_setup/
├── *.py # test framework + authoritative compilers
└── _assets/
├── src/ # headers + orchestration sources
└── build/lib/ # pre-built runtime binaries
Limitations:
- Python edits require
pip install .again from simpler.{kernel_compiler,runtime_compiler,toolchain,elf_parser} import ...does not work — usesimpler_setup.*for those--build(rebuild runtime from source) won't work (site-packages is read-only)
Best for: ci.sh jobs and downstream consumers who only need to run examples.
pip install --no-build-isolation -e .The build is invoked once during install; pyproject.toml sets editable.rebuild = true, so subsequent C++ changes are picked up automatically.
What happens at install time:
simpler_setup/andsimpler/get.pthredirects pointing at the source tree_task_interface.*.sois built intobuild/{wheel_tag}/and dispatched by scikit-build-core's import finderbuild_runtimes.pypre-builds runtime binaries into<repo>/build/lib/install()rules also populate<site-packages>/simpler_setup/_assets/, but those are shadowed by the source-tree redirect
Rebuild behavior on import:
Every fresh Python process that imports simpler_setup or _task_interface triggers cmake --build + cmake --install against the top-level CMakeLists before the import returns. This covers:
- nanobind module (
python/bindings/*.cpp) — real incremental rebuild when source changed build_runtimesALL target — re-invokesbuild_runtimes.py, which fans out to per-runtime inner cmakes (each fast no-op when nothing changed)
Startup cost per fresh process:
- Nothing changed: ~6-15 seconds, depending on how many toolchains are installed (one inner cmake per runtime × platform combination, each a few hundred ms)
- Real C++ change: full incremental rebuild blocks import until done
What's still manual:
- Runtime
src/{arch}/...edits for--buildcode paths: pass--buildtorun_example.py(or re-runbuild_runtimes.py).editable.rebuildwill also try, but the inner no-op walk is the same — running--buildexplicitly on the affected example is faster. - Transitional
from simpler.{kernel_compiler,...} import ...still works in editable mode (source tree has the files); migrate tosimpler_setup.*when convenient.
Best for: daily development. Python edits are instant, C++ rebuilds without thinking about pip install.
Turning off rebuild temporarily (e.g. for faster pytest iteration when nothing C++ changed):
# One-off: skip rebuild for this invocation
SKBUILD_EDITABLE_REBUILD=0 pytest ...
# Or edit pyproject.toml to set editable.rebuild = false, then re-installThis path bypasses pip entirely. Useful if you want compile_commands.json, IDE integration, or are debugging a CMake-only concern.
# Dependencies (one-time, install into the venv)
pip install --no-build-isolation nanobind cmake scikit-build-core torch pytest
# Build
cmake -B build -S .
cmake --build build --parallel
# Make Python find the project
export PYTHONPATH="$(pwd):$(pwd)/python"
# Now run anything
python examples/scripts/run_example.py -k ... -g ... -p a2a3simWhy PYTHONPATH="$(pwd):$(pwd)/python":
$(pwd)makessimpler_setupimportable (it lives at repo root)$(pwd)/pythonmakessimpler.*importable (lives underpython/simpler/) and also findspython/_task_interface.*.so
In this mode SKBUILD_MODE=OFF, so CMakeLists.txt takes the non-install branch: the nanobind module's LIBRARY_OUTPUT_DIRECTORY is set to <repo>/python/, and no install() runs. _assets/ is not created — PROJECT_ROOT falls back to the repo root.
What's still needed from pip:
find_package(nanobind CONFIG REQUIRED)inpython/bindings/CMakeLists.txtrequiresnanobindto be discoverable via its Python-installed CMake config. Even withoutpip install ., you needpip install nanobindin the active venv.
Rebuild:
- C++ (nanobind or runtime): manual
cmake --build build/ - nanobind alone:
cmake --build build --target _task_interface - Runtime alone:
cmake --build build --target build_runtimes(or justrun_example.py --build)
Limitations:
editable.rebuildand everything else in[tool.scikit-build]are not consulted — this path doesn't go through scikit-build-core- You manage all dependencies manually
- Good for CMake-centric debugging; not the recommended daily loop
Best for: C++-only iteration, IDE integration, tests/ut/cpp/ development.
The RuntimeCompiler class handles compilation of all three components separately:
from simpler_setup.runtime_compiler import RuntimeCompiler
# For real Ascend hardware (requires CANN toolkit)
compiler = RuntimeCompiler(platform="a2a3")
# For simulation (no Ascend SDK needed)
compiler = RuntimeCompiler(platform="a2a3sim")
# Compile each component to independent binaries
aicore_binary = compiler.compile("aicore", include_dirs, source_dirs) # → .o file
aicpu_binary = compiler.compile("aicpu", include_dirs, source_dirs) # → .so file
host_binary = compiler.compile("host", include_dirs, source_dirs) # → .so fileToolchains used:
- AICore: Bisheng CCE (
cceccompiler) →.oobject file (a2a3 only) - AICPU: aarch64 cross-compiler →
.soshared object (a2a3 only) - Host: Standard gcc/g++ →
.soshared library - HostSim: Standard gcc/g++ for all targets (a2a3sim)
# Simulation platform (no hardware required)
python examples/scripts/run_example.py \
-k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3sim
# Hardware platform (requires Ascend device)
python examples/scripts/run_example.py \
-k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3Expected output:
=== Building Runtime: host_build_graph (platform: a2a3sim) ===
...
=== Comparing Results ===
Comparing f: shape=(16384,), dtype=float32
f: PASS (16384/16384 elements matched)
============================================================
TEST PASSED
============================================================
from simpler.task_interface import ChipWorker
from simpler_setup.runtime_builder import RuntimeBuilder
# Build or locate pre-built runtime binaries
builder = RuntimeBuilder(platform="a2a3sim")
binaries = builder.get_binaries("tensormap_and_ringbuffer")
# Create worker and initialize with platform binaries
worker = ChipWorker()
worker.init(host_path=str(binaries.host_path),
aicpu_path=str(binaries.aicpu_path),
aicore_path=str(binaries.aicore_path))
worker.set_device(device_id=0)
# Execute callable on device
worker.run(chip_callable, orch_args, block_dim=24)
# Cleanup
worker.reset_device()
worker.finalize()In src/{arch}/runtime/host_build_graph/runtime/runtime.h:
#define RUNTIME_MAX_TASKS 131072 // Maximum number of tasks
#define RUNTIME_MAX_ARGS 16 // Maximum arguments per task
#define RUNTIME_MAX_FANOUT 512 // Maximum successors per taskRuntime behavior is configured via kernel_config.py in each example:
RUNTIME_CONFIG = {
"runtime": "host_build_graph", # Runtime to use
"aicpu_thread_num": 3, # Number of AICPU scheduler threads
"block_dim": 3, # Number of AICore blocks (1 block = 1 AIC + 2 AIV)
}Device selection is done via CLI flag:
python examples/scripts/run_example.py -k <kernels> -g <golden.py> -p a2a3 --device 0- Device IDs: 0-15 (typically device 9 used for examples)
- Handshake cores: Usually 3 (1c2v configuration: 1 core, 2 vector units)
- Kernel compilation: Requires
ASCEND_HOME_PATHenvironment variable - Memory management: MemoryAllocator automatically tracks allocations
- Python requirement: NumPy for efficient array operations
Device logs written to ~/ascend/log/debug/device-<id>/
Kernel uses macros:
DEV_INFO: Informational messagesDEV_DEBUG: Debug messagesDEV_WARN: WarningsDEV_ERROR: Error messages