Getting Started

Cloning the Repository

git clone <repo-url>
cd simpler

The pto-isa dependency will be automatically cloned when you first run an example that needs it.

PTO ISA Headers

The pto-isa repository provides header files needed for kernel compilation on the a2a3 (hardware) platform.

The test framework automatically handles PTO_ISA_ROOT setup:

Checks if PTO_ISA_ROOT is already set
If not, clones pto-isa to examples/scripts/_deps/pto-isa on first run
Passes the resolved path to the kernel compiler

Automatic Setup (Recommended): Just run your example - pto-isa will be cloned automatically on first run:

python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \
                                       -g examples/a2a3/host_build_graph/vector_example/golden.py \
                                       -p a2a3sim

By default, the auto-clone uses SSH (git@github.com:...). In CI or environments without SSH keys, use --clone-protocol https:

python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \
                                       -g examples/a2a3/host_build_graph/vector_example/golden.py \
                                       -p a2a3sim --clone-protocol https

Manual Setup (if auto-setup fails or you prefer manual control):

mkdir -p examples/scripts/_deps
git clone --branch main git@github.com:PTO-ISA/pto-isa.git examples/scripts/_deps/pto-isa

# Or use HTTPS
git clone --branch main https://github.com/PTO-ISA/pto-isa.git examples/scripts/_deps/pto-isa

# Set environment variable (optional - auto-detected if in standard location)
export PTO_ISA_ROOT=$(pwd)/examples/scripts/_deps/pto-isa

Using a Different Location:

export PTO_ISA_ROOT=/path/to/your/pto-isa

Troubleshooting:

If git is not available: Clone pto-isa manually and set PTO_ISA_ROOT
If clone fails due to network: Try again or clone manually
If SSH clone fails (e.g., in CI): Use --clone-protocol https or clone manually with HTTPS

Note: For the simulation platform (a2a3sim), PTO ISA headers are optional and only needed if your kernels use PTO ISA intrinsics.

Prerequisites

CMake 3.15+
CANN toolkit with:
- ccec compiler (AICore Bisheng CCE)
- Cross-compiler for AICPU (aarch64-target-linux-gnu-gcc/g++)
Standard C/C++ compiler (gcc/g++) for host
Python 3 with development headers

Environment Setup

source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
export ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest

Install / Develop Workflows

Three ways to get the project running, depending on your role. All three assume an activated project-local venv (see .claude/rules/venv-isolation.md).

At a glance

Concern	`pip install .`	`pip install -e .`	`cmake + PYTHONPATH`
Who it's for	Users / CI	Python + C++ developers	C++-only developers
`simpler_setup` resolves to	site-packages	source tree (via `.pth`)	source tree (via `PYTHONPATH`)
`simpler` resolves to	site-packages (4 files)	source tree `python/simpler/` (all 8 files)	source tree (all 8)
*`_task_interface..so` lives at**	site-packages root	`build/{wheel_tag}/` (finder-dispatched)	`python/_task_interface.*.so`
`PROJECT_ROOT`	`<site-packages>/simpler_setup/_assets/`	repo root	repo root
`src/` found under	`_assets/src/`	`<repo>/src/`	`<repo>/src/`
`build/lib/` found under	`_assets/build/lib/`	`<repo>/build/lib/`	`<repo>/build/lib/`
Edit `.py` → effect	reinstall required	immediate	immediate
Edit nanobind `.cpp` → rebuild	reinstall required	auto on next import (`editable.rebuild`)	manual `cmake --build build/`
Edit runtime `src/` → rebuild	reinstall or manual	manual (`--build` flag or explicit script)	manual
`from simpler.kernel_compiler import`	fails (excluded from wheel)	works (transitional copy — use `simpler_setup` instead)	works (transitional copy — use `simpler_setup` instead)
`--build` path writable	no (site-packages read-only)	yes	yes

1. `pip install .` — user / CI install

pip install --no-build-isolation .

--no-build-isolation is required: scikit-build-core consumes the venv's already-installed scikit-build-core, nanobind, cmake directly; isolation would hide them.

What lands in site-packages:

site-packages/
├── _task_interface.cpython-*.so      # nanobind extension
├── simpler/                          # stable 4 files only
│   ├── __init__.py
│   ├── env_manager.py
│   ├── task_interface.py
│   └── worker.py
└── simpler_setup/
    ├── *.py                          # test framework + authoritative compilers
    └── _assets/
        ├── src/                      # headers + orchestration sources
        └── build/lib/                # pre-built runtime binaries

Limitations:

Python edits require pip install . again
from simpler.{kernel_compiler,runtime_compiler,toolchain,elf_parser} import ... does not work — use simpler_setup.* for those
--build (rebuild runtime from source) won't work (site-packages is read-only)

Best for: ci.sh jobs and downstream consumers who only need to run examples.

2. `pip install -e .` — editable developer install

pip install --no-build-isolation -e .

The build is invoked once during install; pyproject.toml sets editable.rebuild = true, so subsequent C++ changes are picked up automatically.

What happens at install time:

simpler_setup/ and simpler/ get .pth redirects pointing at the source tree
_task_interface.*.so is built into build/{wheel_tag}/ and dispatched by scikit-build-core's import finder
build_runtimes.py pre-builds runtime binaries into <repo>/build/lib/
install() rules also populate <site-packages>/simpler_setup/_assets/, but those are shadowed by the source-tree redirect

Rebuild behavior on import:

Every fresh Python process that imports simpler_setup or _task_interface triggers cmake --build + cmake --install against the top-level CMakeLists before the import returns. This covers:

nanobind module (python/bindings/*.cpp) — real incremental rebuild when source changed
build_runtimes ALL target — re-invokes build_runtimes.py, which fans out to per-runtime inner cmakes (each fast no-op when nothing changed)

Startup cost per fresh process:

Nothing changed: ~6-15 seconds, depending on how many toolchains are installed (one inner cmake per runtime × platform combination, each a few hundred ms)
Real C++ change: full incremental rebuild blocks import until done

What's still manual:

Runtime src/{arch}/... edits for --build code paths: pass --build to run_example.py (or re-run build_runtimes.py). editable.rebuild will also try, but the inner no-op walk is the same — running --build explicitly on the affected example is faster.
Transitional from simpler.{kernel_compiler,...} import ... still works in editable mode (source tree has the files); migrate to simpler_setup.* when convenient.

Best for: daily development. Python edits are instant, C++ rebuilds without thinking about pip install.

Turning off rebuild temporarily (e.g. for faster pytest iteration when nothing C++ changed):

# One-off: skip rebuild for this invocation
SKBUILD_EDITABLE_REBUILD=0 pytest ...

# Or edit pyproject.toml to set editable.rebuild = false, then re-install

3. `cmake + PYTHONPATH` — manual C++ workflow

This path bypasses pip entirely. Useful if you want compile_commands.json, IDE integration, or are debugging a CMake-only concern.

# Dependencies (one-time, install into the venv)
pip install --no-build-isolation nanobind cmake scikit-build-core torch pytest

# Build
cmake -B build -S .
cmake --build build --parallel

# Make Python find the project
export PYTHONPATH="$(pwd):$(pwd)/python"

# Now run anything
python examples/scripts/run_example.py -k ... -g ... -p a2a3sim

Why PYTHONPATH="$(pwd):$(pwd)/python":

$(pwd) makes simpler_setup importable (it lives at repo root)
$(pwd)/python makes simpler.* importable (lives under python/simpler/) and also finds python/_task_interface.*.so

In this mode SKBUILD_MODE=OFF, so CMakeLists.txt takes the non-install branch: the nanobind module's LIBRARY_OUTPUT_DIRECTORY is set to <repo>/python/, and no install() runs. _assets/ is not created — PROJECT_ROOT falls back to the repo root.

What's still needed from pip:

find_package(nanobind CONFIG REQUIRED) in python/bindings/CMakeLists.txt requires nanobind to be discoverable via its Python-installed CMake config. Even without pip install ., you need pip install nanobind in the active venv.

Rebuild:

C++ (nanobind or runtime): manual cmake --build build/
nanobind alone: cmake --build build --target _task_interface
Runtime alone: cmake --build build --target build_runtimes (or just run_example.py --build)

Limitations:

editable.rebuild and everything else in [tool.scikit-build] are not consulted — this path doesn't go through scikit-build-core
You manage all dependencies manually
Good for CMake-centric debugging; not the recommended daily loop

Best for: C++-only iteration, IDE integration, tests/ut/cpp/ development.

Build Process

The RuntimeCompiler class handles compilation of all three components separately:

from simpler_setup.runtime_compiler import RuntimeCompiler

# For real Ascend hardware (requires CANN toolkit)
compiler = RuntimeCompiler(platform="a2a3")

# For simulation (no Ascend SDK needed)
compiler = RuntimeCompiler(platform="a2a3sim")

# Compile each component to independent binaries
aicore_binary = compiler.compile("aicore", include_dirs, source_dirs)    # → .o file
aicpu_binary = compiler.compile("aicpu", include_dirs, source_dirs)      # → .so file
host_binary = compiler.compile("host", include_dirs, source_dirs)        # → .so file

Toolchains used:

AICore: Bisheng CCE (ccec compiler) → .o object file (a2a3 only)
AICPU: aarch64 cross-compiler → .so shared object (a2a3 only)
Host: Standard gcc/g++ → .so shared library
HostSim: Standard gcc/g++ for all targets (a2a3sim)

Quick Start

Running an Example

# Simulation platform (no hardware required)
python examples/scripts/run_example.py \
  -k examples/a2a3/host_build_graph/vector_example/kernels \
  -g examples/a2a3/host_build_graph/vector_example/golden.py \
  -p a2a3sim

# Hardware platform (requires Ascend device)
python examples/scripts/run_example.py \
  -k examples/a2a3/host_build_graph/vector_example/kernels \
  -g examples/a2a3/host_build_graph/vector_example/golden.py \
  -p a2a3

Expected output:

=== Building Runtime: host_build_graph (platform: a2a3sim) ===
...
=== Comparing Results ===
Comparing f: shape=(16384,), dtype=float32
  f: PASS (16384/16384 elements matched)

============================================================
TEST PASSED
============================================================

Python API Example

from simpler.task_interface import ChipWorker
from simpler_setup.runtime_builder import RuntimeBuilder

# Build or locate pre-built runtime binaries
builder = RuntimeBuilder(platform="a2a3sim")
binaries = builder.get_binaries("tensormap_and_ringbuffer")

# Create worker and initialize with platform binaries
worker = ChipWorker()
worker.init(host_path=str(binaries.host_path),
            aicpu_path=str(binaries.aicpu_path),
            aicore_path=str(binaries.aicore_path))
worker.set_device(device_id=0)

# Execute callable on device
worker.run(chip_callable, orch_args, block_dim=24)

# Cleanup
worker.reset_device()
worker.finalize()

Configuration

Compile-time Configuration (Runtime Limits)

In src/{arch}/runtime/host_build_graph/runtime/runtime.h:

#define RUNTIME_MAX_TASKS 131072   // Maximum number of tasks
#define RUNTIME_MAX_ARGS 16        // Maximum arguments per task
#define RUNTIME_MAX_FANOUT 512     // Maximum successors per task

Runtime Configuration

Runtime behavior is configured via kernel_config.py in each example:

RUNTIME_CONFIG = {
    "runtime": "host_build_graph",    # Runtime to use
    "aicpu_thread_num": 3,            # Number of AICPU scheduler threads
    "block_dim": 3,                   # Number of AICore blocks (1 block = 1 AIC + 2 AIV)
}

Device selection is done via CLI flag:

python examples/scripts/run_example.py -k <kernels> -g <golden.py> -p a2a3 --device 0

Notes

Device IDs: 0-15 (typically device 9 used for examples)
Handshake cores: Usually 3 (1c2v configuration: 1 core, 2 vector units)
Kernel compilation: Requires ASCEND_HOME_PATH environment variable
Memory management: MemoryAllocator automatically tracks allocations
Python requirement: NumPy for efficient array operations

Logging

Device logs written to ~/ascend/log/debug/device-<id>/

Kernel uses macros:

DEV_INFO: Informational messages
DEV_DEBUG: Debug messages
DEV_WARN: Warnings
DEV_ERROR: Error messages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started

Cloning the Repository

PTO ISA Headers

Prerequisites

Environment Setup

Install / Develop Workflows

At a glance

1. `pip install .` — user / CI install

2. `pip install -e .` — editable developer install

3. `cmake + PYTHONPATH` — manual C++ workflow

Build Process

Quick Start

Running an Example

Python API Example

Configuration

Compile-time Configuration (Runtime Limits)

Runtime Configuration

Notes

Logging

FilesExpand file tree

getting-started.md

Latest commit

History

getting-started.md

File metadata and controls

Getting Started

Cloning the Repository

PTO ISA Headers

Prerequisites

Environment Setup

Install / Develop Workflows

At a glance

1. pip install . — user / CI install

2. pip install -e . — editable developer install

3. cmake + PYTHONPATH — manual C++ workflow

Build Process

Quick Start

Running an Example

Python API Example

Configuration

Compile-time Configuration (Runtime Limits)

Runtime Configuration

Notes

Logging

1. `pip install .` — user / CI install

2. `pip install -e .` — editable developer install

3. `cmake + PYTHONPATH` — manual C++ workflow