Getting Started

Cloning the Repository

git clone <repo-url>
cd simpler

The pto-isa dependency will be automatically cloned when you first run an example that needs it.

PTO ISA Headers

The pto-isa repository provides header files needed for kernel compilation on the a2a3 (hardware) platform.

The test framework automatically handles PTO_ISA_ROOT setup:

Checks if PTO_ISA_ROOT is already set
If not, clones pto-isa to examples/scripts/_deps/pto-isa on first run
Passes the resolved path to the kernel compiler

Automatic Setup (Recommended): Just run your example - pto-isa will be cloned automatically on first run:

python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \
                                       -g examples/a2a3/host_build_graph/vector_example/golden.py \
                                       -p a2a3sim

By default, the auto-clone uses SSH (git@github.com:...). In CI or environments without SSH keys, use --clone-protocol https:

python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \
                                       -g examples/a2a3/host_build_graph/vector_example/golden.py \
                                       -p a2a3sim --clone-protocol https

Manual Setup (if auto-setup fails or you prefer manual control):

mkdir -p examples/scripts/_deps
git clone --branch main git@github.com:PTO-ISA/pto-isa.git examples/scripts/_deps/pto-isa

# Or use HTTPS
git clone --branch main https://github.com/PTO-ISA/pto-isa.git examples/scripts/_deps/pto-isa

# Set environment variable (optional - auto-detected if in standard location)
export PTO_ISA_ROOT=$(pwd)/examples/scripts/_deps/pto-isa

Using a Different Location:

export PTO_ISA_ROOT=/path/to/your/pto-isa

Troubleshooting:

If git is not available: Clone pto-isa manually and set PTO_ISA_ROOT
If clone fails due to network: Try again or clone manually
If SSH clone fails (e.g., in CI): Use --clone-protocol https or clone manually with HTTPS

Note: For the simulation platform (a2a3sim), PTO ISA headers are optional and only needed if your kernels use PTO ISA intrinsics.

Prerequisites

CMake 3.15+
CANN toolkit with:
- ccec compiler (AICore Bisheng CCE)
- Cross-compiler for AICPU (aarch64-target-linux-gnu-gcc/g++)
Standard C/C++ compiler (gcc/g++) for host
Python 3 with development headers

Environment Setup

source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
export ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest

Install

All workflows assume an activated project-local venv (see .claude/rules/venv-isolation.md for why --no-build-isolation is required).

Recommended daily-dev setup:

python3 -m venv --system-site-packages .venv
source .venv/bin/activate
pip install --no-build-isolation scikit-build-core nanobind cmake pytest numpy ml_dtypes torch
pip install --no-build-isolation -e .

Editing Python is instant (editable install). Editing C++ requires re-running pip install --no-build-isolation -e . (scikit-build-core's auto-rebuild is disabled because it interacts badly with pip's ephemeral build env — see docs/python-packaging.md).

Other supported paths: pip install . (non-editable), pip install --no-build-isolation ., pip install -e ., and cmake + PYTHONPATH (no pip). Full comparison of all 5 paths — what lands where, which entry points work under each, trade-offs — lives in docs/python-packaging.md.

Verifying an install: the single source of truth is tools/verify_packaging.sh, which exercises all 5 install paths × 4 entry points from a fully clean state. CI runs the same script on macOS + Ubuntu (see the packaging-matrix job in .github/workflows/ci.yml).

Build Process

The RuntimeCompiler class handles compilation of all three components separately:

from simpler_setup.runtime_compiler import RuntimeCompiler

# For real Ascend hardware (requires CANN toolkit)
compiler = RuntimeCompiler(platform="a2a3")

# For simulation (no Ascend SDK needed)
compiler = RuntimeCompiler(platform="a2a3sim")

# Compile each component to independent binaries
aicore_binary = compiler.compile("aicore", include_dirs, source_dirs)    # → .o file
aicpu_binary = compiler.compile("aicpu", include_dirs, source_dirs)      # → .so file
host_binary = compiler.compile("host", include_dirs, source_dirs)        # → .so file

Toolchains used:

AICore: Bisheng CCE (ccec compiler) → .o object file (a2a3 only)
AICPU: aarch64 cross-compiler → .so shared object (a2a3 only)
Host: Standard gcc/g++ → .so shared library
HostSim: Standard gcc/g++ for all targets (a2a3sim)

Quick Start

Running an Example

# Simulation platform (no hardware required)
python examples/scripts/run_example.py \
  -k examples/a2a3/host_build_graph/vector_example/kernels \
  -g examples/a2a3/host_build_graph/vector_example/golden.py \
  -p a2a3sim

# Hardware platform (requires Ascend device)
python examples/scripts/run_example.py \
  -k examples/a2a3/host_build_graph/vector_example/kernels \
  -g examples/a2a3/host_build_graph/vector_example/golden.py \
  -p a2a3

Expected output:

=== Building Runtime: host_build_graph (platform: a2a3sim) ===
...
=== Comparing Results ===
Comparing f: shape=(16384,), dtype=float32
  f: PASS (16384/16384 elements matched)

============================================================
TEST PASSED
============================================================

Python API Example

from simpler.task_interface import ChipWorker
from simpler_setup.runtime_builder import RuntimeBuilder

# Build or locate pre-built runtime binaries
builder = RuntimeBuilder(platform="a2a3sim")
binaries = builder.get_binaries("tensormap_and_ringbuffer")

# Create worker and initialize with platform binaries
worker = ChipWorker()
worker.init(host_path=str(binaries.host_path),
            aicpu_path=str(binaries.aicpu_path),
            aicore_path=str(binaries.aicore_path))
worker.set_device(device_id=0)

# Execute callable on device
worker.run(chip_callable, orch_args, block_dim=24)

# Cleanup
worker.reset_device()
worker.finalize()

Configuration

Compile-time Configuration (Runtime Limits)

In src/{arch}/runtime/host_build_graph/runtime/runtime.h:

#define RUNTIME_MAX_TASKS 131072   // Maximum number of tasks
#define RUNTIME_MAX_ARGS 16        // Maximum arguments per task
#define RUNTIME_MAX_FANOUT 512     // Maximum successors per task

Runtime Configuration

Runtime behavior is configured via kernel_config.py in each example:

RUNTIME_CONFIG = {
    "runtime": "host_build_graph",    # Runtime to use
    "aicpu_thread_num": 3,            # Number of AICPU scheduler threads
    "block_dim": 3,                   # Number of AICore blocks (1 block = 1 AIC + 2 AIV)
}

Device selection is done via CLI flag:

python examples/scripts/run_example.py -k <kernels> -g <golden.py> -p a2a3 --device 0

Notes

Device IDs: 0-15 (typically device 9 used for examples)
Handshake cores: Usually 3 (1c2v configuration: 1 core, 2 vector units)
Kernel compilation: Requires ASCEND_HOME_PATH environment variable
Memory management: MemoryAllocator automatically tracks allocations
Python requirement: NumPy for efficient array operations

Logging

Device logs written to ~/ascend/log/debug/device-<id>/

Kernel uses macros:

DEV_INFO: Informational messages
DEV_DEBUG: Debug messages
DEV_WARN: Warnings
DEV_ERROR: Error messages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started

Cloning the Repository

PTO ISA Headers

Prerequisites

Environment Setup

Install

Build Process

Quick Start

Running an Example

Python API Example

Configuration

Compile-time Configuration (Runtime Limits)

Runtime Configuration

Notes

Logging

FilesExpand file tree

getting-started.md

Latest commit

History

getting-started.md

File metadata and controls

Getting Started

Cloning the Repository

PTO ISA Headers

Prerequisites

Environment Setup

Install

Build Process

Quick Start

Running an Example

Python API Example

Configuration

Compile-time Configuration (Runtime Limits)

Runtime Configuration

Notes

Logging