Skip to content

Latest commit

 

History

History
231 lines (164 loc) · 7.65 KB

File metadata and controls

231 lines (164 loc) · 7.65 KB

Getting Started

Cloning the Repository

git clone <repo-url>
cd simpler

The pto-isa dependency will be automatically cloned when you first run an example that needs it.

PTO ISA Headers

The pto-isa repository provides header files needed for kernel compilation on the a2a3 (hardware) platform.

The test framework automatically handles PTO_ISA_ROOT setup:

  1. Checks if PTO_ISA_ROOT is already set
  2. If not, clones pto-isa to examples/scripts/_deps/pto-isa on first run
  3. Passes the resolved path to the kernel compiler

Automatic Setup (Recommended): Just run your example - pto-isa will be cloned automatically on first run:

python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \
                                       -g examples/a2a3/host_build_graph/vector_example/golden.py \
                                       -p a2a3sim

By default, the auto-clone uses SSH (git@github.com:...). In CI or environments without SSH keys, use --clone-protocol https:

python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \
                                       -g examples/a2a3/host_build_graph/vector_example/golden.py \
                                       -p a2a3sim --clone-protocol https

Manual Setup (if auto-setup fails or you prefer manual control):

mkdir -p examples/scripts/_deps
git clone --branch main git@github.com:PTO-ISA/pto-isa.git examples/scripts/_deps/pto-isa

# Or use HTTPS
git clone --branch main https://github.com/PTO-ISA/pto-isa.git examples/scripts/_deps/pto-isa

# Set environment variable (optional - auto-detected if in standard location)
export PTO_ISA_ROOT=$(pwd)/examples/scripts/_deps/pto-isa

Using a Different Location:

export PTO_ISA_ROOT=/path/to/your/pto-isa

Troubleshooting:

  • If git is not available: Clone pto-isa manually and set PTO_ISA_ROOT
  • If clone fails due to network: Try again or clone manually
  • If SSH clone fails (e.g., in CI): Use --clone-protocol https or clone manually with HTTPS

Note: For the simulation platform (a2a3sim), PTO ISA headers are optional and only needed if your kernels use PTO ISA intrinsics.

Prerequisites

  • CMake 3.15+
  • CANN toolkit with:
    • ccec compiler (AICore Bisheng CCE)
    • Cross-compiler for AICPU (aarch64-target-linux-gnu-gcc/g++)
  • Standard C/C++ compiler (gcc/g++) for host
  • Python 3 with development headers

Environment Setup

source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
export ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest

Install

All workflows assume an activated project-local venv (see .claude/rules/venv-isolation.md for why --no-build-isolation is required).

Recommended daily-dev setup:

python3 -m venv --system-site-packages .venv
source .venv/bin/activate
pip install --no-build-isolation scikit-build-core nanobind cmake pytest numpy ml_dtypes torch
pip install --no-build-isolation -e .

Editing Python is instant (editable install). Editing C++ requires re-running pip install --no-build-isolation -e . (scikit-build-core's auto-rebuild is disabled because it interacts badly with pip's ephemeral build env — see docs/python-packaging.md).

Other supported paths: pip install . (non-editable), pip install --no-build-isolation ., pip install -e ., and cmake + PYTHONPATH (no pip). Full comparison of all 5 paths — what lands where, which entry points work under each, trade-offs — lives in docs/python-packaging.md.

Verifying an install: the single source of truth is tools/verify_packaging.sh, which exercises all 5 install paths × 4 entry points from a fully clean state. CI runs the same script on macOS + Ubuntu (see the packaging-matrix job in .github/workflows/ci.yml).

Build Process

The RuntimeCompiler class handles compilation of all three components separately:

from simpler_setup.runtime_compiler import RuntimeCompiler

# For real Ascend hardware (requires CANN toolkit)
compiler = RuntimeCompiler(platform="a2a3")

# For simulation (no Ascend SDK needed)
compiler = RuntimeCompiler(platform="a2a3sim")

# Compile each component to independent binaries
aicore_binary = compiler.compile("aicore", include_dirs, source_dirs)    # → .o file
aicpu_binary = compiler.compile("aicpu", include_dirs, source_dirs)      # → .so file
host_binary = compiler.compile("host", include_dirs, source_dirs)        # → .so file

Toolchains used:

  • AICore: Bisheng CCE (ccec compiler) → .o object file (a2a3 only)
  • AICPU: aarch64 cross-compiler → .so shared object (a2a3 only)
  • Host: Standard gcc/g++ → .so shared library
  • HostSim: Standard gcc/g++ for all targets (a2a3sim)

Quick Start

Running an Example

# Simulation platform (no hardware required)
python examples/scripts/run_example.py \
  -k examples/a2a3/host_build_graph/vector_example/kernels \
  -g examples/a2a3/host_build_graph/vector_example/golden.py \
  -p a2a3sim

# Hardware platform (requires Ascend device)
python examples/scripts/run_example.py \
  -k examples/a2a3/host_build_graph/vector_example/kernels \
  -g examples/a2a3/host_build_graph/vector_example/golden.py \
  -p a2a3

Expected output:

=== Building Runtime: host_build_graph (platform: a2a3sim) ===
...
=== Comparing Results ===
Comparing f: shape=(16384,), dtype=float32
  f: PASS (16384/16384 elements matched)

============================================================
TEST PASSED
============================================================

Python API Example

from simpler.task_interface import ChipWorker
from simpler_setup.runtime_builder import RuntimeBuilder

# Build or locate pre-built runtime binaries
builder = RuntimeBuilder(platform="a2a3sim")
binaries = builder.get_binaries("tensormap_and_ringbuffer")

# Create worker and initialize with platform binaries
worker = ChipWorker()
worker.init(host_path=str(binaries.host_path),
            aicpu_path=str(binaries.aicpu_path),
            aicore_path=str(binaries.aicore_path))
worker.set_device(device_id=0)

# Execute callable on device
worker.run(chip_callable, orch_args, block_dim=24)

# Cleanup
worker.reset_device()
worker.finalize()

Configuration

Compile-time Configuration (Runtime Limits)

In src/{arch}/runtime/host_build_graph/runtime/runtime.h:

#define RUNTIME_MAX_TASKS 131072   // Maximum number of tasks
#define RUNTIME_MAX_ARGS 16        // Maximum arguments per task
#define RUNTIME_MAX_FANOUT 512     // Maximum successors per task

Runtime Configuration

Runtime behavior is configured via kernel_config.py in each example:

RUNTIME_CONFIG = {
    "runtime": "host_build_graph",    # Runtime to use
    "aicpu_thread_num": 3,            # Number of AICPU scheduler threads
    "block_dim": 3,                   # Number of AICore blocks (1 block = 1 AIC + 2 AIV)
}

Device selection is done via CLI flag:

python examples/scripts/run_example.py -k <kernels> -g <golden.py> -p a2a3 --device 0

Notes

  • Device IDs: 0-15 (typically device 9 used for examples)
  • Handshake cores: Usually 3 (1c2v configuration: 1 core, 2 vector units)
  • Kernel compilation: Requires ASCEND_HOME_PATH environment variable
  • Memory management: MemoryAllocator automatically tracks allocations
  • Python requirement: NumPy for efficient array operations

Logging

Device logs written to ~/ascend/log/debug/device-<id>/

Kernel uses macros:

  • DEV_INFO: Informational messages
  • DEV_DEBUG: Debug messages
  • DEV_WARN: Warnings
  • DEV_ERROR: Error messages