Parallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms. By porting to PTO instruction sequences, users can migrate Ascend hardware more easily.
- 🚀 2026-04-02: GitHub
mainis now guarded by passing CI for pre-commit, docs build, CPU-SIM smoke tests, and full CPU-SIM ST. - 🧠 2026-04-02: CPU-SIM BF16 coverage, GitCode-to-GitHub sync fixes, and
TPush/TPop/TPipevalidation updates landed onmain. - 📚 2026-04-02: Documentation quality improved with MkDocs build gating, markdownlint cleanup batches, and a new agent-oriented PTO-ISA workflow skill.
- 🌉 2026-04-02: Costmodel row=
1col-reduce fixes and A3run_st.shpath corrections were merged, reducing bring-up friction across CPU, simulator, and NPU paths. - 🎉 2025-12-27: PTO Tile Library becomes publicly available.
The PTO ISA (Instruction Set Architecture) is built on Ascend’s underlying hardware and software abstractions, providing over 90 standard tile-level operations.
Ascend hardware architectures have significantly evolved over generations, leading to major changes in the instruction sets. The PTO instruction set bridges these hardware differences by raising the abstraction level. We ensure that these PTO instructions work correctly across platforms while maintaining backward compatibility. However, this abstraction does not hide performance tuning opportunities. Users can still fine-tune performance by adjusting tile sizes, tile shapes, instruction order, etc. This provides sufficient control to fine-tune internal pipeline flows.
Our goal is to offer users a simplified, yet powerful way to optimize performance, enabling them to write high-performance code with PTO instructions.
The PTO ecosystem includes the following related projects:
| Project | Description |
|---|---|
| PTOAS | PTO assembler and compiler backend for PTO text/bytecode workflows. |
| pto-dsl | Pythonic interface and JIT compiler for PTO-ISA. |
| pypto | Community-driven Python frontend implementation for PTO kernels. |
| pto-kernels | Custom kernel collections built on PTO-ISA. |
| tilelang-ascend | Ascend TileLang adapter integration for PTO workflows. |
PTO Tile Lib is not aimed at beginner-level users. The intended audience includes:
- Backend developers implementing frameworks that directly interface with Ascend hardware.
- Cross-platform application developers.
- High-performance operator developers (manual operator implementations).
This repository includes performance-oriented kernels with reference measurements and reproducible setups.For performance testing tools, please refer to the msprof tool.
- Kernel:
kernels/manual/a2a3/gemm_performance/
Measured on Ascend A3 (24 cores) with fp16 inputs → fp32 output:
| Parameter | TMATMUL (Cube) Ratio | TEXTRACT Ratio | TLOAD Ratio | TSTORE Ratio | Execution time (ms) |
|---|---|---|---|---|---|
m=1536 k=1536 n=1536 |
54.5% | 42.2% | 72.2% | 7.7% | 0.0388 |
m=3072 k=3072 n=3072 |
79.0% | 62.0% | 90.9% | 5.8% | 0.2067 |
m=6144 k=6144 n=6144 |
86.7% | 68.1% | 95.2% | 3.1% | 1.5060 |
m=7680 k=7680 n=7680 |
80.6% | 63.0% | 98.4% | 2.4% | 3.1680 |
Detailed analysis and tuning notes: High-Performance GEMM Operator Example.
- Kernel:
kernels/manual/common/flash_atten/
Detailed analysis and tuning notes: Flash Attention Kernel Implementation.
- S0: query sequence length (number of rows in Q/O)
- S1: key/value sequence length (number of rows in K/V)
The following features will be released in the future:
| Feature | Description | Scope |
|---|---|---|
| PTO Auto Mode | BiSheng compiler support to automatically allocate tile buffers and insert synchronization. | Compiler / toolchain |
| PTO Tile Fusion | BiSheng compiler support to fuse tile operations automatically. | Compiler / toolchain |
| PTO-AS | Byte Code Support for PTO ISA. | Compiler / toolchain |
| Convolution extension | PTO ISA support for convolution kernels. | ISA Extension |
| Collective communication extension | PTO ISA support for collective communication. | ISA Extension |
| System schedule extension | PTO ISA support for SPMD/MPMD programming. | ISA Extension |
PTO instructions support two modes: Auto Mode (Available only in CPU simulation) (where the user does not allocate buffers or manage pipelining) and Manual Mode (where the user must allocate buffer addresses and manage pipelining). We recommend the following steps for optimizing operators:
- Develop the operator based on Auto Mode, generating PTO instruction sequences according to the algorithm logic. See demos/auto_mode/baseline/add for an example.
- Verify functionality and correctness in CPU simulation (see Run CPU Simulator).
- Port the code to Ascend hardware to ensure correctness and collect performance data. See the msprof tool.
- Identify performance bottlenecks (CUBE Bound / MTE Bound / Vector Bound) and begin optimization and tuning. See Performance Optimization.
We ensure that each PTO instruction, when implemented within a fixed tile shape, fully leverages the capabilities of the underlying hardware. We encapsulate low-level hardware implementations into the tile abstractions and utilize expert knowledge to create a variety of tile templates. During static compilation, the compiler selects the best assembly implementation for the current shape based on template parameters. By merging different PTO instructions, we achieve optimal performance.
In this repository, we demonstrate how standard tile operations can be mapped to various pipelines through template parameters:
- Static tile Shape (Row, Col): Tile Programming Model
- Dynamic tile Mask (Valid Mask): Tile Programming Model
- Event Record & Wait (Set wait flag): Events and Synchronization, General Conventions
- Specialized Fixed Function (SFU)
- Fixed Pipeline (FIXP)
PTO ISA defines over 90 standard operations. See the PTO instruction list. This repository implements a growing subset of them, with ongoing efforts to add more.
- Ascend A2 (Ascend 910B)
- Ascend A3 (Ascend 910C)
- Ascend A5 (Ascend 950)
- CPU (x86_64 / AArch64)
For more details please refer to Released PTO ISA
For detailed, OS-specific setup (Windows / Linux / macOS), see: docs/getting-started.md.
This repository includes comprehensive API documentation and ISA instruction references built with MkDocs (Material theme) under docs/mkdocs/. The documentation covers:
- Complete PTO ISA instruction reference
- API usage guidelines and examples
- Performance tuning guides
- Architecture and design documentation
For the latest documentation, visit the Documentation Center.
Build locally if you need offline access, are working on documentation changes, or want to view unreleased features.
- Python >= 3.8
- pip (Python package manager)
- Install MkDocs and dependencies:
python -m pip install -r docs/mkdocs/requirements.txt- Choose one of the following options:
python -m mkdocs serve -f docs/mkdocs/mkdocs.ymlThe documentation will be available at http://127.0.0.1:8000. The server watches for file changes and automatically reloads. Press Ctrl+C to stop the server.
python -m mkdocs build -f docs/mkdocs/mkdocs.ymlOutput will be in docs/mkdocs/site/. Open docs/mkdocs/site/index.html in your browser.
This method is useful for CI/CD pipelines or when integrating documentation builds into your development workflow.
- Create a Python virtual environment (recommended):
python3 -m venv .venv-mkdocs
source .venv-mkdocs/bin/activate # On Windows: .venv-mkdocs\Scripts\Activate.ps1
python -m pip install -r docs/mkdocs/requirements.txt- Configure and build with CMake:
cmake -S docs -B build/docs -DPython3_EXECUTABLE=$PWD/.venv-mkdocs/bin/python
cmake --build build/docs --target pto_docsOn Windows (PowerShell):
cmake -S docs -B build/docs -DPython3_EXECUTABLE="$PWD\.venv-mkdocs\Scripts\python.exe"
cmake --build build/docs --target pto_docsThe built documentation will be in build/docs/site/.
CPU simulation is cross-platform and does not require Ascend drivers/CANN:
python3 tests/run_cpu.py --clean --verboseBuild & run the GEMM demo (optional):
python3 tests/run_cpu.py --demo gemm --verboseBuild & run the Flash Attention demo (optional):
python3 tests/run_cpu.py --demo flash_attn --verboseRunning ST requires a working Ascend CANN environment and is typically Linux-only.
python3 tests/script/run_st.py -r [sim|npu] -v [a3|a5] -t [TEST_CASE] -g [GTEST_FILTER_CASE]Note: the a3 backend covers the A2/A3 family (include/pto/npu/a2a3).
Example:
python3 tests/script/run_st.py -r npu -v a3 -t tmatmul -g TMATMULTest.case1
python3 tests/script/run_st.py -r sim -v a5 -t tmatmul -g TMATMULTest.case1# Execute the following commands from the project root directory:
chmod +x ./tests/run_st.sh
./tests/run_st.sh a5 npu simple
./tests/run_st.sh a3 sim all# Execute the following commands from the project root directory:
chmod +x ./tests/run_cpu_tests.sh
./tests/run_cpu_tests.sh
python3 tests/run_cpu.py --verboseFor example, if you use the CANN community package and install to the default path:
-
Default path (installed as root)
source /usr/local/Ascend/cann/bin/setenv.bash -
Default path (installed as a non-root user)
source $HOME/Ascend/cann/bin/setenv.bash
If you install to install-path, use:
source ${install-path}/cann/bin/setenv.bash- Run Full ST Tests:
chmod +x build.sh
./build.sh --run_all --a3 --sim- Run Simplified ST Tests:
chmod +x build.sh
./build.sh --run_simple --a5 --npu- Packaging:
chmod +x build.sh
./build.sh --pkg-
ISA Guide and Instruction Navigation: docs/README.md
-
Agent Quick Context (repo map + run commands): docs/agent.md
-
ISA Instruction Documentation Index: docs/isa/README.md
-
Developer Coding Documentation Index: docs/coding/README.md
-
Getting Started Guide (recommended to run on CPU before moving to NPU): docs/getting-started.md
-
Security and Disclosure Process: SECURITY.md
-
Directory-level Reading (Code Organization):
- Build and Packaging (CMake): cmake/README.md
- External Header Files and APIs: include/README.md, include/pto/README.md
- NPU Implementation (Split by SoC): include/pto/npu/README.md, include/pto/npu/a2a3/README.md, include/pto/npu/a5/README.md
- Kernel/Custom Operators: kernels/README.md, kernels/custom/README.md
- Testing and Use Cases: tests/README.md, tests/script/README.md
- Packaging Scripts: scripts/README.md, scripts/package/README.md
include/: PTO C++ header files (see include/README.md)kernels/: Custom operators and kernel implementations (see kernels/README.md)docs/: ISA instructions, API guidelines, and examples (see docs/README.md)tests/: ST/CPU test scripts and use cases (see tests/README.md)scripts/: Packaging and release scripts (see scripts/README.md)build.sh,tests/run_st.sh: Build, package, and example run entry points
This project is licensed under the CANN Open Software License Agreement Version 2.0. See the LICENSE file for details.