A scalable NPU architecture for Transformer LLM inference on edge FPGAs
Public alpha — v0.1.0-alpha is published as a prerelease. Core
architecture and ISA are stable; verification, KV260 bring-up, and
documentation polish are in progress. Expect rough edges; feedback and
issues are welcome.
| Entry point | Link |
|---|---|
| Documentation | https://pccx.pages.dev/ |
| Releases | https://github.com/pccxai/pccx/releases |
v0.1.0-alpha notes |
docs/releases/v0.1.0-alpha.md |
| Roadmap (project board) | https://github.com/orgs/pccxai/projects/1 |
| Contributing | CONTRIBUTING.md |
| How to cite | CITATION.cff |
| Discussions | https://github.com/pccxai/pccx/discussions |
| Good first issues |
good first issue
|
pccx is a hardware-software co-design framework that accelerates autoregressive decoding of Transformer-based LLMs on resource-constrained edge devices. The primary target is the Xilinx Kria KV260 SOM.
Rather than reusing a generic matrix accelerator, pccx is sized around the dominant bottleneck of LLM decoding: memory bandwidth-bound GEMV, not compute-bound GEMM. The architecture separates matrix (GEMM) and vector (GEMV) datapaths, supplies weights through dedicated HP AXI ports, and uses a custom 64-bit VLIW ISA to reduce dispatch-side stalls.
| Core | Configuration | Peak Throughput | Primary Use |
|---|---|---|---|
| GEMM (Matrix) | 32 × 32 systolic array (cascade split @ row 16) | 819 GMAC/s @ 400 MHz | Prefill, Q·Kᵀ, score·V |
| GEMV (Vector) | 4 cores × 32-MAC LUT pipeline + 5-stage reduction tree | Weight-streaming limited (~51.2 GMAC/s @ 400 MHz) | Autoregressive decoding |
| SFU / CVO | CORDIC + LUT hybrid | BF16 / FP32 promoted | Softmax, GELU, RMSNorm, RoPE |
Key design decisions:
- W4A8 precision — INT4 weights × INT8 activations via DSP48E2 dual-channel bit packing (1 DSP = 2 MACs)
- Precision promotion — non-linear ops (Softmax, GELU, RMSNorm, RoPE) automatically upcast to BF16/FP32 for numerical stability
- Custom 64-bit VLIW ISA — 5 opcodes:
GEMV,GEMM,MEMCPY,MEMSET,CVO; decoupled decode/dispatch eliminates front-end stalls - Shared L2 (URAM 1.75 MB) — all three cores share a central SRAM cache; GEMV↔SFU are connected via a direct-connect FIFO, bypassing L2 round-trips
- Dual clock domains — 250 MHz AXI/control plane, 400 MHz core compute (×1.6 frequency gain over v001)
- 3.125× total throughput gain vs. v001 (frequency × dual-MAC DSP packing)
External AXI (250 MHz) Core Domain (400 MHz)
───────────────────── ──────────────────────────────────────────────────────
S_AXIL_CTRL (HPM) ────────► npu_controller_top
├─ ctrl_npu_decoder (64-bit VLIW → opcode + body)
S_AXI_HP0/HP1 ────────► GEMM_systolic_top (32×16×2, W-Stationary)
S_AXI_HP2/HP3 ────────► GEMV_top (4 cores × 32-MAC LUT, 5-stage tree)
S_AXIS_ACP_FMAP ────────► ┌─────────────────────────────────┐
M_AXIS_ACP_RESULT ◄──────── │ Shared L2 Cache (URAM 1.75 MB)│
│ GEMV ──FIFO──► CVO_top (SFU) │
└─────────────────────────────────┘
| Level | Technology | Size | Access |
|---|---|---|---|
| L1 (Activation row buffer) | Block RAM | per-core | Systolic / GEMV lanes |
| L2 (Shared cache) | URAM | 1.75 MB (114,688 × 128-bit) | All cores + mem_dispatcher |
| Weight stream | HP AXI port × 4 | DDR4 bandwidth | HP0/1 → GEMM, HP2/3 → GEMV |
| KV Cache | External / off-chip memory model | System-dependent capacity | ACP / coherent access path |
KV cache bandwidth wall: At 32K context (Gemma 3N E4B), the accumulated KV cache reaches ~1.31 GB. Mitigation: KV quantization (FP16→INT8/INT4), attention sink eviction, and a driver-enforced
KV_MAX_TOKENShard cap.The capacity field above is a system-level memory model, not a guaranteed on-board KV260 figure. Reproducible board measurements will be reported separately with the board configuration, model, context length, precision, and benchmark command.
| Repository | Role |
|---|---|
pccx |
Canonical specification, documentation, and project index. |
pccx-v002 |
v002 IP-core package for LLM, Vision, Voice, and common reusable sources. |
pccx-v003 |
Future v003 IP-core package. |
pccx-FPGA-NPU-LLM-kv260 |
KV260 + LLM application integration; consumes pccx-v002. |
The reusable IP-core line is board- and model-agnostic. Board and model repositories consume the IP-core package; IP-core RTL and compatibility contracts do not name a specific board or model.
For the authoritative version of this table, the boundary rule, and the
submodule pin policy, see docs/reference/repo-topology.md.
pccx/
├── conf.py / index.rst # English Sphinx config & root toctree
├── ko/ # Korean Sphinx subsite (ko-first authoring)
│ ├── conf.py
│ └── docs/ # Korean documentation source
├── docs/ # English documentation source
│ ├── v002/ # Active architecture docs
│ │ ├── Architecture/ # Core design, DSP48E2, KV cache, rationale
│ │ ├── ISA/ # 64-bit VLIW instruction set reference
│ │ ├── Drivers/ # Host API & driver documentation
│ │ └── RTL/ # Embedded RTL source reference
│ └── archive/experimental_v001/
├── assets/images/ # Architecture diagrams (PNG)
├── _static/ # JS/CSS (language switcher, Mermaid theme)
└── codes/
├── v001/hw/rtl/ # v001 RTL (archived, reference only)
└── v002/ # external RTL checkout used by docs builds
Sibling repositories:
pccx-v002— reusable v002 IP-core package.pccx-v003— future reusable v003 IP-core package.- pccxai/pccx-FPGA-NPU-LLM-kv260 — KV260 + LLM application integration that consumes
pccx-v002. - pccxai/pccx-lab — performance simulator, CLI-first verification lab, and trace profiler (mounted under
/en/lab/and/ko/lab/on the docs site).
pccx is developed across staged releases. v002.0 is the baseline KV260
integration; v002.1 layers sparsity and speculative decoding on the v002
line; v003.x belongs to the future pccx-v003 IP-core package. A
long-term auto-porting compiler begins once the v002 / v003 lines are
stable.
| Release | RTL Repo | Target Model | Scope | Throughput Target | Status |
|---|---|---|---|---|---|
| v002.0 | pccx-v002 + pccxai/pccx-FPGA-NPU-LLM-kv260 |
Gemma 3N E4B | A–F baseline integration | measured-only | In progress |
| v002.1 | pccx-v002 + pccxai/pccx-FPGA-NPU-LLM-kv260 |
Gemma 3N E4B | G sparsity / H–H+ EAGLE-3 / I SSD / J Tree / K benchmark | evidence-only | Planned |
| v003.0 | pccx-v003 |
Gemma 4 E4B | foundation + first architectural novelty | TBD | Planned |
| v003.1 | pccx-v003 |
Gemma 4 E4B | second novelty + KV/decoding co-design | TBD | Planned |
| Auto-Porting α | pccxai/pccx |
Arbitrary Transformer | config.json → pccx ISA codegen |
n/a | Planned (Y2) |
v002.1 compute budget: $70–100 total for EAGLE head training ($40 if a TRC TPU grant lands). The training plan is scoped to v002.1, where the speculative-decoding stack is integrated.
→ Full roadmap (EN) · 한국어
Performance simulator, CLI-first verification lab, and trace profiler for the pccx NPU. Pre-RTL bottleneck detection, UVM co-simulation, and testbench/trace workflow support share one workflow.
- Repository: https://github.com/pccxai/pccx-lab
- Documentation: https://pccx.pages.dev/en/lab/ (Korean: https://pccx.pages.dev/ko/lab/)
- Status: Work in Progress
The full technical documentation — architecture deep-dives, ISA encoding tables, DSP48E2 bit-packing derivation, driver API, and embedded RTL source — is published at:
Available in English and 한국어 (Korean).
Highlights:
- Architecture Overview — block diagram, design rationale, 3.125× gain breakdown
- DSP48E2 W4A8 Derivation — dual-channel bit packing math
- Custom ISA Reference — 64-bit VLIW encoding, opcode table, dataflow
- RTL Source Reference — embedded SystemVerilog with live syntax highlighting
In-repo developer-facing documentation:
docs/reference/— repository topology, v002 contract narrative, boundary rule, testing protocol, submodule pin policy.docs/onboarding/— getting started, architecture overview reading order, contribution rules (draft).docs/evidence/— evidence pack index and risk register.docs/roadmap/— milestones reflecting actual state.docs/commercial/— open / commercial / capital track separation (DRAFT, not legal advice).docs/ip/— patent strategy, trademarks, trade secret policy, contributor licence agreement intent (DRAFT, not legal advice).
pip install -r requirements.txt
sudo apt-get install graphviz # for Graphviz diagrams
# Clone v002 RTL (required for literalinclude directives)
git clone --depth 1 \
https://github.com/pccxai/pccx-FPGA-NPU-LLM-kv260 \
codes/v002
# Build English site
sphinx-build -b html . _build/html/en
# Build Korean site
sphinx-build -b html ko _build/html/ko
# Serve locally
python -m http.server --directory _build/html
# → open http://localhost:8000/en/ or /ko/| Pain point (v001) | v002 solution |
|---|---|
| Core role ambiguity (Matrix/Vector/CVO blurred) | Strict separation: GEMM / GEMV / SFU |
| Excessive intermediate bus paths | Shared L2 + direct-connect FIFO for GEMV↔SFU |
| L2 ↔ Global Cache responsibility overlap | Single unified L2 (URAM) |
| Single HP port → one systolic array bottleneck | HP0/HP1 for GEMM, HP2/HP3 for GEMV (distributed) |
| 1 DSP = 1 MAC (bit headroom wasted) | Dual-channel packing → 1 DSP = 2 MACs |
| 250 MHz ceiling (AXI clock) | Decoupled 400 MHz core domain |
Licensed under the Apache License 2.0.
PCCX™ is a mark used by the PCCX project. Korean trademark
applications are pending for PCCX in Classes 09 and 42 (application
numbers 40-2026-0091497 and 40-2026-0091498). Registration has
not been granted; do not use PCCX® until this policy is updated.
See TRADEMARKS.md for permitted use, restricted
use, and the public-safe filing docket.
Built by @hkimw · Documentation · Issues