pccx — Parallel Compute Core eXecutor

A scalable NPU architecture for Transformer LLM inference on edge FPGAs

Project status

Public alpha — v0.1.0-alpha is published as a prerelease. Core architecture and ISA are stable; verification, KV260 bring-up, and documentation polish are in progress. Expect rough edges; feedback and issues are welcome.

Entry point	Link
Documentation	https://pccx.pages.dev/
Releases	https://github.com/pccxai/pccx/releases
`v0.1.0-alpha` notes	docs/releases/v0.1.0-alpha.md
Roadmap (project board)	https://github.com/orgs/pccxai/projects/1
Contributing	CONTRIBUTING.md
How to cite	CITATION.cff
Discussions	https://github.com/pccxai/pccx/discussions
Good first issues	good first issue Suitable for first-time contributors

What is pccx?

pccx is a hardware-software co-design framework that accelerates autoregressive decoding of Transformer-based LLMs on resource-constrained edge devices. The primary target is the Xilinx Kria KV260 SOM.

Rather than reusing a generic matrix accelerator, pccx is sized around the dominant bottleneck of LLM decoding: memory bandwidth-bound GEMV, not compute-bound GEMM. The architecture separates matrix (GEMM) and vector (GEMV) datapaths, supplies weights through dedicated HP AXI ports, and uses a custom 64-bit VLIW ISA to reduce dispatch-side stalls.

Architecture (v002)

Core	Configuration	Peak Throughput	Primary Use
GEMM (Matrix)	32 × 32 systolic array (cascade split @ row 16)	819 GMAC/s @ 400 MHz	Prefill, Q·Kᵀ, score·V
GEMV (Vector)	4 cores × 32-MAC LUT pipeline + 5-stage reduction tree	Weight-streaming limited (~51.2 GMAC/s @ 400 MHz)	Autoregressive decoding
SFU / CVO	CORDIC + LUT hybrid	BF16 / FP32 promoted	Softmax, GELU, RMSNorm, RoPE

Key design decisions:

W4A8 precision — INT4 weights × INT8 activations via DSP48E2 dual-channel bit packing (1 DSP = 2 MACs)
Precision promotion — non-linear ops (Softmax, GELU, RMSNorm, RoPE) automatically upcast to BF16/FP32 for numerical stability
Custom 64-bit VLIW ISA — 5 opcodes: GEMV, GEMM, MEMCPY, MEMSET, CVO; decoupled decode/dispatch eliminates front-end stalls
Shared L2 (URAM 1.75 MB) — all three cores share a central SRAM cache; GEMV↔SFU are connected via a direct-connect FIFO, bypassing L2 round-trips
Dual clock domains — 250 MHz AXI/control plane, 400 MHz core compute (×1.6 frequency gain over v001)
3.125× total throughput gain vs. v001 (frequency × dual-MAC DSP packing)

External AXI (250 MHz)          Core Domain (400 MHz)
─────────────────────           ──────────────────────────────────────────────────────
S_AXIL_CTRL (HPM)    ────────►  npu_controller_top
                                  ├─ ctrl_npu_decoder   (64-bit VLIW → opcode + body)
S_AXI_HP0/HP1        ────────►  GEMM_systolic_top      (32×16×2, W-Stationary)
S_AXI_HP2/HP3        ────────►  GEMV_top               (4 cores × 32-MAC LUT, 5-stage tree)
S_AXIS_ACP_FMAP      ────────►  ┌─────────────────────────────────┐
M_AXIS_ACP_RESULT    ◄────────  │  Shared L2 Cache (URAM 1.75 MB)│
                                │  GEMV ──FIFO──► CVO_top (SFU)  │
                                └─────────────────────────────────┘

Memory Hierarchy

Level	Technology	Size	Access
L1 (Activation row buffer)	Block RAM	per-core	Systolic / GEMV lanes
L2 (Shared cache)	URAM	1.75 MB (114,688 × 128-bit)	All cores + mem_dispatcher
Weight stream	HP AXI port × 4	DDR4 bandwidth	HP0/1 → GEMM, HP2/3 → GEMV
KV Cache	External / off-chip memory model	System-dependent capacity	ACP / coherent access path

KV cache bandwidth wall: At 32K context (Gemma 3N E4B), the accumulated KV cache reaches ~1.31 GB. Mitigation: KV quantization (FP16→INT8/INT4), attention sink eviction, and a driver-enforced KV_MAX_TOKENS hard cap.

The capacity field above is a system-level memory model, not a guaranteed on-board KV260 figure. Reproducible board measurements will be reported separately with the board configuration, model, context length, precision, and benchmark command.

Repository topology

Repository	Role
`pccx`	Canonical specification, documentation, and project index.
`pccx-v002`	v002 IP-core package for LLM, Vision, Voice, and common reusable sources.
`pccx-v003`	Future v003 IP-core package.
`pccx-FPGA-NPU-LLM-kv260`	KV260 + LLM application integration; consumes `pccx-v002`.

The reusable IP-core line is board- and model-agnostic. Board and model repositories consume the IP-core package; IP-core RTL and compatibility contracts do not name a specific board or model.

For the authoritative version of this table, the boundary rule, and the submodule pin policy, see docs/reference/repo-topology.md.

Local documentation layout

pccx/
├── conf.py / index.rst          # English Sphinx config & root toctree
├── ko/                          # Korean Sphinx subsite (ko-first authoring)
│   ├── conf.py
│   └── docs/                    # Korean documentation source
├── docs/                        # English documentation source
│   ├── v002/                    # Active architecture docs
│   │   ├── Architecture/        # Core design, DSP48E2, KV cache, rationale
│   │   ├── ISA/                 # 64-bit VLIW instruction set reference
│   │   ├── Drivers/             # Host API & driver documentation
│   │   └── RTL/                 # Embedded RTL source reference
│   └── archive/experimental_v001/
├── assets/images/               # Architecture diagrams (PNG)
├── _static/                     # JS/CSS (language switcher, Mermaid theme)
└── codes/
    ├── v001/hw/rtl/             # v001 RTL (archived, reference only)
    └── v002/                    # external RTL checkout used by docs builds

Sibling repositories:

pccx-v002 — reusable v002 IP-core package.
pccx-v003 — future reusable v003 IP-core package.
pccxai/pccx-FPGA-NPU-LLM-kv260 — KV260 + LLM application integration that consumes pccx-v002.
pccxai/pccx-lab — performance simulator, CLI-first verification lab, and trace profiler (mounted under /en/lab/ and /ko/lab/ on the docs site).

Roadmap — Staged release track

pccx is developed across staged releases. v002.0 is the baseline KV260 integration; v002.1 layers sparsity and speculative decoding on the v002 line; v003.x belongs to the future pccx-v003 IP-core package. A long-term auto-porting compiler begins once the v002 / v003 lines are stable.

Release	RTL Repo	Target Model	Scope	Throughput Target	Status
v002.0	`pccx-v002` + `pccxai/pccx-FPGA-NPU-LLM-kv260`	Gemma 3N E4B	A–F baseline integration	measured-only	In progress
v002.1	`pccx-v002` + `pccxai/pccx-FPGA-NPU-LLM-kv260`	Gemma 3N E4B	G sparsity / H–H+ EAGLE-3 / I SSD / J Tree / K benchmark	evidence-only	Planned
v003.0	`pccx-v003`	Gemma 4 E4B	foundation + first architectural novelty	TBD	Planned
v003.1	`pccx-v003`	Gemma 4 E4B	second novelty + KV/decoding co-design	TBD	Planned
Auto-Porting α	`pccxai/pccx`	Arbitrary Transformer	`config.json` → pccx ISA codegen	n/a	Planned (Y2)

v002.1 compute budget: $70–100 total for EAGLE head training ($40 if a TRC TPU grant lands). The training plan is scoped to v002.1, where the speculative-decoding stack is integrated.

→ Full roadmap (EN) · 한국어

Ecosystem

pccx-lab — Simulator & Verification Lab

Performance simulator, CLI-first verification lab, and trace profiler for the pccx NPU. Pre-RTL bottleneck detection, UVM co-simulation, and testbench/trace workflow support share one workflow.

Repository: https://github.com/pccxai/pccx-lab
Documentation: https://pccx.pages.dev/en/lab/ (Korean: https://pccx.pages.dev/ko/lab/)
Status: Work in Progress

Documentation

The full technical documentation — architecture deep-dives, ISA encoding tables, DSP48E2 bit-packing derivation, driver API, and embedded RTL source — is published at:

pccx.pages.dev/

Available in English and 한국어 (Korean).

Highlights:

Architecture Overview — block diagram, design rationale, 3.125× gain breakdown
DSP48E2 W4A8 Derivation — dual-channel bit packing math
Custom ISA Reference — 64-bit VLIW encoding, opcode table, dataflow
RTL Source Reference — embedded SystemVerilog with live syntax highlighting

Documentation map

In-repo developer-facing documentation:

docs/reference/ — repository topology, v002 contract narrative, boundary rule, testing protocol, submodule pin policy.
docs/onboarding/ — getting started, architecture overview reading order, contribution rules (draft).
docs/evidence/ — evidence pack index and risk register.
docs/roadmap/ — milestones reflecting actual state.
docs/commercial/ — open / commercial / capital track separation (DRAFT, not legal advice).
docs/ip/ — patent strategy, trademarks, trade secret policy, contributor licence agreement intent (DRAFT, not legal advice).

Building the Docs Locally

pip install -r requirements.txt
sudo apt-get install graphviz   # for Graphviz diagrams

# Clone v002 RTL (required for literalinclude directives)
git clone --depth 1 \
  https://github.com/pccxai/pccx-FPGA-NPU-LLM-kv260 \
  codes/v002

# Build English site
sphinx-build -b html . _build/html/en

# Build Korean site
sphinx-build -b html ko _build/html/ko

# Serve locally
python -m http.server --directory _build/html
# → open http://localhost:8000/en/ or /ko/

v001 → v002 Migration

Pain point (v001)	v002 solution
Core role ambiguity (Matrix/Vector/CVO blurred)	Strict separation: GEMM / GEMV / SFU
Excessive intermediate bus paths	Shared L2 + direct-connect FIFO for GEMV↔SFU
L2 ↔ Global Cache responsibility overlap	Single unified L2 (URAM)
Single HP port → one systolic array bottleneck	HP0/HP1 for GEMM, HP2/HP3 for GEMV (distributed)
1 DSP = 1 MAC (bit headroom wasted)	Dual-channel packing → 1 DSP = 2 MACs
250 MHz ceiling (AXI clock)	Decoupled 400 MHz core domain

License

Licensed under the Apache License 2.0.

Trademark

PCCX™ is a mark used by the PCCX project. Korean trademark applications are pending for PCCX in Classes 09 and 42 (application numbers 40-2026-0091497 and 40-2026-0091498). Registration has not been granted; do not use PCCX® until this policy is updated. See TRADEMARKS.md for permitted use, restricted use, and the public-safe filing docket.

Built by @hkimw · Documentation · Issues

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
.github/workflows		.github/workflows
_ext		_ext
_extra		_extra
_static		_static
_templates		_templates
assets/images		assets/images
codes/v001		codes/v001
docs		docs
ko		ko
plots		plots
tools		tools
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
DCO.md		DCO.md
IP_POLICY.md		IP_POLICY.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
PATENTS.md		PATENTS.md
README.md		README.md
RELEASING.md		RELEASING.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
TRADEMARKS.md		TRADEMARKS.md
conf.py		conf.py
conf_common.py		conf_common.py
index.rst		index.rst
main.tex		main.tex
refs.bib		refs.bib
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pccx — Parallel Compute Core eXecutor

Project status

What is pccx?

Architecture (v002)

Memory Hierarchy

Repository topology

Local documentation layout

Roadmap — Staged release track

Ecosystem

pccx-lab — Simulator & Verification Lab

Documentation

pccx.pages.dev/

Documentation map

Building the Docs Locally

v001 → v002 Migration

License

Trademark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pccx — Parallel Compute Core eXecutor

Project status

What is pccx?

Architecture (v002)

Memory Hierarchy

Repository topology

Local documentation layout

Roadmap — Staged release track

Ecosystem

pccx-lab — Simulator & Verification Lab

Documentation

pccx.pages.dev/

Documentation map

Building the Docs Locally

v001 → v002 Migration

License

Trademark

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages