diff --git a/.gitignore b/.gitignore index 957a88d..29eb5c0 100644 --- a/.gitignore +++ b/.gitignore @@ -29,3 +29,14 @@ incremental_db/ *.sof *.sld *.jdi + +# PYNQ port (`pynq/` subdir) build artefacts +pynq/hw/build/ +pynq/hw/sim/cocotb/sim_build/ +pynq/hw/sim/cocotb/dump.vcd +pynq/hw/sim/cocotb/results.xml +pynq/hw/sim/cocotb/cocotb_env/ +pynq/.Xil/ +pynq/NA/ +pynq/sw/notebooks/.ipynb_checkpoints/ +pynq/.venv/ diff --git a/NOTICE b/NOTICE index c1de8d4..01b2bb7 100644 --- a/NOTICE +++ b/NOTICE @@ -7,3 +7,29 @@ otherwise noted. This repository contains third-party reference material, tool-generated files, and vendor IP with separate ownership and licensing terms. See THIRD_PARTY_NOTICES.md for details. + +---------------------------------------------------------------------- + +PYNQ-Z2 port (`pynq/` subdirectory) +Copyright 2026 Abdullah Al-Nafisah + +The `pynq/` subdirectory adds a port of TALOS-V2 to the Xilinx PYNQ-Z2 +(Zynq-7020 XC7Z020CLG400-1). It is self-contained and does not modify +the original Intel DE1-SoC flow at the repository root. Both flows can +coexist; users with both boards can program either from this single +repository. + +Contents of `pynq/`: + + - Files byte-identical to upstream `rtl/` are redistributed under + Apache-2.0 (copies kept in `pynq/hw/src/core/` and `pynq/hw/ip/` + so the Vivado build is self-contained). + - One file modified from upstream — `pynq/hw/src/core/include/ + microgpt_exact_core_rom_init.svh` — paths adjusted for Vivado's + `INCLUDE_DIRS` mechanism. Modifications fall under Apache-2.0 §4. + - New original work — AXI4-Lite wrapper (`pynq/hw/src/top/`), + Vivado batch build (`pynq/hw/tcl/`), cocotb regression suite + (`pynq/hw/sim/cocotb/`), Python PYNQ driver (`pynq/sw/`), + notebooks, demos, tutorials, and built bitstream artefacts — + licensed under BSD 3-Clause; see `pynq/LICENSE.original` and the + per-file attribution in `pynq/UPSTREAM.md`. diff --git a/pynq/LICENSE.original b/pynq/LICENSE.original new file mode 100644 index 0000000..4dfcb31 --- /dev/null +++ b/pynq/LICENSE.original @@ -0,0 +1,34 @@ +BSD 3-Clause License + +Copyright (c) 2026, Abdullah Al-Nafisah + +Files in this `pynq/` subdirectory that are NOT byte-identical to upstream +TALOS-V2 (see `UPSTREAM.md` for the per-file attribution) are licensed +under the BSD 3-Clause License below. Files that ARE byte-identical to +upstream remain governed by Apache-2.0 (`../LICENSE`). + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +1. Redistributions of source code must retain the above copyright notice, + this list of conditions and the following disclaimer. + +2. Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +3. Neither the name of the copyright holder nor the names of its contributors + may be used to endorse or promote products derived from this software + without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE +LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. diff --git a/pynq/README.md b/pynq/README.md new file mode 100644 index 0000000..35391a3 --- /dev/null +++ b/pynq/README.md @@ -0,0 +1,141 @@ +# TALOS-V2 — PYNQ-Z2 port + +Self-contained Xilinx **PYNQ-Z2** (Zynq-7020 XC7Z020CLG400-1) port of +the [TALOS-V2](https://github.com/Luthiraa/TALOS-V2) "exact" microGPT +SystemVerilog accelerator. The PL keeps the upstream +`microgpt_exact_core` and its sub-blocks **byte-identical**; only the +host bridge, clocking, I/O, and host-side tooling are rewritten for the +Vivado / PYNQ flow. Per-file attribution lives in +[`UPSTREAM.md`](UPSTREAM.md). + +The original Intel DE1-SoC flow at the [repository root](..) is +unchanged. Both flows coexist in this repository so users with either +board can build and run microGPT from a single clone. + +**Licensing:** files byte-identical to upstream are governed by +Apache-2.0 (`../LICENSE`); original PYNQ-port additions are governed +by BSD 3-Clause (`LICENSE.original`). See `../NOTICE`. + +## Directory layout + +``` +docs/ Design notes + draft upstream license request +demos/ Pre-computed weight heatmaps for the portfolio site +hw/ + build/ Vivado project (gitignored output) + constraints/ pynq_z2.xdc (LD0..LD3 only) + ip/ Q12 weight ROMs (.hex) — origin: TALOS-V2 + src/ + core/ Unmodified TALOS-V2 RTL + .svh includes + top/ microgpt_pynq_top.sv (AXI4-Lite wrapper — new) + sim/cocotb/ cocotb regression suite for the AXI wrapper + tcl/ build.tcl (Vivado batch build — new) +overlays/ microgpt.bit / microgpt.hwh land here +sw/ + drivers/ microgpt.py (pynq.MMIO driver — new) + notebooks/ demo.ipynb, hardware_advantage.ipynb, throughput.ipynb + tests/ +tutorials/ Three-notebook workflow walkthrough +UPSTREAM.md Per-file attribution (TALOS-V2 vs this fork) +LICENSE_STATUS.md Why this repo is not yet open-source-redistributable +``` + +## Tutorials + +Start with [`tutorials/00_overview.ipynb`](tutorials/00_overview.ipynb) +for the workflow loop, then `01_explore_weights.ipynb` to visualise +the Q12 ROMs, then `02_register_map_and_driver.ipynb` for the +AXI4-Lite layout and driver hot path. + +## Quick-start + +### 1. Build the bitstream + +```bash +# From repo root +vivado -mode batch -source hw/tcl/build.tcl +``` + +This creates the Vivado project under `hw/build/`, runs synthesis and +implementation, and copies `microgpt.bit` + `microgpt.hwh` into +`overlays/`. + +### 2. Deploy to the PYNQ-Z2 + +```bash +scp overlays/microgpt.bit overlays/microgpt.hwh \ + xilinx@:/home/xilinx/pynq/overlays/microgpt/ +scp -r sw/drivers sw/notebooks \ + xilinx@:/home/xilinx/jupyter_notebooks/microgpt/ +``` + +### 3. Run on the board + +Open Jupyter (`http://:9090`) and run +`sw/notebooks/demo.ipynb`, or from a Python shell: + +```python +from microgpt import MicroGPT +gpt = MicroGPT() +text, info = gpt.generate(max_tokens=8, temperature=1.0, seed=42) +print(text, info["cycles"]) +``` + +## Register map (AXI4-Lite slave at 0x4000_0000, 4 KB) + +| Offset | RW | Field | +|--------:|:--:|:--------------------------------------------------------------| +| 0x000 | RO | Magic = `0x4D475254` ("MGRT") | +| 0x004 | RO | Version = `0x00020001` | +| 0x008 | WO | bit0 = start pulse, bit1 = clear pulse | +| 0x00C | RO | Status `{pos, out_len, 0, 0, direct_mode, toggle, error, done, busy, ready}` | +| 0x010 | RW | Config `{temp_q8_8[31:16], max_gen[15:8], 0[7:0]}` | +| 0x014 | RW | RNG seed | +| 0x018 | RO | `{top_logit_q12[31:16], argmax_token[15:8], last_token[7:0]}` | +| 0x01C | RO | BOS_TOKEN (`26`) | +| 0x020 | RW | Step config `{0, step_token, step_pos, step_clear, direct_mode}` | +| 0x024 | WO | Step trigger pulse (bit0) | +| 0x028 | RO | heartbeat_reg snapshot (debug; zero-padded to 32b) | +| 0x060.. | RO | `output_mem[0..15]` -- 16 generated tokens | +| 0x0D8 | RO | perf_cycles | +| 0x0DC | RO | tokens_per_sec | +| 0x100.. | RO | 27 sign-extended logits (Q12) | + +PL LEDs LD0..LD3 expose `{heartbeat, busy, done, error}` (heartbeat moved +to LD0 so it stays visible even on boards where LD3/M14 has a physical +fault, as observed on the deployed PYNQ-Z2 unit). + +## Avalon-MM -> AXI4-Lite translation summary + +| DE1-SoC (Avalon-MM) | PYNQ-Z2 (AXI4-Lite) | +|----------------------------------------------------|--------------------------------------------------| +| `jtag_microgpt_bridge` master + `waitrequest`/`readdatavalid` handshakes | Standard AXI4-Lite slave on PS GP0 (`s_axi_*`). | +| 50 MHz `CLOCK_50` host domain + 56.25 MHz core PLL | Single domain `s_axi_aclk = FCLK_CLK0 = 50 MHz`. | +| Toggle-bit triggers (`host_start_toggle_50` etc.) crossed via 2-FF synchronizers | 1-cycle `start_pulse` / `clear_pulse` / `step_pulse` decoded inline. | +| `host_toggle_reg` flips on every JTAG read or write | `host_toggle_reg` flips on every successful AXI read or write. | +| WSTRB / byte enables driven by JTAG bridge (4'b1111) | `s_axi_wstrb` accepted but ignored; aligned 32-bit writes only. | +| Resets: `~SW[1] && pll_locked` | `s_axi_aresetn` from `proc_sys_reset` driven by `FCLK_RESET0_N`. | +| Outputs: 10x LEDR + 6x HEX | 4x PL LEDs (LD0..LD3): `heartbeat`, `busy`, `done`, `error`. | +| Weights via `$readmemh("generated/...hex", ...)` | Weights live in `hw/ip/`; build.tcl adds it to `INCLUDE_DIRS` and `rom_init.svh` references bare filenames. | + +## Notes + +- The unmodified core RTL lives in `hw/src/core/` and includes + `microgpt_exact_core_params.svh`, `microgpt_exact_core_math.svh`, + and `microgpt_exact_core_rom_init.svh`. Parameters (e.g. `EMBED_DIM`, + `VOCAB_SIZE`, `FRAC_BITS`) are unchanged from the DE1 build. +- The build script targets `xc7z010clg400-1`; if your PYNQ-Z2 carries + the larger XC7Z020 die, edit the `part` variable at the top of + `hw/tcl/build.tcl`. +- All RTL uses 4-space indentation, no tabs. + + +## Run Build: +```bash +mkdir -p hw/build +vivado -mode batch -source hw/tcl/build.tcl \ + -log hw/build/vivado_build.log \ + -journal hw/build/vivado_build.jou \ + 2>&1 | tee hw/build/build_console.log +``` + diff --git a/pynq/UPSTREAM.md b/pynq/UPSTREAM.md new file mode 100644 index 0000000..a5fea29 --- /dev/null +++ b/pynq/UPSTREAM.md @@ -0,0 +1,45 @@ +# Upstream attribution + +This repository is a **port** of the SystemVerilog inference core from +[`Luthiraa/TALOS-V2`](https://github.com/Luthiraa/TALOS-V2), an RTL +implementation of a Karpathy-style microGPT for the Intel DE1-SoC +(Cyclone V), to the **Xilinx PYNQ-Z2** (Zynq-7020 XC7Z020CLG400-1). + +All credit for the inference core RTL and the underlying numerical +design (Q12 fixed-point, systolic matvec tile, processing-element +array, RMS-norm + saturating-divider engines, categorical sampler) +belongs to the upstream author(s) of TALOS-V2. + +This fork's contribution is the *host-side bridge to PYNQ*: an +AXI4-Lite slave wrapper, a Vivado batch build, a cocotb regression +suite for the wrapper, and a Python (`pynq.MMIO` + UIO IRQ) driver. + +## Per-subtree origin + +| Subtree in this repo | Origin | Modifications | +| ----------------------------------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `hw/src/core/*.sv` (7 files) | `Luthiraa/TALOS-V2/rtl/src/` | **Unmodified, byte-identical** to upstream. | +| `hw/src/core/include/microgpt_exact_core_math.svh` | `Luthiraa/TALOS-V2/rtl/src/include/` | **Unmodified, byte-identical** to upstream. | +| `hw/src/core/include/microgpt_exact_core_params.svh` | `Luthiraa/TALOS-V2/rtl/src/include/` | **Unmodified, byte-identical** to upstream. | +| `hw/src/core/include/microgpt_exact_core_rom_init.svh` | `Luthiraa/TALOS-V2/rtl/src/include/` | Modified: paths updated for Vivado `INCLUDE_DIRS` and bare-filename `$readmemh` references. | +| `hw/ip/*.hex` (9 weight ROMs) | `Luthiraa/TALOS-V2/rtl/generated/` | Unmodified Q12 fixed-point exports of the upstream-trained microGPT weights. | +| `hw/src/top/microgpt_pynq_top.sv` | **New (this fork)** | AXI4-Lite slave wrapper exposing the upstream core via the Zynq PS GP0. | +| `hw/tcl/build.tcl` | **New (this fork)** | Vivado batch build (Zynq + AXI Interconnect + top + constraints). | +| `hw/sim/cocotb/` | **New (this fork)** | cocotb regression suite targeting `microgpt_pynq_top` (caught a production write-path bug pre-bitstream). | +| `sw/drivers/microgpt.py` | **New (this fork)** | Python MMIO driver, IRQ fast path via `/dev/uio`. | +| `sw/notebooks/*.ipynb` | **New (this fork)** | Demo, hardware-advantage, throughput notebooks for the deployed overlay. | +| `overlays/*.bit`, `*.hwh` | **New (this fork)** | Vivado-built artefacts targeting `xc7z010clg400-1` / `xc7z020clg400-1`. | +| `demos/build.py` | **New (this fork)** | Weight-tensor heatmap renderer for the companion portfolio site. | +| `tutorials/` | **New (this fork)** | Workflow walkthrough notebooks. | + +## Conventions adopted + +- All upstream files retain their original headers, naming, parameter + values (`EMBED_DIM`, `VOCAB_SIZE`, `FRAC_BITS`, …), and behaviour. +- The cocotb tests target only the **new** AXI wrapper (`hw/src/top/`); + upstream core behaviour is **not** retested here — that responsibility + remains with the upstream ModelSim testbenches. +- No upstream file in `hw/src/core/` should be edited in this repo. + If the upstream core needs a fix, the fix belongs in upstream and + this repo pulls it in via a fresh copy + a noted update in this + file's modification log. diff --git a/pynq/demos/build.py b/pynq/demos/build.py new file mode 100644 index 0000000..edeb17d --- /dev/null +++ b/pynq/demos/build.py @@ -0,0 +1,120 @@ +""" +Build precomputed demo assets for the website project page. + +Renders each of the 4,192 INT16 (Q12) weight tensors baked into the PL +fabric as a labeled heatmap PNG, plus a metadata JSON consumed by the +website's HeatmapViewer. No inference run — just a faithful visualisation of +what's literally hardcoded into LUTRAM/BRAM at synthesis time. + +Refresh: + + python3 demos/build.py + cp demos/out/* ../AbdullahAlNafisah.github.io/public/demos/pynq-microgpt/ + +Pure numpy + matplotlib. Q12 sign convention: each value is a 16-bit +two's-complement int read as Q12 fixed-point (one sign bit, 3 integer +bits, 12 fractional bits → range [-8, 8)). +""" + +from __future__ import annotations + +import json +from pathlib import Path + +import matplotlib + +matplotlib.use("Agg") +import matplotlib.pyplot as plt +import numpy as np + +REPO_ROOT = Path(__file__).resolve().parent.parent +IP_DIR = REPO_ROOT / "hw" / "ip" +OUT = Path(__file__).parent / "out" +OUT.mkdir(parents=True, exist_ok=True) + +FRAC_BITS = 12 # matches hw/src/core/include/microgpt_exact_core_params.svh +EMBED_DIM = 16 +VOCAB_SIZE = 27 # 26 letters + BOS sentinel +MLP_DIM = 64 +SEQ_LEN = 16 # wpe_q12.hex has 16*16 entries + + +def load_q12_hex(path: Path, shape: tuple[int, int]) -> np.ndarray: + """Read a .hex file (one 16-bit hex word per line) into a float Q12 matrix.""" + raw = np.array( + [int(line.strip(), 16) for line in path.read_text().splitlines() if line.strip()], + dtype=np.uint16, + ) + signed = raw.astype(np.int32) + signed[signed >= 0x8000] -= 0x10000 # two's complement → signed int16 in int32 storage + fp = signed.astype(np.float64) / (1 << FRAC_BITS) + assert fp.size == shape[0] * shape[1], f"{path.name}: expected {shape}, got {fp.size} values" + return fp.reshape(shape) + + +# (filename, label, shape, blurb) +WEIGHTS: list[tuple[str, str, tuple[int, int], str]] = [ + ("wte_q12.hex", "WTE — token embedding", (VOCAB_SIZE, EMBED_DIM), "27 tokens × 16-dim embedding"), + ("wpe_q12.hex", "WPE — positional embedding", (SEQ_LEN, EMBED_DIM), "16 positions × 16-dim embedding"), + ("layer0_attn_wq_q12.hex", "W_Q — attention query", (EMBED_DIM, EMBED_DIM), "16 × 16"), + ("layer0_attn_wk_q12.hex", "W_K — attention key", (EMBED_DIM, EMBED_DIM), "16 × 16"), + ("layer0_attn_wv_q12.hex", "W_V — attention value", (EMBED_DIM, EMBED_DIM), "16 × 16"), + ("layer0_attn_wo_q12.hex", "W_O — attention output", (EMBED_DIM, EMBED_DIM), "16 × 16"), + ("layer0_mlp_fc1_q12.hex", "FC1 — MLP up-projection", (EMBED_DIM, MLP_DIM), "16 × 64"), + ("layer0_mlp_fc2_q12.hex", "FC2 — MLP down-projection", (MLP_DIM, EMBED_DIM), "64 × 16"), + ("lm_head_q12.hex", "LM head — logits projection", (EMBED_DIM, VOCAB_SIZE), "16 × 27"), +] + + +def render_heatmap(arr: np.ndarray, label: str, out_path: Path) -> None: + """Render a symmetric-around-zero heatmap. PNG output is dimensionless.""" + vmax = float(np.max(np.abs(arr))) or 1e-9 + fig, ax = plt.subplots(figsize=(4.0, 4.0 * arr.shape[0] / max(arr.shape[1], 1)), dpi=200) + ax.imshow(arr, cmap="RdBu_r", vmin=-vmax, vmax=vmax, interpolation="nearest", aspect="auto") + ax.set_axis_off() + fig.subplots_adjust(left=0.02, right=0.98, top=0.98, bottom=0.02) + fig.savefig(out_path, bbox_inches="tight", pad_inches=0) + plt.close(fig) + + +def main() -> None: + presets = [] + total_params = 0 + for fname, label, shape, blurb in WEIGHTS: + path = IP_DIR / fname + w = load_q12_hex(path, shape) + total_params += w.size + png_name = fname.replace("_q12.hex", ".png") + png_path = OUT / png_name + render_heatmap(w, label, png_path) + presets.append({ + "name": label, + "image_url": f"/demos/pynq-microgpt/{png_name}", + "subtitle": ( + f"{blurb} · {w.size} INT16 (Q12) values · " + f"|w| ≤ {float(np.max(np.abs(w))):.2f} · σ = {float(np.std(w)):.3f}" + ), + }) + + meta = { + "kind": "heatmap", + "title": ( + f"microgpt fabric weights · char-level GPT · {EMBED_DIM}-dim · 1 block · " + f"{total_params} INT16 (Q12) params total" + ), + "image_label": "weight matrix (red = positive · blue = negative · scaled per matrix)", + "presets": presets, + "caption": ( + "Each preset renders one of the nine weight tensors literally " + "hardcoded into PL fabric (LUTRAM / BRAM / constants) by the " + "synthesizer — no DRAM, no DMA. Values read straight from " + "hw/ip/*.hex and interpreted as 16-bit Q12 fixed-point. " + "Generated by demos/build.py." + ), + } + (OUT / "weights.json").write_text(json.dumps(meta, indent=2)) + print(f"Wrote {len(presets)} presets ({total_params} params) to {OUT}") + + +if __name__ == "__main__": + main() diff --git a/pynq/demos/out/layer0_attn_wk.png b/pynq/demos/out/layer0_attn_wk.png new file mode 100644 index 0000000..0bc5868 Binary files /dev/null and b/pynq/demos/out/layer0_attn_wk.png differ diff --git a/pynq/demos/out/layer0_attn_wo.png b/pynq/demos/out/layer0_attn_wo.png new file mode 100644 index 0000000..7bd0570 Binary files /dev/null and b/pynq/demos/out/layer0_attn_wo.png differ diff --git a/pynq/demos/out/layer0_attn_wq.png b/pynq/demos/out/layer0_attn_wq.png new file mode 100644 index 0000000..48eeccc Binary files /dev/null and b/pynq/demos/out/layer0_attn_wq.png differ diff --git a/pynq/demos/out/layer0_attn_wv.png b/pynq/demos/out/layer0_attn_wv.png new file mode 100644 index 0000000..f2dd9ef Binary files /dev/null and b/pynq/demos/out/layer0_attn_wv.png differ diff --git a/pynq/demos/out/layer0_mlp_fc1.png b/pynq/demos/out/layer0_mlp_fc1.png new file mode 100644 index 0000000..ffd518d Binary files /dev/null and b/pynq/demos/out/layer0_mlp_fc1.png differ diff --git a/pynq/demos/out/layer0_mlp_fc2.png b/pynq/demos/out/layer0_mlp_fc2.png new file mode 100644 index 0000000..4d4128a Binary files /dev/null and b/pynq/demos/out/layer0_mlp_fc2.png differ diff --git a/pynq/demos/out/lm_head.png b/pynq/demos/out/lm_head.png new file mode 100644 index 0000000..4736e0f Binary files /dev/null and b/pynq/demos/out/lm_head.png differ diff --git a/pynq/demos/out/weights.json b/pynq/demos/out/weights.json new file mode 100644 index 0000000..87eccd5 --- /dev/null +++ b/pynq/demos/out/weights.json @@ -0,0 +1,53 @@ +{ + "kind": "heatmap", + "title": "microgpt fabric weights \u00b7 char-level GPT \u00b7 16-dim \u00b7 1 block \u00b7 4192 INT16 (Q12) params total", + "image_label": "weight matrix (red = positive \u00b7 blue = negative \u00b7 scaled per matrix)", + "presets": [ + { + "name": "WTE \u2014 token embedding", + "image_url": "/demos/pynq-microgpt/wte.png", + "subtitle": "27 tokens \u00d7 16-dim embedding \u00b7 432 INT16 (Q12) values \u00b7 |w| \u2264 0.69 \u00b7 \u03c3 = 0.199" + }, + { + "name": "WPE \u2014 positional embedding", + "image_url": "/demos/pynq-microgpt/wpe.png", + "subtitle": "16 positions \u00d7 16-dim embedding \u00b7 256 INT16 (Q12) values \u00b7 |w| \u2264 0.73 \u00b7 \u03c3 = 0.157" + }, + { + "name": "W_Q \u2014 attention query", + "image_url": "/demos/pynq-microgpt/layer0_attn_wq.png", + "subtitle": "16 \u00d7 16 \u00b7 256 INT16 (Q12) values \u00b7 |w| \u2264 0.48 \u00b7 \u03c3 = 0.157" + }, + { + "name": "W_K \u2014 attention key", + "image_url": "/demos/pynq-microgpt/layer0_attn_wk.png", + "subtitle": "16 \u00d7 16 \u00b7 256 INT16 (Q12) values \u00b7 |w| \u2264 0.55 \u00b7 \u03c3 = 0.162" + }, + { + "name": "W_V \u2014 attention value", + "image_url": "/demos/pynq-microgpt/layer0_attn_wv.png", + "subtitle": "16 \u00d7 16 \u00b7 256 INT16 (Q12) values \u00b7 |w| \u2264 0.34 \u00b7 \u03c3 = 0.123" + }, + { + "name": "W_O \u2014 attention output", + "image_url": "/demos/pynq-microgpt/layer0_attn_wo.png", + "subtitle": "16 \u00d7 16 \u00b7 256 INT16 (Q12) values \u00b7 |w| \u2264 0.33 \u00b7 \u03c3 = 0.109" + }, + { + "name": "FC1 \u2014 MLP up-projection", + "image_url": "/demos/pynq-microgpt/layer0_mlp_fc1.png", + "subtitle": "16 \u00d7 64 \u00b7 1024 INT16 (Q12) values \u00b7 |w| \u2264 0.64 \u00b7 \u03c3 = 0.156" + }, + { + "name": "FC2 \u2014 MLP down-projection", + "image_url": "/demos/pynq-microgpt/layer0_mlp_fc2.png", + "subtitle": "64 \u00d7 16 \u00b7 1024 INT16 (Q12) values \u00b7 |w| \u2264 0.45 \u00b7 \u03c3 = 0.125" + }, + { + "name": "LM head \u2014 logits projection", + "image_url": "/demos/pynq-microgpt/lm_head.png", + "subtitle": "16 \u00d7 27 \u00b7 432 INT16 (Q12) values \u00b7 |w| \u2264 1.33 \u00b7 \u03c3 = 0.244" + } + ], + "caption": "Each preset renders one of the nine weight tensors literally hardcoded into PL fabric (LUTRAM / BRAM / constants) by the synthesizer \u2014 no DRAM, no DMA. Values read straight from hw/ip/*.hex and interpreted as 16-bit Q12 fixed-point. Generated by demos/build.py." +} \ No newline at end of file diff --git a/pynq/demos/out/wpe.png b/pynq/demos/out/wpe.png new file mode 100644 index 0000000..6a2760c Binary files /dev/null and b/pynq/demos/out/wpe.png differ diff --git a/pynq/demos/out/wte.png b/pynq/demos/out/wte.png new file mode 100644 index 0000000..e58480a Binary files /dev/null and b/pynq/demos/out/wte.png differ diff --git a/pynq/docs/.gitkeep b/pynq/docs/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/pynq/hw/constraints/.gitkeep b/pynq/hw/constraints/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/pynq/hw/constraints/pynq_z2.xdc b/pynq/hw/constraints/pynq_z2.xdc new file mode 100644 index 0000000..e444ee0 --- /dev/null +++ b/pynq/hw/constraints/pynq_z2.xdc @@ -0,0 +1,28 @@ +# ----------------------------------------------------------------------------- +# pynq_z2.xdc -- PYNQ-Z2 (Zynq XC7Z020-1CLG400C / -1 family) PL pin constraints. +# +# DE1 deviation: only the four PL LEDs LD0..LD3 are pinned out. CLOCK_50, SW, +# HEX0..HEX5, and LEDR[9:0] from the original DE1 board are removed -- the PS +# supplies the clock through FCLK_CLK0 and the host runs over AXI. +# +# No clock constraints are needed: FCLK_CLK0 is constrained automatically +# by the Zynq PS IP. +# ----------------------------------------------------------------------------- + +# DIAGNOSTIC SWAP: LD3 stayed dark even with the heartbeat counter's R pin +# proven clean. To isolate "physical LD3/M14" vs "design heartbeat path", the +# four functional LED nets are rotated one position over the four physical LEDs: +# LD0 (R14) <- led_heartbeat (must blink at ~0.74 Hz if the path works) +# LD1 (P14) <- led_busy +# LD2 (N16) <- led_done +# LD3 (M14) <- led_error (lights only after the forced-error inference) +# The README register-map and software driver still see the same signals; only +# the visible LED that each one drives changed. +# LD0 +set_property -dict { PACKAGE_PIN R14 IOSTANDARD LVCMOS33 } [get_ports { led_heartbeat }]; +# LD1 +set_property -dict { PACKAGE_PIN P14 IOSTANDARD LVCMOS33 } [get_ports { led_busy }]; +# LD2 +set_property -dict { PACKAGE_PIN N16 IOSTANDARD LVCMOS33 } [get_ports { led_done }]; +# LD3 +set_property -dict { PACKAGE_PIN M14 IOSTANDARD LVCMOS33 } [get_ports { led_error }]; diff --git a/pynq/hw/ip/.gitkeep b/pynq/hw/ip/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/pynq/hw/ip/layer0_attn_wk_q12.hex b/pynq/hw/ip/layer0_attn_wk_q12.hex new file mode 100644 index 0000000..99e2096 --- /dev/null +++ b/pynq/hw/ip/layer0_attn_wk_q12.hex @@ -0,0 +1,256 @@ +00af +fd08 +0021 +02d0 +0017 +fe9b +ff7f +0354 +fe61 +fedc +009c +0060 +fcbe +ff0d +00cf +fef4 +febe +04d1 +002f +031b +fea0 +fe6c +0310 +0201 +fe3e +0295 +fdb5 +003d +008a +fe64 +0287 +fb34 +02d4 +00f3 +fbed +0403 +ffb4 +00b8 +ff67 +0568 +ffba +0159 +fff7 +00a8 +01b5 +fdf7 +0203 +febc +0223 +fd44 +fef5 +026c +0120 +0084 +004a +027d +000f +0070 +017b +02f7 +fa84 +fe97 +01a8 +ff62 +fda6 +01c7 +001b +0151 +fe69 +ff27 +003d +001e +01b6 +0181 +ffd1 +fa84 +0460 +0096 +fe99 +fd87 +fd0a +0184 +fef9 +fcbd +00fd +0006 +0247 +ff8a +ffaf +004f +fe03 +f730 +010e +00a4 +fe5f +fd5c +fe17 +025e +0244 +0487 +fcb7 +fcdb +0464 +026f +fc38 +ff3e +01f5 +054b +fe93 +faf2 +01c6 +0167 +032a +01bb +ffd4 +0087 +ffb3 +01a5 +01e7 +006f +fee5 +0005 +03f9 +04b7 +fd7c +ffe4 +00b5 +0035 +0606 +fdd3 +fea0 +fdc5 +0235 +ff7a +fdf7 +02db +0390 +fe45 +00d5 +ff04 +01f2 +0568 +fefe +fe29 +fe03 +03ac +0245 +024d +ff13 +01a8 +01e6 +01b9 +fb80 +0099 +ff18 +fcce +048b +fffe +0021 +0184 +faa3 +01cd +0437 +0041 +fe92 +fece +05b4 +ff4b +fa90 +0029 +fffd +025d +fcbf +fd4b +0098 +020c +ff6a +fd2e +ff95 +fe3a +ff26 +00f1 +0133 +ff5c +0057 +023a +078b +03b1 +fa4b +fdcd +02b6 +fe64 +00d9 +ff48 +ffa6 +00b9 +fffd +fafb +0039 +fbe9 +fe10 +ff4a +01a9 +fc9d +ff42 +025e +00ab +fa9a +0310 +00f0 +ff94 +00fa +023b +0498 +0361 +04f2 +ff13 +02c9 +0074 +02a1 +ff91 +ffe2 +01e4 +0234 +00d3 +fea6 +ffc1 +0044 +007b +f992 +fcd8 +fb44 +ff1c +ffce +003f +fbfa +01ad +054e +fb60 +00ea +ff4e +0113 +ff5a +fdea +fd72 +fbe5 +fd82 +f973 +004d +fc03 +ffeb +fcc9 +0261 +0314 +fbf5 +fde2 diff --git a/pynq/hw/ip/layer0_attn_wo_q12.hex b/pynq/hw/ip/layer0_attn_wo_q12.hex new file mode 100644 index 0000000..ab127bf --- /dev/null +++ b/pynq/hw/ip/layer0_attn_wo_q12.hex @@ -0,0 +1,256 @@ +fe04 +fdf0 +010a +fbea +0095 +fd35 +ffc4 +fdef +ff94 +fe86 +007e +fd34 +ff0f +fdb3 +fc49 +ff0f +fe3c +fb1c +fea4 +01c3 +02d9 +0105 +fd82 +fcec +fc28 +fdfe +ffb1 +fc88 +fea3 +fd03 +fff5 +ffd9 +0054 +00dc +01b6 +0167 +feda +fe55 +00d8 +014b +feac +0018 +fef9 +0088 +0266 +0235 +002d +02e2 +00a8 +0182 +feb8 +fe59 +fd27 +fef0 +0383 +0024 +fe89 +ff4a +fe23 +002c +ff10 +02b8 +ff70 +002c +fdf4 +ff02 +fee3 +ffaa +02cd +01da +ff32 +ff67 +ffef +00ff +011c +fdc0 +ffff +fd4d +016e +ff13 +fdf3 +fe3f +fe13 +0006 +031c +007d +fea1 +ff7c +008f +fee9 +ff55 +fe27 +ff3c +fceb +ff1b +00c1 +ffe1 +fcc5 +fd81 +ff27 +014e +0118 +fde5 +fdfe +ffd2 +ff77 +0018 +ffb9 +fee9 +ffad +ffcc +ff99 +fe61 +ff3c +fff0 +0081 +fe55 +0178 +ff62 +fd06 +fea6 +fddf +fef0 +fe7f +0095 +fec4 +fc99 +ff20 +feb2 +0172 +0029 +01c2 +ffed +042f +ffcc +00ee +fefe +0097 +00c9 +0076 +feee +0023 +010d +fe9d +fef9 +ff87 +039e +ffd4 +02ec +00f1 +0197 +003b +fee3 +0153 +fdca +01dd +02c3 +feb5 +fc38 +0209 +fd33 +008c +00c9 +0009 +ff34 +00cf +012e +fcd2 +fe70 +fe4e +ff8d +fe63 +fe19 +fd04 +fe27 +fe5d +0130 +015c +ff9e +022a +ffd9 +0261 +006b +00fc +015d +00aa +ffc0 +021e +ff71 +ffea +03f6 +fed1 +fff8 +ffb4 +fcc4 +fc7b +fcc0 +0091 +0234 +fc6e +ff33 +00c2 +0219 +fb4c +0082 +ffa0 +fcec +00bc +ffa0 +ffbe +0256 +0138 +02b7 +0084 +fc4f +0133 +0192 +fd89 +ff8c +01e9 +0026 +00cb +01b6 +fdcd +004d +ffd3 +fac2 +fbe3 +0016 +ff3f +ffed +fd15 +0004 +ff1d +008b +fd5d +fde6 +ffb8 +007a +fef6 +fe81 +fe1c +ff44 +029f +00e7 +023d +fe2b +019d +0182 +0132 +00c0 +0163 +ffba +fe3d +006c +ff2e diff --git a/pynq/hw/ip/layer0_attn_wq_q12.hex b/pynq/hw/ip/layer0_attn_wq_q12.hex new file mode 100644 index 0000000..9d200cc --- /dev/null +++ b/pynq/hw/ip/layer0_attn_wq_q12.hex @@ -0,0 +1,256 @@ +fddd +0265 +011c +ff4e +ff7b +019e +008f +00d7 +006d +018d +fe8f +fe82 +0271 +fe38 +0074 +fed1 +ffbb +fc4c +fdc6 +fb2e +0142 +01ab +002f +004e +ff75 +0286 +ffe4 +fe85 +fec8 +0467 +fde3 +011b +fc11 +ffc6 +ff1f +fb86 +02be +00ee +ffb8 +ff05 +0289 +00bc +fe8b +ffa7 +fd63 +ff72 +ff4f +fb27 +fd18 +0448 +003e +fdd6 +fd54 +0039 +0137 +016f +01fb +fea9 +fe5f +fdc6 +027e +fff2 +fdf7 +fe3f +046f +0004 +ff88 +ff04 +fea3 +0249 +0033 +fdfd +00c0 +010e +004b +0082 +fdd1 +001d +02c1 +006b +0232 +fd78 +fe86 +0276 +fbf6 +000c +0001 +ff05 +fdc9 +fc77 +fe82 +ffe3 +fd54 +0223 +03a0 +ffbc +0232 +f9eb +00ea +fa53 +00d8 +fe26 +0083 +fccf +02a4 +00b7 +017b +fdc9 +028e +fefe +f95b +009c +ff6f +0183 +0212 +ffe4 +01b0 +fff1 +01a9 +ff5c +031c +00d4 +fe80 +ff2a +0191 +fe84 +fd89 +00da +fe6b +fde5 +04c8 +fd9f +0122 +040a +008b +0186 +ff11 +fe08 +fd6a +0229 +fc80 +00f2 +01ae +0313 +0048 +ff7c +040a +ff89 +ff3d +fb3e +fc48 +fe83 +022c +ffeb +fece +fdd9 +03d9 +fee5 +fc15 +fc6e +03a7 +fca4 +ff4d +01e3 +fdb1 +fc05 +ff78 +ffe9 +007a +045b +0274 +fc8f +0488 +0001 +fe9e +01c8 +00bf +0204 +fd68 +ff39 +0104 +0119 +051a +0712 +0273 +03eb +fb12 +fc82 +0258 +02af +ff58 +025d +fe56 +006b +feaf +03fd +024c +0296 +01ea +0672 +ffce +0234 +fc78 +fdb2 +ff6c +ffac +02fe +009f +0117 +0231 +fed8 +fc8c +ff4e +fc80 +fd8a +fcd6 +02a9 +ff77 +0089 +fdcf +007a +feb2 +fd23 +fce4 +fe6c +0292 +024c +00eb +ff39 +0629 +03ab +07a9 +010d +fb39 +fc1b +0112 +04df +fec2 +03d1 +ff79 +fd4e +ff0f +028e +039a +ffa6 +02d6 +032d +0665 +008f +0036 +fb1d +00ee +fe14 +00e6 +047f +0287 diff --git a/pynq/hw/ip/layer0_attn_wv_q12.hex b/pynq/hw/ip/layer0_attn_wv_q12.hex new file mode 100644 index 0000000..d1d164b --- /dev/null +++ b/pynq/hw/ip/layer0_attn_wv_q12.hex @@ -0,0 +1,256 @@ +0179 +0084 +000f +fdc5 +0487 +02aa +00a3 +02ad +01b6 +fff4 +fed1 +fff2 +0013 +fef3 +ff4f +fc93 +fea4 +003f +0161 +ff14 +0161 +fed5 +fb73 +fdd0 +017d +0012 +ff5e +0189 +00b7 +fdb8 +fc3e +fd26 +0288 +00df +ff03 +027c +01cb +ff23 +fac1 +fe24 +ff58 +fbab +056a +0089 +01f8 +fe8e +fde8 +0109 +013b +fcaf +0151 +0341 +02a2 +0056 +00ee +001e +fd7b +026a +02c7 +015e +0260 +fee3 +fe19 +0092 +0123 +fcbd +ff26 +04a7 +fc45 +feec +02c6 +0376 +ffe8 +fdb0 +03bf +fec0 +0160 +fee5 +00c4 +ff87 +0168 +ff6f +0092 +0180 +ffbc +fff6 +0207 +ff39 +fefc +02db +ff98 +0092 +0049 +029e +fe68 +00da +ff22 +00fd +005f +fc06 +0449 +fc13 +fc7e +fe32 +fecd +ff2d +fd43 +ff7c +fe4d +0297 +feb3 +ffe8 +01b4 +ff81 +0117 +fd47 +0012 +030d +fe7c +021b +ff18 +00a9 +03d8 +006d +022e +ff74 +fceb +fc2f +ff84 +fed2 +0097 +0029 +ff5c +01e6 +01c8 +040b +fdf8 +0079 +02b0 +0002 +ff1e +fca6 +00d2 +fd90 +0259 +ff54 +fff5 +fed4 +fee4 +00a4 +0078 +010b +0152 +006c +0088 +fece +ff78 +0085 +fddd +fda9 +00cd +00b0 +00b1 +ff98 +ffa5 +0270 +ff91 +0288 +ff4c +0149 +ff18 +0056 +febb +019e +0029 +ffed +009a +01ae +0157 +fe1e +0307 +0237 +ff88 +ff90 +fe13 +023f +0448 +0158 +ff93 +fc17 +feb5 +fd61 +01b0 +001f +ff7b +0093 +0213 +ff96 +fe2c +fe02 +009a +ff3e +00ff +fc32 +fee4 +0013 +ffaf +ff4f +fb99 +0307 +0025 +fe56 +03a5 +02c6 +ffe5 +0163 +018b +00e6 +003e +011e +fd88 +fd9a +fe49 +fe26 +fea8 +01d4 +0304 +001e +02f3 +02e4 +024c +03b3 +fc8c +029c +0306 +0159 +0031 +fbb7 +fe73 +00a7 +fede +ffe7 +fed0 +ff74 +010d +feef +fead +fefa +0056 +fdb6 +0114 +ff68 +01c5 +00fb +ff1d +ff0a diff --git a/pynq/hw/ip/layer0_mlp_fc1_q12.hex b/pynq/hw/ip/layer0_mlp_fc1_q12.hex new file mode 100644 index 0000000..c2f54ff --- /dev/null +++ b/pynq/hw/ip/layer0_mlp_fc1_q12.hex @@ -0,0 +1,1024 @@ +031c +fddc +0146 +02ea +febf +00fc +fffd +ffb8 +0154 +fc96 +007c +0006 +0066 +01cb +0473 +00fb +feef +ff4f +0022 +fe74 +004b +03b3 +fc83 +fd42 +0408 +fe8f +ff79 +014d +ff92 +04fa +00fc +fde1 +0213 +ff69 +fea8 +fe70 +000e +01da +ff06 +011e +0307 +fcb6 +fd70 +012d +0293 +0419 +ffc9 +00a0 +0078 +fcbb +02e7 +0300 +ff42 +ffff +fc9f +fef3 +04a5 +0166 +ff7a +0264 +0014 +01c7 +01e1 +ff70 +0008 +fdf9 +ffc4 +0197 +fd31 +ff74 +fe1f +0130 +0146 +fc01 +feae +fd31 +ff04 +036e +0173 +fd80 +0372 +fe54 +00f3 +ff84 +ff83 +02a6 +ff62 +01dd +0232 +fe80 +0128 +fe70 +003e +fe5f +03f9 +fe7d +009d +feb7 +0043 +037e +00f4 +03e6 +fe15 +00a6 +0470 +019b +fc23 +0052 +fcba +0517 +0248 +fe71 +00fd +fc26 +03db +0298 +0178 +0392 +ffb0 +025f +ffa1 +fd1e +0027 +00ac +0106 +0235 +ff68 +ff97 +0272 +01c6 +feb1 +feed +ff50 +fe4a +0228 +0092 +014b +fc3f +fec0 +fdc7 +fd4e +fed1 +0228 +0299 +01c0 +008a +ffbd +0133 +05ac +025c +fb95 +0241 +fe58 +ff66 +0195 +ff1d +042b +0122 +0251 +fe14 +00f2 +00d1 +002a +0191 +fe68 +01ac +fc7f +ffa2 +02ff +fd0d +ffa7 +ffe3 +01c5 +01eb +0385 +fff2 +0029 +ff39 +0034 +0070 +00ee +0522 +fe80 +0641 +0388 +f947 +fbc4 +fcc3 +0186 +0422 +025b +fcc7 +0407 +fe2d +ff91 +0089 +015a +0064 +fdcd +fc21 +fc59 +fc45 +fd38 +fe60 +ffeb +010b +0183 +fef1 +0361 +0267 +ff12 +febd +fc9f +01ea +0115 +ff80 +01b3 +f5bc +fd59 +fd21 +ff2e +00dc +00cf +fe6e +0369 +fe31 +022e +028e +fccf +0395 +0157 +0011 +00d3 +feb4 +0123 +0130 +fe2e +02af +00a5 +feb3 +00f1 +ffc4 +0351 +02f1 +fee4 +03f3 +fefc +0192 +05cb +01b3 +fd67 +0100 +002c +fd17 +02ab +fd8c +0418 +ff65 +002b +00a6 +ffd9 +0283 +ff13 +00d2 +01e7 +fea6 +fef6 +fc2c +ff81 +04fb +016a +ffd1 +0122 +0147 +ff56 +fe57 +024e +0027 +0265 +fc14 +03c3 +fd29 +ff7d +0035 +faeb +0139 +ffbc +ff9d +ff44 +fe55 +ff48 +0034 +0081 +0162 +fb8d +03d9 +0251 +fddf +ff5d +005a +006d +00e6 +0634 +fca5 +017c +fcec +0176 +0565 +014c +00f3 +fd0d +fecc +0195 +ff94 +fe02 +ffaf +017a +0267 +03b0 +ff7e +02e8 +ff9c +0068 +0210 +ffbe +02d4 +fef7 +ff2d +00db +fd5d +fbb2 +fd7a +00ba +0666 +fe62 +ff0e +027d +fdea +03ba +018b +ffe3 +fdf4 +f9c7 +fdc2 +0461 +fda2 +052a +025c +fd8a +fff8 +01d7 +fe74 +02fc +fe8f +03a7 +fe9e +014f +012f +fab0 +ff2b +01f2 +fd3a +fcd1 +02e0 +00a8 +0324 +025b +fd34 +010c +ff75 +0389 +0203 +fe5c +0101 +fefc +02b6 +03eb +ff54 +fdb4 +fe2e +fdf2 +009d +032c +fca9 +01cc +0305 +ffc0 +fdee +fe60 +02f1 +fb14 +0368 +0297 +fe22 +ffb1 +ff27 +0118 +027a +0109 +f9e7 +02f9 +fe19 +018a +0002 +fe2d +03df +fea5 +00de +057d +fd2b +feaf +ffbd +037e +ffb2 +0246 +fc9c +ff6f +ff6d +00c9 +ff34 +fd84 +02ef +fcff +03b9 +0060 +fb90 +fd88 +0106 +006a +029e +03ba +fcd8 +0206 +001a +0084 +fd14 +0033 +fe30 +fb7b +ff66 +04b8 +fef6 +01d9 +0067 +fe2e +00e5 +029c +fff7 +04a7 +0042 +01e0 +03b0 +feab +ff45 +fca6 +ff06 +028e +ffbc +ff7c +0147 +0109 +049c +fe67 +fbab +0064 +fc19 +00aa +feac +0083 +02b0 +fc76 +ff46 +041b +fbff +fd35 +ffe4 +ff57 +0046 +00aa +fcba +038b +0054 +01a2 +0227 +fe57 +0247 +fdf1 +fcbc +009a +ffc5 +fb49 +fe2e +02cc +0417 +0401 +fcc6 +ff90 +fc5b +0072 +00a9 +ff37 +0165 +fc5b +031a +02fb +ff4c +fc4c +ffe9 +ff89 +022e +0136 +fe57 +00b4 +ff86 +037d +033e +011d +0562 +fd63 +0306 +01d6 +fdb2 +037b +02db +fdb0 +ffe4 +01cb +027c +0190 +fea3 +0054 +036d +01c2 +0235 +fd33 +fecf +0324 +fbc4 +0484 +0188 +0212 +fd71 +ff67 +00c7 +0111 +fda6 +02d5 +0461 +0068 +068f +0065 +feec +ffcb +ff27 +ff2b +037b +feb8 +002d +ffdc +00f0 +ffab +f972 +fff2 +ffcb +01e6 +031d +03b4 +ffe5 +ff0f +fc5c +00c6 +feb0 +fa6b +ffa2 +01ae +ff58 +026c +00ab +0108 +fd51 +fc51 +ff78 +fec0 +fe9c +0825 +fefe +fc44 +fdf5 +fd1e +04ef +027a +ff1b +ffd3 +fd99 +034a +001e +fdee +fee6 +ff62 +0056 +04fd +005a +0046 +00f4 +fd2c +ffe9 +0285 +0002 +ff9b +fb9c +013b +fd8a +fbba +0446 +0078 +0007 +fe38 +00ee +fd6c +fd2d +005b +019b +04c6 +ff60 +fc9b +ff16 +0110 +0263 +009f +0419 +ff13 +0140 +fc12 +fd76 +fea5 +011c +01cd +fec5 +0266 +0200 +0197 +0086 +0114 +0378 +0149 +020d +fdbd +feee +0216 +fcef +fc55 +0119 +01c4 +02cc +01e1 +0086 +01ea +fedd +055e +0156 +ff76 +0248 +ffc5 +0175 +0379 +007d +ff31 +febf +fd85 +02a7 +03a6 +fdf6 +ff0e +fd73 +050f +fe9b +0090 +050b +0054 +01ca +003d +0019 +ff6a +01a2 +fda9 +01b0 +01d0 +037d +0584 +fdd7 +fe00 +01b0 +0085 +01b9 +fb3a +0086 +02de +fb80 +ff42 +fe9b +0113 +00b5 +0486 +fd7d +0289 +ff98 +0303 +03e3 +0260 +fdde +fe07 +00c7 +016d +fc99 +fc65 +0353 +fd3d +00a3 +fe83 +fb47 +0136 +ffc5 +0173 +0007 +03c5 +0148 +ffdc +ff0b +fff1 +fcfb +fd80 +fdf0 +ff79 +04bd +058a +01a8 +00ca +feca +03b6 +fd21 +009d +ff08 +fc22 +fe05 +0567 +016b +ff52 +0259 +fdd3 +ffb6 +02ac +fe90 +00e1 +fd42 +ffc6 +0131 +01da +ffc2 +fb93 +00c5 +045d +fde9 +0244 +01e0 +00b7 +026a +041d +fea6 +008e +fe13 +00a1 +03e3 +feae +ff66 +fa3f +ff74 +0008 +fe61 +fdae +05e4 +ff0b +01bb +0163 +fdc7 +fe7f +fdfc +0427 +fe57 +ffbf +03b6 +ff8a +015e +0425 +fe19 +fe0e +0186 +fdf7 +ffc8 +0115 +fece +fce4 +fa60 +0195 +fb3d +0188 +ff4d +004b +fd37 +0660 +0385 +013f +0498 +fc98 +ff73 +038c +ff05 +03c1 +fffe +00d3 +0066 +fe34 +0360 +fe53 +fe7b +fee9 +fd1a +fd38 +013a +fd46 +01cf +02c0 +fc49 +02c7 +ff6d +02eb +02e4 +02c9 +03ee +fec7 +0136 +0020 +fe81 +f96c +ff4f +feaa +0357 +0181 +fde8 +fb7e +fda1 +007d +01e0 +fc50 +04e3 +01d2 +014a +ff40 +ffe5 +fc42 +fda6 +0150 +fffb +086c +011a +01a9 +ff2a +fe5c +ff05 +fcce +01e2 +fbc4 +ff7a +ff89 +fc67 +ff5b +ff4c +0213 +040e +0202 +fcfb +ff53 +ff8b +0087 +0436 +0020 +fe53 +fb3f +05af +00c6 +ffa6 +0051 +0014 +fbb6 +febc +005d +fa8e +010f +0000 +fe71 +013a +fb6d +0233 +01ed +ffd0 +fc00 +fd68 +fc21 +fba8 +0315 +063c +0215 +00db +00ce +fed8 +ffa3 +011e +0037 +0195 +fed0 +ffa4 +0201 +ffb5 +0104 +01a4 +0073 +035f +02d3 +fd1f +034a +fff3 +001f +024e +00c0 +0051 +ff5e +0067 +ffe2 +fd2b +fea1 +fe44 +fc76 +0455 +0244 +fb72 +fc3c +fd53 +01f4 +ffb1 +fe94 +00d7 +fe11 +04ec +0247 +fbcf +fdd4 +01ba +ff1b +ffc6 +0552 +fe7c +02eb +fedf +ff56 +00b8 +fe28 +032a +fd20 +0165 +03f5 +fdcf +fbed +fff0 +0033 +009a +0240 +fa97 +fe4f +feb9 +0171 +004b +fb51 +025d +fb70 +fce0 +0348 +0088 +ff7f +016c +00fc +003f +0177 +01e2 +010b +020e +fffb +0037 +0425 +085d +fea3 +04c7 +fcf0 +fd72 +fb5d +ff9b +0066 +009a +fec5 +008f +016e +fe0a +0010 +01eb +fe4d +01e4 +0038 +0222 +02b3 +01dd +fb99 +00a4 +0040 +02e2 +01b4 +fbbb diff --git a/pynq/hw/ip/layer0_mlp_fc2_q12.hex b/pynq/hw/ip/layer0_mlp_fc2_q12.hex new file mode 100644 index 0000000..dce7e57 --- /dev/null +++ b/pynq/hw/ip/layer0_mlp_fc2_q12.hex @@ -0,0 +1,1024 @@ +ff17 +fd3e +ff9f +0029 +fd96 +01c2 +ffe3 +ffe0 +ff5b +ff11 +ffbd +ff66 +ff1d +feb8 +ff5f +01a1 +febf +ff46 +01e5 +fe01 +fdeb +fee8 +fc91 +ffa1 +0350 +fed7 +fd41 +00e8 +fd95 +fe14 +fd6a +feb9 +fda7 +fde5 +009a +ff10 +fec6 +013d +0102 +007a +fdf9 +00c1 +0074 +ff7c +fd78 +005d +ff82 +fe9d +fe63 +00de +ffe5 +fe9e +fea4 +ff31 +00cd +fee6 +fd36 +0243 +fe9e +01b8 +0123 +00b8 +fe94 +0143 +0088 +fde4 +0210 +ffcf +0040 +0205 +fe49 +ffa5 +01d6 +fe41 +ff41 +01a9 +00f4 +ff23 +fdd6 +0102 +ff99 +0046 +009d +0157 +0037 +011e +01c6 +ff0f +fd16 +ffcd +01fb +00c8 +03c1 +feb6 +ffda +ff55 +0093 +0069 +0016 +0351 +03e0 +0032 +0459 +0086 +ffca +fdec +ff2c +01da +0125 +0231 +0284 +ff5d +ffbc +010b +03cb +ffd7 +03f7 +0734 +ffcc +fe32 +047e +ff84 +fdd1 +0180 +ff1b +fdaa +03af +fff9 +ff23 +fec3 +01b7 +ff79 +02c2 +00ca +00b6 +ffa0 +feed +03ce +ff3a +00bd +ff58 +fcaf +fe16 +fea5 +ffdd +fd4d +025f +ffdc +ff51 +01b6 +ffb8 +fe3f +00e1 +0191 +00db +0255 +0111 +02cd +012e +02aa +0054 +fefb +016f +0136 +fe0f +fe5e +fec6 +0006 +feca +fec8 +fe49 +01b2 +fe66 +fe84 +fce5 +0447 +feab +01fb +fc01 +00e1 +003d +fec9 +02e0 +045c +fefe +0290 +018d +0141 +003d +ffad +01ee +ff0b +0094 +005b +01d4 +ffa5 +0070 +fecf +fdbd +ff15 +0003 +0268 +fd4c +029f +fe9b +fe8f +01b1 +fb4f +006f +fe88 +ffb5 +007d +ffc2 +0597 +037b +fc69 +0158 +ff62 +01c7 +05c9 +ffd3 +0020 +fe67 +01e2 +fdbc +00f1 +fed9 +fb69 +00db +0073 +fdfa +00c8 +012e +0323 +012b +0368 +fd17 +003d +0059 +028e +fe9a +fd58 +0436 +ff74 +00a1 +040e +01bc +fe3c +008d +ffd6 +0016 +fff1 +025f +002d +028e +feb6 +0098 +0330 +ff7b +01bd +0014 +0056 +ff05 +ff6d +ffcc +0078 +00c3 +00e6 +0007 +fdbc +00cd +ffb8 +01d2 +00d0 +0060 +0010 +fedf +0039 +0184 +006e +fbb0 +fd7b +0024 +0094 +0047 +03b9 +fd29 +ff50 +01fa +ff75 +015a +0320 +01cf +00cc +009a +039a +fe61 +ff7b +04ec +0234 +01e7 +008d +0124 +0061 +01a1 +023b +00fc +fefc +fc85 +01ac +01fe +0265 +fff7 +ff74 +ffb2 +0281 +fee9 +0248 +0057 +ff9a +fdff +0196 +fdb3 +fdbd +fccc +01f5 +ff37 +004e +fd09 +ff82 +0025 +fd14 +fc3f +fce7 +fd27 +fe27 +febc +00fb +001d +ff8a +ff06 +fb1c +00b0 +0088 +fec2 +fb92 +fdcf +0114 +fec5 +fea3 +fdb2 +fd7c +fba8 +fb5c +fd35 +01ea +f95d +ffdd +0184 +fbd5 +fd78 +038b +026f +00c5 +fe5b +0054 +fd60 +fe59 +fbeb +fec2 +ffd4 +ffe6 +01f7 +fb65 +ffab +fe6e +fa4a +ff98 +ffcf +0116 +fe09 +fc03 +fd79 +fd59 +02f5 +fe2d +0062 +0078 +fead +fe9f +fdd8 +0250 +03dd +0109 +fe3f +fbd6 +0389 +0038 +01ea +0175 +0251 +0219 +fdae +013f +fec3 +ff82 +ff0e +004f +fee4 +0127 +ffc0 +ffda +033a +00c2 +ff90 +00d8 +00f4 +ff13 +012c +029a +ffcc +006d +02a6 +055e +026d +ffa8 +00e5 +01b6 +02d0 +fecc +00ff +017b +0369 +014a +04f6 +ff55 +0107 +067d +025a +ffd0 +04e8 +0069 +fcc0 +02ab +ff96 +021c +043d +015e +ff34 +04bd +ff7c +0054 +0110 +0081 +032c +ffe0 +fcf0 +fd3b +017b +fe7c +ff56 +ff23 +0048 +0247 +fe1c +023a +fec0 +008c +013f +ffb5 +ffd9 +ff8d +fcd4 +fd6c +fda8 +0005 +0420 +01ab +01d3 +fdc1 +ff58 +fdfc +01e4 +030b +ff50 +fd75 +010f +024e +0189 +01ac +0159 +ffb0 +000a +ff8c +00c4 +019c +ff55 +0289 +030a +0224 +01fb +03f5 +ff5a +fe9d +00dc +0326 +0159 +00ee +ffb1 +016a +ffa6 +ff67 +fe3c +0041 +fcc1 +002b +fec3 +fec3 +ffa4 +fd21 +ffdd +ff8d +ffc4 +ff03 +fea7 +f907 +fe13 +fc66 +feab +ff14 +000b +000c +fe2e +fcc1 +ffb6 +fd05 +ff2d +ff23 +fff2 +ff54 +fdbf +fed4 +fc9c +ffed +fa9d +fb69 +fee0 +00c4 +ff5f +fd91 +fe3a +feba +fd68 +ff19 +fd20 +fd85 +fdd4 +fe62 +fe5e +fedd +00a8 +fb97 +fbd4 +00a9 +fcda +fe70 +ff25 +fe97 +fb28 +fd9d +ffd6 +fe40 +fc66 +fe04 +fee2 +fdaa +002d +0077 +0120 +ff4d +0227 +fe6c +fe8b +02d3 +feb4 +01fc +0190 +0283 +0185 +0082 +ff08 +015b +fe9b +fe43 +ff26 +0011 +00f7 +fd7a +ffcc +01e3 +fee8 +01d1 +001a +fea6 +008a +000a +009f +0199 +0391 +0194 +00d8 +0239 +0142 +fe29 +fe5b +001f +fd2b +0039 +01ab +0445 +fd0c +017f +fbb5 +01ff +fe9f +0219 +f99c +0049 +ff70 +fd82 +035a +fe0d +fe96 +00cf +0102 +044d +0215 +fcce +0166 +fee8 +fe7d +fc1b +fea2 +ff52 +ff42 +ffd1 +034a +02be +fecc +fdaa +fd9c +fe9d +013b +0337 +fd04 +fd9e +ffa2 +fdd2 +0264 +fead +00e2 +ff50 +002a +0109 +ffb1 +febd +0126 +fd8c +fabc +006b +00be +02fd +003d +fee1 +0163 +fe6b +01ac +fcad +fdbe +00d8 +01b9 +ff73 +0024 +fff7 +00ec +0157 +ff24 +fe1d +fe9b +0230 +fd41 +0160 +011e +00c3 +fda7 +027c +fe8a +feaf +ff05 +ff1a +0201 +0169 +ffe4 +ffe9 +fe6a +ff30 +ff79 +02c2 +ff9f +ff53 +0030 +011c +ff16 +ff6a +ff7f +01a7 +0116 +ff28 +fe02 +ff97 +003f +00ac +02a3 +01fa +03fb +02a7 +0141 +fdf1 +fcdb +febc +0090 +0200 +fd0f +ffd5 +006e +ffbd +ff1c +fe7f +ff9d +fd82 +020d +01ca +fff5 +0252 +008c +ff0c +00e0 +019d +0180 +004e +0106 +00f1 +036a +ff53 +ff7f +024d +0048 +0019 +fdfe +00d0 +00f7 +ff48 +0257 +ff52 +01b8 +023f +0185 +01e4 +01ff +fc9f +fbf0 +0369 +010e +ff77 +009a +ffdd +fe51 +fd99 +fe56 +ff62 +feda +02af +ff6f +00db +ff1f +ff55 +00cb +fda8 +fee6 +fc77 +fbde +045b +0192 +fde1 +fea6 +ffc1 +0043 +ffef +03c4 +0152 +0291 +fed3 +0252 +024d +fddd +fe69 +060f +0325 +0095 +032d +0001 +015b +fdfe +ff6a +ffd0 +fd29 +00e3 +fedf +0045 +0084 +004c +0527 +fe93 +fe72 +04c8 +fe42 +018a +03a4 +0039 +0270 +01bc +fe10 +02ba +fef0 +fcd2 +feaf +039a +ff2e +00a1 +fcf2 +feaa +005a +fe7c +feaf +012f +fe8c +00a5 +026e +fff8 +0150 +01d0 +fee0 +fcd3 +fed2 +fdae +0027 +02fa +009b +fea5 +0251 +fe6c +0189 +fe39 +ffe7 +fea4 +ff68 +fe09 +002b +fd4f +0115 +01bf +fde5 +0150 +0057 +fdb7 +fd2c +ff40 +fe3d +0096 +ffa7 +fe64 +ff7c +00c0 +004f +ff9b +021e +017b +009b +006e +0292 +0082 +0029 +ffc4 +fe67 +fcdc +0010 +fd8c +fef1 +fe31 +ff34 +fff5 +0164 +ffdf +fea4 +0253 +fe92 +fcf1 +fd46 +fedc +01cd +ff73 +fbb0 +fbb7 +01d3 +fd92 +fd4a +fd76 +fe4d +fc78 +ff03 +013e +fe84 +feb0 +ff12 +ff21 +fe01 +fd08 +febd +fb2a +ffec +fbcc +0058 +ff92 +fcc1 +0051 +fe80 +fe01 +fc0c +fbca +fe9b +fdf9 +fd4a +0035 +fdad +ff4d +fd19 +fdf8 +fcec +ffee +fe6f +feac +ff37 +ff1b +ff2e +fe0a +ff1f +fe37 +fe55 +ff30 +ff98 +0054 +03e3 +ff24 +00e1 +0054 +0014 +fe22 +011c +0067 +fe92 +fe43 +feb6 +024b +fe7b +0028 +036a +0120 +03f4 +0130 +ffa6 +011f +00ea +028b +fd32 +00d3 +ff40 +ffcc +0141 +0018 +0119 +0087 +ff3c +ff20 +fd49 +04db +020d +ffaf +0029 +01ee +01bf +0322 +fdbf +02b8 +fffd +ffd7 +01f9 +ff2e +ff32 +015d +0394 +00d6 +017b +0096 +0206 +ff96 +fed6 +ff66 +0052 +00da +02f0 +00be +013a +02f5 +0049 diff --git a/pynq/hw/ip/lm_head_q12.hex b/pynq/hw/ip/lm_head_q12.hex new file mode 100644 index 0000000..08c1258 --- /dev/null +++ b/pynq/hw/ip/lm_head_q12.hex @@ -0,0 +1,432 @@ +fd85 +0015 +fc15 +0048 +00c4 +0135 +0471 +0106 +fd22 +025d +fdd3 +fe6f +0126 +ff26 +ffcb +042e +01b9 +fe67 +02b2 +ff5a +fd75 +0359 +f971 +034a +027a +0098 +fd47 +ff9a +fcb9 +0345 +0225 +f872 +02b2 +fd3f +006c +fca0 +fda9 +0421 +fa23 +0047 +050c +ffdc +fef0 +fe76 +fe7b +028e +03c5 +fdf7 +0015 +fbed +021c +ff53 +fc14 +04aa +fdf5 +ffa4 +00b0 +0208 +01df +fe11 +fe29 +ffd3 +0113 +fdbf +00d8 +027f +fee9 +00e8 +0016 +0087 +02f3 +ffff +ff95 +ff85 +fdaf +ff4b +017f +01ec +fe6c +0450 +0291 +fba7 +04fa +fe74 +fa9e +0bbe +f8e1 +0032 +08f4 +fe76 +f811 +0112 +fada +057d +0575 +fb45 +01ee +fcb2 +0396 +fca9 +ff70 +0877 +fc42 +00c3 +070d +fb93 +fa95 +fe7b +fd60 +031f +047d +fd78 +00c8 +0402 +fec4 +ffc1 +ffc0 +fbf1 +ff70 +fc96 +0140 +f92e +ff84 +0196 +fb40 +000c +03b1 +fb4d +036c +049e +01ff +0259 +fc14 +ffbd +0288 +0144 +fe73 +013b +fd1f +00e8 +0204 +00fd +0049 +01b2 +022b +f9c3 +fde6 +000b +fe25 +0661 +f7d3 +ffa8 +0695 +ff9a +fef5 +ff7a +fa0c +0421 +06d1 +fc39 +021d +fb0b +ff4b +fe99 +fd8d +024c +f98f +fdaf +0359 +00f2 +001a +fcc0 +fa47 +02a9 +038a +ff2c +011a +fa8e +0386 +ffc5 +ff01 +fe73 +ffca +fb69 +ff2f +028c +0409 +fde9 +ffd7 +0130 +fe38 +015f +0001 +faf8 +ff8e +011c +faec +00e6 +f8fe +fe3f +00e8 +fff2 +00f0 +fe4e +fea3 +03d7 +0117 +fcbf +ff50 +fe76 +055c +fc64 +04af +00df +fd1b +02be +fece +02e5 +0611 +004a +ff29 +fce8 +fce2 +fca3 +fec0 +0129 +011a +ff95 +ffb2 +021c +02f3 +0082 +011c +fca0 +fd6a +fd1a +024e +02ab +0243 +02a1 +037f +fa35 +0271 +0391 +fe6a +085d +f873 +00d8 +08e1 +fab6 +fcd2 +0122 +ff17 +0415 +085b +fb62 +074b +f98d +0ade +03f9 +fa65 +1553 +f6f4 +01a4 +0b8f +f3e6 +f535 +01d9 +fc4e +0944 +115d +f842 +ffc3 +ff1d +02f3 +0093 +ff04 +fe70 +fd35 +fef2 +ff58 +00cd +00c8 +fd96 +02ca +01c4 +fc3f +fc01 +0265 +fc37 +004e +fd17 +011d +0181 +fe31 +00c6 +0426 +0131 +ffd8 +fef6 +fc2a +022a +fee8 +fdde +013b +0151 +00a8 +fdb3 +008b +019e +ff57 +ffa2 +0686 +00a7 +fec5 +01b7 +fb63 +0084 +02f5 +fdd9 +ff78 +ffe9 +053e +fe06 +fdb2 +0403 +027f +0241 +019b +fb71 +fb07 +fe9e +013d +0556 +0789 +fcbe +013e +fbb7 +0566 +fecc +fbb6 +0161 +fb02 +005a +0553 +ff2c +fdd0 +fc91 +fe65 +03e7 +03ce +fb94 +00d0 +fcc3 +0615 +015b +ff08 +065c +f8f3 +034a +0633 +fc8c +fd4d +0045 +fe9c +061c +07f8 +fd23 +031c +fca5 +0956 +ffff +fe5b +0ac7 +f9b0 +ff72 +0c02 +fc02 +ff98 +012e +ffa7 +0961 +0512 +fcca +025f +032a +03c9 +0379 +feaf +006a +00ba +0025 +ffbe +ff80 +feec +fdea +02f5 +021c +ffcc +003f +0126 +fcec +02b1 +fc07 +fae2 +058d +fb20 +0043 +009d +01d5 +fca6 +fedc +fe5e +03b0 +02bb +fcb4 +faf1 +01af +0922 +0230 +045b +016f +00e4 +0172 +001c +fc7a +feb2 +06d1 +fda8 +fd14 +fdb9 +027f diff --git a/pynq/hw/ip/wpe_q12.hex b/pynq/hw/ip/wpe_q12.hex new file mode 100644 index 0000000..2d79c5c --- /dev/null +++ b/pynq/hw/ip/wpe_q12.hex @@ -0,0 +1,256 @@ +ff14 +fd30 +f464 +ffcd +0058 +fe7f +fd57 +00a4 +00ff +04dd +00c9 +0038 +ff9c +ffd4 +fbea +ffbf +fe92 +fefe +fae2 +fed8 +fd30 +ffc0 +04b3 +ff3c +fb08 +0267 +0110 +fa6a +024b +ff81 +ffea +00eb +0398 +017d +fb8b +fc87 +fe07 +fbce +0195 +00e0 +fef3 +040d +ffe1 +fade +0607 +0219 +fb54 +0004 +02eb +0112 +fe9b +fe19 +fd1c +fd78 +03dc +fdc1 +fde6 +06d7 +01e8 +fe66 +0177 +ff8f +fc4e +0034 +00d6 +02b2 +ff6d +024e +0227 +fc9a +058a +014b +fe95 +03e6 +ffb7 +0134 +01ea +fec0 +fb9c +00cd +fe78 +026a +007c +012c +051d +fee6 +029b +01c6 +fc6d +ffa9 +010b +01ad +000c +fcd4 +fc08 +01bd +fbd1 +0248 +033e +0158 +053a +ff68 +01f4 +0031 +fecd +fe23 +0088 +0448 +ffa2 +fc5d +fc92 +03bd +fc13 +0106 +036b +ff0f +055d +ffbb +0109 +ff1f +013a +fcf6 +01cf +05bb +fc52 +fca9 +fce4 +0247 +fcc6 +0100 +034f +0107 +02fa +00e1 +ffc3 +ffa0 +03b6 +fc1c +ff40 +04da +fde4 +fe2b +0114 +0412 +fe9c +ff62 +0348 +026b +031f +0277 +fd60 +0086 +01e0 +fbf4 +0025 +025b +fbde +fed9 +fefb +ffed +fcf1 +01c4 +0209 +0484 +02bb +ff43 +fb03 +ff5a +0393 +faf2 +ffa8 +054b +fcb6 +0081 +01aa +0179 +fe6d +03b0 +00a0 +036c +fe10 +02a7 +004a +fd61 +ff90 +ff92 +0058 +0212 +00c6 +00bc +008d +ffc5 +ff4e +0117 +001d +01e1 +ffef +00b0 +fe26 +fc2c +0008 +ff99 +fe51 +ff79 +0092 +fee3 +ff38 +01bb +fefe +fff5 +00bd +fe18 +0157 +01cf +ff7b +00da +029b +0020 +fe6e +02c1 +fd7c +01b8 +005a +0135 +fee9 +ffc3 +ff5d +0325 +0122 +ff73 +00d5 +fe94 +ff73 +0086 +00b8 +0135 +0026 +022b +0020 +00b4 +0089 +fe28 +ffcb +00c8 +0018 +0086 +fe29 +fdc6 +00d0 +00f9 +0040 +0013 +00d3 +fe76 +fdd6 +01bd diff --git a/pynq/hw/ip/wte_q12.hex b/pynq/hw/ip/wte_q12.hex new file mode 100644 index 0000000..930cd7f --- /dev/null +++ b/pynq/hw/ip/wte_q12.hex @@ -0,0 +1,432 @@ +0216 +00db +03df +ff06 +00af +fb43 +f98c +ffae +0020 +0125 +05a0 +0210 +fe9f +fd58 +fd4b +fa19 +0201 +03d4 +0150 +037f +fc27 +fe43 +fdf9 +0004 +ff06 +fed9 +fba9 +fe68 +07f3 +04bf +fa85 +ff7b +0375 +01f9 +f89d +fed7 +ff36 +fdb1 +011f +fac2 +0052 +fb59 +fe65 +fd2d +fd71 +0212 +0303 +0443 +00fc +041c +fedc +0517 +002c +0154 +0231 +ff97 +fc6f +fd3a +fd44 +ff86 +044a +005a +fdb5 +01c5 +ff3f +faef +041d +0056 +0137 +fbd8 +fbb8 +fc08 +00cf +01f9 +0679 +ff10 +02a9 +ffd3 +f860 +00a5 +fe29 +fe4c +00c5 +ff6c +fb8a +ffdc +0195 +00f6 +01ad +0037 +fb50 +013c +01a1 +002f +0165 +01c5 +0139 +0b0c +f8fb +fdfd +fd37 +fd2d +0388 +fc62 +0009 +0137 +0024 +fea7 +ffd0 +ff7c +fee9 +01f0 +fcd1 +0012 +0057 +00c0 +003d +036b +01d0 +00f9 +fd64 +fe62 +fbec +014c +009d +0097 +0001 +021d +fd06 +fd53 +ffcd +fa20 +05fc +02be +01f0 +fc7a +018c +037f +0116 +fcf6 +ff4c +ffe3 +f52b +00b5 +fdee +ff1a +fb83 +ff2a +ffe7 +02b1 +0555 +ffc0 +fe57 +ff91 +fc89 +fe61 +0029 +01d6 +027c +0514 +fe53 +022e +ff23 +0319 +02e4 +fd94 +03bc +ff34 +ff3e +0067 +fbf0 +0092 +006f +00be +0100 +07d6 +023a +0278 +0109 +06bf +fc9f +005a +06c7 +fe30 +fd87 +0042 +ffc9 +00a3 +03e6 +01cb +0064 +08c7 +fe03 +03ee +fee8 +022d +feb9 +0278 +0041 +0345 +fd1f +ff7d +f9c4 +0213 +01f4 +0131 +ff3f +02ed +fcb9 +007f +016a +fea2 +015f +05e7 +0392 +0520 +feed +02d5 +00c7 +0209 +0000 +fad9 +00d5 +0221 +fe28 +fbf5 +055c +fb25 +00b5 +fe07 +fc0d +04b1 +fe8d +01d4 +0659 +fcdc +0195 +ff1a +f980 +f91a +00e2 +031c +faee +0074 +ff29 +f91d +04d0 +ffce +fbbb +fa3e +fee7 +01d7 +fda8 +fd70 +013b +ffba +0055 +ff78 +0159 +ff9b +0356 +0091 +02c5 +0039 +ffee +001b +ff47 +0135 +fe6c +ff2e +0055 +fcf0 +019d +01a6 +00a2 +04ed +fc6f +01c1 +0579 +0130 +ff65 +008a +fee4 +01c0 +0058 +fe40 +00b3 +0358 +fd21 +00b2 +fc30 +fc07 +0398 +fb0d +04d4 +ff95 +03f5 +fd64 +feea +04f8 +f9cf +fcb3 +0361 +00bc +fc63 +096c +fe62 +020d +0072 +fd06 +06cf +feda +039b +fd04 +fd1d +fcd2 +0219 +ff0d +0100 +fe08 +026d +fe99 +fdd7 +fb3f +01cc +faa4 +fb16 +fb36 +ffbb +05b5 +03eb +fcd8 +fff5 +0007 +f566 +fd06 +02c1 +02a7 +fe77 +068e +fdae +03af +06fc +052e +f9f3 +ffca +f81e +0032 +03ee +01ec +025c +0405 +fef3 +01a5 +fbcb +0089 +008e +02f6 +0905 +00f5 +f8e4 +002b +fe7c +fb8a +016d +0135 +ff9d +04ad +ffce +01fa +02a2 +01db +ff80 +ff6e +0074 +0326 +ff3a +ff7c +fcd3 +01aa +00cb +009f +0169 +0108 +fc5c +f7e2 +01ae +ff0b +01ec +036c +00a9 +023e +fd69 +0468 +031e +ff5f +00ca +fe69 +0077 +019a +00c7 +028a +fdda +00e0 +fe26 +029f +0196 +01ed +fd1e +0050 +fc91 +fecd +0312 +01db +fed1 +02ab +00c0 +fdba +f55a +fe5c +ff55 +007a +ff79 +fec6 +fde6 +0592 +01c5 +fe5a +ff4a +ff68 +00c4 +00ee diff --git a/pynq/hw/sim/cocotb/Makefile b/pynq/hw/sim/cocotb/Makefile new file mode 100644 index 0000000..618484f --- /dev/null +++ b/pynq/hw/sim/cocotb/Makefile @@ -0,0 +1,61 @@ +# ----------------------------------------------------------------------------- +# cocotb Makefile -- microgpt_pynq_top AXI4-Lite slave testbench. +# +# Drives the wrapper directly with cocotbext-axi (no BD, no PS7 model needed) +# so we can catch handshake / register-map regressions before paying for a +# 30-minute Vivado build and a board power cycle. +# +# Usage: +# cd hw/sim/cocotb +# make # all tests, default simulator (icarus) +# make SIM=verilator # use Verilator instead +# make WAVES=1 # dump a waveform alongside the run +# make MODULE=test_microgpt_axi TESTCASE=test_03_write_completes_does_not_hang +# ----------------------------------------------------------------------------- + +TOPLEVEL_LANG ?= verilog + +# Auto-pick whichever simulator is installed (prefer icarus when both are +# present, since it's the cocotb default). Override with `make SIM=...`. +ifeq ($(SIM),) + ifneq ($(shell command -v iverilog 2>/dev/null),) + SIM := icarus + else ifneq ($(shell command -v verilator 2>/dev/null),) + SIM := verilator + else + $(error Neither iverilog nor verilator found in PATH. Install one: \ + sudo pacman -S iverilog OR sudo pacman -S verilator) + endif +endif + +PWD := $(shell pwd) +REPO_ROOT := $(abspath $(PWD)/../../..) +HW_ROOT := $(REPO_ROOT)/hw + +# DUT + the unmodified TALOS-V2 core it instantiates. Children stay +# SystemVerilog (Icarus -g2012 / Verilator handle the dialect). +VERILOG_SOURCES += $(HW_ROOT)/src/top/microgpt_pynq_top.sv +VERILOG_SOURCES += $(wildcard $(HW_ROOT)/src/core/*.sv) + +ifeq ($(SIM),icarus) + COMPILE_ARGS += -g2012 + COMPILE_ARGS += -I$(HW_ROOT)/src/core/include + COMPILE_ARGS += -I$(HW_ROOT)/ip + # Shrink the heartbeat counter for sim so test_08 sees a toggle quickly. + COMPILE_ARGS += -Pmicrogpt_pynq_top.HEARTBEAT_BITS=12 +endif + +ifeq ($(SIM),verilator) + EXTRA_ARGS += --trace --trace-structs + EXTRA_ARGS += -I$(HW_ROOT)/src/core/include + EXTRA_ARGS += -I$(HW_ROOT)/ip + EXTRA_ARGS += --timescale 1ns/1ps + EXTRA_ARGS += -Wno-fatal -Wno-WIDTH -Wno-UNOPTFLAT + # Shrink the heartbeat counter for sim so test_08 sees a toggle quickly. + EXTRA_ARGS += -GHEARTBEAT_BITS=12 +endif + +TOPLEVEL = microgpt_pynq_top +MODULE ?= test_microgpt_axi + +include $(shell cocotb-config --makefiles)/Makefile.sim diff --git a/pynq/hw/sim/cocotb/README.md b/pynq/hw/sim/cocotb/README.md new file mode 100644 index 0000000..5831718 --- /dev/null +++ b/pynq/hw/sim/cocotb/README.md @@ -0,0 +1,93 @@ +# microgpt AXI4-Lite cocotb testbench + +Catches AXI handshake / register-map regressions in +`hw/src/top/microgpt_pynq_top.sv` **before** paying for a 30-minute Vivado +build and a board power cycle. + +The tests drive the wrapper directly (no BD, no PS7 model) using +`cocotbext-axi`. The DUT elaborates against the unmodified TALOS-V2 core +in `hw/src/core/`. + +## Why this exists + +A previous bitstream wedged the entire Zynq PS bus on the first AXI write. +The slave kept `AWREADY`/`WREADY` perpetually high in idle and only +latched the transaction when AW and W were valid in the *same* cycle. The +PS M_AXI_GP0 master staggers them by a cycle, considered both handshakes +complete, and then waited forever for `BVALID`. `test_03_write_completes_does_not_hang` +is the regression for that exact failure mode: it would have caught the +bug pre-bitstream and saved a power cycle. + +## Dependencies + +Arch / CachyOS: + +``` +sudo pacman -S iverilog +# OR (newer simulator with stricter SV support): +sudo pacman -S verilator # paru -S verilator-bin if you want the AUR build + +pip install cocotb cocotbext-axi +``` + +`cocotb` >= 1.8 and `cocotbext-axi` >= 0.1.20 are known good. + +## Run all tests + +``` +cd hw/sim/cocotb +make +``` + +Default simulator is **Icarus Verilog** (`SIM=icarus`), invoked with +`-g2012` so it tolerates the SV-2012 dialect used in the TALOS-V2 core +(unpacked 2-D arrays, etc.). + +To use **Verilator** instead: + +``` +make SIM=verilator +``` + +Verilator is stricter about SV but compiles much faster and traces +better; switch to it if Icarus chokes on something the core uses. + +## Run one test + +``` +make MODULE=test_microgpt_axi TESTCASE=test_03_write_completes_does_not_hang +``` + +## Waveforms + +``` +make WAVES=1 +gtkwave dump.vcd # Icarus default; with WAVES=1 cocotb may emit cocotb.fst +``` + +## Test inventory + +| # | Test | What it proves | +|---|-----------------------------------------------|-------------------------------------------------------------------| +| 1 | `test_01_read_magic_version_bos` | Magic / version / BOS constants read back as documented | +| 2 | `test_02_status_register_initial` | After reset: ready=1, busy=done=error=0 | +| 3 | `test_03_write_completes_does_not_hang` | **Production-bug regression.** A single write must complete BVALID | +| 4 | `test_04_write_then_read_back` | Config write round-trips through the readable representation | +| 5 | `test_05_host_toggle_flips_on_each_transaction` | `status[4]` flips on every successful AXI transaction | +| 6 | `test_06_unmapped_address_does_not_hang` | Out-of-decode writes/reads still complete the handshake | +| 7 | `test_07_start_pulse_via_cmd` | Writing bit 0 of `CMD` fires `start_pulse`, FSM leaves READY | + +Every test imposes a hard cocotb timeout, so a hang in the DUT becomes a +clean test failure rather than an infinite simulation. + +## Caveats + +- `test_07` exercises the full DUT including the unmodified TALOS-V2 + core. If your simulator can't elaborate a core file, switch to + Verilator (`make SIM=verilator`) before debugging the test itself. +- The `.hex` weight files under `hw/ip/` are loaded by `$readmemh` paths + inside the core. The Makefile passes `-I$(HW_ROOT)/ip` for resolution; + if a child changes its readmemh search expectation you may need to add + another `-I`. +- The XDC, BD, and PS7 are intentionally NOT in the loop. Pin-level / BD + wiring failures still need an on-board test. diff --git a/pynq/hw/sim/cocotb/test_microgpt_axi.py b/pynq/hw/sim/cocotb/test_microgpt_axi.py new file mode 100644 index 0000000..239805f --- /dev/null +++ b/pynq/hw/sim/cocotb/test_microgpt_axi.py @@ -0,0 +1,345 @@ +"""Cocotb tests for the microgpt_pynq_top AXI4-Lite slave wrapper. + +Built specifically to catch the production write-path bug that wedged the +Zynq PS bus on the deployed bitstream: the slave kept AWREADY/WREADY high +in idle, the Zynq M_AXI_GP0 master saw both handshakes complete on +separate cycles, but the slave never latched the address and never +asserted BVALID. + +Each test imposes a hard cocotb timeout, so a hang in the DUT becomes a +clean test failure rather than an infinite simulation. +""" + +import cocotb +from cocotb.clock import Clock +from cocotb.triggers import ClockCycles +from cocotbext.axi import AxiLiteBus, AxiLiteMaster + +# 50 MHz, matching FCLK_CLK0 on the deployed board. +CLK_PERIOD_NS = 20 +RESET_CYCLES = 32 +SETTLE_CYCLES = 8 + +# Register map -- byte offsets, mirrors hw/src/top/microgpt_pynq_top.sv. +A_MAGIC = 0x000 +A_VERSION = 0x004 +A_CMD = 0x008 +A_STATUS = 0x00C +A_CONFIG = 0x010 +A_SEED = 0x014 +A_LOGIT_INFO = 0x018 +A_BOS = 0x01C +A_STEP_CFG = 0x020 +A_STEP_TRIG = 0x024 + +# status register field positions +ST_READY_BIT = 0 +ST_BUSY_BIT = 1 +ST_DONE_BIT = 2 +ST_ERROR_BIT = 3 +ST_TOGGLE_BIT = 4 + + +# ----------------------------------------------------------------------------- +# Helpers +# ----------------------------------------------------------------------------- +async def _start_clock(dut): + cocotb.start_soon(Clock(dut.s_axi_aclk, CLK_PERIOD_NS, units="ns").start()) + + +async def _reset(dut): + dut.s_axi_aresetn.value = 0 + await ClockCycles(dut.s_axi_aclk, RESET_CYCLES) + dut.s_axi_aresetn.value = 1 + await ClockCycles(dut.s_axi_aclk, SETTLE_CYCLES) + + +def _make_master(dut): + return AxiLiteMaster( + AxiLiteBus.from_prefix(dut, "s_axi"), + dut.s_axi_aclk, + dut.s_axi_aresetn, + reset_active_level=False, + ) + + +async def _read32(master, addr): + resp = await master.read(addr, 4) + return int.from_bytes(resp.data, "little") + + +async def _write32(master, addr, value): + await master.write(addr, int(value).to_bytes(4, "little")) + + +# ----------------------------------------------------------------------------- +# Tests +# ----------------------------------------------------------------------------- +@cocotb.test(timeout_time=200, timeout_unit="us") +async def test_01_read_magic_version_bos(dut): + """Constants come back as documented in the register map.""" + await _start_clock(dut) + master = _make_master(dut) + await _reset(dut) + + magic = await _read32(master, A_MAGIC) + assert magic == 0x4D475254, f"magic mismatch: {magic:#010x}" + + version = await _read32(master, A_VERSION) + assert version == 0x00020001, f"version mismatch: {version:#010x}" + + bos = await _read32(master, A_BOS) + assert bos == 0x0000001A, f"bos mismatch: {bos:#010x}" + + +@cocotb.test(timeout_time=200, timeout_unit="us") +async def test_02_status_register_initial(dut): + """After reset: ready=1, busy=0, done=0, error=0.""" + await _start_clock(dut) + master = _make_master(dut) + await _reset(dut) + + s = await _read32(master, A_STATUS) + assert (s >> ST_READY_BIT) & 1 == 1, f"ready bit not set: {s:#010x}" + assert (s >> ST_BUSY_BIT) & 1 == 0, f"busy unexpectedly set: {s:#010x}" + assert (s >> ST_DONE_BIT) & 1 == 0, f"done unexpectedly set: {s:#010x}" + assert (s >> ST_ERROR_BIT) & 1 == 0, f"error unexpectedly set: {s:#010x}" + + +@cocotb.test(timeout_time=50, timeout_unit="us") +async def test_03_write_completes_does_not_hang(dut): + """THE production-bug regression test. + + A single AXI4-Lite write must complete its B-channel handshake within + a bounded number of cycles. If the slave wedges (the bug we fixed), + the underlying cocotbext-axi `master.write` coroutine awaits BVALID + forever and the cocotb timeout above kills the test loudly. + """ + await _start_clock(dut) + master = _make_master(dut) + await _reset(dut) + + await _write32(master, A_CONFIG, 0x01000100) + + +@cocotb.test(timeout_time=200, timeout_unit="us") +async def test_04_write_then_read_back(dut): + """Config write round-trips after a CMD start_pulse commits the stage. + + The wrapper deliberately uses a stage-then-commit pattern: A_CONFIG + writes land in `host_temperature_reg`/`host_max_gen_reg` (shadow), + and the A_CONFIG readback returns the *committed* `temperature_reg`/ + `max_gen_reg` -- which only update when start_pulse fires (bit 0 of + A_CMD). So a write -> immediate read returns the prior committed + values (after reset, that's temp=0x0080, max_gen=0x0F). To verify + the round-trip we have to commit the stage by writing A_CMD=1 first. + """ + await _start_clock(dut) + master = _make_master(dut) + await _reset(dut) + + payload = 0x01000100 # temp = 0x0100, max_gen = 0x01, [7:0] reserved + await _write32(master, A_CONFIG, payload) + # Commit: start_pulse copies host_* shadow into the latched register set. + await _write32(master, A_CMD, 0x00000001) + # The latching happens on the cycle start_pulse is observed by the FSM; + # a few cycles of margin make this read race-free. + await ClockCycles(dut.s_axi_aclk, 4) + + readback = await _read32(master, A_CONFIG) + + # Bottom byte is reserved/zero in the spec; mask before comparing. + assert (readback & 0xFFFFFF00) == (payload & 0xFFFFFF00), ( + f"config readback {readback:#010x} != written {payload:#010x}" + ) + + +@cocotb.test(timeout_time=400, timeout_unit="us") +async def test_05_host_toggle_flips_on_each_transaction(dut): + """status[4] (host_toggle) must flip on every successful AXI transaction + (read OR write). Drive a deliberate mix of reads and writes through the + AXI bus, read STATUS via the bus to sample bit[4], and assert that the + sampled toggle equals the running parity of all completed transactions. + + Why the predicted parity is the right oracle: each AXI transaction flips + host_toggle exactly once. A STATUS read SAMPLES the toggle BEFORE the + flip caused by that read itself, so the sampled value at the k-th + completed STATUS read equals the parity of the k transactions that + completed *before* it (initial toggle = 0). + + The previous test only did back-to-back STATUS reads (parity always + alternates), which would silently miss a regression where WRITE + transactions stop flipping the toggle. The hardware-repro pattern + (write+read interleaved) was misleading because two flips per iter alias + to a constant 0 either way -- this version uses a non-uniform mix of + reads and writes so any drop in the write path's flip is loud. + """ + await _start_clock(dut) + master = _make_master(dut) + await _reset(dut) + + txn_count = 0 # AXI transactions completed so far + + async def status_read_assert(label): + nonlocal txn_count + s = await _read32(master, A_STATUS) + sampled = (s >> ST_TOGGLE_BIT) & 1 + expected = txn_count & 1 + assert sampled == expected, ( + f"[{label}] host_toggle sampled={sampled} expected={expected} " + f"(prior txn_count={txn_count}, status={s:#010x}). " + f"A WRITE that fails to flip the toggle is the most likely cause." + ) + txn_count += 1 + + async def writeable(addr, data): + nonlocal txn_count + await _write32(master, addr, data) + txn_count += 1 + + # Phase A: pure-read parity (regression of the original test_05 spirit). + for i in range(5): + await status_read_assert(f"A.{i} pure-read") + + # Phase B: interleaved reads and writes -- catches the case where writes + # would have stopped flipping the toggle (the misdiagnosed prod bug). + await writeable(A_CONFIG, 0xABCD0000) + await status_read_assert("B.0 read after CONFIG write") + await writeable(A_SEED, 0xDEADBEEF) + await status_read_assert("B.1 read after SEED write") + await writeable(A_STEP_CFG, 0x12340000) + await writeable(A_CONFIG, 0x55667700) + await status_read_assert("B.2 read after two writes") + await status_read_assert("B.3 read-only follow-up") + await writeable(A_SEED, 0xCAFEBABE) + await status_read_assert("B.4 read after another write") + + +@cocotb.test(timeout_time=50, timeout_unit="us") +async def test_06_unmapped_address_does_not_hang(dut): + """Unmapped offsets within the 4 KB window must still complete. + + Either OKAY (silent ignore) or SLVERR is acceptable per AXI -- what + is NEVER acceptable is dropping the response and wedging the master. + cocotbext-axi raises on a non-OKAY response; we accept either. + """ + await _start_clock(dut) + master = _make_master(dut) + await _reset(dut) + + try: + await _write32(master, 0xFFC, 0xDEADBEEF) + except Exception: + # SLVERR surfaces as an exception; that's fine -- the handshake + # still completed (otherwise we'd time out, not raise). + pass + + try: + _ = await _read32(master, 0xFF8) + except Exception: + pass + + +@cocotb.test(timeout_time=2, timeout_unit="ms") +async def test_07_start_pulse_via_cmd(dut): + """Writing bit 0 of CMD (0x008) must move the FSM out of ST_READY. + + The unmodified TALOS-V2 core may take many cycles to actually finish + a generation; this test only verifies the pulse fires and the wrapper + leaves the READY state. A larger window is OK because the cocotb + timeout still bounds the run. + """ + await _start_clock(dut) + master = _make_master(dut) + await _reset(dut) + + s0 = await _read32(master, A_STATUS) + assert (s0 >> ST_READY_BIT) & 1 == 1, ( + f"DUT did not start in READY: {s0:#010x}" + ) + + await _write32(master, A_CMD, 0x00000001) + + saw_departure = False + for _ in range(256): + s = await _read32(master, A_STATUS) + ready = (s >> ST_READY_BIT) & 1 + busy = (s >> ST_BUSY_BIT) & 1 + done = (s >> ST_DONE_BIT) & 1 + if (not ready) or busy or done: + saw_departure = True + break + + assert saw_departure, "FSM never transitioned out of ST_READY after CMD start_pulse" + + +@cocotb.test(timeout_time=2, timeout_unit="ms") +async def test_09_done_irq_pulses_and_clears_on_start(dut): + """done_irq must: + * be 0 right after reset, + * stay 0 while the FSM is busy, + * go to 1 once a generation completes, + * go back to 0 when the next start_pulse fires. + """ + await _start_clock(dut) + master = _make_master(dut) + await _reset(dut) + + assert int(dut.done_irq.value) == 0, "done_irq should be 0 right after reset" + + # Force a fast termination: max_gen=0 makes the FSM go straight to + # ST_DONE with error_reg=1, which triggers irq_pending_reg <= 1. + await _write32(master, A_CONFIG, 0x00800000) + await _write32(master, A_CMD, 0x00000001) + + # Within a small window the irq line must come up. + armed = False + for _ in range(64): + await ClockCycles(dut.s_axi_aclk, 4) + if int(dut.done_irq.value): + armed = True + break + assert armed, "done_irq never asserted after generation completed" + + # Read STATUS through the bus to confirm done=1 (this also exercises that + # the irq line stays high across reads). + s = await _read32(master, A_STATUS) + assert (s >> 2) & 1, f"done bit not set in STATUS: {s:#010x}" + assert int(dut.done_irq.value) == 1, "irq line dropped before next start_pulse fired" + + # Next start should clear the irq immediately (in the cycle the FSM sees + # start_pulse). Allow one extra cycle for the synchronous reset. + await _write32(master, A_CMD, 0x00000001) + await ClockCycles(dut.s_axi_aclk, 8) + assert int(dut.done_irq.value) == 0, "done_irq did not clear after next start_pulse" + + +@cocotb.test(timeout_time=80, timeout_unit="ms") +async def test_10_led_heartbeat_toggles(dut): + """led_heartbeat must blink autonomously regardless of AXI activity. + + The Makefile overrides HEARTBEAT_BITS to a small value for sim so the + counter MSB toggles every few microseconds; we sample the led_heartbeat + output port over up to 70 ms simulated time and assert it took both + values at least once. led_heartbeat is a primary output port, not an + internal reg, so this stays consistent with the "no peeking at internal + regs" rule the AXI tests follow. + """ + await _start_clock(dut) + await _reset(dut) + + seen = {int(dut.led_heartbeat.value)} + # Poll in 1 ms chunks (50 000 cycles @ 50 MHz) for up to ~70 ms. + for _ in range(70): + if len(seen) >= 2: + break + await ClockCycles(dut.s_axi_aclk, 50_000) + seen.add(int(dut.led_heartbeat.value)) + + assert len(seen) >= 2, ( + f"led_heartbeat did not toggle within 70 ms simulated time " + f"(only saw value(s) {seen}). Either the heartbeat counter is " + f"stuck (held in reset / optimised away) or HEARTBEAT_BITS was " + f"not overridden by the Makefile -- production HEARTBEAT_BITS=26 " + f"would need ~670 ms of sim time to flip." + ) diff --git a/pynq/hw/src/.gitkeep b/pynq/hw/src/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/pynq/hw/src/core/include/microgpt_exact_core_math.svh b/pynq/hw/src/core/include/microgpt_exact_core_math.svh new file mode 100644 index 0000000..c7b7492 --- /dev/null +++ b/pynq/hw/src/core/include/microgpt_exact_core_math.svh @@ -0,0 +1,117 @@ +function signed [15:0] sat16; + input signed [63:0] value; + begin + if (value > 64'sd32767) + sat16 = 16'sd32767; + else if (value < -64'sd32768) + sat16 = 16'sh8000; + else + sat16 = value[15:0]; + end +endfunction + +function signed [15:0] mul_q12; + input signed [15:0] a; + input signed [15:0] b; + reg signed [63:0] p; + begin + p = $signed(a) * $signed(b); + mul_q12 = sat16(p >>> FRAC_BITS); + end +endfunction + +function [31:0] xorshift32; + input [31:0] value; + reg [31:0] x; + begin + x = value; + x = x ^ (x << 13); + x = x ^ (x >> 17); + x = x ^ (x << 5); + xorshift32 = x; + end +endfunction + +function [31:0] isqrt64; + input [63:0] value; + reg [65:0] rem; + reg [32:0] root; + reg [33:0] cand; + integer bit_idx; + begin + rem = 66'd0; + root = 33'd0; + for (bit_idx = 31; bit_idx >= 0; bit_idx = bit_idx - 1) begin + rem = (rem << 2) | ((value >> (bit_idx * 2)) & 64'd3); + root = root << 1; + cand = (root << 1) | 34'd1; + if (rem >= cand) begin + rem = rem - cand; + root = root + 33'd1; + end + end + isqrt64 = root[31:0]; + end +endfunction + +function signed [15:0] rms_scale_from_sum; + input signed [63:0] sumsq; + reg [63:0] ms_q12; + reg [31:0] denom_q12; + reg [63:0] scale_q12; + begin + ms_q12 = (sumsq / EMBED_DIM) + 64'd1; + denom_q12 = isqrt64(ms_q12 * SCALE); + if (denom_q12 == 0) + scale_q12 = 64'd32767; + else + scale_q12 = (64'd4096 * 64'd4096) / denom_q12; + rms_scale_from_sum = sat16(scale_q12); + end +endfunction + +function [31:0] exp_neg_q12; + input signed [31:0] delta_q12; + reg [5:0] index; + begin + if (delta_q12 >= 0) begin + exp_neg_q12 = 32'd4096; + end else begin + index = ((-delta_q12) + 32'sd511) >>> 10; + case (index) + 6'd0: exp_neg_q12 = 32'd4096; + 6'd1: exp_neg_q12 = 32'd3189; + 6'd2: exp_neg_q12 = 32'd2484; + 6'd3: exp_neg_q12 = 32'd1935; + 6'd4: exp_neg_q12 = 32'd1507; + 6'd5: exp_neg_q12 = 32'd1174; + 6'd6: exp_neg_q12 = 32'd914; + 6'd7: exp_neg_q12 = 32'd712; + 6'd8: exp_neg_q12 = 32'd555; + 6'd9: exp_neg_q12 = 32'd432; + 6'd10: exp_neg_q12 = 32'd337; + 6'd11: exp_neg_q12 = 32'd262; + 6'd12: exp_neg_q12 = 32'd204; + 6'd13: exp_neg_q12 = 32'd159; + 6'd14: exp_neg_q12 = 32'd124; + 6'd15: exp_neg_q12 = 32'd97; + 6'd16: exp_neg_q12 = 32'd75; + 6'd17: exp_neg_q12 = 32'd59; + 6'd18: exp_neg_q12 = 32'd46; + 6'd19: exp_neg_q12 = 32'd36; + 6'd20: exp_neg_q12 = 32'd28; + 6'd21: exp_neg_q12 = 32'd22; + 6'd22: exp_neg_q12 = 32'd17; + 6'd23: exp_neg_q12 = 32'd13; + 6'd24: exp_neg_q12 = 32'd10; + 6'd25: exp_neg_q12 = 32'd8; + 6'd26: exp_neg_q12 = 32'd6; + 6'd27: exp_neg_q12 = 32'd5; + 6'd28: exp_neg_q12 = 32'd4; + 6'd29: exp_neg_q12 = 32'd3; + 6'd30: exp_neg_q12 = 32'd2; + default: exp_neg_q12 = 32'd1; + endcase + end + end +endfunction diff --git a/pynq/hw/src/core/include/microgpt_exact_core_params.svh b/pynq/hw/src/core/include/microgpt_exact_core_params.svh new file mode 100644 index 0000000..1a848fc --- /dev/null +++ b/pynq/hw/src/core/include/microgpt_exact_core_params.svh @@ -0,0 +1,52 @@ +localparam integer EMBED_DIM = 16; +localparam integer VOCAB_SIZE = 27; +localparam integer MLP_DIM = 64; +localparam integer N_HEAD = 4; +localparam integer HEAD_DIM = 4; +localparam integer FRAC_BITS = 12; +localparam integer SCALE = 1 << FRAC_BITS; +localparam integer TILE_ROWS = 16; +localparam integer LAST_EMBED_ROW_TILE = EMBED_DIM - TILE_ROWS; +localparam integer LAST_MLP_ROW_TILE = MLP_DIM - TILE_ROWS; +localparam integer LAST_VOCAB_ROW_TILE = ((VOCAB_SIZE - 1) / TILE_ROWS) * TILE_ROWS; +localparam [6:0] TILE_ROW_STEP = 7'd16; + +localparam [5:0] + ST_IDLE = 6'd0, + ST_LOAD_X = 6'd1, + ST_RMS0_SUM = 6'd2, + ST_RMS0_APPLY = 6'd3, + ST_ATTN_SAVE_RES = 6'd4, + ST_ATTN_RMS_SUM = 6'd5, + ST_ATTN_RMS_APPLY = 6'd6, + ST_Q_LINEAR = 6'd7, + ST_K_LINEAR = 6'd8, + ST_V_LINEAR = 6'd9, + ST_CACHE_QKV = 6'd10, + ST_ATTN_DOT = 6'd11, + ST_ATTN_SOFT = 6'd12, + ST_ATTN_WO = 6'd13, + ST_ATTN_ADD = 6'd14, + ST_MLP_SAVE_RES = 6'd15, + ST_MLP_RMS_SUM = 6'd16, + ST_MLP_RMS_APPLY = 6'd17, + ST_FC1 = 6'd18, + ST_FC2 = 6'd19, + ST_MLP_ADD = 6'd20, + ST_LM_HEAD = 6'd21, + ST_SAMPLE = 6'd22, + ST_DONE = 6'd23, + ST_ATTN_SUM = 6'd24, + ST_ATTN_WEIGHT = 6'd25, + ST_ATTN_MAX = 6'd26, + ST_SAMPLE_MAX = 6'd27, + ST_RMS0_WAIT = 6'd30, + ST_ATTN_RMS_WAIT = 6'd31, + ST_MLP_RMS_WAIT = 6'd32, + ST_ATTN_DIV_WAIT = 6'd33, + ST_SAMPLE_SCALE = 6'd35, + ST_ATTN_DIV_PREP = 6'd36, + ST_LM_HEAD_REDUCE = 6'd37, + ST_LM_HEAD_FINISH = 6'd38, + ST_ATTN_WEIGHT_ACC = 6'd39, + ST_ATTN_DOT_COMMIT = 6'd40; diff --git a/pynq/hw/src/core/include/microgpt_exact_core_rom_init.svh b/pynq/hw/src/core/include/microgpt_exact_core_rom_init.svh new file mode 100644 index 0000000..978bc8d --- /dev/null +++ b/pynq/hw/src/core/include/microgpt_exact_core_rom_init.svh @@ -0,0 +1,14 @@ +// PYNQ-Z2 port deviation: original DE1 paths "generated/.hex" replaced +// with bare filenames so Vivado resolves them via the hw/ip include search +// path (set by build.tcl). Functional contents are unchanged. +initial begin + $readmemh("wte_q12.hex", wte_rom); + $readmemh("wpe_q12.hex", wpe_rom); + $readmemh("lm_head_q12.hex", lm_head_rom); + $readmemh("layer0_attn_wq_q12.hex", attn_wq_rom); + $readmemh("layer0_attn_wk_q12.hex", attn_wk_rom); + $readmemh("layer0_attn_wv_q12.hex", attn_wv_rom); + $readmemh("layer0_attn_wo_q12.hex", attn_wo_rom); + $readmemh("layer0_mlp_fc1_q12.hex", mlp_fc1_rom); + $readmemh("layer0_mlp_fc2_q12.hex", mlp_fc2_rom); +end diff --git a/pynq/hw/src/core/matrixmul_unit.sv b/pynq/hw/src/core/matrixmul_unit.sv new file mode 100644 index 0000000..6c9b427 --- /dev/null +++ b/pynq/hw/src/core/matrixmul_unit.sv @@ -0,0 +1,132 @@ +module matrixmul_unit #( + parameter int M = 4, + parameter int DATA_WIDTH = 32, + parameter int ACC_WIDTH = 64 +) ( + input logic clk, + input logic rst_n, + input logic en, + input logic start, + input logic signed [(M*M*DATA_WIDTH)-1:0] matrix_a_flat, + input logic signed [(M*M*DATA_WIDTH)-1:0] matrix_b_flat, + output logic busy, + output logic done, + output logic signed [(M*M*ACC_WIDTH)-1:0] matrix_c_flat +); + + localparam int TOTAL_CYCLES = (3 * M) - 2; + localparam int COUNTER_W = (TOTAL_CYCLES <= 1) ? 1 : $clog2(TOTAL_CYCLES + 1); + + logic [COUNTER_W-1:0] cycle_count; + logic pe_clear; + + logic signed [DATA_WIDTH-1:0] a_mem [0:M-1][0:M-1]; + logic signed [DATA_WIDTH-1:0] b_mem [0:M-1][0:M-1]; + logic signed [DATA_WIDTH-1:0] a_inject [0:M-1]; + logic signed [DATA_WIDTH-1:0] b_inject [0:M-1]; + + logic signed [DATA_WIDTH-1:0] pe_a_out [0:M-1][0:M-1]; + logic signed [DATA_WIDTH-1:0] pe_b_out [0:M-1][0:M-1]; + logic signed [ACC_WIDTH-1:0] pe_acc_out [0:M-1][0:M-1]; + + always_comb begin + for (int i_c = 0; i_c < M; i_c = i_c + 1) begin + int k_a; + k_a = cycle_count - i_c; + if (busy && (k_a >= 0) && (k_a < M)) begin + a_inject[i_c] = a_mem[i_c][k_a]; + end else begin + a_inject[i_c] = '0; + end + end + + for (int j_c = 0; j_c < M; j_c = j_c + 1) begin + int k_b; + k_b = cycle_count - j_c; + if (busy && (k_b >= 0) && (k_b < M)) begin + b_inject[j_c] = b_mem[k_b][j_c]; + end else begin + b_inject[j_c] = '0; + end + end + end + + genvar r; + genvar c; + generate + for (r = 0; r < M; r = r + 1) begin : GEN_ROW + for (c = 0; c < M; c = c + 1) begin : GEN_COL + logic signed [DATA_WIDTH-1:0] a_in_wire; + logic signed [DATA_WIDTH-1:0] b_in_wire; + + assign a_in_wire = (c == 0) ? a_inject[r] : pe_a_out[r][c-1]; + assign b_in_wire = (r == 0) ? b_inject[c] : pe_b_out[r-1][c]; + + processing_element #( + .DATA_WIDTH(DATA_WIDTH), + .ACC_WIDTH(ACC_WIDTH) + ) u_pe ( + .clk(clk), + .rst_n(rst_n), + .en(en), + .clear(pe_clear), + .a_in(a_in_wire), + .b_in(b_in_wire), + .a_out(pe_a_out[r][c]), + .b_out(pe_b_out[r][c]), + .acc_out(pe_acc_out[r][c]) + ); + end + end + endgenerate + + always_ff @(posedge clk or negedge rst_n) begin + if (!rst_n) begin + busy <= 1'b0; + done <= 1'b0; + pe_clear <= 1'b0; + cycle_count <= '0; + matrix_c_flat <= '0; + for (int i_s = 0; i_s < M; i_s = i_s + 1) begin + for (int j_s = 0; j_s < M; j_s = j_s + 1) begin + a_mem[i_s][j_s] <= '0; + b_mem[i_s][j_s] <= '0; + end + end + end else begin + pe_clear <= 1'b0; + + if (en) begin + if (start && !busy) begin + busy <= 1'b1; + done <= 1'b0; + pe_clear <= 1'b1; + cycle_count <= '0; + matrix_c_flat <= '0; + + for (int i_l = 0; i_l < M; i_l = i_l + 1) begin + for (int j_l = 0; j_l < M; j_l = j_l + 1) begin + a_mem[i_l][j_l] <= matrix_a_flat[((i_l*M + j_l)*DATA_WIDTH) +: DATA_WIDTH]; + b_mem[i_l][j_l] <= matrix_b_flat[((i_l*M + j_l)*DATA_WIDTH) +: DATA_WIDTH]; + end + end + end else if (busy) begin + if (pe_clear) begin + pe_clear <= 1'b0; + end else if (cycle_count < TOTAL_CYCLES[COUNTER_W-1:0]) begin + cycle_count <= cycle_count + 1'b1; + end else begin + busy <= 1'b0; + done <= 1'b1; + for (int i_o = 0; i_o < M; i_o = i_o + 1) begin + for (int j_o = 0; j_o < M; j_o = j_o + 1) begin + matrix_c_flat[((i_o*M + j_o)*ACC_WIDTH) +: ACC_WIDTH] <= pe_acc_out[i_o][j_o]; + end + end + end + end + end + end + end + +endmodule diff --git a/pynq/hw/src/core/microgpt_categorical_sampler.sv b/pynq/hw/src/core/microgpt_categorical_sampler.sv new file mode 100644 index 0000000..71c75cc --- /dev/null +++ b/pynq/hw/src/core/microgpt_categorical_sampler.sv @@ -0,0 +1,287 @@ +module microgpt_categorical_sampler #( + parameter integer VOCAB_SIZE = 27 +) ( + input wire clk, + input wire resetn, + input wire start, + input wire [15:0] temperature_q8_8, + input wire [31:0] rng_state, + input wire [7:0] argmax_token, + input wire signed [15:0] top_logit_q12, + input wire signed [(VOCAB_SIZE*16)-1:0] logits_flat, + output reg busy, + output reg done, + output reg [7:0] next_token +); + +localparam [3:0] + ST_IDLE = 4'd0, + ST_SUM = 4'd1, + ST_SUM_WEIGHT = 4'd2, + ST_SUM_ACC = 4'd3, + ST_MIX = 4'd4, + ST_SCALE = 4'd5, + ST_CUT = 4'd6, + ST_PICK = 4'd7, + ST_DONE = 4'd8; + +reg [3:0] state_reg; +reg [6:0] row_reg; +reg [6:0] sum_row_reg; +reg [31:0] weight_sum_reg; +reg [31:0] cut_reg; +reg [31:0] acc_reg; +reg [31:0] mixed_rng_reg; +reg [63:0] scaled_cut_reg; +reg [7:0] choice_reg; +reg found_reg; +reg [31:0] sample_weight [0:VOCAB_SIZE-1]; +reg [31:0] weight_pipe_reg; +reg [7:0] fine_index_reg; +reg delta_nonneg_reg; +reg sum_last_reg; + +reg signed [31:0] delta_tmp; +reg [31:0] weight_tmp; +reg [31:0] sum_tmp; +reg [31:0] acc_tmp; +reg [7:0] choice_tmp; +reg found_tmp; +reg [5:0] coarse_index_tmp; +reg [1:0] frac_index_tmp; +reg [31:0] w0_tmp; +reg [31:0] w1_tmp; +reg [31:0] diff_tmp; +integer i; + +wire signed [15:0] current_logit = + logits_flat[(row_reg * 16) +: 16]; + +function automatic signed [31:0] apply_temperature_delta; + input signed [31:0] delta_q12; + input [15:0] temp_q8_8; + begin + apply_temperature_delta = delta_q12; + if (temp_q8_8 <= 16'd128) + apply_temperature_delta = delta_q12 <<< 1; + else if (temp_q8_8 > 16'd256 && temp_q8_8 <= 16'd512) + apply_temperature_delta = delta_q12 >>> 1; + else if (temp_q8_8 > 16'd512) + apply_temperature_delta = delta_q12 >>> 2; + end +endfunction + +function automatic [31:0] exp_neg_coarse_q12; + input [5:0] index; + begin + case (index) + 6'd0: exp_neg_coarse_q12 = 32'd4096; + 6'd1: exp_neg_coarse_q12 = 32'd3189; + 6'd2: exp_neg_coarse_q12 = 32'd2484; + 6'd3: exp_neg_coarse_q12 = 32'd1935; + 6'd4: exp_neg_coarse_q12 = 32'd1507; + 6'd5: exp_neg_coarse_q12 = 32'd1174; + 6'd6: exp_neg_coarse_q12 = 32'd914; + 6'd7: exp_neg_coarse_q12 = 32'd712; + 6'd8: exp_neg_coarse_q12 = 32'd555; + 6'd9: exp_neg_coarse_q12 = 32'd432; + 6'd10: exp_neg_coarse_q12 = 32'd337; + 6'd11: exp_neg_coarse_q12 = 32'd262; + 6'd12: exp_neg_coarse_q12 = 32'd204; + 6'd13: exp_neg_coarse_q12 = 32'd159; + 6'd14: exp_neg_coarse_q12 = 32'd124; + 6'd15: exp_neg_coarse_q12 = 32'd97; + 6'd16: exp_neg_coarse_q12 = 32'd75; + 6'd17: exp_neg_coarse_q12 = 32'd59; + 6'd18: exp_neg_coarse_q12 = 32'd46; + 6'd19: exp_neg_coarse_q12 = 32'd36; + 6'd20: exp_neg_coarse_q12 = 32'd28; + 6'd21: exp_neg_coarse_q12 = 32'd22; + 6'd22: exp_neg_coarse_q12 = 32'd17; + 6'd23: exp_neg_coarse_q12 = 32'd13; + 6'd24: exp_neg_coarse_q12 = 32'd10; + 6'd25: exp_neg_coarse_q12 = 32'd8; + 6'd26: exp_neg_coarse_q12 = 32'd6; + 6'd27: exp_neg_coarse_q12 = 32'd5; + 6'd28: exp_neg_coarse_q12 = 32'd4; + 6'd29: exp_neg_coarse_q12 = 32'd3; + 6'd30: exp_neg_coarse_q12 = 32'd2; + default: exp_neg_coarse_q12 = 32'd1; + endcase + end +endfunction + +function automatic [31:0] exp_neg_sample_q12; + input signed [31:0] delta_q12; + reg [7:0] fine_index; + reg [5:0] coarse_index; + reg [1:0] frac_index; + reg [31:0] w0; + reg [31:0] w1; + reg [31:0] diff; + begin + if (delta_q12 >= 0) begin + exp_neg_sample_q12 = 32'd4096; + end else begin + fine_index = ((-delta_q12) + 32'sd127) >>> 8; + coarse_index = fine_index[7:2]; + frac_index = fine_index[1:0]; + w0 = exp_neg_coarse_q12(coarse_index); + w1 = exp_neg_coarse_q12(coarse_index + 6'd1); + diff = w0 - w1; + exp_neg_sample_q12 = w0 - ((diff * {30'd0, frac_index}) >>> 2); + if (exp_neg_sample_q12 == 32'd0) + exp_neg_sample_q12 = 32'd1; + end + end +endfunction + +always @(posedge clk) begin + if (!resetn) begin + state_reg <= ST_IDLE; + row_reg <= 7'd0; + sum_row_reg <= 7'd0; + weight_sum_reg <= 32'd0; + cut_reg <= 32'd0; + acc_reg <= 32'd0; + mixed_rng_reg <= 32'd0; + scaled_cut_reg <= 64'd0; + choice_reg <= 8'd0; + found_reg <= 1'b0; + weight_pipe_reg <= 32'd0; + fine_index_reg <= 8'd0; + delta_nonneg_reg <= 1'b0; + sum_last_reg <= 1'b0; + next_token <= 8'd0; + busy <= 1'b0; + done <= 1'b0; + for (i = 0; i < VOCAB_SIZE; i = i + 1) + sample_weight[i] <= 32'd0; + end else begin + done <= 1'b0; + + case (state_reg) + ST_IDLE: begin + busy <= 1'b0; + if (start) begin + row_reg <= 7'd0; + sum_row_reg <= 7'd0; + weight_sum_reg <= 32'd0; + cut_reg <= 32'd0; + acc_reg <= 32'd0; + choice_reg <= argmax_token; + found_reg <= 1'b0; + weight_pipe_reg <= 32'd0; + fine_index_reg <= 8'd0; + delta_nonneg_reg <= 1'b0; + sum_last_reg <= 1'b0; + busy <= 1'b1; + state_reg <= ST_SUM; + end + end + + ST_SUM: begin + delta_tmp = apply_temperature_delta( + $signed(current_logit) - $signed(top_logit_q12), + temperature_q8_8 + ); + sum_row_reg <= row_reg; + sum_last_reg <= (row_reg == VOCAB_SIZE - 1); + if (delta_tmp >= 0) begin + delta_nonneg_reg <= 1'b1; + fine_index_reg <= 8'd0; + end else begin + delta_nonneg_reg <= 1'b0; + fine_index_reg <= ((-delta_tmp) + 32'sd127) >>> 8; + end + state_reg <= ST_SUM_WEIGHT; + end + + ST_SUM_WEIGHT: begin + if (delta_nonneg_reg) begin + weight_pipe_reg <= 32'd4096; + end else begin + coarse_index_tmp = fine_index_reg[7:2]; + frac_index_tmp = fine_index_reg[1:0]; + w0_tmp = exp_neg_coarse_q12(coarse_index_tmp); + w1_tmp = exp_neg_coarse_q12(coarse_index_tmp + 6'd1); + diff_tmp = w0_tmp - w1_tmp; + weight_pipe_reg <= w0_tmp - ((diff_tmp * {30'd0, frac_index_tmp}) >>> 2); + if ((w0_tmp - ((diff_tmp * {30'd0, frac_index_tmp}) >>> 2)) == 32'd0) + weight_pipe_reg <= 32'd1; + end + state_reg <= ST_SUM_ACC; + end + + ST_SUM_ACC: begin + weight_tmp = weight_pipe_reg; + sum_tmp = weight_sum_reg + weight_tmp; + sample_weight[sum_row_reg] <= weight_tmp; + weight_sum_reg <= sum_tmp; + if (sum_last_reg) begin + if (sum_tmp == 32'd0) + weight_sum_reg <= 32'd1; + row_reg <= 7'd0; + state_reg <= ST_MIX; + end else begin + row_reg <= sum_row_reg + 7'd1; + state_reg <= ST_SUM; + end + end + + ST_MIX: begin + mixed_rng_reg <= rng_state * 32'h000149FB; + state_reg <= ST_SCALE; + end + + ST_SCALE: begin + scaled_cut_reg <= {40'd0, mixed_rng_reg[31:8]} * {32'd0, weight_sum_reg}; + state_reg <= ST_CUT; + end + + ST_CUT: begin + cut_reg <= scaled_cut_reg[55:24]; + if (scaled_cut_reg[55:24] >= weight_sum_reg) + cut_reg <= weight_sum_reg - 32'd1; + acc_reg <= 32'd0; + choice_reg <= argmax_token; + found_reg <= 1'b0; + row_reg <= 7'd0; + state_reg <= ST_PICK; + end + + ST_PICK: begin + weight_tmp = sample_weight[row_reg]; + acc_tmp = acc_reg + weight_tmp; + choice_tmp = choice_reg; + found_tmp = found_reg; + if (!found_tmp && (acc_tmp > cut_reg)) begin + choice_tmp = {1'b0, row_reg}; + found_tmp = 1'b1; + end + acc_reg <= acc_tmp; + choice_reg <= choice_tmp; + found_reg <= found_tmp; + if (row_reg == VOCAB_SIZE - 1) begin + next_token <= choice_tmp; + state_reg <= ST_DONE; + end else begin + row_reg <= row_reg + 7'd1; + end + end + + ST_DONE: begin + busy <= 1'b0; + done <= 1'b1; + state_reg <= ST_IDLE; + end + + default: begin + state_reg <= ST_IDLE; + busy <= 1'b0; + end + endcase + end +end + +endmodule diff --git a/pynq/hw/src/core/microgpt_exact_core.sv b/pynq/hw/src/core/microgpt_exact_core.sv new file mode 100644 index 0000000..28d5a0f --- /dev/null +++ b/pynq/hw/src/core/microgpt_exact_core.sv @@ -0,0 +1,852 @@ +module microgpt_exact_core ( + input wire clk, + input wire resetn, + input wire start, + input wire clear_cache, + input wire sample_mode, + input wire [15:0] temperature_q8_8, + input wire [31:0] rng_state_in, + input wire [7:0] token_in, + input wire [7:0] pos_in, + output reg busy, + output reg done, + output reg [7:0] next_token, + output reg [7:0] argmax_token, + output reg [31:0] rng_state_out, + output reg signed [15:0] top_logit_q12, + output wire signed [(27*16)-1:0] logits_flat +); + +`include "microgpt_exact_core_params.svh" + +reg [5:0] state_reg; +reg [7:0] token_reg; +reg [7:0] pos_reg; +reg [6:0] row_reg; +reg [6:0] col_reg; +reg [3:0] idx_reg; +reg [3:0] lm_reduce_idx; +reg [1:0] head_reg; +reg [4:0] time_reg; + +reg signed [63:0] acc_reg; +reg signed [63:0] attn_dot_acc_reg; +reg signed [63:0] sumsq_reg; +reg signed [63:0] linear_acc [0:15]; +reg signed [15:0] rms_scale_reg; +reg signed [15:0] attn_max_reg; +reg [31:0] attn_weight_sum_reg; +reg [31:0] attn_weight_reg; +reg rms_start_reg; +reg [63:0] rms_sumsq_reg; +reg [3:0] attn_div_start_reg; +reg signed [63:0] attn_div_num_reg [0:3]; +reg [31:0] attn_div_den_reg; +reg signed [63:0] attn_value_acc [0:3]; +reg signed [15:0] attn_value_sample [0:3]; + +reg signed [15:0] x_vec [0:15]; +reg signed [15:0] norm_vec [0:15]; +reg signed [15:0] residual_vec [0:15]; +reg signed [15:0] q_vec [0:15]; +reg signed [15:0] k_vec [0:15]; +reg signed [15:0] v_vec [0:15]; +reg signed [15:0] x_attn [0:15]; +reg signed [15:0] mlp_vec [0:63]; +reg signed [15:0] logits [0:26]; +reg signed [15:0] lm_tile_logits [0:TILE_ROWS-1]; +reg signed [15:0] attn_scores [0:15]; + +reg signed [15:0] k_cache [0:15][0:15]; +reg signed [15:0] v_cache [0:15][0:15]; + +reg signed [15:0] wte_rom [0:431]; +reg signed [15:0] wpe_rom [0:255]; +reg signed [15:0] lm_head_rom [0:431]; +reg signed [15:0] attn_wq_rom [0:255]; +reg signed [15:0] attn_wk_rom [0:255]; +reg signed [15:0] attn_wv_rom [0:255]; +reg signed [15:0] attn_wo_rom [0:255]; +reg signed [15:0] mlp_fc1_rom [0:1023]; +reg signed [15:0] mlp_fc2_rom [0:1023]; + +integer i; +integer j; +integer t; +genvar logits_idx; + +reg signed [63:0] acc_next; +reg signed [63:0] prod64; +reg signed [15:0] value16; +reg signed [15:0] max_logit_tmp; +reg signed [15:0] max_score_tmp; +reg signed [31:0] delta_tmp; +reg [31:0] weight_tmp; +reg [7:0] best_token_tmp; +reg systolic_start_reg; +reg sampler_start_reg; + +reg signed [15:0] systolic_vector_value; +reg signed [(TILE_ROWS*16)-1:0] systolic_weights_flat; +wire [4:0] systolic_col_idx; +wire systolic_busy; +wire systolic_done; +wire signed [(TILE_ROWS*64)-1:0] systolic_result_flat; +wire rms_busy; +wire rms_done; +wire signed [15:0] rms_scale_out; +wire [3:0] attn_div_busy; +wire [3:0] attn_div_done; +wire signed [15:0] attn_div_quotient [0:3]; +wire sampler_busy; +wire sampler_done; +wire [7:0] sampler_next_token; + +systolic_matvec16_tile #( + .DATA_WIDTH(16), + .ACC_WIDTH(64), + .LANES(TILE_ROWS) +) linear_tile_inst ( + .clk(clk), + .resetn(resetn), + .start(systolic_start_reg), + .vector_value(systolic_vector_value), + .weights_flat(systolic_weights_flat), + .col_idx(systolic_col_idx), + .busy(systolic_busy), + .done(systolic_done), + .result_flat(systolic_result_flat) +); + +rms_scale_engine rms_scale_inst ( + .clk(clk), + .resetn(resetn), + .start(rms_start_reg), + .sumsq(rms_sumsq_reg), + .busy(rms_busy), + .done(rms_done), + .scale_q12(rms_scale_out) +); + +genvar div_idx; +generate + for (div_idx = 0; div_idx < HEAD_DIM; div_idx = div_idx + 1) begin : GEN_ATTN_DIV + sat_div16_engine attn_div_inst ( + .clk(clk), + .resetn(resetn), + .start(attn_div_start_reg[div_idx]), + .numerator(attn_div_num_reg[div_idx]), + .denominator(attn_div_den_reg), + .busy(attn_div_busy[div_idx]), + .done(attn_div_done[div_idx]), + .quotient(attn_div_quotient[div_idx]) + ); + end +endgenerate + + +microgpt_categorical_sampler #( + .VOCAB_SIZE(VOCAB_SIZE) +) sampler_inst ( + .clk(clk), + .resetn(resetn), + .start(sampler_start_reg), + .temperature_q8_8(temperature_q8_8), + .rng_state(rng_state_out), + .argmax_token(argmax_token), + .top_logit_q12(top_logit_q12), + .logits_flat(logits_flat), + .busy(sampler_busy), + .done(sampler_done), + .next_token(sampler_next_token) +); + +`include "microgpt_exact_core_rom_init.svh" + +always @(*) begin + systolic_vector_value = 16'sd0; + systolic_weights_flat = '0; + + case (state_reg) + ST_Q_LINEAR: begin + systolic_vector_value = norm_vec[systolic_col_idx]; + for (i = 0; i < TILE_ROWS; i = i + 1) + systolic_weights_flat[(i*16) +: 16] = attn_wq_rom[(row_reg + i) * EMBED_DIM + systolic_col_idx]; + end + + ST_K_LINEAR: begin + systolic_vector_value = norm_vec[systolic_col_idx]; + for (i = 0; i < TILE_ROWS; i = i + 1) + systolic_weights_flat[(i*16) +: 16] = attn_wk_rom[(row_reg + i) * EMBED_DIM + systolic_col_idx]; + end + + ST_V_LINEAR: begin + systolic_vector_value = norm_vec[systolic_col_idx]; + for (i = 0; i < TILE_ROWS; i = i + 1) + systolic_weights_flat[(i*16) +: 16] = attn_wv_rom[(row_reg + i) * EMBED_DIM + systolic_col_idx]; + end + + ST_ATTN_WO: begin + systolic_vector_value = x_attn[systolic_col_idx]; + for (i = 0; i < TILE_ROWS; i = i + 1) + systolic_weights_flat[(i*16) +: 16] = attn_wo_rom[(row_reg + i) * EMBED_DIM + systolic_col_idx]; + end + + ST_FC1: begin + systolic_vector_value = norm_vec[systolic_col_idx]; + for (i = 0; i < TILE_ROWS; i = i + 1) + systolic_weights_flat[(i*16) +: 16] = mlp_fc1_rom[((row_reg + i) * EMBED_DIM) + systolic_col_idx]; + end + + ST_FC2: begin + systolic_vector_value = mlp_vec[col_reg + systolic_col_idx]; + for (i = 0; i < TILE_ROWS; i = i + 1) + systolic_weights_flat[(i*16) +: 16] = mlp_fc2_rom[((row_reg + i) * MLP_DIM) + col_reg + systolic_col_idx]; + end + + ST_LM_HEAD: begin + systolic_vector_value = x_vec[systolic_col_idx]; + for (i = 0; i < TILE_ROWS; i = i + 1) begin + if ((row_reg + i) < VOCAB_SIZE) + systolic_weights_flat[(i*16) +: 16] = lm_head_rom[((row_reg + i) * EMBED_DIM) + systolic_col_idx]; + else + systolic_weights_flat[(i*16) +: 16] = 16'sd0; + end + end + + default: begin + end + endcase +end + +`include "microgpt_exact_core_math.svh" + +generate + for (logits_idx = 0; logits_idx < VOCAB_SIZE; logits_idx = logits_idx + 1) begin : GEN_LOGITS_FLAT + assign logits_flat[(logits_idx*16) +: 16] = logits[logits_idx]; + end +endgenerate + +always @(posedge clk) begin + if (!resetn) begin + state_reg <= ST_IDLE; + token_reg <= 8'd0; + pos_reg <= 8'd0; + row_reg <= 7'd0; + col_reg <= 7'd0; + idx_reg <= 4'd0; + lm_reduce_idx <= 4'd0; + head_reg <= 2'd0; + time_reg <= 5'd0; + acc_reg <= 64'sd0; + attn_dot_acc_reg <= 64'sd0; + sumsq_reg <= 64'sd0; + systolic_start_reg <= 1'b0; + sampler_start_reg <= 1'b0; + rms_start_reg <= 1'b0; + attn_div_start_reg <= 4'd0; + rms_scale_reg <= 16'sd0; + attn_max_reg <= 16'sd0; + attn_weight_sum_reg <= 32'd0; + attn_weight_reg <= 32'd0; + rms_sumsq_reg <= 64'd0; + attn_div_den_reg <= 32'd0; + busy <= 1'b0; + done <= 1'b0; + next_token <= 8'd0; + argmax_token <= 8'd0; + rng_state_out <= 32'd1; + top_logit_q12 <= 16'sd0; + for (i = 0; i < EMBED_DIM; i = i + 1) begin + x_vec[i] <= 16'sd0; + norm_vec[i] <= 16'sd0; + residual_vec[i] <= 16'sd0; + q_vec[i] <= 16'sd0; + k_vec[i] <= 16'sd0; + v_vec[i] <= 16'sd0; + x_attn[i] <= 16'sd0; + end + for (i = 0; i < MLP_DIM; i = i + 1) + mlp_vec[i] <= 16'sd0; + for (i = 0; i < VOCAB_SIZE; i = i + 1) + logits[i] <= 16'sd0; + for (i = 0; i < TILE_ROWS; i = i + 1) + lm_tile_logits[i] <= 16'sd0; + for (i = 0; i < 16; i = i + 1) + linear_acc[i] <= 64'sd0; + for (i = 0; i < HEAD_DIM; i = i + 1) begin + attn_div_num_reg[i] <= 64'sd0; + attn_value_acc[i] <= 64'sd0; + attn_value_sample[i] <= 16'sd0; + end + for (i = 0; i < 16; i = i + 1) begin + attn_scores[i] <= 16'sd0; + for (j = 0; j < EMBED_DIM; j = j + 1) begin + k_cache[i][j] <= 16'sd0; + v_cache[i][j] <= 16'sd0; + end + end + end else begin + done <= 1'b0; + systolic_start_reg <= 1'b0; + sampler_start_reg <= 1'b0; + rms_start_reg <= 1'b0; + attn_div_start_reg <= 4'd0; + case (state_reg) + ST_IDLE: begin + busy <= 1'b0; + if (start) begin + token_reg <= token_in; + pos_reg <= pos_in; + idx_reg <= 4'd0; + row_reg <= 7'd0; + col_reg <= 7'd0; + head_reg <= 2'd0; + time_reg <= 5'd0; + acc_reg <= 64'sd0; + sumsq_reg <= 64'sd0; + if (sample_mode) + rng_state_out <= xorshift32(rng_state_in); + else + rng_state_out <= rng_state_in; + if (clear_cache) begin + for (i = 0; i < 16; i = i + 1) begin + for (j = 0; j < EMBED_DIM; j = j + 1) begin + k_cache[i][j] <= 16'sd0; + v_cache[i][j] <= 16'sd0; + end + end + end + busy <= 1'b1; + state_reg <= ST_LOAD_X; + end + end + + ST_LOAD_X: begin + value16 = sat16($signed(wte_rom[token_reg * EMBED_DIM + idx_reg]) + $signed(wpe_rom[pos_reg * EMBED_DIM + idx_reg])); + prod64 = $signed(value16) * $signed(value16); + x_vec[idx_reg] <= value16; + sumsq_reg <= sumsq_reg + (prod64 >>> FRAC_BITS); + if (idx_reg == EMBED_DIM - 1) begin + idx_reg <= 4'd0; + rms_sumsq_reg <= sumsq_reg + (prod64 >>> FRAC_BITS); + state_reg <= ST_RMS0_WAIT; + end else begin + idx_reg <= idx_reg + 4'd1; + end + end + + ST_RMS0_SUM: begin + prod64 = $signed(x_vec[idx_reg]) * $signed(x_vec[idx_reg]); + sumsq_reg <= sumsq_reg + (prod64 >>> FRAC_BITS); + if (idx_reg == EMBED_DIM - 1) begin + rms_sumsq_reg <= sumsq_reg + (prod64 >>> FRAC_BITS); + state_reg <= ST_RMS0_WAIT; + end else begin + idx_reg <= idx_reg + 4'd1; + end + end + + ST_RMS0_WAIT: begin + if (!rms_busy && !rms_done) + rms_start_reg <= 1'b1; + if (rms_done) begin + rms_scale_reg <= rms_scale_out; + idx_reg <= 4'd0; + sumsq_reg <= 64'sd0; + state_reg <= ST_RMS0_APPLY; + end + end + + ST_RMS0_APPLY: begin + value16 = mul_q12(x_vec[idx_reg], rms_scale_reg); + x_vec[idx_reg] <= value16; + residual_vec[idx_reg] <= value16; + if (idx_reg == EMBED_DIM - 1) begin + idx_reg <= 4'd0; + sumsq_reg <= 64'sd0; + state_reg <= ST_ATTN_RMS_SUM; + end else begin + idx_reg <= idx_reg + 4'd1; + end + end + + ST_ATTN_SAVE_RES: begin + for (i = 0; i < EMBED_DIM; i = i + 1) + residual_vec[i] <= x_vec[i]; + idx_reg <= 4'd0; + sumsq_reg <= 64'sd0; + state_reg <= ST_ATTN_RMS_SUM; + end + + ST_ATTN_RMS_SUM: begin + prod64 = $signed(x_vec[idx_reg]) * $signed(x_vec[idx_reg]); + sumsq_reg <= sumsq_reg + (prod64 >>> FRAC_BITS); + if (idx_reg == EMBED_DIM - 1) begin + rms_sumsq_reg <= sumsq_reg + (prod64 >>> FRAC_BITS); + state_reg <= ST_ATTN_RMS_WAIT; + end else begin + idx_reg <= idx_reg + 4'd1; + end + end + + ST_ATTN_RMS_WAIT: begin + if (!rms_busy && !rms_done) + rms_start_reg <= 1'b1; + if (rms_done) begin + rms_scale_reg <= rms_scale_out; + idx_reg <= 4'd0; + state_reg <= ST_ATTN_RMS_APPLY; + end + end + + ST_ATTN_RMS_APPLY: begin + norm_vec[idx_reg] <= mul_q12(x_vec[idx_reg], rms_scale_reg); + if (idx_reg == EMBED_DIM - 1) begin + row_reg <= 7'd0; + col_reg <= 7'd0; + acc_reg <= 64'sd0; + state_reg <= ST_Q_LINEAR; + end else begin + idx_reg <= idx_reg + 4'd1; + end + end + + ST_Q_LINEAR: begin + if (!systolic_busy && !systolic_done) + systolic_start_reg <= 1'b1; + if (systolic_done) begin + for (i = 0; i < TILE_ROWS; i = i + 1) + q_vec[i] <= sat16($signed(systolic_result_flat[(i*64) +: 64]) >>> FRAC_BITS); + if (row_reg == LAST_EMBED_ROW_TILE) begin + row_reg <= 7'd0; + state_reg <= ST_K_LINEAR; + end else begin + row_reg <= row_reg + TILE_ROW_STEP; + end + end + end + + ST_K_LINEAR: begin + if (!systolic_busy && !systolic_done) + systolic_start_reg <= 1'b1; + if (systolic_done) begin + for (i = 0; i < TILE_ROWS; i = i + 1) + k_vec[i] <= sat16($signed(systolic_result_flat[(i*64) +: 64]) >>> FRAC_BITS); + if (row_reg == LAST_EMBED_ROW_TILE) begin + row_reg <= 7'd0; + state_reg <= ST_V_LINEAR; + end else begin + row_reg <= row_reg + TILE_ROW_STEP; + end + end + end + + ST_V_LINEAR: begin + if (!systolic_busy && !systolic_done) + systolic_start_reg <= 1'b1; + if (systolic_done) begin + for (i = 0; i < TILE_ROWS; i = i + 1) + v_vec[i] <= sat16($signed(systolic_result_flat[(i*64) +: 64]) >>> FRAC_BITS); + if (row_reg == LAST_EMBED_ROW_TILE) begin + row_reg <= 7'd0; + state_reg <= ST_CACHE_QKV; + end else begin + row_reg <= row_reg + TILE_ROW_STEP; + end + end + end + + ST_CACHE_QKV: begin + for (i = 0; i < EMBED_DIM; i = i + 1) begin + k_cache[pos_reg][i] <= k_vec[i]; + v_cache[pos_reg][i] <= v_vec[i]; + end + head_reg <= 2'd0; + time_reg <= 5'd0; + col_reg <= 7'd0; + acc_reg <= 64'sd0; + state_reg <= ST_ATTN_DOT; + end + + ST_ATTN_DOT: begin + acc_next = 64'sd0 + + ($signed(q_vec[head_reg * HEAD_DIM + 0]) * $signed(k_cache[time_reg][head_reg * HEAD_DIM + 0])) + + ($signed(q_vec[head_reg * HEAD_DIM + 1]) * $signed(k_cache[time_reg][head_reg * HEAD_DIM + 1])) + + ($signed(q_vec[head_reg * HEAD_DIM + 2]) * $signed(k_cache[time_reg][head_reg * HEAD_DIM + 2])) + + ($signed(q_vec[head_reg * HEAD_DIM + 3]) * $signed(k_cache[time_reg][head_reg * HEAD_DIM + 3])); + attn_dot_acc_reg <= acc_next; + state_reg <= ST_ATTN_DOT_COMMIT; + end + + ST_ATTN_DOT_COMMIT: begin + value16 = sat16((attn_dot_acc_reg >>> FRAC_BITS) >>> 1); + attn_scores[time_reg] <= value16; + if (time_reg == 5'd0 || value16 > attn_max_reg) + attn_max_reg <= value16; + acc_reg <= 64'sd0; + col_reg <= 7'd0; + if (time_reg == pos_reg[4:0]) begin + attn_weight_sum_reg <= 32'd0; + time_reg <= 5'd0; + state_reg <= ST_ATTN_SUM; + end else begin + time_reg <= time_reg + 5'd1; + state_reg <= ST_ATTN_DOT; + end + end + + ST_ATTN_SOFT: begin + attn_max_reg <= attn_scores[0]; + if (pos_reg == 8'd0) begin + attn_weight_sum_reg <= 32'd0; + time_reg <= 5'd0; + state_reg <= ST_ATTN_SUM; + end else begin + time_reg <= 5'd1; + state_reg <= ST_ATTN_MAX; + end + end + + ST_ATTN_MAX: begin + max_score_tmp = attn_max_reg; + if (attn_scores[time_reg] > max_score_tmp) + max_score_tmp = attn_scores[time_reg]; + attn_max_reg <= max_score_tmp; + if (time_reg == pos_reg[4:0]) begin + attn_weight_sum_reg <= 32'd0; + time_reg <= 5'd0; + state_reg <= ST_ATTN_SUM; + end else begin + time_reg <= time_reg + 5'd1; + end + end + + ST_ATTN_SUM: begin + delta_tmp = $signed(attn_scores[time_reg]) - $signed(attn_max_reg); + weight_tmp = exp_neg_q12(delta_tmp); + attn_weight_sum_reg <= attn_weight_sum_reg + weight_tmp; + if (time_reg == pos_reg[4:0]) begin + if ((attn_weight_sum_reg + weight_tmp) == 32'd0) + attn_weight_sum_reg <= 32'd1; + col_reg <= 7'd0; + time_reg <= 5'd0; + acc_reg <= 64'sd0; + for (i = 0; i < HEAD_DIM; i = i + 1) + attn_value_acc[i] <= 64'sd0; + state_reg <= ST_ATTN_WEIGHT; + end else begin + time_reg <= time_reg + 5'd1; + end + end + + ST_ATTN_WEIGHT: begin + delta_tmp = $signed(attn_scores[time_reg]) - $signed(attn_max_reg); + weight_tmp = exp_neg_q12(delta_tmp); + attn_weight_reg <= weight_tmp; + for (i = 0; i < HEAD_DIM; i = i + 1) + attn_value_sample[i] <= v_cache[time_reg][head_reg * HEAD_DIM + i]; + state_reg <= ST_ATTN_WEIGHT_ACC; + end + + ST_ATTN_WEIGHT_ACC: begin + for (i = 0; i < HEAD_DIM; i = i + 1) begin + attn_value_acc[i] <= attn_value_acc[i] + + ($signed({1'b0, attn_weight_reg[30:0]}) * $signed(attn_value_sample[i])); + end + if (time_reg == pos_reg[4:0]) begin + attn_div_den_reg <= attn_weight_sum_reg; + acc_reg <= 64'sd0; + time_reg <= 5'd0; + state_reg <= ST_ATTN_DIV_PREP; + end else begin + time_reg <= time_reg + 5'd1; + state_reg <= ST_ATTN_WEIGHT; + end + end + + ST_ATTN_DIV_PREP: begin + for (i = 0; i < HEAD_DIM; i = i + 1) + attn_div_num_reg[i] <= attn_value_acc[i]; + state_reg <= ST_ATTN_DIV_WAIT; + end + + ST_ATTN_DIV_WAIT: begin + if (!(|attn_div_busy) && !(|attn_div_done)) + attn_div_start_reg <= 4'b1111; + if (&attn_div_done) begin + for (i = 0; i < HEAD_DIM; i = i + 1) + x_attn[head_reg * HEAD_DIM + i] <= attn_div_quotient[i]; + col_reg <= 7'd0; + if (head_reg == N_HEAD - 1) begin + row_reg <= 7'd0; + state_reg <= ST_ATTN_WO; + end else begin + head_reg <= head_reg + 2'd1; + time_reg <= 5'd0; + state_reg <= ST_ATTN_DOT; + end + end + end + + ST_ATTN_WO: begin + if (!systolic_busy && !systolic_done) + systolic_start_reg <= 1'b1; + if (systolic_done) begin + for (i = 0; i < TILE_ROWS; i = i + 1) + norm_vec[i] <= sat16($signed(systolic_result_flat[(i*64) +: 64]) >>> FRAC_BITS); + if (row_reg == LAST_EMBED_ROW_TILE) begin + row_reg <= 7'd0; + idx_reg <= 4'd0; + sumsq_reg <= 64'sd0; + state_reg <= ST_ATTN_ADD; + end else begin + row_reg <= row_reg + TILE_ROW_STEP; + end + end + end + + ST_ATTN_ADD: begin + value16 = sat16($signed(norm_vec[idx_reg]) + $signed(residual_vec[idx_reg])); + prod64 = $signed(value16) * $signed(value16); + x_vec[idx_reg] <= value16; + residual_vec[idx_reg] <= value16; + sumsq_reg <= sumsq_reg + (prod64 >>> FRAC_BITS); + if (idx_reg == EMBED_DIM - 1) begin + idx_reg <= 4'd0; + rms_sumsq_reg <= sumsq_reg + (prod64 >>> FRAC_BITS); + state_reg <= ST_MLP_RMS_WAIT; + end else begin + idx_reg <= idx_reg + 4'd1; + end + end + + ST_MLP_SAVE_RES: begin + for (i = 0; i < EMBED_DIM; i = i + 1) + residual_vec[i] <= x_vec[i]; + idx_reg <= 4'd0; + sumsq_reg <= 64'sd0; + state_reg <= ST_MLP_RMS_SUM; + end + + ST_MLP_RMS_SUM: begin + prod64 = $signed(x_vec[idx_reg]) * $signed(x_vec[idx_reg]); + sumsq_reg <= sumsq_reg + (prod64 >>> FRAC_BITS); + if (idx_reg == EMBED_DIM - 1) begin + rms_sumsq_reg <= sumsq_reg + (prod64 >>> FRAC_BITS); + state_reg <= ST_MLP_RMS_WAIT; + end else begin + idx_reg <= idx_reg + 4'd1; + end + end + + ST_MLP_RMS_WAIT: begin + if (!rms_busy && !rms_done) + rms_start_reg <= 1'b1; + if (rms_done) begin + rms_scale_reg <= rms_scale_out; + idx_reg <= 4'd0; + state_reg <= ST_MLP_RMS_APPLY; + end + end + + ST_MLP_RMS_APPLY: begin + norm_vec[idx_reg] <= mul_q12(x_vec[idx_reg], rms_scale_reg); + if (idx_reg == EMBED_DIM - 1) begin + row_reg <= 7'd0; + col_reg <= 7'd0; + for (i = 0; i < 16; i = i + 1) + linear_acc[i] <= 64'sd0; + state_reg <= ST_FC1; + end else begin + idx_reg <= idx_reg + 4'd1; + end + end + + ST_FC1: begin + if (!systolic_busy && !systolic_done) + systolic_start_reg <= 1'b1; + if (systolic_done) begin + case (row_reg) + 7'd0: begin + for (i = 0; i < TILE_ROWS; i = i + 1) begin + value16 = sat16($signed(systolic_result_flat[(i*64) +: 64]) >>> FRAC_BITS); + mlp_vec[i] <= value16[15] ? 16'sd0 : value16; + end + end + 7'd16: begin + for (i = 0; i < TILE_ROWS; i = i + 1) begin + value16 = sat16($signed(systolic_result_flat[(i*64) +: 64]) >>> FRAC_BITS); + mlp_vec[16 + i] <= value16[15] ? 16'sd0 : value16; + end + end + 7'd32: begin + for (i = 0; i < TILE_ROWS; i = i + 1) begin + value16 = sat16($signed(systolic_result_flat[(i*64) +: 64]) >>> FRAC_BITS); + mlp_vec[32 + i] <= value16[15] ? 16'sd0 : value16; + end + end + default: begin + for (i = 0; i < TILE_ROWS; i = i + 1) begin + value16 = sat16($signed(systolic_result_flat[(i*64) +: 64]) >>> FRAC_BITS); + mlp_vec[48 + i] <= value16[15] ? 16'sd0 : value16; + end + end + endcase + if (row_reg == LAST_MLP_ROW_TILE) begin + row_reg <= 7'd0; + col_reg <= 7'd0; + for (i = 0; i < 16; i = i + 1) + linear_acc[i] <= 64'sd0; + state_reg <= ST_FC2; + end else begin + row_reg <= row_reg + TILE_ROW_STEP; + end + end + end + + ST_FC2: begin + if (!systolic_busy && !systolic_done) + systolic_start_reg <= 1'b1; + if (systolic_done) begin + if (col_reg == 7'd48) begin + for (i = 0; i < TILE_ROWS; i = i + 1) begin + acc_next = linear_acc[i] + $signed(systolic_result_flat[(i*64) +: 64]); + norm_vec[i] <= sat16(acc_next >>> FRAC_BITS); + linear_acc[i] <= 64'sd0; + end + if (row_reg == LAST_EMBED_ROW_TILE) begin + row_reg <= 7'd0; + col_reg <= 7'd0; + idx_reg <= 4'd0; + state_reg <= ST_MLP_ADD; + end else begin + row_reg <= row_reg + TILE_ROW_STEP; + col_reg <= 7'd0; + end + end else begin + for (i = 0; i < TILE_ROWS; i = i + 1) + linear_acc[i] <= linear_acc[i] + $signed(systolic_result_flat[(i*64) +: 64]); + col_reg <= col_reg + 7'd16; + end + end + end + + ST_MLP_ADD: begin + x_vec[idx_reg] <= sat16($signed(norm_vec[idx_reg]) + $signed(residual_vec[idx_reg])); + if (idx_reg == EMBED_DIM - 1) begin + row_reg <= 7'd0; + col_reg <= 7'd0; + top_logit_q12 <= 16'sh8000; + argmax_token <= 8'd0; + state_reg <= ST_LM_HEAD; + end else begin + idx_reg <= idx_reg + 4'd1; + end + end + + ST_LM_HEAD: begin + if (!systolic_busy && !systolic_done) + systolic_start_reg <= 1'b1; + if (systolic_done) begin + for (i = 0; i < TILE_ROWS; i = i + 1) begin + if ((row_reg + i) < VOCAB_SIZE) begin + value16 = sat16($signed(systolic_result_flat[(i*64) +: 64]) >>> FRAC_BITS); + logits[row_reg + i] <= value16; + lm_tile_logits[i] <= value16; + end else begin + lm_tile_logits[i] <= 16'sh8000; + end + end + lm_reduce_idx <= 4'd0; + state_reg <= ST_LM_HEAD_REDUCE; + end + end + + ST_LM_HEAD_REDUCE: begin + max_logit_tmp = top_logit_q12; + best_token_tmp = argmax_token; + if ((row_reg + lm_reduce_idx) < VOCAB_SIZE) begin + if (lm_tile_logits[lm_reduce_idx] > max_logit_tmp) begin + max_logit_tmp = lm_tile_logits[lm_reduce_idx]; + best_token_tmp = row_reg + lm_reduce_idx; + end + end + + top_logit_q12 <= max_logit_tmp; + argmax_token <= best_token_tmp; + + if ((lm_reduce_idx == TILE_ROWS - 1) || ((row_reg + lm_reduce_idx) == VOCAB_SIZE - 1)) begin + lm_reduce_idx <= 4'd0; + if (row_reg == LAST_VOCAB_ROW_TILE) begin + state_reg <= ST_LM_HEAD_FINISH; + end else begin + row_reg <= row_reg + TILE_ROW_STEP; + state_reg <= ST_LM_HEAD; + end + end else begin + lm_reduce_idx <= lm_reduce_idx + 4'd1; + end + end + + ST_LM_HEAD_FINISH: begin + if (sample_mode) begin + sampler_start_reg <= 1'b1; + state_reg <= ST_SAMPLE_SCALE; + end else begin + next_token <= argmax_token; + state_reg <= ST_DONE; + end + end + + ST_SAMPLE: begin + top_logit_q12 <= logits[0]; + argmax_token <= 8'd0; + row_reg <= 7'd1; + state_reg <= ST_SAMPLE_MAX; + end + + ST_SAMPLE_MAX: begin + max_logit_tmp = top_logit_q12; + best_token_tmp = argmax_token; + if (logits[row_reg] > max_logit_tmp) begin + max_logit_tmp = logits[row_reg]; + best_token_tmp = {1'b0, row_reg}; + end + top_logit_q12 <= max_logit_tmp; + argmax_token <= best_token_tmp; + if (row_reg == VOCAB_SIZE - 1) begin + if (sample_mode) begin + sampler_start_reg <= 1'b1; + state_reg <= ST_SAMPLE_SCALE; + end else begin + next_token <= best_token_tmp; + state_reg <= ST_DONE; + end + end else begin + row_reg <= row_reg + 7'd1; + end + end + + ST_SAMPLE_SCALE: begin + if (sampler_done) begin + next_token <= sampler_next_token; + state_reg <= ST_DONE; + end + end + + ST_DONE: begin + busy <= 1'b0; + done <= 1'b1; + state_reg <= ST_IDLE; + end + + default: begin + state_reg <= ST_IDLE; + busy <= 1'b0; + end + endcase + end +end + +endmodule diff --git a/pynq/hw/src/core/processing_element.sv b/pynq/hw/src/core/processing_element.sv new file mode 100644 index 0000000..1abbaa9 --- /dev/null +++ b/pynq/hw/src/core/processing_element.sv @@ -0,0 +1,40 @@ +module processing_element #( + parameter int DATA_WIDTH = 32, + parameter int ACC_WIDTH = 64 +) ( + input logic clk, + input logic rst_n, + input logic en, + input logic clear, + input logic signed [DATA_WIDTH-1:0] a_in, + input logic signed [DATA_WIDTH-1:0] b_in, + output logic signed [DATA_WIDTH-1:0] a_out, + output logic signed [DATA_WIDTH-1:0] b_out, + output logic signed [ACC_WIDTH-1:0] acc_out +); + + logic signed [(2*DATA_WIDTH)-1:0] mul_full; + logic signed [ACC_WIDTH-1:0] mul_ext; + + always_comb begin + mul_full = a_in * b_in; + mul_ext = {{(ACC_WIDTH-(2*DATA_WIDTH)){mul_full[(2*DATA_WIDTH)-1]}}, mul_full}; + end + + always_ff @(posedge clk or negedge rst_n) begin + if (!rst_n) begin + a_out <= '0; + b_out <= '0; + acc_out <= '0; + end else if (clear) begin + a_out <= '0; + b_out <= '0; + acc_out <= '0; + end else if (en) begin + a_out <= a_in; + b_out <= b_in; + acc_out <= acc_out + mul_ext; + end + end + +endmodule diff --git a/pynq/hw/src/core/rms_scale_engine.sv b/pynq/hw/src/core/rms_scale_engine.sv new file mode 100644 index 0000000..ca34eff --- /dev/null +++ b/pynq/hw/src/core/rms_scale_engine.sv @@ -0,0 +1,139 @@ +module rms_scale_engine ( + input logic clk, + input logic resetn, + input logic start, + input logic [63:0] sumsq, + output logic busy, + output logic done, + output logic signed [15:0] scale_q12 +); + + localparam logic [1:0] + ST_IDLE = 2'd0, + ST_SQRT = 2'd1, + ST_DIV = 2'd2; + + logic [1:0] state_reg; + logic [63:0] radicand_reg; + logic [65:0] rem_reg; + logic [32:0] root_reg; + logic [31:0] denom_reg; + logic [5:0] iter_reg; + + logic [32:0] div_rem_reg; + logic [24:0] div_quot_reg; + logic [5:0] div_bit_reg; + + logic [65:0] rem_next; + logic [32:0] root_shift; + logic [33:0] cand_next; + logic [32:0] div_rem_next; + logic [24:0] div_quot_next; + logic div_in_bit; + + function automatic signed [15:0] sat16; + input signed [63:0] value; + begin + if (value > 64'sd32767) + sat16 = 16'sd32767; + else if (value < -64'sd32768) + sat16 = 16'sh8000; + else + sat16 = value[15:0]; + end + endfunction + + always_ff @(posedge clk) begin + if (!resetn) begin + state_reg <= ST_IDLE; + busy <= 1'b0; + done <= 1'b0; + scale_q12 <= 16'sd0; + radicand_reg <= 64'd0; + rem_reg <= 66'd0; + root_reg <= 33'd0; + denom_reg <= 32'd0; + iter_reg <= 6'd0; + div_rem_reg <= 33'd0; + div_quot_reg <= 25'd0; + div_bit_reg <= 6'd0; + end else begin + done <= 1'b0; + + case (state_reg) + ST_IDLE: begin + busy <= 1'b0; + if (start) begin + busy <= 1'b1; + radicand_reg <= (((sumsq >> 4) + 64'd1) << 12); + rem_reg <= 66'd0; + root_reg <= 33'd0; + iter_reg <= 6'd31; + state_reg <= ST_SQRT; + end + end + + ST_SQRT: begin + rem_next = (rem_reg << 2) | ((radicand_reg >> (iter_reg * 2)) & 64'd3); + root_shift = root_reg << 1; + cand_next = (root_shift << 1) | 34'd1; + + if (rem_next >= cand_next) begin + rem_reg <= rem_next - cand_next; + root_reg <= root_shift + 33'd1; + end else begin + rem_reg <= rem_next; + root_reg <= root_shift; + end + + if (iter_reg == 6'd0) begin + denom_reg <= (rem_next >= cand_next) ? (root_shift + 33'd1) : root_shift; + if (((rem_next >= cand_next) ? (root_shift + 33'd1) : root_shift) == 33'd0 || + ((rem_next >= cand_next) ? (root_shift + 33'd1) : root_shift) <= 33'd512) begin + scale_q12 <= 16'sd32767; + busy <= 1'b0; + done <= 1'b1; + state_reg <= ST_IDLE; + end else begin + div_rem_reg <= 33'd0; + div_quot_reg <= 25'd0; + div_bit_reg <= 6'd24; + state_reg <= ST_DIV; + end + end else begin + iter_reg <= iter_reg - 6'd1; + end + end + + ST_DIV: begin + div_in_bit = (div_bit_reg == 6'd24); + div_rem_next = {div_rem_reg[31:0], div_in_bit}; + div_quot_next = div_quot_reg; + + if (div_rem_next >= {1'b0, denom_reg}) begin + div_rem_next = div_rem_next - {1'b0, denom_reg}; + div_quot_next[div_bit_reg] = 1'b1; + end + + div_rem_reg <= div_rem_next; + div_quot_reg <= div_quot_next; + + if (div_bit_reg == 6'd0) begin + scale_q12 <= sat16($signed({39'd0, div_quot_next})); + busy <= 1'b0; + done <= 1'b1; + state_reg <= ST_IDLE; + end else begin + div_bit_reg <= div_bit_reg - 6'd1; + end + end + + default: begin + state_reg <= ST_IDLE; + busy <= 1'b0; + end + endcase + end + end + +endmodule diff --git a/pynq/hw/src/core/sat_div16_engine.sv b/pynq/hw/src/core/sat_div16_engine.sv new file mode 100644 index 0000000..28e069c --- /dev/null +++ b/pynq/hw/src/core/sat_div16_engine.sv @@ -0,0 +1,87 @@ +module sat_div16_engine ( + input logic clk, + input logic resetn, + input logic start, + input logic signed [63:0] numerator, + input logic [31:0] denominator, + output logic busy, + output logic done, + output logic signed [15:0] quotient +); + + logic [63:0] num_reg; + logic [31:0] denom_reg; + logic [64:0] rem_reg; + logic [63:0] quot_reg; + logic [6:0] bit_reg; + logic neg_reg; + + logic [64:0] rem_next; + logic [63:0] quot_next; + logic signed [64:0] signed_quot_next; + + function automatic signed [15:0] sat16; + input signed [63:0] value; + begin + if (value > 64'sd32767) + sat16 = 16'sd32767; + else if (value < -64'sd32768) + sat16 = 16'sh8000; + else + sat16 = value[15:0]; + end + endfunction + + always_ff @(posedge clk) begin + if (!resetn) begin + busy <= 1'b0; + done <= 1'b0; + quotient <= 16'sd0; + num_reg <= 64'd0; + denom_reg <= 32'd0; + rem_reg <= 65'd0; + quot_reg <= 64'd0; + bit_reg <= 7'd0; + neg_reg <= 1'b0; + end else begin + done <= 1'b0; + + if (!busy) begin + if (start) begin + busy <= 1'b1; + neg_reg <= numerator[63]; + num_reg <= numerator[63] ? (~numerator + 64'd1) : numerator[63:0]; + denom_reg <= (denominator == 32'd0) ? 32'd1 : denominator; + rem_reg <= 65'd0; + quot_reg <= 64'd0; + // Attention numerator is sum(exp_q12 * value_q12) across at most + // 16 positions: 4096 * 32768 * 16 = 2^31. Starting at bit 31 + // preserves the exact quotient and removes 32 dead divide cycles. + bit_reg <= 7'd31; + end + end else begin + rem_next = {rem_reg[63:0], num_reg[bit_reg]}; + quot_next = quot_reg; + if (rem_next >= {33'd0, denom_reg}) begin + rem_next = rem_next - {33'd0, denom_reg}; + quot_next[bit_reg] = 1'b1; + end + + rem_reg <= rem_next; + quot_reg <= quot_next; + + if (bit_reg == 7'd0) begin + signed_quot_next = $signed({1'b0, quot_next}); + if (neg_reg) + signed_quot_next = -signed_quot_next; + quotient <= sat16(signed_quot_next); + busy <= 1'b0; + done <= 1'b1; + end else begin + bit_reg <= bit_reg - 7'd1; + end + end + end + end + +endmodule diff --git a/pynq/hw/src/core/systolic_matvec16_tile.sv b/pynq/hw/src/core/systolic_matvec16_tile.sv new file mode 100644 index 0000000..d310e00 --- /dev/null +++ b/pynq/hw/src/core/systolic_matvec16_tile.sv @@ -0,0 +1,67 @@ +module systolic_matvec16_tile #( + parameter int DATA_WIDTH = 16, + parameter int ACC_WIDTH = 64, + parameter int LANES = 4 +) ( + input logic clk, + input logic resetn, + input logic start, + input logic signed [DATA_WIDTH-1:0] vector_value, + input logic signed [(LANES*DATA_WIDTH)-1:0] weights_flat, + output logic [4:0] col_idx, + output logic busy, + output logic done, + output logic signed [(LANES*ACC_WIDTH)-1:0] result_flat +); + + localparam int COLS = 16; + + logic signed [ACC_WIDTH-1:0] acc [0:LANES-1]; + + genvar lane; + generate + for (lane = 0; lane < LANES; lane = lane + 1) begin : GEN_MAC_LANE + wire signed [DATA_WIDTH-1:0] lane_weight; + wire signed [DATA_WIDTH-1:0] lane_value; + wire signed [ACC_WIDTH-1:0] lane_product; + + assign lane_weight = weights_flat[(lane*DATA_WIDTH) +: DATA_WIDTH]; + assign lane_value = vector_value; + assign lane_product = $signed(lane_weight) * $signed(lane_value); + + assign result_flat[(lane*ACC_WIDTH) +: ACC_WIDTH] = acc[lane]; + + always_ff @(posedge clk) begin + if (!resetn) begin + acc[lane] <= '0; + end else if (start && !busy) begin + acc[lane] <= '0; + end else if (busy && (col_idx < COLS[4:0])) begin + acc[lane] <= acc[lane] + lane_product; + end + end + end + endgenerate + + always_ff @(posedge clk) begin + if (!resetn) begin + busy <= 1'b0; + done <= 1'b0; + col_idx <= 5'd0; + end else begin + done <= 1'b0; + if (start && !busy) begin + busy <= 1'b1; + col_idx <= 5'd0; + end else if (busy) begin + if (col_idx == (COLS - 1)) begin + busy <= 1'b0; + done <= 1'b1; + end else begin + col_idx <= col_idx + 5'd1; + end + end + end + end + +endmodule diff --git a/pynq/hw/src/top/microgpt_pynq_top.sv b/pynq/hw/src/top/microgpt_pynq_top.sv new file mode 100644 index 0000000..0d36b8f --- /dev/null +++ b/pynq/hw/src/top/microgpt_pynq_top.sv @@ -0,0 +1,695 @@ +// ----------------------------------------------------------------------------- +// microgpt_pynq_top.sv +// +// PYNQ-Z2 / Zynq-7000 top wrapper for the TALOS-V2 microGPT exact core. +// +// Deviations from the original DE1-SoC top (de1_soc_microgpt_rtl.sv): +// * Avalon-MM JTAG-master replaced with a 12-bit / 32-bit AXI4-Lite slave +// mapped at 0x4000_0000 (4 KB) on the PS GP0 port. +// * Single clock domain (s_axi_aclk == FCLK_CLK0 @ 50 MHz). All toggle-bit +// CDC synchronizers removed; control "toggles" become 1-cycle pulses. +// * Altera PLL (sys_pll_56_25) removed -- the PS supplies the clock. +// * SW[1:0] inputs removed; reset comes from FCLK_RESET0_N (s_axi_aresetn) +// and "enable" is implied (always running, gated only by host commands). +// * 7-segment HEX0..HEX5 outputs removed. +// * Only 4 PL outputs remain, wired to PYNQ-Z2 LD0..LD3: +// led_busy, led_done, led_error, led_heartbeat. +// * WSTRB is ignored -- only aligned 32-bit writes are accepted. +// * host_toggle_reg still toggles on every successful AXI transaction so the +// host can detect that its access landed (kept for behavioural parity). +// ----------------------------------------------------------------------------- + +`default_nettype none + +module microgpt_pynq_top #( + // Width of the free-running heartbeat counter feeding led_heartbeat. + // Synth default 26 gives 50 MHz / 2^26 = ~0.74 Hz visible blink. The cocotb + // Makefile overrides this to a much smaller value so test_08 can observe a + // led_heartbeat toggle within a short sim window. + parameter integer HEARTBEAT_BITS = 26 +) ( + (* X_INTERFACE_PARAMETER = "ASSOCIATED_BUSIF s_axi, ASSOCIATED_RESET s_axi_aresetn, FREQ_HZ 50000000" *) + (* X_INTERFACE_INFO = "xilinx.com:signal:clock:1.0 s_axi_aclk CLK" *) + input wire s_axi_aclk, + (* X_INTERFACE_PARAMETER = "POLARITY ACTIVE_LOW" *) + (* X_INTERFACE_INFO = "xilinx.com:signal:reset:1.0 s_axi_aresetn RST" *) + input wire s_axi_aresetn, + + // AXI4-Lite write address channel + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi AWADDR" *) + input wire [11:0] s_axi_awaddr, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi AWPROT" *) + input wire [2:0] s_axi_awprot, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi AWVALID" *) + input wire s_axi_awvalid, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi AWREADY" *) + output wire s_axi_awready, + + // AXI4-Lite write data channel + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi WDATA" *) + input wire [31:0] s_axi_wdata, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi WSTRB" *) + input wire [3:0] s_axi_wstrb, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi WVALID" *) + input wire s_axi_wvalid, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi WREADY" *) + output wire s_axi_wready, + + // AXI4-Lite write response channel + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi BRESP" *) + output wire [1:0] s_axi_bresp, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi BVALID" *) + output wire s_axi_bvalid, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi BREADY" *) + input wire s_axi_bready, + + // AXI4-Lite read address channel + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi ARADDR" *) + input wire [11:0] s_axi_araddr, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi ARPROT" *) + input wire [2:0] s_axi_arprot, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi ARVALID" *) + input wire s_axi_arvalid, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi ARREADY" *) + output wire s_axi_arready, + + // AXI4-Lite read data channel + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi RDATA" *) + output wire [31:0] s_axi_rdata, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi RRESP" *) + output wire [1:0] s_axi_rresp, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi RVALID" *) + output wire s_axi_rvalid, + (* X_INTERFACE_INFO = "xilinx.com:interface:aximm:1.0 s_axi RREADY" *) + input wire s_axi_rready, + + // PL LEDs (LD0..LD3 on PYNQ-Z2) + output wire led_busy, + output wire led_done, + output wire led_error, + output wire led_heartbeat, + + // Active-high level interrupt asserted when a generation completes + // (rising edge of done_latched_reg). Cleared automatically when the + // host writes a new start_pulse via REG_CMD bit0, so the driver pattern + // is: write start, wait on irq, read tokens, return. No explicit ACK. + (* X_INTERFACE_INFO = "xilinx.com:signal:interrupt:1.0 done_irq INTERRUPT" *) + (* X_INTERFACE_PARAMETER = "SENSITIVITY LEVEL_HIGH" *) + output wire done_irq +); + + // --------------------------------------------------------------------- + // Local parameters + // --------------------------------------------------------------------- + localparam [7:0] BOS_TOKEN = 8'd26; + localparam [2:0] ST_READY = 3'd0; + localparam [2:0] ST_WAIT_CORE = 3'd1; + localparam [2:0] ST_DONE = 3'd2; + + // Word-address map (s_axi_*addr[11:2]) + localparam [9:0] A_MAGIC = 10'h000; + localparam [9:0] A_VERSION = 10'h001; + localparam [9:0] A_CMD = 10'h002; // WO: bit0=start, bit1=clear + localparam [9:0] A_STATUS = 10'h003; + localparam [9:0] A_CONFIG = 10'h004; // {temp[31:16], max_gen[15:8], 8'd0} + localparam [9:0] A_SEED = 10'h005; + localparam [9:0] A_LOGIT_INFO = 10'h006; + localparam [9:0] A_BOS = 10'h007; + localparam [9:0] A_STEP_CFG = 10'h008; + localparam [9:0] A_STEP_TRIG = 10'h009; + localparam [9:0] A_HEARTBEAT = 10'h00A; // RO debug: heartbeat_reg as a 32-b word + localparam [9:0] A_OUT_BASE = 10'h018; // 0x060/4 -> 16 words + localparam [9:0] A_OUT_LAST = 10'h027; + localparam [9:0] A_PERF_CYC = 10'h036; + localparam [9:0] A_TPS = 10'h037; + localparam [9:0] A_LOGITS_BASE = 10'h040; // 27 entries + localparam [9:0] A_LOGITS_LAST = 10'h05A; + + // --------------------------------------------------------------------- + // AXI4-Lite slave FSM (very small, single outstanding txn) + // --------------------------------------------------------------------- + reg awready_reg; + reg wready_reg; + reg bvalid_reg; + reg arready_reg; + reg rvalid_reg; + reg [31:0] rdata_reg; + + reg [11:0] awaddr_reg; + reg [11:0] araddr_reg; + + assign s_axi_awready = awready_reg; + assign s_axi_wready = wready_reg; + assign s_axi_bvalid = bvalid_reg; + assign s_axi_bresp = 2'b00; + assign s_axi_arready = arready_reg; + assign s_axi_rvalid = rvalid_reg; + assign s_axi_rdata = rdata_reg; + assign s_axi_rresp = 2'b00; + + wire read_handshake = s_axi_arvalid && arready_reg; + + // Latch write-strobe of the cycle so register-write logic can use it. + reg write_pulse_reg; + reg [11:0] write_addr_reg; + reg [31:0] write_data_reg; + + // Per-channel "transferred and pending" bits used by the rewritten + // AXI4-Lite write FSM below (Xilinx-template style). + reg aw_latched; + reg w_latched; + + // --------------------------------------------------------------------- + // Host-facing registers + // --------------------------------------------------------------------- + reg [15:0] host_temperature_reg; + reg [7:0] host_max_gen_reg; + reg [31:0] host_seed_reg; + reg host_direct_mode_reg; + reg host_step_clear_reg; + reg [7:0] host_step_token_reg; + reg [7:0] host_step_pos_reg; + reg host_toggle_reg; // toggles on every AXI transaction + + // 1-cycle pulses (replaces Avalon toggle-bit CDC) + reg start_pulse; + reg clear_pulse; + reg step_pulse; + + // --------------------------------------------------------------------- + // Core control plane (single-clock now, was on the 56.25 MHz domain) + // --------------------------------------------------------------------- + reg [2:0] state_reg; + reg [7:0] token_reg; + reg [7:0] pos_reg; + reg [7:0] out_len_reg; + reg [31:0] rng_reg; + reg [15:0] temperature_reg; + reg [7:0] max_gen_reg; + reg start_core_reg; + reg clear_cache_reg; + reg done_latched_reg; + reg [7:0] last_token_reg; + reg [31:0] perf_cycles_reg; + reg [31:0] tokens_per_sec_reg; + reg error_reg; + reg host_run_reg; + reg direct_mode_reg; + reg step_clear_reg; + reg [7:0] step_token_reg; + reg [7:0] step_pos_reg; + reg [7:0] output_mem [0:15]; + // Heartbeat counter widened from 24b to 26b so led_heartbeat blinks + // at ~0.74 Hz (50 MHz / 2^26) -- unambiguously visible to the eye. + // The 24-bit MSB toggled at ~3 Hz which read as a steady half-bright + // glow rather than a blink on the deployed bitstream. + (* DONT_TOUCH = "true" *) reg [HEARTBEAT_BITS-1:0] heartbeat_reg; + + integer out_i; + + // --------------------------------------------------------------------- + // Core instance + // --------------------------------------------------------------------- + wire core_busy; + wire core_done; + wire [7:0] core_next_token; + wire [7:0] core_argmax_token; + wire [31:0] core_rng_state; + wire signed [15:0] core_top_logit; + wire signed [(27*16)-1:0] core_logits_flat; + + // KEEP_HIERARCHY blocks Vivado synth from peering inside the core for + // signal-equivalence merging. Without it, Vivado 2024.1 was tying the + // wrapper's heartbeat_reg synchronous-reset pin to an internal control + // signal of the unmodified core (GEN_ATTN_DIV[0].attn_div_inst/p_0_in), + // which left led_heartbeat dark on hardware even though every netlist + // probe of the synth checkpoint showed the counter wired correctly. + (* KEEP_HIERARCHY = "yes" *) + microgpt_exact_core core_inst ( + .clk (s_axi_aclk), + .resetn (s_axi_aresetn), + .start (start_core_reg), + .clear_cache (clear_cache_reg), + .sample_mode (~direct_mode_reg), + .temperature_q8_8 (temperature_reg), + .rng_state_in (rng_reg), + .token_in (token_reg), + .pos_in (pos_reg), + .busy (core_busy), + .done (core_done), + .next_token (core_next_token), + .argmax_token (core_argmax_token), + .rng_state_out (core_rng_state), + .top_logit_q12 (core_top_logit), + .logits_flat (core_logits_flat) + ); + + // --------------------------------------------------------------------- + // AXI4-Lite write channel (FIXED -- 2026-05-09) + // + // PRODUCTION BUG (wedged the entire Zynq PS bus on the first write): + // The previous implementation kept AWREADY/WREADY perpetually high + // in idle and only latched the transaction when AWVALID, AWREADY, + // WVALID, and WREADY were all high in the SAME cycle (the + // `write_handshake` wire). Per AXI4 spec, however, once the slave + // samples READY high while VALID is high the handshake on that + // channel is COMPLETE, regardless of what the other channel is + // doing. The Zynq PS M_AXI_GP0 (AXI3) through axi_interconnect:2.1 + // routinely staggers AW and W by a cycle, so: + // cycle N : AWVALID=1, AWREADY=1 -> master considers AW done, + // drops AWVALID; slave never latched anything. + // cycle N+1 : WVALID=1, WREADY=1 -> master considers W done, + // drops WVALID; slave still never latched anything. + // cycle N+2..inf : master waits for BVALID. BVALID never asserts. + // The PS master is now wedged on an outstanding write, + // the kernel locks up, hard power cycle required. + // + // FIX: standard Xilinx-template AXI4-Lite slave FSM. AWREADY pulses + // for one cycle when AWVALID is observed and no AW is currently + // pending, and the address is captured in awaddr_reg. WREADY behaves + // identically and independently. When BOTH sides are latched + // (aw_latched && w_latched) we fire the actual register-write pulse + // and assert BVALID. BVALID is held until BREADY is observed, after + // which the latches clear and the slave is ready for the next txn. + // + // BRESP is left at 2'b00 (OKAY); unmapped writes still complete the + // BVALID handshake (they just don't have an effect in the decode + // case statement below) -- which is the contract the cocotb suite + // checks in test_06. + // --------------------------------------------------------------------- + always @(posedge s_axi_aclk) begin + if (!s_axi_aresetn) begin + awready_reg <= 1'b0; + wready_reg <= 1'b0; + bvalid_reg <= 1'b0; + aw_latched <= 1'b0; + w_latched <= 1'b0; + awaddr_reg <= 12'd0; + write_pulse_reg <= 1'b0; + write_addr_reg <= 12'd0; + write_data_reg <= 32'd0; + end else begin + write_pulse_reg <= 1'b0; + + // ----- AW channel: one-cycle AWREADY pulse + latch ------------ + if (!aw_latched && !awready_reg && s_axi_awvalid) begin + awready_reg <= 1'b1; + awaddr_reg <= s_axi_awaddr; + aw_latched <= 1'b1; + end else begin + awready_reg <= 1'b0; + end + + // ----- W channel: independent of AW, same shape -------------- + if (!w_latched && !wready_reg && s_axi_wvalid) begin + wready_reg <= 1'b1; + write_data_reg <= s_axi_wdata; + w_latched <= 1'b1; + end else begin + wready_reg <= 1'b0; + end + + // ----- B channel: fire write + BVALID once both sides in ----- + if (aw_latched && w_latched && !bvalid_reg) begin + bvalid_reg <= 1'b1; + write_pulse_reg <= 1'b1; + write_addr_reg <= awaddr_reg; + end else if (bvalid_reg && s_axi_bready) begin + bvalid_reg <= 1'b0; + aw_latched <= 1'b0; + w_latched <= 1'b0; + end + end + end + + // --------------------------------------------------------------------- + // AXI read channel + // --------------------------------------------------------------------- + reg [31:0] read_data_comb; + reg read_pulse_reg; + + always @(posedge s_axi_aclk) begin + if (!s_axi_aresetn) begin + arready_reg <= 1'b1; + rvalid_reg <= 1'b0; + rdata_reg <= 32'd0; + araddr_reg <= 12'd0; + read_pulse_reg <= 1'b0; + end else begin + read_pulse_reg <= 1'b0; + if (read_handshake) begin + arready_reg <= 1'b0; + araddr_reg <= s_axi_araddr; + rvalid_reg <= 1'b1; + rdata_reg <= read_data_comb; + read_pulse_reg <= 1'b1; + end else if (rvalid_reg && s_axi_rready) begin + rvalid_reg <= 1'b0; + arready_reg <= 1'b1; + end + end + end + + // Word-address slices used by the read mux. For reads we look up against + // the live s_axi_araddr in the same cycle we accept it. + wire [9:0] r_word_addr = s_axi_araddr[11:2]; + wire [9:0] w_word_addr = write_addr_reg[11:2]; + + // --------------------------------------------------------------------- + // Read data mux (combinational) + // --------------------------------------------------------------------- + integer rd_i; + always @* begin + read_data_comb = 32'd0; + rd_i = 0; + case (r_word_addr) + A_MAGIC: read_data_comb = 32'h4D475254; // "MGRT" + A_VERSION: read_data_comb = 32'h00020001; + A_CMD: read_data_comb = 32'd0; // WO -> reads as 0 + A_STATUS: read_data_comb = { + pos_reg, + out_len_reg, + 8'd0, + 2'd0, + direct_mode_reg, + host_toggle_reg, + error_reg, + done_latched_reg, + (state_reg == ST_WAIT_CORE), + (state_reg == ST_READY) + }; + A_CONFIG: read_data_comb = {temperature_reg, max_gen_reg, 8'd0}; + A_SEED: read_data_comb = rng_reg; + A_LOGIT_INFO: read_data_comb = {core_top_logit[15:0], core_argmax_token, last_token_reg}; + A_BOS: read_data_comb = {16'd0, 8'd0, BOS_TOKEN}; + A_STEP_CFG: read_data_comb = {8'd0, step_token_reg, step_pos_reg, step_clear_reg, direct_mode_reg}; + A_STEP_TRIG: read_data_comb = 32'd0; // WO -> reads as 0 + A_HEARTBEAT: read_data_comb = {{(32-HEARTBEAT_BITS){1'b0}}, heartbeat_reg}; + A_PERF_CYC: read_data_comb = perf_cycles_reg; + A_TPS: read_data_comb = tokens_per_sec_reg; + default: begin + if ((r_word_addr >= A_OUT_BASE) && (r_word_addr <= A_OUT_LAST)) begin + rd_i = r_word_addr - A_OUT_BASE; + read_data_comb = {24'd0, output_mem[rd_i[3:0]]}; + end else if ((r_word_addr >= A_LOGITS_BASE) && + (r_word_addr <= A_LOGITS_LAST)) begin + rd_i = r_word_addr - A_LOGITS_BASE; + read_data_comb = {{16{core_logits_flat[(rd_i*16)+15]}}, + core_logits_flat[(rd_i*16) +: 16]}; + end else begin + read_data_comb = 32'd0; + end + end + endcase + end + + // --------------------------------------------------------------------- + // Register write decode + control pulses + // --------------------------------------------------------------------- + always @(posedge s_axi_aclk) begin + if (!s_axi_aresetn) begin + host_temperature_reg <= 16'h0080; + host_max_gen_reg <= 8'd15; + host_seed_reg <= 32'h0000_0001; + host_direct_mode_reg <= 1'b0; + host_step_clear_reg <= 1'b0; + host_step_token_reg <= BOS_TOKEN; + host_step_pos_reg <= 8'd0; + host_toggle_reg <= 1'b0; + start_pulse <= 1'b0; + clear_pulse <= 1'b0; + step_pulse <= 1'b0; + end else begin + // Default: clear pulses every cycle + start_pulse <= 1'b0; + clear_pulse <= 1'b0; + step_pulse <= 1'b0; + + // Toggle on any successful AXI transaction (read or write). + if (write_pulse_reg || read_pulse_reg) + host_toggle_reg <= ~host_toggle_reg; + + if (write_pulse_reg) begin + case (w_word_addr) + A_CMD: begin + if (write_data_reg[0]) start_pulse <= 1'b1; + if (write_data_reg[1]) clear_pulse <= 1'b1; + end + A_CONFIG: begin + host_max_gen_reg <= write_data_reg[15:8]; + host_temperature_reg <= write_data_reg[31:16]; + end + A_SEED: begin + host_seed_reg <= write_data_reg; + end + A_STEP_CFG: begin + host_direct_mode_reg <= write_data_reg[0]; + host_step_clear_reg <= write_data_reg[1]; + host_step_pos_reg <= write_data_reg[15:8]; + host_step_token_reg <= write_data_reg[23:16]; + end + A_STEP_TRIG: begin + if (write_data_reg[0]) step_pulse <= 1'b1; + end + default: ; + endcase + end + end + end + + // --------------------------------------------------------------------- + // Core control state machine (functional twin of DE1 top, single clock) + // --------------------------------------------------------------------- + always @(posedge s_axi_aclk) begin + if (!s_axi_aresetn) begin + state_reg <= ST_READY; + token_reg <= BOS_TOKEN; + pos_reg <= 8'd0; + out_len_reg <= 8'd0; + rng_reg <= host_seed_reg; + temperature_reg <= 16'h0080; + max_gen_reg <= 8'd15; + start_core_reg <= 1'b0; + clear_cache_reg <= 1'b0; + done_latched_reg <= 1'b0; + last_token_reg <= 8'd0; + perf_cycles_reg <= 32'd0; + tokens_per_sec_reg <= 32'd0; + error_reg <= 1'b0; + host_run_reg <= 1'b0; + direct_mode_reg <= 1'b0; + step_clear_reg <= 1'b0; + step_token_reg <= BOS_TOKEN; + step_pos_reg <= 8'd0; + for (out_i = 0; out_i < 16; out_i = out_i + 1) + output_mem[out_i] <= 8'd0; + end else begin + start_core_reg <= 1'b0; + clear_cache_reg <= 1'b0; + + if (state_reg == ST_WAIT_CORE) + perf_cycles_reg <= perf_cycles_reg + 32'd1; + + // Latch host config on a fresh start request. + if (start_pulse) begin + max_gen_reg <= host_max_gen_reg; + temperature_reg <= host_temperature_reg; + rng_reg <= host_seed_reg; + direct_mode_reg <= 1'b0; + host_run_reg <= 1'b1; + end + + if (step_pulse) begin + direct_mode_reg <= host_direct_mode_reg; + step_clear_reg <= host_step_clear_reg; + step_token_reg <= host_step_token_reg; + step_pos_reg <= host_step_pos_reg; + end + + if (clear_pulse) begin + state_reg <= ST_READY; + token_reg <= BOS_TOKEN; + pos_reg <= 8'd0; + out_len_reg <= 8'd0; + done_latched_reg <= 1'b0; + last_token_reg <= 8'd0; + perf_cycles_reg <= 32'd0; + tokens_per_sec_reg <= 32'd0; + error_reg <= 1'b0; + host_run_reg <= 1'b0; + direct_mode_reg <= 1'b0; + for (out_i = 0; out_i < 16; out_i = out_i + 1) + output_mem[out_i] <= 8'd0; + end else begin + case (state_reg) + ST_READY: begin + if (step_pulse && host_direct_mode_reg) begin + token_reg <= host_step_token_reg; + pos_reg <= host_step_pos_reg; + out_len_reg <= 8'd0; + done_latched_reg <= 1'b0; + last_token_reg <= 8'd0; + perf_cycles_reg <= 32'd0; + tokens_per_sec_reg <= 32'd0; + error_reg <= 1'b0; + if (host_step_clear_reg) + rng_reg <= host_seed_reg; + clear_cache_reg <= host_step_clear_reg; + start_core_reg <= 1'b1; + state_reg <= ST_WAIT_CORE; + end else if (start_pulse) begin + token_reg <= BOS_TOKEN; + pos_reg <= 8'd0; + out_len_reg <= 8'd0; + done_latched_reg <= 1'b0; + last_token_reg <= 8'd0; + perf_cycles_reg <= 32'd0; + tokens_per_sec_reg <= 32'd0; + error_reg <= 1'b0; + for (out_i = 0; out_i < 16; out_i = out_i + 1) + output_mem[out_i] <= 8'd0; + clear_cache_reg <= 1'b1; + if (host_max_gen_reg == 8'd0 || host_max_gen_reg > 8'd15) begin + error_reg <= 1'b1; + done_latched_reg <= 1'b1; + state_reg <= ST_DONE; + end else begin + start_core_reg <= 1'b1; + state_reg <= ST_WAIT_CORE; + end + end + end + + ST_WAIT_CORE: begin + if (core_done) begin + rng_reg <= core_rng_state; + last_token_reg <= core_next_token; + if (direct_mode_reg) begin + done_latched_reg <= 1'b1; + state_reg <= ST_DONE; + end else if ((core_next_token == BOS_TOKEN) || (pos_reg == 8'd15)) begin + done_latched_reg <= 1'b1; + state_reg <= ST_DONE; + end else begin + output_mem[out_len_reg[3:0]] <= core_next_token; + token_reg <= core_next_token; + pos_reg <= pos_reg + 8'd1; + out_len_reg <= out_len_reg + 8'd1; + if ((out_len_reg + 8'd1) >= max_gen_reg) begin + done_latched_reg <= 1'b1; + state_reg <= ST_DONE; + end else begin + start_core_reg <= 1'b1; + state_reg <= ST_WAIT_CORE; + end + end + end + end + + ST_DONE: begin + if (step_pulse && host_direct_mode_reg) begin + token_reg <= host_step_token_reg; + pos_reg <= host_step_pos_reg; + out_len_reg <= 8'd0; + done_latched_reg <= 1'b0; + last_token_reg <= 8'd0; + perf_cycles_reg <= 32'd0; + tokens_per_sec_reg <= 32'd0; + error_reg <= 1'b0; + if (host_step_clear_reg) + rng_reg <= host_seed_reg; + clear_cache_reg <= host_step_clear_reg; + start_core_reg <= 1'b1; + state_reg <= ST_WAIT_CORE; + end else if (start_pulse) begin + token_reg <= BOS_TOKEN; + pos_reg <= 8'd0; + out_len_reg <= 8'd0; + done_latched_reg <= 1'b0; + last_token_reg <= 8'd0; + perf_cycles_reg <= 32'd0; + tokens_per_sec_reg <= 32'd0; + error_reg <= 1'b0; + for (out_i = 0; out_i < 16; out_i = out_i + 1) + output_mem[out_i] <= 8'd0; + clear_cache_reg <= 1'b1; + if (host_max_gen_reg == 8'd0 || host_max_gen_reg > 8'd15) begin + error_reg <= 1'b1; + done_latched_reg <= 1'b1; + state_reg <= ST_DONE; + end else begin + start_core_reg <= 1'b1; + state_reg <= ST_WAIT_CORE; + end + end + end + + default: state_reg <= ST_READY; + endcase + end + end + end + + // --------------------------------------------------------------------- + // Free-running heartbeat counter -- isolated in its own always block. + // + // Why a separate block: when this counter lived inside the giant FSM + // control always block above, Vivado synth (2024.1) lifted its reset + // path through the unmodified microgpt_exact_core's sampler and the + // resulting bitstream left led_heartbeat dark on hardware even though + // every checkpoint introspection showed heartbeat_reg_reg[25] driving + // the OBUF input correctly. Moving the counter to a minimal isolated + // always block prevents the optimisation from straying into the core's + // sampler instance. (* KEEP = "true" *) on the reg keeps a future + // synth release from merging it back in. + // --------------------------------------------------------------------- + always @(posedge s_axi_aclk) begin + if (!s_axi_aresetn) + heartbeat_reg <= {HEARTBEAT_BITS{1'b0}}; + else + heartbeat_reg <= heartbeat_reg + 1'b1; + end + + // --------------------------------------------------------------------- + // LED outputs (PYNQ-Z2 LD0..LD3) + // --------------------------------------------------------------------- + assign led_busy = (state_reg == ST_WAIT_CORE); + assign led_done = done_latched_reg; + assign led_error = error_reg; + assign led_heartbeat = heartbeat_reg[HEARTBEAT_BITS-1]; + + // --------------------------------------------------------------------- + // PL->PS interrupt: asserts on rising edge of done_latched_reg, holds + // until the host writes a new start_pulse (or a clear_pulse). The PS + // sees a level-high IRQ; the GIC + PYNQ Interrupt driver wakes the + // userspace coroutine the moment the line goes high. + // + // Why latched-and-cleared-by-start instead of a 1-cycle pulse: + // * a 1-cycle pulse at 50 MHz is technically observable by the GIC + // after synchronisation, but it is more fragile across IRQ_F2P + // metastability than a level signal, + // * letting start_pulse be the implicit ACK saves one MMIO + // transaction per call (no separate IRQ_ACK register write), + // * because every generate() call begins with a start, the line + // is guaranteed to be low when the next wait() is armed. + // --------------------------------------------------------------------- + reg done_latched_reg_d; + reg irq_pending_reg; + always @(posedge s_axi_aclk) begin + if (!s_axi_aresetn) begin + done_latched_reg_d <= 1'b0; + irq_pending_reg <= 1'b0; + end else begin + done_latched_reg_d <= done_latched_reg; + if (start_pulse || clear_pulse) + irq_pending_reg <= 1'b0; + else if (done_latched_reg && !done_latched_reg_d) + irq_pending_reg <= 1'b1; + end + end + + assign done_irq = irq_pending_reg; + +endmodule + +`default_nettype wire diff --git a/pynq/hw/tcl/.gitkeep b/pynq/hw/tcl/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/pynq/hw/tcl/build.tcl b/pynq/hw/tcl/build.tcl new file mode 100644 index 0000000..18b7750 --- /dev/null +++ b/pynq/hw/tcl/build.tcl @@ -0,0 +1,207 @@ +# ----------------------------------------------------------------------------- +# build.tcl -- Vivado non-project / project-based build for microgpt_pynq_top +# targeting the PYNQ-Z2 (Zynq-7000, XC7Z010-1CLG400C). +# +# Usage (from the repository root): +# vivado -mode batch -source hw/tcl/build.tcl +# +# Outputs: +# hw/build/microgpt_pynq.xpr Vivado project +# overlays/microgpt.bit Final bitstream +# overlays/microgpt.hwh Hardware handoff for PYNQ +# +# DE1 deviation: replaces Quartus .qpf/.qsf flow. Generates an AXI block +# design instead of the Avalon-MM JTAG bridge / Altera PLL hierarchy. +# ----------------------------------------------------------------------------- + +# --- Paths ----------------------------------------------------------------- +set repo_root [file normalize [file join [file dirname [info script]] .. ..]] +set hw_root [file join $repo_root "hw"] +set src_root [file join $hw_root "src"] +set core_root [file join $src_root "core"] +set core_inc [file join $core_root "include"] +set top_root [file join $src_root "top"] +set ip_root [file join $hw_root "ip"] +set constr_root [file join $hw_root "constraints"] +set build_root [file join $hw_root "build"] +set overlays [file join $repo_root "overlays"] + +file mkdir $build_root +file mkdir $overlays + +# --- Project --------------------------------------------------------------- +set proj_name "microgpt_pynq" +set part "xc7z020clg400-1" + +create_project -force $proj_name $build_root -part $part +set_property board_part tul.com.tw:pynq-z2:part0:1.0 [current_project] + +# --- Sources --------------------------------------------------------------- +# RTL +add_files -norecurse [glob [file join $core_root "*.sv"]] +add_files -norecurse [file join $top_root "microgpt_pynq_top.sv"] + +# SystemVerilog headers (.svh) -- must be added as project files (not just +# resolved via include_dirs) for create_bd_cell -type module -reference to +# accept the referencing RTL module. +set svh_files [glob -nocomplain [file join $core_inc "*.svh"]] +if {[llength $svh_files] > 0} { + add_files -norecurse $svh_files + set_property file_type "Verilog Header" [get_files -of_objects [get_filesets sources_1] *.svh] +} + +# Hex weights -- added so they are tracked, but FILE_TYPE is left as default +# ("Memory File"); $readmemh resolves them through the include search path +# below rather than via the FILE_TYPE = "Memory Initialization Files" hook. +add_files -norecurse [glob [file join $ip_root "*.hex"]] + +# Constraints +add_files -fileset constrs_1 -norecurse [file join $constr_root "pynq_z2.xdc"] + +# Verilog includes (.svh) and the hex search directory +set_property include_dirs [list $core_inc $ip_root] [get_filesets sources_1] +set_property include_dirs [list $core_inc $ip_root] [get_filesets sim_1] + +# Make sure SystemVerilog compile is used +set_property file_type SystemVerilog [get_files -filter {NAME =~ "*.sv"}] + +# Vivado's `create_bd_cell -type module -reference` rejects a SystemVerilog +# file as the top of a module reference (filemgmt-56-195). microgpt_pynq_top.sv +# is Verilog-2001 compatible (no logic/always_ff/interfaces/packages -- only +# wire/reg, +: part-selects, and inline X_INTERFACE_INFO attributes), so we +# re-tag just that one file as Verilog. Children stay SystemVerilog. +set_property file_type Verilog \ + [get_files -of_objects [get_filesets sources_1] "microgpt_pynq_top.sv"] + +# --- Block design ---------------------------------------------------------- +set bd_name "system" +create_bd_design $bd_name + +# Zynq-7000 PS with PYNQ-Z2 preset (falls back to apply_bd_automation if the +# board file is not installed). +set zynq_ps [create_bd_cell -type ip -vlnv xilinx.com:ip:processing_system7:5.5 ps7_0] +apply_bd_automation -rule xilinx.com:bd_rule:processing_system7 \ + -config { make_external "FIXED_IO, DDR" apply_board_preset "1" Master "Disable" Slave "Disable" } \ + $zynq_ps + +# Force FCLK_CLK0 = 50 MHz, single AXI GP master, fabric interrupt enabled +# so we can wire microgpt_0/done_irq up to IRQ_F2P[0]. +set_property -dict [list \ + CONFIG.PCW_FPGA0_PERIPHERAL_FREQMHZ {50} \ + CONFIG.PCW_USE_M_AXI_GP0 {1} \ + CONFIG.PCW_USE_FABRIC_INTERRUPT {1} \ + CONFIG.PCW_IRQ_F2P_INTR {1} \ + CONFIG.PCW_EN_CLK0_PORT {1} \ + CONFIG.PCW_EN_RST0_PORT {1} \ +] $zynq_ps + +# Custom IP wrapper (RTL module brought in as a BD cell) +set top_cell [create_bd_cell -type module -reference microgpt_pynq_top microgpt_0] + +# Processor System Reset +set rst_inst [create_bd_cell -type ip -vlnv xilinx.com:ip:proc_sys_reset:5.0 rst_ps7_50m] + +# AXI Interconnect (1 master, 1 slave) +set axi_ic [create_bd_cell -type ip -vlnv xilinx.com:ip:axi_interconnect:2.1 axi_ic_0] +set_property -dict [list CONFIG.NUM_SI {1} CONFIG.NUM_MI {1}] $axi_ic + +# Connections -- clock / reset +connect_bd_net [get_bd_pins ps7_0/FCLK_CLK0] [get_bd_pins rst_ps7_50m/slowest_sync_clk] +connect_bd_net [get_bd_pins ps7_0/FCLK_RESET0_N] [get_bd_pins rst_ps7_50m/ext_reset_in] +connect_bd_net [get_bd_pins ps7_0/FCLK_CLK0] [get_bd_pins ps7_0/M_AXI_GP0_ACLK] +connect_bd_net [get_bd_pins ps7_0/FCLK_CLK0] [get_bd_pins axi_ic_0/ACLK] +connect_bd_net [get_bd_pins ps7_0/FCLK_CLK0] [get_bd_pins axi_ic_0/S00_ACLK] +connect_bd_net [get_bd_pins ps7_0/FCLK_CLK0] [get_bd_pins axi_ic_0/M00_ACLK] +connect_bd_net [get_bd_pins ps7_0/FCLK_CLK0] [get_bd_pins microgpt_0/s_axi_aclk] +connect_bd_net [get_bd_pins rst_ps7_50m/interconnect_aresetn] [get_bd_pins axi_ic_0/ARESETN] +connect_bd_net [get_bd_pins rst_ps7_50m/peripheral_aresetn] [get_bd_pins axi_ic_0/S00_ARESETN] +connect_bd_net [get_bd_pins rst_ps7_50m/peripheral_aresetn] [get_bd_pins axi_ic_0/M00_ARESETN] +connect_bd_net [get_bd_pins rst_ps7_50m/peripheral_aresetn] [get_bd_pins microgpt_0/s_axi_aresetn] + +# AXI buses +connect_bd_intf_net [get_bd_intf_pins ps7_0/M_AXI_GP0] [get_bd_intf_pins axi_ic_0/S00_AXI] +connect_bd_intf_net [get_bd_intf_pins axi_ic_0/M00_AXI] [get_bd_intf_pins microgpt_0/s_axi] + +# LEDs out to top-level ports +create_bd_port -dir O led_busy +create_bd_port -dir O led_done +create_bd_port -dir O led_error +create_bd_port -dir O led_heartbeat +connect_bd_net [get_bd_pins microgpt_0/led_busy] [get_bd_ports led_busy] +connect_bd_net [get_bd_pins microgpt_0/led_done] [get_bd_ports led_done] +connect_bd_net [get_bd_pins microgpt_0/led_error] [get_bd_ports led_error] +connect_bd_net [get_bd_pins microgpt_0/led_heartbeat] [get_bd_ports led_heartbeat] + +# PL->PS interrupt: route microgpt_0/done_irq to IRQ_F2P[0] via xlconcat so +# the BD can grow more interrupt sources later (the GIC F2P input is a +# 16-bit bus). One source today, but the wiring is future-proof. +set irq_concat [create_bd_cell -type ip -vlnv xilinx.com:ip:xlconcat:2.1 irq_concat] +set_property -dict [list CONFIG.NUM_PORTS {1}] $irq_concat +connect_bd_net [get_bd_pins microgpt_0/done_irq] [get_bd_pins irq_concat/In0] +connect_bd_net [get_bd_pins irq_concat/dout] [get_bd_pins ps7_0/IRQ_F2P] + +# Address map: microgpt @ 0x4000_0000, range 4 KB +# The AXI4-Lite slave inferred from microgpt_pynq_top names its segment +# 'reg0' (Vivado's default when the HDL doesn't bind an explicit Reg name), +# so the PS7 master segment auto-derives as SEG_microgpt_0_reg0. +assign_bd_address [get_bd_addr_segs {microgpt_0/s_axi/reg0}] +set_property offset 0x40000000 [get_bd_addr_segs ps7_0/Data/SEG_microgpt_0_reg0] +set_property range 4K [get_bd_addr_segs ps7_0/Data/SEG_microgpt_0_reg0] + +validate_bd_design +save_bd_design + +# Generate synthesis / simulation / hw-handoff products for the BD before +# wrapping. Required so the .hwh appears under $proj.gen/.../hw_handoff and +# so make_wrapper sees up-to-date generated sources. +generate_target all [get_files ${bd_name}.bd] + +# HDL wrapper +set bd_file [get_files ${bd_name}.bd] +set wrapper_file [make_wrapper -files $bd_file -top -import] +add_files -norecurse -fileset sources_1 $wrapper_file +set_property top "${bd_name}_wrapper" [get_filesets sources_1] + +# --- Implementation ------------------------------------------------------- +update_compile_order -fileset sources_1 +launch_runs synth_1 -jobs 4 +wait_on_run synth_1 +if {[get_property PROGRESS [get_runs synth_1]] != "100%" || [get_property STATUS [get_runs synth_1]] != "synth_design Complete!"} { + puts "ERROR: Synthesis failed." + exit 1 +} +launch_runs impl_1 -to_step write_bitstream -jobs 4 +wait_on_run impl_1 +if {[get_property PROGRESS [get_runs impl_1]] != "100%" || ![string match "*write_bitstream Complete!*" [get_property STATUS [get_runs impl_1]]]} { + puts "ERROR: Implementation / bitstream write failed." + exit 1 +} + +# --- Export bit + hwh ----------------------------------------------------- +# `get_files -of_objects [get_runs ...]` is rejected in Vivado 2024.1 +# (run objects are not a supported -of_objects type), so wrap and fall through +# to the run-directory glob. +if {[catch {get_files -of_objects [get_runs impl_1] *.bit} bit_src]} { + set bit_src "" +} +if {$bit_src eq ""} { + set bit_src [glob -nocomplain [file join $build_root "$proj_name.runs" "impl_1" "*.bit"]] +} +file copy -force $bit_src [file join $overlays "microgpt.bit"] + +# .hwh lives next to the BD after write_bitstream / generate_target +set hwh_src [glob -nocomplain \ + [file join $build_root "$proj_name.gen" "sources_1" "bd" $bd_name "hw_handoff" "${bd_name}.hwh"]] +if {$hwh_src eq ""} { + set hwh_src [glob -nocomplain \ + [file join $build_root "$proj_name.srcs" "sources_1" "bd" $bd_name "hw_handoff" "${bd_name}.hwh"]] +} +if {$hwh_src ne ""} { + file copy -force $hwh_src [file join $overlays "microgpt.hwh"] +} else { + puts "WARNING: could not locate ${bd_name}.hwh -- check your Vivado version's BD output paths." +} + +puts "Done. Artifacts in [file join $overlays]:" +puts " microgpt.bit" +puts " microgpt.hwh" diff --git a/pynq/overlays/.gitkeep b/pynq/overlays/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/pynq/overlays/microgpt.bit b/pynq/overlays/microgpt.bit new file mode 100644 index 0000000..f423c7b Binary files /dev/null and b/pynq/overlays/microgpt.bit differ diff --git a/pynq/overlays/microgpt.bit.prev b/pynq/overlays/microgpt.bit.prev new file mode 100644 index 0000000..cd37ced Binary files /dev/null and b/pynq/overlays/microgpt.bit.prev differ diff --git a/pynq/overlays/microgpt.hwh b/pynq/overlays/microgpt.hwh new file mode 100644 index 0000000..f8a27ee --- /dev/null +++ b/pynq/overlays/microgpt.hwh @@ -0,0 +1,2692 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/pynq/overlays/microgpt.hwh.prev b/pynq/overlays/microgpt.hwh.prev new file mode 100644 index 0000000..689bebb --- /dev/null +++ b/pynq/overlays/microgpt.hwh.prev @@ -0,0 +1,2531 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/pynq/scripts/.gitkeep b/pynq/scripts/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/pynq/sw/drivers/.gitkeep b/pynq/sw/drivers/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/pynq/sw/drivers/microgpt.py b/pynq/sw/drivers/microgpt.py new file mode 100644 index 0000000..59f898a --- /dev/null +++ b/pynq/sw/drivers/microgpt.py @@ -0,0 +1,346 @@ +"""microgpt.py -- PYNQ host driver for the TALOS-V2 microGPT overlay. + +DE1 deviation: replaces the JTAG-master Avalon-MM bridge / Windows +`microgpt_bos_start.exe` flow with a `pynq.Overlay` + `pynq.MMIO` driver +that talks to the AXI4-Lite slave at 0x4000_0000. +""" + +from __future__ import annotations + +import asyncio +import os +import time +from pathlib import Path +from typing import List, Optional + +import numpy as np +from pynq import Overlay, MMIO + + +# --------------------------------------------------------------------------- +# Register map (matches hw/src/top/microgpt_pynq_top.sv) +# --------------------------------------------------------------------------- +REG_MAGIC = 0x000 # RO +REG_VERSION = 0x004 # RO +REG_CMD = 0x008 # WO bit0=start, bit1=clear +REG_STATUS = 0x00C # RO +REG_CONFIG = 0x010 # RW {temp_q8_8[31:16], max_gen[15:8], 0[7:0]} +REG_SEED = 0x014 # RW +REG_LOGIT_INFO = 0x018 # RO {top_logit_q12[31:16], argmax[15:8], last[7:0]} +REG_BOS = 0x01C # RO +REG_STEP_CFG = 0x020 # RW +REG_STEP_TRIG = 0x024 # WO bit0 +REG_OUT_BASE = 0x060 # RO 16 x u8 tokens (one per word, LSB) +REG_PERF_CYC = 0x0D8 # RO +REG_TPS = 0x0DC # RO +REG_LOGITS_BASE = 0x100 # RO 27 x s16 (sign-extended in low 16 bits) + +MAGIC_EXPECTED = 0x4D475254 # "MGRT" +VERSION_EXPECTED = 0x00020001 + +DEFAULT_BITFILE = ( + Path(__file__).resolve().parent.parent.parent / "overlays" / "microgpt.bit" +) + +# Tiny vocab from rtl/microgpt/names.txt -- 26 letters + BOS/EOS sentinel. +TOKEN_ALPHABET = "abcdefghijklmnopqrstuvwxyz" +BOS_TOKEN_ID = 26 + + +def _temperature_to_q8_8(temperature: float) -> int: + """Convert a float temperature to unsigned Q8.8 used by the core.""" + if temperature <= 0.0: + raise ValueError("temperature must be > 0") + q = int(round(temperature * 256.0)) + if q < 1: + q = 1 + if q > 0xFFFF: + q = 0xFFFF + return q + + +class MicroGPT: + """Minimum-viable host driver for the microGPT overlay. + + Example: + gpt = MicroGPT() + name, info = gpt.generate(max_tokens=8, temperature=1.0, seed=42) + """ + + AXI_BASE = 0x40000000 + AXI_RANGE = 0x1000 + + def __init__( + self, + bitfile: Optional[os.PathLike] = None, + download: bool = True, + use_irq: bool = False, + ) -> None: + # Why use_irq defaults to False: per-call completion is ~150-300 us + # of RTL work. The spin-poll over the cached uint32 STATUS view + # (introduced with the mmio.array refactor) takes <1 us per + # iteration, so the whole wait costs <100 us of CPU. Routing that + # through Linux uio + a context switch costs ~300 us per call here. + # We measured both on this overlay: busy-poll is the winner. The + # IRQ infrastructure (RTL output, BD wiring, /dev/uio) is + # all in place if you set use_irq=True -- valuable for future + # workloads where each call is milliseconds (larger model, + # async multi-stream, etc), but the sub-ms regime favours spinning. + bit_path = Path(bitfile) if bitfile is not None else DEFAULT_BITFILE + if not bit_path.exists(): + raise FileNotFoundError( + f"Overlay bitstream not found: {bit_path}. " + "Build it with `vivado -mode batch -source hw/tcl/build.tcl`." + ) + self.overlay = Overlay(str(bit_path), download=download) + self.mmio = MMIO(self.AXI_BASE, self.AXI_RANGE) + # Zero-copy uint32 view of the AXI-Lite window. Indexed access goes + # straight to the bus without per-MMIO-method Python overhead, which + # is the per-call hot path's biggest cost. Bench shows ~1.5x on + # `gpt.generate()` end-to-end vs the old self.mmio.read/write API. + self._u32 = np.asarray(self.mmio.array, dtype=np.uint32) + self._sanity_check() + + # PL->PS interrupt setup. The build.tcl wires microgpt_0/done_irq + # straight to ps7_0/IRQ_F2P[0] through an xlconcat -- this is the + # minimum-overhead path on Zynq-7000, but PYNQ's Interrupt class + # only auto-discovers irqs that go through an axi_intc IP, so the + # high-level API doesn't see ours. The uio kernel driver still + # creates /dev/uioN with name="fabric" (covering all PL irqs); we + # bind to that directly with os.read/os.write -- one blocking read + # per generation instead of the busy-poll loop. + self._uio_fd = -1 + if use_irq: + try: + self._uio_fd = self._open_fabric_uio() + except Exception: + self._uio_fd = -1 + + # ---- low-level helpers ------------------------------------------------- + # _read/_write keep their old shape so external callers (and notebook + # examples) still work, but they use the cached uint32 view internally. + def _read(self, offset: int) -> int: + return int(self._u32[offset >> 2]) + + def _write(self, offset: int, value: int) -> None: + self._u32[offset >> 2] = np.uint32(int(value) & 0xFFFFFFFF) + + def _sanity_check(self) -> None: + magic = self._read(REG_MAGIC) + version = self._read(REG_VERSION) + if magic != MAGIC_EXPECTED: + raise RuntimeError( + f"Bad magic 0x{magic:08X} at 0x{REG_MAGIC:03X}; " + f"expected 0x{MAGIC_EXPECTED:08X}." + ) + if version != VERSION_EXPECTED: + raise RuntimeError( + f"Unexpected version 0x{version:08X}; expected 0x{VERSION_EXPECTED:08X}." + ) + + # ---- public API -------------------------------------------------------- + def reset(self) -> None: + """Issue a clear pulse and wait for the core to return to ready.""" + self._write(REG_CMD, 0x2) # bit1 = clear + self._wait_ready() + + def status(self) -> dict: + return self._unpack_status(int(self._u32[REG_STATUS >> 2])) + + def _wait_ready(self, timeout_s: float = 1.0) -> None: + end = time.monotonic() + timeout_s + u32 = self._u32 + idx = REG_STATUS >> 2 + spin = 0 + while True: + s = int(u32[idx]) + # ready=bit0, busy=bit1 + if (s & 0x1) and not (s & 0x2): + return + spin += 1 + # Only check the wall clock every ~4096 spins -- saves a syscall + # per polled iteration on the (very common) sub-millisecond path. + if (spin & 0xFFF) == 0 and time.monotonic() > end: + raise TimeoutError("microGPT did not return to ready in time.") + + @staticmethod + def _open_fabric_uio() -> int: + """Locate /dev/uioN named 'fabric' and open it for irq waits. + + Raises OSError / FileNotFoundError if no fabric uio is available. + """ + sys_uio = "/sys/class/uio" + for entry in sorted(os.listdir(sys_uio)): + try: + with open(os.path.join(sys_uio, entry, "name")) as f: + if f.read().strip() == "fabric": + fd = os.open(os.path.join("/dev", entry), os.O_RDWR) + # Arm the interrupt -- subsequent os.read blocks + # until the line transitions to active. + os.write(fd, (1).to_bytes(4, "little")) + return fd + except OSError: + continue + raise FileNotFoundError("No /dev/uioN named 'fabric' found") + + def _wait_done(self, timeout_s: float = 5.0) -> dict: + # Fast path: read /dev/uio. The kernel blocks the read until + # IRQ_F2P[0] (i.e. our done_irq) goes high. Spurious wake-ups can + # happen because our irq line is level-held until the next call's + # start_pulse clears it (so the kernel may queue an extra event when + # it re-arms the irq with the line still high); we tolerate that by + # reading STATUS after each wake and looping until done is actually + # set. + if self._uio_fd >= 0: + import select + u32 = self._u32 + sidx = REG_STATUS >> 2 + end = time.monotonic() + timeout_s + while True: + # Block in the kernel until next irq, with a per-iteration + # timeout so we can still surface a real hang. + remaining = max(0.001, end - time.monotonic()) + r, _, _ = select.select([self._uio_fd], [], [], remaining) + if not r: + raise TimeoutError("microGPT did not finish generation in time.") + # Read the 4-byte event count to ack and re-arm. + os.read(self._uio_fd, 4) + os.write(self._uio_fd, (1).to_bytes(4, "little")) + s = int(u32[sidx]) + if s & 0x4: + return self._unpack_status(s) + if s & 0x8: + raise RuntimeError(f"core reported error; status=0x{s:08x}") + # Spurious wake: line was held high from the previous call, + # we just consumed the queued event; loop and re-block. + + # Fallback: busy-poll STATUS. Used when /dev/uio isn't + # available (e.g. older bitstream without IRQ wiring). + end = time.monotonic() + timeout_s + u32 = self._u32 + idx = REG_STATUS >> 2 + spin = 0 + while True: + s = int(u32[idx]) + if s & 0x4: # done + return self._unpack_status(s) + if s & 0x8: # error + raise RuntimeError(f"core reported error; status=0x{s:08x}") + spin += 1 + if (spin & 0xFFF) == 0 and time.monotonic() > end: + raise TimeoutError("microGPT did not finish generation in time.") + + @staticmethod + def _unpack_status(s: int) -> dict: + return { + "ready": bool(s & 0x1), + "busy": bool((s >> 1) & 0x1), + "done": bool((s >> 2) & 0x1), + "error": bool((s >> 3) & 0x1), + "host_toggle": bool((s >> 4) & 0x1), + "direct_mode": bool((s >> 5) & 0x1), + "out_len": (s >> 16) & 0xFF, + "pos": (s >> 24) & 0xFF, + } + + def generate( + self, + max_tokens: int = 15, + temperature: float = 1.0, + seed: int = 1, + ) -> tuple[str, dict]: + """Run a fresh BOS-seeded generation and return (text, info). + + Args: + max_tokens: 1..15. Hard upper bound enforced by the core. + temperature: Q8.8 temperature applied before the categorical + sampler. Pass 0.5..2.0 for sensible behaviour. + seed: 32-bit unsigned RNG seed. + + Returns: + (decoded_string, info_dict). info_dict carries `tokens`, `cycles`, + `tokens_per_sec`, and the final `status` snapshot. + + Note: the explicit clear-pulse that older versions issued on every + call has been removed. The wrapper FSM accepts a `start_pulse` from + ST_READY *or* ST_DONE and unconditionally re-initialises the result + registers, so an extra clear is just two wasted AXI transactions. + """ + if not 1 <= max_tokens <= 15: + raise ValueError("max_tokens must be in [1, 15]") + + u32 = self._u32 + + # Program config + seed (Q8.8 temperature, max_gen). + cfg = ((_temperature_to_q8_8(temperature) & 0xFFFF) << 16) | \ + ((max_tokens & 0xFF) << 8) + u32[REG_CONFIG >> 2] = np.uint32(cfg) + u32[REG_SEED >> 2] = np.uint32(seed & 0xFFFFFFFF) + + # Fire start pulse. + u32[REG_CMD >> 2] = np.uint32(0x1) + + st = self._wait_done() + out_len = st["out_len"] + + # Burst-read the token block + the perf counter as a single numpy + # slice. Numpy reads each uint32 through the mmap'd /dev/mem region + # in C, with no per-element Python interpreter overhead. This is the + # single largest win once the polling loop is tight: ~10 fewer + # Python round-trips per generation at max_tokens=15. + if out_len: + base = REG_OUT_BASE >> 2 + tokens = (np.asarray(u32[base : base + out_len], dtype=np.uint32) & 0xFF).tolist() + else: + tokens = [] + cycles = int(u32[REG_PERF_CYC >> 2]) + tps = int(u32[REG_TPS >> 2]) + + text = "".join( + TOKEN_ALPHABET[t] if 0 <= t < len(TOKEN_ALPHABET) else "?" + for t in tokens + ) + return text, { + "tokens": tokens, + "cycles": cycles, + "tokens_per_sec": tps, + "status": st, + } + + def step( + self, + token: int, + pos: int, + clear: bool = False, + seed: Optional[int] = None, + ) -> dict: + """Single direct-mode step: feed (token, pos) and read back the result.""" + if not 0 <= token <= 255 or not 0 <= pos <= 15: + raise ValueError("token must be 0..255 and pos must be 0..15") + if seed is not None: + self._write(REG_SEED, seed & 0xFFFFFFFF) + cfg = ((token & 0xFF) << 16) | ((pos & 0xFF) << 8) \ + | (0x2 if clear else 0x0) | 0x1 # direct_mode=1 + self._write(REG_STEP_CFG, cfg) + self._write(REG_STEP_TRIG, 0x1) + + st = self._wait_done() + info = self._read(REG_LOGIT_INFO) + return { + "last_token": info & 0xFF, + "argmax_token": (info >> 8) & 0xFF, + "top_logit_q12": _sign_extend((info >> 16) & 0xFFFF, 16), + "status": st, + } + + def logits(self) -> List[int]: + """Read the 27 logits from the last completed step (signed Q12).""" + return [ + _sign_extend(self._read(REG_LOGITS_BASE + 4 * i) & 0xFFFF, 16) + for i in range(27) + ] + + +def _sign_extend(value: int, bits: int) -> int: + sign_bit = 1 << (bits - 1) + return (value & (sign_bit - 1)) - (value & sign_bit) diff --git a/pynq/sw/notebooks/.gitkeep b/pynq/sw/notebooks/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/pynq/sw/notebooks/demo.ipynb b/pynq/sw/notebooks/demo.ipynb new file mode 100644 index 0000000..09ab0ed --- /dev/null +++ b/pynq/sw/notebooks/demo.ipynb @@ -0,0 +1,123 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# microGPT on PYNQ-Z2 -- demo\n", + "\n", + "This notebook loads the `microgpt.bit` overlay onto the Zynq PL and\n", + "drives the AXI4-Lite slave through the `MicroGPT` Python class to\n", + "generate five short names, printing cycle counts for each run.\n", + "\n", + "Before running, copy `overlays/microgpt.bit` and `overlays/microgpt.hwh`\n", + "to the board (or run this notebook directly on the board).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "ename": "ModuleNotFoundError", + "evalue": "No module named 'pynq'", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mModuleNotFoundError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[1]\u001b[39m\u001b[32m, line 9\u001b[39m\n\u001b[32m 5\u001b[39m drivers_path = Path(\u001b[33m'..'\u001b[39m) / \u001b[33m'drivers'\u001b[39m\n\u001b[32m 6\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m str(drivers_path.resolve()) \u001b[38;5;28;01mnot\u001b[39;00m \u001b[38;5;28;01min\u001b[39;00m sys.path:\n\u001b[32m 7\u001b[39m sys.path.insert(\u001b[32m0\u001b[39m, str(drivers_path.resolve()))\n\u001b[32m 8\u001b[39m \n\u001b[32m----> \u001b[39m\u001b[32m9\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m microgpt \u001b[38;5;28;01mimport\u001b[39;00m MicroGPT\n", + "\u001b[36mFile \u001b[39m\u001b[32m~/Desktop/new/microgpt/sw/drivers/microgpt.py:15\u001b[39m\n\u001b[32m 12\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mpathlib\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m Path\n\u001b[32m 13\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mtyping\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m List, Optional\n\u001b[32m---> \u001b[39m\u001b[32m15\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mpynq\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m Overlay, MMIO\n\u001b[32m 18\u001b[39m \u001b[38;5;66;03m# ---------------------------------------------------------------------------\u001b[39;00m\n\u001b[32m 19\u001b[39m \u001b[38;5;66;03m# Register map (matches hw/src/top/microgpt_pynq_top.sv)\u001b[39;00m\n\u001b[32m 20\u001b[39m \u001b[38;5;66;03m# ---------------------------------------------------------------------------\u001b[39;00m\n\u001b[32m 21\u001b[39m REG_MAGIC = \u001b[32m0x000\u001b[39m \u001b[38;5;66;03m# RO\u001b[39;00m\n", + "\u001b[31mModuleNotFoundError\u001b[39m: No module named 'pynq'" + ] + } + ], + "source": [ + "import sys\n", + "from pathlib import Path\n", + "\n", + "# Make the driver importable regardless of where the notebook is launched.\n", + "drivers_path = Path('..') / 'drivers'\n", + "if str(drivers_path.resolve()) not in sys.path:\n", + " sys.path.insert(0, str(drivers_path.resolve()))\n", + "\n", + "from microgpt import MicroGPT\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "gpt = MicroGPT()\n", + "print('Loaded overlay; status:', gpt.status())\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5 generations\n", + "\n", + "Each row prints the seed, the generated name, the cycle count\n", + "from the core's performance counter, and the final status word." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "seeds = [1, 42, 1337, 0xDEADBEEF, 0xC0FFEE]\n", + "for s in seeds:\n", + " text, info = gpt.generate(max_tokens=8, temperature=1.0, seed=s)\n", + " print(f\"seed=0x{s:08X} text={text!r:12s} cycles={info['cycles']:>8d} tokens={info['tokens']}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Single-step direct-mode probe\n", + "\n", + "This pokes a single token+position into the core and reads back\n", + "the 27 logits the lm_head produced. Useful for debugging." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "result = gpt.step(token=26, pos=0, clear=True, seed=1)\n", + "print('argmax token:', result['argmax_token'])\n", + "print('top logit (Q12):', result['top_logit_q12'])\n", + "print('all logits:', gpt.logits())\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.14.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/pynq/sw/notebooks/hardware_advantage.ipynb b/pynq/sw/notebooks/hardware_advantage.ipynb new file mode 100644 index 0000000..a32cd9e --- /dev/null +++ b/pynq/sw/notebooks/hardware_advantage.ipynb @@ -0,0 +1,246 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# What the FPGA buys you for microGPT\n", + "\n", + "This notebook demonstrates three concrete wins from running the TALOS-V2 microGPT core on the PYNQ-Z2 PL rather than as a Python program on the PS:\n", + "\n", + "1. **Tightly bounded latency.** The FPGA reports cycle-accurate timing for every generation through `perf_cycles_reg`. Run-to-run jitter is exactly zero for a fixed seed and within a handful of cycles for a varying seed — no GC pauses, no Python interpreter overhead.\n", + "2. **Bit-exact determinism.** Same seed → same tokens, byte for byte, every time. The categorical sampler is reproducible because the LFSR-fed RNG is deterministic in silicon.\n", + "3. **Empirical sampling distribution at zero marginal cost.** Because each generation is a fast register transaction (a few thousand FPGA cycles), we can sweep thousands of seeds in seconds and recover the model's *empirical* token distribution at each position — something that would take orders of magnitude longer in a software-only tiny-model implementation.\n", + "\n", + "Each section below is small enough that you can execute it in a few seconds, and each one ends with a concrete number." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 0. Setup\n", + "\n", + "We point at the driver that lives next to this notebook tree (`../drivers/microgpt.py`), construct the `MicroGPT` host object, and program the bitstream. After this cell, the PL is alive and `gpt.status()` reports `ready=True`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys, time, statistics\n", + "from collections import Counter\n", + "from pathlib import Path\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "\n", + "drivers_path = (Path('..') / 'drivers').resolve()\n", + "if str(drivers_path) not in sys.path:\n", + " sys.path.insert(0, str(drivers_path))\n", + "\n", + "from microgpt import MicroGPT, TOKEN_ALPHABET, BOS_TOKEN_ID\n", + "\n", + "gpt = MicroGPT()\n", + "print('overlay loaded; status =', gpt.status())\n", + "FCLK_HZ = 50_000_000 # PS-driven FCLK_CLK0\n", + "VOCAB = list(TOKEN_ALPHABET) + [''] # 26 letters + sentinel" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Deterministic latency (the headline number)\n", + "\n", + "We run `generate()` 50 times with the same parameters, read the FPGA's `perf_cycles` counter for each run, and convert to wall-clock latency at 50 MHz.\n", + "\n", + "**What we expect:** every run produces *exactly* the same cycle count (because the seed and config are identical, the trellis through the FSM and core is identical), giving a measured *standard deviation of zero cycles*. That is impossible for a Python forward pass; even a JIT-compiled software inference path would show µs-scale jitter from GC, scheduler, and cache effects." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "N_RUNS = 50\n", + "MAX_TOK = 8\n", + "TEMP = 1.0\n", + "SEED = 42\n", + "\n", + "cycle_samples = []\n", + "wall_samples = []\n", + "for _ in range(N_RUNS):\n", + " t0 = time.perf_counter()\n", + " text, info = gpt.generate(max_tokens=MAX_TOK, temperature=TEMP, seed=SEED)\n", + " t1 = time.perf_counter()\n", + " cycle_samples.append(info['cycles'])\n", + " wall_samples.append(t1 - t0)\n", + "\n", + "cyc_mean = statistics.mean(cycle_samples)\n", + "cyc_std = statistics.pstdev(cycle_samples)\n", + "ns_per_token = (cyc_mean / len(info['tokens'])) * (1e9 / FCLK_HZ)\n", + "wall_mean_ms = 1e3 * statistics.mean(wall_samples)\n", + "wall_std_ms = 1e3 * statistics.pstdev(wall_samples)\n", + "\n", + "print(f\"runs : {N_RUNS}\")\n", + "print(f\"text per run : {text!r}\")\n", + "print(f\"FPGA cycles / run : mean={cyc_mean:.1f} stdev={cyc_std:.3f}\")\n", + "print(f\"FPGA cycles / token : {cyc_mean / len(info['tokens']):.1f}\")\n", + "print(f\"hardware latency : {ns_per_token/1e3:.2f} us/token (= {ns_per_token/1e6:.3f} ms/token at {FCLK_HZ/1e6:.0f} MHz)\")\n", + "print(f\"wall-clock per run : {wall_mean_ms:.2f} +- {wall_std_ms:.2f} ms (PS overhead, AXI transactions)\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fig, axs = plt.subplots(1, 2, figsize=(11, 3.5))\n", + "axs[0].hist(cycle_samples, bins=10, color='C0', edgecolor='k')\n", + "axs[0].set_xlabel('FPGA cycles per run'); axs[0].set_ylabel('count'); axs[0].set_title(f'cycle count, N={N_RUNS}')\n", + "axs[0].axvline(cyc_mean, color='C3', linestyle='--', label=f'mean = {cyc_mean:.0f}')\n", + "axs[0].legend()\n", + "axs[1].hist([w*1e3 for w in wall_samples], bins=15, color='C2', edgecolor='k')\n", + "axs[1].set_xlabel('wall-clock per run (ms)'); axs[1].set_ylabel('count'); axs[1].set_title('PS-side wall clock')\n", + "plt.tight_layout(); plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Reading the chart.** The left histogram is a single bar at the exact cycle count: the hardware is bit-exact for the same seed. The right histogram is the wall-clock per run as observed by Python; the spread there is *entirely* host-side overhead (AXI transactions, status polling, the Python interpreter), and it represents the *floor* you would have to get above with any software-only reimplementation." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Bit-exact reproducibility from the seed\n", + "\n", + "Run 5 generations with the *same* seed and check the output is byte-for-byte identical. Then run 8 generations with *different* seeds to confirm we get a varied set. Determinism for a fixed seed is what makes the hardware suitable for regression testing and reproducible benchmarks; diversity across seeds is what makes the categorical sampler useful for actual generation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "out_same = [gpt.generate(max_tokens=8, temperature=1.0, seed=0xC0FFEE)[0] for _ in range(5)]\n", + "print('5 runs at seed=0xC0FFEE :', out_same)\n", + "assert all(o == out_same[0] for o in out_same), 'reproducibility check failed'\n", + "\n", + "seeds = [1, 42, 1337, 0xDEADBEEF, 0xC0FFEE, 0xCAFEBABE, 7, 97]\n", + "for s in seeds:\n", + " text, info = gpt.generate(max_tokens=8, temperature=1.0, seed=s)\n", + " print(f\"seed=0x{s:08X} text={text!r:12s} cycles={info['cycles']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Empirical token distribution by position\n", + "\n", + "Because each generation is just a few thousand FPGA cycles, we can run the core with thousands of random seeds and reconstruct the *empirical* sampling distribution of the model — i.e. how often each letter appears at position 0, position 1, … . This is the kind of measurement that's expensive to do with a slow software inference path; here it takes a few seconds for a thousand generations.\n", + "\n", + "It's also a real cross-check on the model: position 0 is always conditioned on BOS (so the distribution there should match what the model thinks BOS most often produces); later positions are conditioned on whatever was just sampled, so they spread out." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "N_SEEDS = 1024\n", + "N_TOKENS = 8\n", + "\n", + "rng = np.random.default_rng(2026)\n", + "seeds = rng.integers(1, 2**31 - 1, size=N_SEEDS, dtype=np.int64)\n", + "\n", + "# pos_freq[pos][token_id] -> count\n", + "pos_freq = [Counter() for _ in range(N_TOKENS)]\n", + "total_cycles = 0\n", + "t0 = time.perf_counter()\n", + "for s in seeds:\n", + " text, info = gpt.generate(max_tokens=N_TOKENS, temperature=1.0, seed=int(s))\n", + " total_cycles += info['cycles']\n", + " for pos, tok in enumerate(info['tokens'][:N_TOKENS]):\n", + " pos_freq[pos][int(tok)] += 1\n", + "t_total = time.perf_counter() - t0\n", + "print(f'{N_SEEDS} generations of {N_TOKENS} tokens in {t_total:.2f} s')\n", + "print(f'total FPGA cycles : {total_cycles} (= {total_cycles*1e3/(N_SEEDS*FCLK_HZ):.2f} ms of pure inference)')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Plot the per-position empirical distribution as a heatmap.\n", + "freq_matrix = np.zeros((N_TOKENS, len(VOCAB)))\n", + "for pos in range(N_TOKENS):\n", + " total = sum(pos_freq[pos].values())\n", + " if total == 0:\n", + " continue\n", + " for tok, c in pos_freq[pos].items():\n", + " if 0 <= tok < len(VOCAB):\n", + " freq_matrix[pos, tok] = c / total\n", + "\n", + "fig, ax = plt.subplots(figsize=(11, 4))\n", + "im = ax.imshow(freq_matrix, aspect='auto', cmap='viridis', origin='lower')\n", + "ax.set_xticks(range(len(VOCAB)))\n", + "ax.set_xticklabels(VOCAB, fontsize=9)\n", + "ax.set_yticks(range(N_TOKENS))\n", + "ax.set_yticklabels([f'pos {i}' for i in range(N_TOKENS)])\n", + "ax.set_xlabel('sampled token')\n", + "ax.set_title(f'Empirical sampling distribution per position ({N_SEEDS} seeds, T=1.0)')\n", + "fig.colorbar(im, ax=ax, label='probability')\n", + "plt.tight_layout(); plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Reading the heatmap.** Brighter cells are more likely outputs at that (position, token). Position 0 is always conditioned on BOS so its row reflects the model's prior on the first letter; later rows broaden because the model conditions on a different sample each run. If a single letter dominates a row, that's where the model is confident; uniform-looking rows mean the model is more uncertain and the temperature has more leverage.\n", + "\n", + "What this section is really showing is throughput: 1024 generations × 8 tokens each ≈ 8 192 tokens, sampled in a few seconds *including* every Python register write/read. The on-chip core itself accounted for only the small fraction reported as \"pure inference\" above." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "* **Deterministic latency.** Every run with the same seed reports the same FPGA cycle count to the bit. The wall-clock spread you see is host overhead, not hardware non-determinism.\n", + "* **Repeatable sampling.** The on-chip RNG is fed by a 32-bit LFSR seeded over AXI, so the categorical sampler is bit-exact reproducible — `generate(..., seed=s)` is a function in the mathematical sense, not a Monte Carlo flavour of one. That property is invaluable for tests and for reproducing reported metrics.\n", + "* **Cheap empirical statistics.** A thousand generations is a few seconds of work, so building the per-position distribution heatmap is essentially free. A pure-Python inference of the same model would be wall-clock-bound long before we collect that many samples.\n", + "\n", + "Once the bitstream is loaded, the PS only does AXI register writes/reads. There is no model state in DRAM, no weight movement per token, no Python in the critical path. That's the win." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/pynq/sw/notebooks/throughput.ipynb b/pynq/sw/notebooks/throughput.ipynb new file mode 100644 index 0000000..8c42562 --- /dev/null +++ b/pynq/sw/notebooks/throughput.ipynb @@ -0,0 +1,289 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How fast is microgpt on PYNQ-Z2?\n", + "\n", + "This notebook puts a single number on the question: **how many tokens per second can the microgpt overlay sustain on this board?**\n", + "\n", + "Two answers exist:\n", + "1. **The RTL ceiling.** What the core itself can do per FPGA cycle, as reported by the `perf_cycles_reg` counter the wrapper exposes at offset `0x0D8`. This is what we *would* see with a free-running data path.\n", + "2. **The PS-bound wall-clock rate.** What Python actually achieves, including AXI-Lite register polls, status reads, FSM start pulses, and Python-loop overhead.\n", + "\n", + "We sweep `max_tokens` from 1 to 15, measure both numbers at each setting, and extract:\n", + "* the **per-call FSM overhead** (cycles spent in the wrapper getting in/out of `ST_WAIT_CORE`),\n", + "* the **steady-state RTL cycles per token** (the slope of cycles-vs-tokens),\n", + "* the **maximum sustained tokens/sec** achievable from this Python interface.\n", + "\n", + "Headline numbers print at the bottom." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 0. Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys, time, statistics\n", + "from pathlib import Path\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "\n", + "drivers_path = (Path('..') / 'drivers').resolve()\n", + "if str(drivers_path) not in sys.path:\n", + " sys.path.insert(0, str(drivers_path))\n", + "\n", + "from microgpt import MicroGPT\n", + "\n", + "gpt = MicroGPT()\n", + "FCLK_HZ = 50_000_000\n", + "print('overlay loaded; status =', gpt.status())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Cycles vs. tokens (the RTL story)\n", + "\n", + "Run a generation at each `max_tokens` value 1..15, average over a handful of seeds (so the categorical sampler's natural variation in cycle count doesn't bias the line), and record the FPGA cycles each one took. The relationship is essentially linear: cycles = (per-call overhead) + (cycles-per-token) × (tokens emitted)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MAX_TOKENS_GRID = list(range(1, 16))\n", + "SEEDS_PER_POINT = 8\n", + "\n", + "rng = np.random.default_rng(2026)\n", + "seed_pool = rng.integers(1, 2**31 - 1, size=SEEDS_PER_POINT, dtype=np.int64)\n", + "\n", + "rows = [] # (max_tok, n_emitted, cycles, wall_s)\n", + "for mt in MAX_TOKENS_GRID:\n", + " for s in seed_pool:\n", + " t0 = time.perf_counter()\n", + " text, info = gpt.generate(max_tokens=mt, temperature=1.0, seed=int(s))\n", + " dt = time.perf_counter() - t0\n", + " rows.append((mt, len(info['tokens']), info['cycles'], dt))\n", + "\n", + "rows = np.asarray(rows, dtype=float)\n", + "print(f'collected {len(rows)} samples across {len(MAX_TOKENS_GRID)} max_tokens settings.')\n", + "print('first 4 rows (max_tok, n_emitted, cycles, wall_s):')\n", + "print(rows[:4])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Average cycles and wall-time at each max_tok, plot.\n", + "agg = {}\n", + "for mt, n, cyc, wall in rows:\n", + " agg.setdefault(int(mt), []).append((n, cyc, wall))\n", + "\n", + "max_tokens = np.array(sorted(agg.keys()), dtype=float)\n", + "n_emitted = np.array([np.mean([r[0] for r in agg[int(mt)]]) for mt in max_tokens])\n", + "cycles_mean = np.array([np.mean([r[1] for r in agg[int(mt)]]) for mt in max_tokens])\n", + "cycles_std = np.array([np.std ([r[1] for r in agg[int(mt)]]) for mt in max_tokens])\n", + "wall_mean_ms = np.array([1e3 * np.mean([r[2] for r in agg[int(mt)]]) for mt in max_tokens])\n", + "\n", + "# Linear fit: cycles = overhead + slope * n_emitted\n", + "slope, intercept = np.polyfit(n_emitted, cycles_mean, 1)\n", + "cyc_per_tok = slope\n", + "fsm_overhead = intercept\n", + "rtl_ceiling = FCLK_HZ / cyc_per_tok # tokens/sec at the steady-state slope\n", + "\n", + "print(f'fit: cycles ~= {fsm_overhead:.0f} (per-call overhead) + {cyc_per_tok:.1f} cycles/token * n_tokens')\n", + "print(f'-> at FCLK = {FCLK_HZ/1e6:.0f} MHz the steady-state RTL ceiling is {rtl_ceiling:,.0f} tokens/sec ({1e6/rtl_ceiling:.2f} us/token)')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fig, axs = plt.subplots(1, 2, figsize=(11, 4))\n", + "axs[0].errorbar(n_emitted, cycles_mean, yerr=cycles_std, fmt='o', capsize=3, label='measured (mean +- stdev)')\n", + "fit_x = np.linspace(0, n_emitted.max() + 1, 50)\n", + "axs[0].plot(fit_x, intercept + slope * fit_x, '--', color='C3',\n", + " label=f'fit: {intercept:.0f} + {slope:.1f}*n')\n", + "axs[0].set_xlabel('tokens emitted'); axs[0].set_ylabel('FPGA cycles')\n", + "axs[0].set_title('RTL cycles vs tokens emitted'); axs[0].legend(); axs[0].grid(alpha=0.3)\n", + "\n", + "axs[1].plot(max_tokens, wall_mean_ms, 'o-', color='C2')\n", + "axs[1].set_xlabel('max_tokens')\n", + "axs[1].set_ylabel('Python wall-clock per generate() (ms)')\n", + "axs[1].set_title('PS-side wall clock per call')\n", + "axs[1].grid(alpha=0.3)\n", + "plt.tight_layout(); plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Reading the chart.**\n", + "* Left panel: the RTL cycle count is essentially linear in tokens emitted. The slope is the steady-state cycles-per-token (the work the unmodified TALOS-V2 core does for each token). The intercept is the wrapper's FSM overhead per generation — the cost of `ST_READY → ST_WAIT_CORE → ST_DONE` and back.\n", + "* Right panel: the wall-clock per `gpt.generate()` is also linear-ish in `max_tokens`, but the slope and intercept are larger than the RTL view because every call pays for ~6–10 AXI register transactions plus the Python loop and busy-wait status polling. That's the gap we'll quantify next." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Sustained tokens/sec across the sweep\n", + "\n", + "At each `max_tokens` we have wall-clock per call and tokens emitted per call. Dividing tokens by wall-clock gives the *throughput at that batch size*. Larger `max_tokens` amortizes the per-call AXI overhead over more tokens, so throughput should grow with `max_tokens` and asymptote toward whatever the per-token cost dominates." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tps_rtl_steady = FCLK_HZ / cyc_per_tok # constant: the RTL ceiling at steady state\n", + "tps_rtl_per_call = (FCLK_HZ * n_emitted) / cycles_mean # what we'd see if AXI overhead were 0\n", + "tps_wall = (n_emitted * 1e3) / wall_mean_ms # what the user sees from Python\n", + "\n", + "print(f' max_tok tokens_emitted wall_ms tps_wall tps_rtl_per_call tps_rtl_steady')\n", + "for i, mt in enumerate(max_tokens.astype(int)):\n", + " print(f' {mt:>4d} {n_emitted[i]:>6.2f} {wall_mean_ms[i]:>6.2f} '\n", + " f'{tps_wall[i]:>7.0f} {tps_rtl_per_call[i]:>7.0f} {tps_rtl_steady:>7.0f}')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(8, 4.5))\n", + "ax.plot(max_tokens, tps_wall, 'o-', color='C0', label='end-to-end (Python wall clock)')\n", + "ax.plot(max_tokens, tps_rtl_per_call, 's-', color='C2', label='RTL-only per call (cycles-based)')\n", + "ax.axhline(tps_rtl_steady, color='C3', linestyle='--',\n", + " label=f'RTL steady-state ceiling = {tps_rtl_steady:,.0f} tok/s')\n", + "ax.set_xlabel('max_tokens'); ax.set_ylabel('throughput (tokens/sec)')\n", + "ax.set_title('microgpt throughput vs max_tokens')\n", + "ax.legend(); ax.grid(alpha=0.3)\n", + "plt.tight_layout(); plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Reading the chart.** Three lines:\n", + "* **Red dashed**: pure RTL steady-state throughput. This is what the unmodified TALOS-V2 core could in principle deliver if you fed it tokens with zero overhead — it's the cycles-per-token slope inverted.\n", + "* **Green**: per-call RTL throughput including the wrapper FSM overhead. Smaller `max_tokens` pays a fixed FSM startup cost over fewer tokens, so this curve asymptotes to the red line as `max_tokens` grows.\n", + "* **Blue**: what Python actually measures. The gap between blue and green is *entirely* PS-side overhead: AXI register transactions for `REG_CMD`, `REG_CONFIG`, `REG_SEED`, `REG_STATUS` polling for `done`, and 1..15 reads of `REG_OUT_BASE+4*i`. That gap is the cost of the AXI-Lite register interface." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. The headline: maximum sustained throughput\n", + "\n", + "Run the largest `max_tokens` setting (15) for a fixed wall-clock budget (~5 s) and count how many tokens were emitted. This is the *maximum sustained* tokens/sec that a Python user can observe from this overlay." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "BUDGET_S = 5.0\n", + "MAX_TOK = 15\n", + "\n", + "rng = np.random.default_rng(7)\n", + "n_calls = 0\n", + "n_tokens = 0\n", + "n_cycles_rtl = 0\n", + "\n", + "t_end = time.perf_counter() + BUDGET_S\n", + "while time.perf_counter() < t_end:\n", + " s = int(rng.integers(1, 2**31 - 1))\n", + " text, info = gpt.generate(max_tokens=MAX_TOK, temperature=1.0, seed=s)\n", + " n_calls += 1\n", + " n_tokens += len(info['tokens'])\n", + " n_cycles_rtl += info['cycles']\n", + "elapsed = BUDGET_S\n", + "\n", + "tps_e2e = n_tokens / elapsed\n", + "tps_rtl = (FCLK_HZ * n_tokens) / n_cycles_rtl\n", + "print(f'sustained run, max_tokens={MAX_TOK}, budget={BUDGET_S:.1f} s:')\n", + "print(f' generations completed : {n_calls}')\n", + "print(f' tokens emitted : {n_tokens}')\n", + "print(f' RTL cycles total : {n_cycles_rtl} ({n_cycles_rtl / FCLK_HZ * 1e3:.1f} ms of pure inference)')\n", + "print(f' tps end-to-end (wall) : {tps_e2e:>8,.0f} tokens/sec')\n", + "print(f' tps if AXI were free : {tps_rtl:>8,.0f} tokens/sec')\n", + "print(f' fraction of RTL achieved : {100*tps_e2e/tps_rtl:.1f}% (the rest is PS-side overhead)')\n", + "print(f' RTL steady-state ceiling : {tps_rtl_steady:>8,.0f} tokens/sec')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. What would it take to close the gap?\n", + "\n", + "The PS-side overhead per call breaks down approximately as (numbers are typical for PYNQ + cocotbext-axi style transactions):\n", + "\n", + "| step | AXI transactions | comment |\n", + "|------|---:|---|\n", + "| reset (CMD = clear) | 1W | one MMIO write |\n", + "| program CONFIG + SEED | 2W | two MMIO writes |\n", + "| start (CMD = start) | 1W | one MMIO write |\n", + "| poll STATUS until `done` | k×R | typically 3–10 reads while the core runs |\n", + "| read out_len from STATUS | (covered above) | — |\n", + "| read OUT_BASE 0..out_len-1 | (out_len)×R | one MMIO read per token |\n", + "| read REG_PERF_CYC, REG_TPS | 2R | two MMIO reads |\n", + "\n", + "Each MMIO transaction on the PYNQ-Z2 over AXI-Lite is ~hundreds of ns of bus time *plus* Python interpreter time. For `max_tokens=15` that's roughly 20–25 transactions per generation. Closing the gap to the RTL ceiling would mean replacing the per-token register reads with a single AXI4-Stream burst — the same upgrade story as the rtl_ber overlay." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "* **RTL steady-state ceiling**: the slope of cycles vs tokens, inverted at 50 MHz, is the maximum the core can do per cycle. From this notebook's measurement, that landed at the value printed by section 1 (typically tens of thousands of tokens/sec for this tiny model).\n", + "* **Sustained end-to-end** through Python is several times lower because every generation pays AXI-Lite + Python overhead on its 20-ish register transactions.\n", + "* The headline tokens/sec for a Python user is the **section 3 \"tps end-to-end (wall)\"** number — that is the answer to *\"how many tokens/sec can microgpt do on this board?\"* in practice today.\n", + "* The headline tokens/sec for a hypothetical streaming data-plane (no per-token register polls) is the **steady-state ceiling** number — that is the answer to *\"what would the RTL be capable of if we got out of its way?\"*" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/pynq/sw/tests/.gitkeep b/pynq/sw/tests/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/pynq/tutorials/00_overview.ipynb b/pynq/tutorials/00_overview.ipynb new file mode 100644 index 0000000..e3b0ec9 --- /dev/null +++ b/pynq/tutorials/00_overview.ipynb @@ -0,0 +1,116 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 00 \u2014 Overview\n", + "\n", + "**microgpt on PYNQ-Z2: a char-level GPT baked entirely into PL fabric.**\n", + "\n", + "This tutorial set walks the workflow end-to-end: from looking at the\n", + "weights, to running the AXI wrapper in cocotb, to using the deployed\n", + "overlay from Python over `pynq.MMIO`.\n", + "\n", + "## What is special about this overlay\n", + "\n", + "Most embedded transformer inference paths trade fabric for DRAM\n", + "streaming: weights live off-chip in DDR, and the FPGA runs one matmul\n", + "at a time. That gives flexibility at the cost of latency and host\n", + "orchestration complexity.\n", + "\n", + "microgpt takes the opposite extreme: every one of the **4,192 INT16\n", + "(Q12) parameters** is hardcoded into LUTRAM / BRAM / constants at\n", + "**synthesis time**. There is no DDR, no DMA, no host-side inference\n", + "loop. The PS pushes a prompt token through a single AXI4-Lite slave,\n", + "the PL runs the full forward pass in fabric, and the PS reads the\n", + "generated tokens back through the same slave.\n", + "\n", + "The exact-arithmetic GPT core is a port of\n", + "[`Luthiraa/TALOS-V2`](https://github.com/Luthiraa/TALOS-V2);\n", + "see [`UPSTREAM.md`](../UPSTREAM.md) for per-file attribution.\n", + "\n", + "## The four-stage loop\n", + "\n", + "```\n", + " .hex weights \u2192 Vivado build \u2192 AXI4-Lite slave \u2192 PS driver\n", + " (hw/ip/*.hex) (hw/tcl/build (microgpt_pynq_ (sw/drivers/\n", + " .tcl) top.sv) microgpt.py)\n", + " \u2502 \u2502 \u2502 \u2502\n", + " \u25bc \u25bc \u25bc \u25bc\n", + " visualise synthesise simulate & generate\n", + " (this notebook, (Vivado 2024.1) verify handshakes tokens\n", + " `01_\u2026`) with cocotb (notebook `03_\u2026`\n", + " (notebook `02_\u2026`) runs on PYNQ)\n", + "```\n", + "\n", + "## Tutorials in this set\n", + "\n", + "| # | Notebook | Runs locally? | Needs |\n", + "|---|---|---|---|\n", + "| 00 | This overview | n/a | nothing |\n", + "| 01 | `01_explore_weights.ipynb` | yes | `numpy`, `matplotlib` |\n", + "| 02 | `02_register_map_and_driver.ipynb` | yes (Python parts) | `numpy` |\n", + "| 03 | Cocotb simulation of the AXI wrapper | yes | `cocotb`, `verilator` (or `iverilog`) |\n", + "\n", + "(03 is documented in [`hw/sim/cocotb/README.md`](../hw/sim/cocotb/README.md) and\n", + "run directly with `make` rather than from a notebook.)\n", + "\n", + "## Reproducing the bitstream\n", + "\n", + "```bash\n", + "source ~/tools/Xilinx/Vivado/2024.1/settings64.sh\n", + "rm -rf hw/build && mkdir hw/build\n", + "vivado -mode batch -source hw/tcl/build.tcl\n", + "```\n", + "\n", + "Produces `overlays/microgpt.bit` and `.hwh` targeting\n", + "`xc7z010clg400-1`. Edit the `part` variable at the top of\n", + "`hw/tcl/build.tcl` to retarget for a `xc7z020` PYNQ-Z2 unit.\n", + "\n", + "## Register map (4 KB AXI4-Lite BAR)\n", + "\n", + "| Offset | Reg | RW | Purpose |\n", + "|--------|--------------|----|--------------------------------------------------|\n", + "| 0x000 | MAGIC | RO | `'MGRT'` = 0x4D475254 |\n", + "| 0x004 | VERSION | RO | 0x00020001 |\n", + "| 0x008 | CMD | WO | bit0=start pulse, bit1=clear pulse |\n", + "| 0x00C | STATUS | RO | ready / busy / done / error / toggle / pos / out_len |\n", + "| 0x010 | CONFIG | RW | temperature Q8.8 (hi 16b), max_gen (next 8b) |\n", + "| 0x014 | SEED | RW | RNG seed for the categorical sampler |\n", + "| 0x018 | LOGIT_INFO | RO | argmax token + last-sampled token + top-logit Q12 |\n", + "| 0x01C | BOS | RO | BOS_TOKEN (= 26) |\n", + "| 0x060+ | OUTPUT_MEM | RO | 16 generated tokens (low byte of each u32) |\n", + "| 0x100+ | LOGITS | RO | 27 sign-extended Q12 logits |\n", + "\n", + "(Full register map and bitfield notes live in\n", + "[`README.md`](../README.md#register-map).)\n", + "\n", + "## Going further\n", + "\n", + "- The current build targets `xc7z010clg400-1` to fit the smaller\n", + " PYNQ-Z2 die. The model fits comfortably; a larger model could\n", + " potentially run on the `xc7z020` die without restructuring.\n", + "- `sw/notebooks/throughput.ipynb` measures per-token latency on the\n", + " deployed board and shows the ~1.7\u00d7 speedup from the burst-readback\n", + " + UIO IRQ driver optimisations.\n", + "- The categorical sampler uses `xorshift32` for reproducibility:\n", + " `gpt.generate(seed=0xC0FFEE)` returns deterministic text.\n" + ], + "id": "c00" + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.x" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/pynq/tutorials/01_explore_weights.ipynb b/pynq/tutorials/01_explore_weights.ipynb new file mode 100644 index 0000000..3ae3a3f --- /dev/null +++ b/pynq/tutorials/01_explore_weights.ipynb @@ -0,0 +1,157 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 01 \u2014 Explore the Q12 weights\n", + "\n", + "All 4,192 parameters of microgpt are baked into PL fabric at\n", + "synthesis time. They live as 16-bit fixed-point (Q12) values in\n", + "`hw/ip/*.hex` and are pulled into BRAM/LUTRAM via `$readmemh` in\n", + "`microgpt_exact_core_rom_init.svh`.\n", + "\n", + "This notebook loads the hex files, decodes them as Q12 fixed-point,\n", + "and renders each weight tensor so you can see the structure of the\n", + "**actual** model that ends up in the gates.\n" + ], + "id": "c00" + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "IP_DIR = Path('../hw/ip').resolve()\n", + "\n", + "FRAC_BITS = 12 # Q12: 1 sign + 3 integer + 12 fractional bits \u2192 range [-8, 8)\n", + "\n", + "WEIGHTS = [\n", + " ('wte_q12.hex', 'WTE \u2014 token embedding', (27, 16)),\n", + " ('wpe_q12.hex', 'WPE \u2014 positional embedding', (16, 16)),\n", + " ('layer0_attn_wq_q12.hex', 'W_Q \u2014 attention query', (16, 16)),\n", + " ('layer0_attn_wk_q12.hex', 'W_K \u2014 attention key', (16, 16)),\n", + " ('layer0_attn_wv_q12.hex', 'W_V \u2014 attention value', (16, 16)),\n", + " ('layer0_attn_wo_q12.hex', 'W_O \u2014 attention output', (16, 16)),\n", + " ('layer0_mlp_fc1_q12.hex', 'FC1 \u2014 MLP up-projection', (16, 64)),\n", + " ('layer0_mlp_fc2_q12.hex', 'FC2 \u2014 MLP down-projection', (64, 16)),\n", + " ('lm_head_q12.hex', 'LM head \u2014 logits projection', (16, 27)),\n", + "]\n" + ], + "id": "c01" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Decoder\n", + "\n", + "Each `.hex` line is a 16-bit two's-complement word. We convert to a\n", + "Python `int`, take the signed value, and divide by `2**FRAC_BITS` to\n", + "get the real-valued weight.\n" + ], + "id": "c02" + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def load_q12_hex(path, shape):\n", + " raw = np.array([int(l.strip(), 16) for l in path.read_text().splitlines() if l.strip()], dtype=np.uint16)\n", + " signed = raw.astype(np.int32)\n", + " signed[signed >= 0x8000] -= 0x10000 # two's-complement \u2192 signed\n", + " fp = signed.astype(np.float64) / (1 << FRAC_BITS)\n", + " assert fp.size == shape[0] * shape[1], f'{path.name}: expected {shape}, got {fp.size}'\n", + " return fp.reshape(shape)\n", + "\n", + "tensors = {label: load_q12_hex(IP_DIR / fname, shape) for fname, label, shape in WEIGHTS}\n", + "total_params = sum(t.size for t in tensors.values())\n", + "print(f'Loaded {len(tensors)} tensors \u00b7 {total_params} parameters total')\n" + ], + "id": "c03" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Visualise\n", + "\n", + "Heatmap each tensor with a diverging colormap centred at zero.\n" + ], + "id": "c04" + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fig, axes = plt.subplots(3, 3, figsize=(12, 11), constrained_layout=True)\n", + "for ax, (label, w) in zip(axes.flat, tensors.items()):\n", + " vmax = float(np.abs(w).max()) or 1e-9\n", + " ax.imshow(w, cmap='RdBu_r', vmin=-vmax, vmax=vmax, aspect='auto', interpolation='nearest')\n", + " ax.set_title(f'{label}\\n{w.shape} \u00b7 |w|\u2264{vmax:.2f} \u00b7 \u03c3={w.std():.3f}', fontsize=9)\n", + " ax.set_axis_off()\n", + "plt.show()\n" + ], + "id": "c05" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What you should see\n", + "\n", + "- **WTE / WPE** have visible per-token / per-position structure.\n", + "- **W_Q, W_K, W_V** are 16\u00d716 matrices that the systolic matvec\n", + " tile multiplies the embedded token by, every step.\n", + "- **FC1 (16\u219264)** and **FC2 (64\u219216)** are the MLP block.\n", + "- **LM head (16\u219227)** projects to vocabulary logits.\n", + "\n", + "Every one of these values is baked into LUTRAM / BRAM at synth time \u2014\n", + "there is no DRAM behind any of it. If you change a weight, you have\n", + "to rebuild the bitstream.\n", + "\n", + "## Sanity check vs the build artefacts\n" + ], + "id": "c06" + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Histogram of the LM-head weights \u2014 should be roughly centred\n", + "import numpy as np\n", + "lm = tensors['LM head \u2014 logits projection']\n", + "fig, ax = plt.subplots(figsize=(6, 3))\n", + "ax.hist(lm.flatten(), bins=40, color='steelblue', edgecolor='black')\n", + "ax.set_xlabel('Q12 weight value'); ax.set_ylabel('count')\n", + "ax.set_title(f'lm_head weights \u00b7 n={lm.size} \u00b7 \u03c3={lm.std():.3f}')\n", + "plt.show()\n" + ], + "id": "c07" + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.x" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/pynq/tutorials/02_register_map_and_driver.ipynb b/pynq/tutorials/02_register_map_and_driver.ipynb new file mode 100644 index 0000000..9cfd9f2 --- /dev/null +++ b/pynq/tutorials/02_register_map_and_driver.ipynb @@ -0,0 +1,178 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 02 \u2014 Register map and driver\n", + "\n", + "This notebook is mostly informational: it walks through the\n", + "AXI4-Lite register map, decodes a STATUS word, and shows the\n", + "driver's hot path. The actual `MicroGPT` driver requires PYNQ +\n", + "the bitstream loaded on a real PYNQ-Z2 board, so the generate()\n", + "calls are guarded \u2014 they will work on the board, not on a dev\n", + "laptop.\n" + ], + "id": "c00" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## STATUS register decoder\n", + "\n", + "The STATUS register at offset `0x00C` packs several fields into a\n", + "single 32-bit word. The exact layout lives in\n", + "`hw/src/top/microgpt_pynq_top.sv`. Here is a small decoder you can\n", + "run on the raw u32 value to see the meaning at a glance.\n" + ], + "id": "c01" + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def decode_status(u32):\n", + " return {\n", + " 'ready': bool(u32 & (1 << 0)),\n", + " 'busy': bool(u32 & (1 << 1)),\n", + " 'done': bool(u32 & (1 << 2)),\n", + " 'error': bool(u32 & (1 << 3)),\n", + " 'toggle': bool(u32 & (1 << 4)),\n", + " 'direct_mode': bool(u32 & (1 << 5)),\n", + " 'out_len': (u32 >> 16) & 0xFF,\n", + " 'pos': (u32 >> 24) & 0xFF,\n", + " }\n", + "\n", + "# Example: an idle, post-reset status\n", + "decode_status(0x0000_0001)\n" + ], + "id": "c02" + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example: 'busy and generating, position 3, no errors yet'\n", + "decode_status(0x0300_0002)\n" + ], + "id": "c03" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Driver hot path\n", + "\n", + "The deployed driver caches a `uint32` view over the MMIO array once\n", + "per instance, then does direct burst reads. Here is the relevant\n", + "snippet from `sw/drivers/microgpt.py`:\n", + "\n", + "```python\n", + "self._u32 = np.asarray(self.mmio.array, dtype=np.uint32)\n", + "...\n", + "# tight wait loop, time-check every 4096 spins (not every spin)\n", + "spins = 0\n", + "while True:\n", + " if self._u32[A_STATUS >> 2] & (1 << ST_DONE_BIT):\n", + " break\n", + " spins += 1\n", + " if spins & 0xFFF == 0 and time.perf_counter() - t0 > self._timeout_s:\n", + " raise TimeoutError('done bit never set')\n", + "\n", + "# burst-read 16 generated tokens with one numpy view\n", + "tokens = (self._u32[A_OUTPUT_MEM >> 2 : (A_OUTPUT_MEM >> 2) + n] & 0xFF).tolist()\n", + "```\n", + "\n", + "The `mmio.array` view + masking pattern is roughly **1.7\u00d7 faster**\n", + "than the obvious `[mmio.read(addr + i*4) for i in range(n)]` loop\n", + "because each `mmio.read()` does a fresh attribute lookup and a\n", + "Python-level branch. See `sw/notebooks/throughput.ipynb` for the\n", + "measurement on the deployed board.\n" + ], + "id": "c04" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## IRQ fast path\n", + "\n", + "An optional `use_irq=True` flag opens `/dev/uio` (PL fabric IRQ)\n", + "and replaces the spin loop with a blocking `os.read`:\n", + "\n", + "```python\n", + "import os, struct\n", + "self._uio_fd = os.open(f'/dev/uio{fabric_uio_index}', os.O_RDWR)\n", + "...\n", + "# enable the IRQ once\n", + "os.write(self._uio_fd, struct.pack(' dict: + return { + "cell_type": "markdown", + "metadata": {}, + "source": [l + "\n" for l in "\n".join(lines).splitlines()], + } + + +def code(*lines: str) -> dict: + return { + "cell_type": "code", + "execution_count": None, + "metadata": {}, + "outputs": [], + "source": [l + "\n" for l in "\n".join(lines).splitlines()], + } + + +def notebook(*cells: dict) -> dict: + cells_with_ids = [] + for i, c in enumerate(cells): + # Stable per-position ids satisfy nbformat ≥ 4.5 without random churn. + cells_with_ids.append({**c, "id": f"c{i:02d}"}) + return { + "cells": cells_with_ids, + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3", + }, + "language_info": {"name": "python", "version": "3.x"}, + }, + "nbformat": 4, + "nbformat_minor": 5, + } + + +# ----------------------------------------------------------------------------- +# 00 — Overview +# ----------------------------------------------------------------------------- + +n00 = notebook( + md( + "# 00 — Overview", + "", + "**microgpt on PYNQ-Z2: a char-level GPT baked entirely into PL fabric.**", + "", + "This tutorial set walks the workflow end-to-end: from looking at the", + "weights, to running the AXI wrapper in cocotb, to using the deployed", + "overlay from Python over `pynq.MMIO`.", + "", + "## What is special about this overlay", + "", + "Most embedded transformer inference paths trade fabric for DRAM", + "streaming: weights live off-chip in DDR, and the FPGA runs one matmul", + "at a time. That gives flexibility at the cost of latency and host", + "orchestration complexity.", + "", + "microgpt takes the opposite extreme: every one of the **4,192 INT16", + "(Q12) parameters** is hardcoded into LUTRAM / BRAM / constants at", + "**synthesis time**. There is no DDR, no DMA, no host-side inference", + "loop. The PS pushes a prompt token through a single AXI4-Lite slave,", + "the PL runs the full forward pass in fabric, and the PS reads the", + "generated tokens back through the same slave.", + "", + "The exact-arithmetic GPT core is a port of", + "[`Luthiraa/TALOS-V2`](https://github.com/Luthiraa/TALOS-V2);", + "see [`UPSTREAM.md`](../UPSTREAM.md) for per-file attribution.", + "", + "## The four-stage loop", + "", + "```", + " .hex weights → Vivado build → AXI4-Lite slave → PS driver", + " (hw/ip/*.hex) (hw/tcl/build (microgpt_pynq_ (sw/drivers/", + " .tcl) top.sv) microgpt.py)", + " │ │ │ │", + " ▼ ▼ ▼ ▼", + " visualise synthesise simulate & generate", + " (this notebook, (Vivado 2024.1) verify handshakes tokens", + " `01_…`) with cocotb (notebook `03_…`", + " (notebook `02_…`) runs on PYNQ)", + "```", + "", + "## Tutorials in this set", + "", + "| # | Notebook | Runs locally? | Needs |", + "|---|---|---|---|", + "| 00 | This overview | n/a | nothing |", + "| 01 | `01_explore_weights.ipynb` | yes | `numpy`, `matplotlib` |", + "| 02 | `02_register_map_and_driver.ipynb` | yes (Python parts) | `numpy` |", + "| 03 | Cocotb simulation of the AXI wrapper | yes | `cocotb`, `verilator` (or `iverilog`) |", + "", + "(03 is documented in [`hw/sim/cocotb/README.md`](../hw/sim/cocotb/README.md) and", + "run directly with `make` rather than from a notebook.)", + "", + "## Reproducing the bitstream", + "", + "```bash", + "source ~/tools/Xilinx/Vivado/2024.1/settings64.sh", + "rm -rf hw/build && mkdir hw/build", + "vivado -mode batch -source hw/tcl/build.tcl", + "```", + "", + "Produces `overlays/microgpt.bit` and `.hwh` targeting", + "`xc7z010clg400-1`. Edit the `part` variable at the top of", + "`hw/tcl/build.tcl` to retarget for a `xc7z020` PYNQ-Z2 unit.", + "", + "## Register map (4 KB AXI4-Lite BAR)", + "", + "| Offset | Reg | RW | Purpose |", + "|--------|--------------|----|--------------------------------------------------|", + "| 0x000 | MAGIC | RO | `'MGRT'` = 0x4D475254 |", + "| 0x004 | VERSION | RO | 0x00020001 |", + "| 0x008 | CMD | WO | bit0=start pulse, bit1=clear pulse |", + "| 0x00C | STATUS | RO | ready / busy / done / error / toggle / pos / out_len |", + "| 0x010 | CONFIG | RW | temperature Q8.8 (hi 16b), max_gen (next 8b) |", + "| 0x014 | SEED | RW | RNG seed for the categorical sampler |", + "| 0x018 | LOGIT_INFO | RO | argmax token + last-sampled token + top-logit Q12 |", + "| 0x01C | BOS | RO | BOS_TOKEN (= 26) |", + "| 0x060+ | OUTPUT_MEM | RO | 16 generated tokens (low byte of each u32) |", + "| 0x100+ | LOGITS | RO | 27 sign-extended Q12 logits |", + "", + "(Full register map and bitfield notes live in", + "[`README.md`](../README.md#register-map).)", + "", + "## Going further", + "", + "- The current build targets `xc7z010clg400-1` to fit the smaller", + " PYNQ-Z2 die. The model fits comfortably; a larger model could", + " potentially run on the `xc7z020` die without restructuring.", + "- `sw/notebooks/throughput.ipynb` measures per-token latency on the", + " deployed board and shows the ~1.7× speedup from the burst-readback", + " + UIO IRQ driver optimisations.", + "- The categorical sampler uses `xorshift32` for reproducibility:", + " `gpt.generate(seed=0xC0FFEE)` returns deterministic text.", + ), +) + + +# ----------------------------------------------------------------------------- +# 01 — Explore weights +# ----------------------------------------------------------------------------- + +n01 = notebook( + md( + "# 01 — Explore the Q12 weights", + "", + "All 4,192 parameters of microgpt are baked into PL fabric at", + "synthesis time. They live as 16-bit fixed-point (Q12) values in", + "`hw/ip/*.hex` and are pulled into BRAM/LUTRAM via `$readmemh` in", + "`microgpt_exact_core_rom_init.svh`.", + "", + "This notebook loads the hex files, decodes them as Q12 fixed-point,", + "and renders each weight tensor so you can see the structure of the", + "**actual** model that ends up in the gates.", + ), + code( + "from pathlib import Path", + "import numpy as np", + "import matplotlib.pyplot as plt", + "", + "IP_DIR = Path('../hw/ip').resolve()", + "", + "FRAC_BITS = 12 # Q12: 1 sign + 3 integer + 12 fractional bits → range [-8, 8)", + "", + "WEIGHTS = [", + " ('wte_q12.hex', 'WTE — token embedding', (27, 16)),", + " ('wpe_q12.hex', 'WPE — positional embedding', (16, 16)),", + " ('layer0_attn_wq_q12.hex', 'W_Q — attention query', (16, 16)),", + " ('layer0_attn_wk_q12.hex', 'W_K — attention key', (16, 16)),", + " ('layer0_attn_wv_q12.hex', 'W_V — attention value', (16, 16)),", + " ('layer0_attn_wo_q12.hex', 'W_O — attention output', (16, 16)),", + " ('layer0_mlp_fc1_q12.hex', 'FC1 — MLP up-projection', (16, 64)),", + " ('layer0_mlp_fc2_q12.hex', 'FC2 — MLP down-projection', (64, 16)),", + " ('lm_head_q12.hex', 'LM head — logits projection', (16, 27)),", + "]", + ), + md( + "## Decoder", + "", + "Each `.hex` line is a 16-bit two's-complement word. We convert to a", + "Python `int`, take the signed value, and divide by `2**FRAC_BITS` to", + "get the real-valued weight.", + ), + code( + "def load_q12_hex(path, shape):", + " raw = np.array([int(l.strip(), 16) for l in path.read_text().splitlines() if l.strip()], dtype=np.uint16)", + " signed = raw.astype(np.int32)", + " signed[signed >= 0x8000] -= 0x10000 # two's-complement → signed", + " fp = signed.astype(np.float64) / (1 << FRAC_BITS)", + " assert fp.size == shape[0] * shape[1], f'{path.name}: expected {shape}, got {fp.size}'", + " return fp.reshape(shape)", + "", + "tensors = {label: load_q12_hex(IP_DIR / fname, shape) for fname, label, shape in WEIGHTS}", + "total_params = sum(t.size for t in tensors.values())", + "print(f'Loaded {len(tensors)} tensors · {total_params} parameters total')", + ), + md("## Visualise", "", "Heatmap each tensor with a diverging colormap centred at zero."), + code( + "fig, axes = plt.subplots(3, 3, figsize=(12, 11), constrained_layout=True)", + "for ax, (label, w) in zip(axes.flat, tensors.items()):", + " vmax = float(np.abs(w).max()) or 1e-9", + " ax.imshow(w, cmap='RdBu_r', vmin=-vmax, vmax=vmax, aspect='auto', interpolation='nearest')", + " ax.set_title(f'{label}\\n{w.shape} · |w|≤{vmax:.2f} · σ={w.std():.3f}', fontsize=9)", + " ax.set_axis_off()", + "plt.show()", + ), + md( + "## What you should see", + "", + "- **WTE / WPE** have visible per-token / per-position structure.", + "- **W_Q, W_K, W_V** are 16×16 matrices that the systolic matvec", + " tile multiplies the embedded token by, every step.", + "- **FC1 (16→64)** and **FC2 (64→16)** are the MLP block.", + "- **LM head (16→27)** projects to vocabulary logits.", + "", + "Every one of these values is baked into LUTRAM / BRAM at synth time —", + "there is no DRAM behind any of it. If you change a weight, you have", + "to rebuild the bitstream.", + "", + "## Sanity check vs the build artefacts", + ), + code( + "# Histogram of the LM-head weights — should be roughly centred", + "import numpy as np", + "lm = tensors['LM head — logits projection']", + "fig, ax = plt.subplots(figsize=(6, 3))", + "ax.hist(lm.flatten(), bins=40, color='steelblue', edgecolor='black')", + "ax.set_xlabel('Q12 weight value'); ax.set_ylabel('count')", + "ax.set_title(f'lm_head weights · n={lm.size} · σ={lm.std():.3f}')", + "plt.show()", + ), +) + + +# ----------------------------------------------------------------------------- +# 02 — Register map and driver +# ----------------------------------------------------------------------------- + +n02 = notebook( + md( + "# 02 — Register map and driver", + "", + "This notebook is mostly informational: it walks through the", + "AXI4-Lite register map, decodes a STATUS word, and shows the", + "driver's hot path. The actual `MicroGPT` driver requires PYNQ +", + "the bitstream loaded on a real PYNQ-Z2 board, so the generate()", + "calls are guarded — they will work on the board, not on a dev", + "laptop.", + ), + md( + "## STATUS register decoder", + "", + "The STATUS register at offset `0x00C` packs several fields into a", + "single 32-bit word. The exact layout lives in", + "`hw/src/top/microgpt_pynq_top.sv`. Here is a small decoder you can", + "run on the raw u32 value to see the meaning at a glance.", + ), + code( + "def decode_status(u32):", + " return {", + " 'ready': bool(u32 & (1 << 0)),", + " 'busy': bool(u32 & (1 << 1)),", + " 'done': bool(u32 & (1 << 2)),", + " 'error': bool(u32 & (1 << 3)),", + " 'toggle': bool(u32 & (1 << 4)),", + " 'direct_mode': bool(u32 & (1 << 5)),", + " 'out_len': (u32 >> 16) & 0xFF,", + " 'pos': (u32 >> 24) & 0xFF,", + " }", + "", + "# Example: an idle, post-reset status", + "decode_status(0x0000_0001)", + ), + code( + "# Example: 'busy and generating, position 3, no errors yet'", + "decode_status(0x0300_0002)", + ), + md( + "## Driver hot path", + "", + "The deployed driver caches a `uint32` view over the MMIO array once", + "per instance, then does direct burst reads. Here is the relevant", + "snippet from `sw/drivers/microgpt.py`:", + "", + "```python", + "self._u32 = np.asarray(self.mmio.array, dtype=np.uint32)", + "...", + "# tight wait loop, time-check every 4096 spins (not every spin)", + "spins = 0", + "while True:", + " if self._u32[A_STATUS >> 2] & (1 << ST_DONE_BIT):", + " break", + " spins += 1", + " if spins & 0xFFF == 0 and time.perf_counter() - t0 > self._timeout_s:", + " raise TimeoutError('done bit never set')", + "", + "# burst-read 16 generated tokens with one numpy view", + "tokens = (self._u32[A_OUTPUT_MEM >> 2 : (A_OUTPUT_MEM >> 2) + n] & 0xFF).tolist()", + "```", + "", + "The `mmio.array` view + masking pattern is roughly **1.7× faster**", + "than the obvious `[mmio.read(addr + i*4) for i in range(n)]` loop", + "because each `mmio.read()` does a fresh attribute lookup and a", + "Python-level branch. See `sw/notebooks/throughput.ipynb` for the", + "measurement on the deployed board.", + ), + md( + "## IRQ fast path", + "", + "An optional `use_irq=True` flag opens `/dev/uio` (PL fabric IRQ)", + "and replaces the spin loop with a blocking `os.read`:", + "", + "```python", + "import os, struct", + "self._uio_fd = os.open(f'/dev/uio{fabric_uio_index}', os.O_RDWR)", + "...", + "# enable the IRQ once", + "os.write(self._uio_fd, struct.pack(' None: + pairs = [ + ("00_overview.ipynb", n00), + ("01_explore_weights.ipynb", n01), + ("02_register_map_and_driver.ipynb", n02), + ] + for name, nb in pairs: + (OUT / name).write_text(json.dumps(nb, indent=1)) + print(f"wrote {name}") + + +if __name__ == "__main__": + main()