Luthiraa · AbdullahAlNafisah · May 15, 2026
diff --git a/.gitignore b/.gitignore
@@ -29,3 +29,14 @@ incremental_db/
 *.sof
 *.sld
 *.jdi
+
+# PYNQ port (`pynq/` subdir) build artefacts
+pynq/hw/build/
+pynq/hw/sim/cocotb/sim_build/
+pynq/hw/sim/cocotb/dump.vcd
+pynq/hw/sim/cocotb/results.xml
+pynq/hw/sim/cocotb/cocotb_env/
+pynq/.Xil/
+pynq/NA/
+pynq/sw/notebooks/.ipynb_checkpoints/
+pynq/.venv/
diff --git a/NOTICE b/NOTICE
@@ -7,3 +7,29 @@ otherwise noted.
 This repository contains third-party reference material, tool-generated files,
 and vendor IP with separate ownership and licensing terms. See
 THIRD_PARTY_NOTICES.md for details.
+
+----------------------------------------------------------------------
+
+PYNQ-Z2 port (`pynq/` subdirectory)
+Copyright 2026 Abdullah Al-Nafisah
+
+The `pynq/` subdirectory adds a port of TALOS-V2 to the Xilinx PYNQ-Z2
+(Zynq-7020 XC7Z020CLG400-1). It is self-contained and does not modify
+the original Intel DE1-SoC flow at the repository root. Both flows can
+coexist; users with both boards can program either from this single
+repository.
+
+Contents of `pynq/`:
+
+  - Files byte-identical to upstream `rtl/` are redistributed under
+    Apache-2.0 (copies kept in `pynq/hw/src/core/` and `pynq/hw/ip/`
+    so the Vivado build is self-contained).
+  - One file modified from upstream — `pynq/hw/src/core/include/
+    microgpt_exact_core_rom_init.svh` — paths adjusted for Vivado's
+    `INCLUDE_DIRS` mechanism. Modifications fall under Apache-2.0 §4.
+  - New original work — AXI4-Lite wrapper (`pynq/hw/src/top/`),
+    Vivado batch build (`pynq/hw/tcl/`), cocotb regression suite
+    (`pynq/hw/sim/cocotb/`), Python PYNQ driver (`pynq/sw/`),
+    notebooks, demos, tutorials, and built bitstream artefacts —
+    licensed under BSD 3-Clause; see `pynq/LICENSE.original` and the
+    per-file attribution in `pynq/UPSTREAM.md`.
diff --git a/pynq/LICENSE.original b/pynq/LICENSE.original
@@ -0,0 +1,34 @@
+BSD 3-Clause License
+
+Copyright (c) 2026, Abdullah Al-Nafisah
+
+Files in this `pynq/` subdirectory that are NOT byte-identical to upstream
+TALOS-V2 (see `UPSTREAM.md` for the per-file attribution) are licensed
+under the BSD 3-Clause License below. Files that ARE byte-identical to
+upstream remain governed by Apache-2.0 (`../LICENSE`).
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice,
+   this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+
+3. Neither the name of the copyright holder nor the names of its contributors
+   may be used to endorse or promote products derived from this software
+   without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
diff --git a/pynq/README.md b/pynq/README.md
@@ -0,0 +1,141 @@
+# TALOS-V2 — PYNQ-Z2 port
+
+Self-contained Xilinx **PYNQ-Z2** (Zynq-7020 XC7Z020CLG400-1) port of
+the [TALOS-V2](https://github.com/Luthiraa/TALOS-V2) "exact" microGPT
+SystemVerilog accelerator. The PL keeps the upstream
+`microgpt_exact_core` and its sub-blocks **byte-identical**; only the
+host bridge, clocking, I/O, and host-side tooling are rewritten for the
+Vivado / PYNQ flow. Per-file attribution lives in
+[`UPSTREAM.md`](UPSTREAM.md).
+
+The original Intel DE1-SoC flow at the [repository root](..) is
+unchanged. Both flows coexist in this repository so users with either
+board can build and run microGPT from a single clone.
+
+**Licensing:** files byte-identical to upstream are governed by
+Apache-2.0 (`../LICENSE`); original PYNQ-port additions are governed
+by BSD 3-Clause (`LICENSE.original`). See `../NOTICE`.
+
+## Directory layout
+
+```
+docs/                   Design notes + draft upstream license request
+demos/                  Pre-computed weight heatmaps for the portfolio site
+hw/
+  build/                Vivado project (gitignored output)
+  constraints/          pynq_z2.xdc (LD0..LD3 only)
+  ip/                   Q12 weight ROMs (.hex) — origin: TALOS-V2
+  src/
+    core/               Unmodified TALOS-V2 RTL + .svh includes
+    top/                microgpt_pynq_top.sv (AXI4-Lite wrapper — new)
+  sim/cocotb/           cocotb regression suite for the AXI wrapper
+  tcl/                  build.tcl (Vivado batch build — new)
+overlays/               microgpt.bit / microgpt.hwh land here
+sw/
+  drivers/              microgpt.py (pynq.MMIO driver — new)
+  notebooks/            demo.ipynb, hardware_advantage.ipynb, throughput.ipynb
+  tests/
+tutorials/              Three-notebook workflow walkthrough
+UPSTREAM.md             Per-file attribution (TALOS-V2 vs this fork)
+LICENSE_STATUS.md       Why this repo is not yet open-source-redistributable
+```
+
+## Tutorials
+
+Start with [`tutorials/00_overview.ipynb`](tutorials/00_overview.ipynb)
+for the workflow loop, then `01_explore_weights.ipynb` to visualise
+the Q12 ROMs, then `02_register_map_and_driver.ipynb` for the
+AXI4-Lite layout and driver hot path.
+
+## Quick-start
+
+### 1. Build the bitstream
+
+```bash
+# From repo root
+vivado -mode batch -source hw/tcl/build.tcl
+```
+
+This creates the Vivado project under `hw/build/`, runs synthesis and
+implementation, and copies `microgpt.bit` + `microgpt.hwh` into
+`overlays/`.
+
+### 2. Deploy to the PYNQ-Z2
+
+```bash
+scp overlays/microgpt.bit overlays/microgpt.hwh \
+    xilinx@<board-ip>:/home/xilinx/pynq/overlays/microgpt/
+scp -r sw/drivers sw/notebooks \
+    xilinx@<board-ip>:/home/xilinx/jupyter_notebooks/microgpt/
+```
+
+### 3. Run on the board
+
+Open Jupyter (`http://<board-ip>:9090`) and run
+`sw/notebooks/demo.ipynb`, or from a Python shell:
+
+```python
+from microgpt import MicroGPT
+gpt = MicroGPT()
+text, info = gpt.generate(max_tokens=8, temperature=1.0, seed=42)
+print(text, info["cycles"])
+```
+
+## Register map (AXI4-Lite slave at 0x4000_0000, 4 KB)
+
+| Offset  | RW | Field                                                         |
+|--------:|:--:|:--------------------------------------------------------------|
+| 0x000   | RO | Magic = `0x4D475254` ("MGRT")                                 |
+| 0x004   | RO | Version = `0x00020001`                                        |
+| 0x008   | WO | bit0 = start pulse, bit1 = clear pulse                        |
+| 0x00C   | RO | Status `{pos, out_len, 0, 0, direct_mode, toggle, error, done, busy, ready}` |
+| 0x010   | RW | Config `{temp_q8_8[31:16], max_gen[15:8], 0[7:0]}`            |
+| 0x014   | RW | RNG seed                                                      |
+| 0x018   | RO | `{top_logit_q12[31:16], argmax_token[15:8], last_token[7:0]}` |
+| 0x01C   | RO | BOS_TOKEN (`26`)                                              |
+| 0x020   | RW | Step config `{0, step_token, step_pos, step_clear, direct_mode}` |
+| 0x024   | WO | Step trigger pulse (bit0)                                     |
+| 0x028   | RO | heartbeat_reg snapshot (debug; zero-padded to 32b)            |
+| 0x060.. | RO | `output_mem[0..15]` -- 16 generated tokens                    |
+| 0x0D8   | RO | perf_cycles                                                   |
+| 0x0DC   | RO | tokens_per_sec                                                |
+| 0x100.. | RO | 27 sign-extended logits (Q12)                                 |
+
+PL LEDs LD0..LD3 expose `{heartbeat, busy, done, error}` (heartbeat moved
+to LD0 so it stays visible even on boards where LD3/M14 has a physical
+fault, as observed on the deployed PYNQ-Z2 unit).
+
+## Avalon-MM -> AXI4-Lite translation summary
+
+| DE1-SoC (Avalon-MM)                                | PYNQ-Z2 (AXI4-Lite)                              |
+|----------------------------------------------------|--------------------------------------------------|
+| `jtag_microgpt_bridge` master + `waitrequest`/`readdatavalid` handshakes | Standard AXI4-Lite slave on PS GP0 (`s_axi_*`).  |
+| 50 MHz `CLOCK_50` host domain + 56.25 MHz core PLL | Single domain `s_axi_aclk = FCLK_CLK0 = 50 MHz`. |
+| Toggle-bit triggers (`host_start_toggle_50` etc.) crossed via 2-FF synchronizers | 1-cycle `start_pulse` / `clear_pulse` / `step_pulse` decoded inline. |
+| `host_toggle_reg` flips on every JTAG read or write | `host_toggle_reg` flips on every successful AXI read or write. |
+| WSTRB / byte enables driven by JTAG bridge (4'b1111) | `s_axi_wstrb` accepted but ignored; aligned 32-bit writes only. |
+| Resets: `~SW[1] && pll_locked`                      | `s_axi_aresetn` from `proc_sys_reset` driven by `FCLK_RESET0_N`. |
+| Outputs: 10x LEDR + 6x HEX                         | 4x PL LEDs (LD0..LD3): `heartbeat`, `busy`, `done`, `error`. |
+| Weights via `$readmemh("generated/...hex", ...)`    | Weights live in `hw/ip/`; build.tcl adds it to `INCLUDE_DIRS` and `rom_init.svh` references bare filenames. |
+
+## Notes
+
+- The unmodified core RTL lives in `hw/src/core/` and includes
+  `microgpt_exact_core_params.svh`, `microgpt_exact_core_math.svh`,
+  and `microgpt_exact_core_rom_init.svh`. Parameters (e.g. `EMBED_DIM`,
+  `VOCAB_SIZE`, `FRAC_BITS`) are unchanged from the DE1 build.
+- The build script targets `xc7z010clg400-1`; if your PYNQ-Z2 carries
+  the larger XC7Z020 die, edit the `part` variable at the top of
+  `hw/tcl/build.tcl`.
+- All RTL uses 4-space indentation, no tabs.
+
+
+## Run Build:
+```bash
+mkdir -p hw/build
+vivado -mode batch -source hw/tcl/build.tcl \
+  -log hw/build/vivado_build.log \
+  -journal hw/build/vivado_build.jou \
+  2>&1 | tee hw/build/build_console.log
+```
+
diff --git a/pynq/UPSTREAM.md b/pynq/UPSTREAM.md
@@ -0,0 +1,45 @@
+# Upstream attribution
+
+This repository is a **port** of the SystemVerilog inference core from
+[`Luthiraa/TALOS-V2`](https://github.com/Luthiraa/TALOS-V2), an RTL
+implementation of a Karpathy-style microGPT for the Intel DE1-SoC
+(Cyclone V), to the **Xilinx PYNQ-Z2** (Zynq-7020 XC7Z020CLG400-1).
+
+All credit for the inference core RTL and the underlying numerical
+design (Q12 fixed-point, systolic matvec tile, processing-element
+array, RMS-norm + saturating-divider engines, categorical sampler)
+belongs to the upstream author(s) of TALOS-V2.
+
+This fork's contribution is the *host-side bridge to PYNQ*: an
+AXI4-Lite slave wrapper, a Vivado batch build, a cocotb regression
+suite for the wrapper, and a Python (`pynq.MMIO` + UIO IRQ) driver.
+
+## Per-subtree origin
+
+| Subtree in this repo                | Origin                  | Modifications                                                                                                                              |
+| ----------------------------------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
+| `hw/src/core/*.sv` (7 files)        | `Luthiraa/TALOS-V2/rtl/src/`         | **Unmodified, byte-identical** to upstream.                                                                                                |
+| `hw/src/core/include/microgpt_exact_core_math.svh` | `Luthiraa/TALOS-V2/rtl/src/include/` | **Unmodified, byte-identical** to upstream.                                                                                                |
+| `hw/src/core/include/microgpt_exact_core_params.svh` | `Luthiraa/TALOS-V2/rtl/src/include/` | **Unmodified, byte-identical** to upstream.                                                                                                |
+| `hw/src/core/include/microgpt_exact_core_rom_init.svh` | `Luthiraa/TALOS-V2/rtl/src/include/` | Modified: paths updated for Vivado `INCLUDE_DIRS` and bare-filename `$readmemh` references.                                                |
+| `hw/ip/*.hex` (9 weight ROMs)       | `Luthiraa/TALOS-V2/rtl/generated/`   | Unmodified Q12 fixed-point exports of the upstream-trained microGPT weights.                                                               |
+| `hw/src/top/microgpt_pynq_top.sv`   | **New (this fork)**     | AXI4-Lite slave wrapper exposing the upstream core via the Zynq PS GP0.                                                                    |
+| `hw/tcl/build.tcl`                  | **New (this fork)**     | Vivado batch build (Zynq + AXI Interconnect + top + constraints).                                                                          |
+| `hw/sim/cocotb/`                    | **New (this fork)**     | cocotb regression suite targeting `microgpt_pynq_top` (caught a production write-path bug pre-bitstream).                                  |
+| `sw/drivers/microgpt.py`            | **New (this fork)**     | Python MMIO driver, IRQ fast path via `/dev/uio<n>`.                                                                                       |
+| `sw/notebooks/*.ipynb`              | **New (this fork)**     | Demo, hardware-advantage, throughput notebooks for the deployed overlay.                                                                   |
+| `overlays/*.bit`, `*.hwh`           | **New (this fork)**     | Vivado-built artefacts targeting `xc7z010clg400-1` / `xc7z020clg400-1`.                                                                    |
+| `demos/build.py`                    | **New (this fork)**     | Weight-tensor heatmap renderer for the companion portfolio site.                                                                           |
+| `tutorials/`                        | **New (this fork)**     | Workflow walkthrough notebooks.                                                                                                            |
+
+## Conventions adopted
+
+- All upstream files retain their original headers, naming, parameter
+  values (`EMBED_DIM`, `VOCAB_SIZE`, `FRAC_BITS`, …), and behaviour.
+- The cocotb tests target only the **new** AXI wrapper (`hw/src/top/`);
+  upstream core behaviour is **not** retested here — that responsibility
+  remains with the upstream ModelSim testbenches.
+- No upstream file in `hw/src/core/` should be edited in this repo.
+  If the upstream core needs a fix, the fix belongs in upstream and
+  this repo pulls it in via a fresh copy + a noted update in this
+  file's modification log.
diff --git a/pynq/demos/build.py b/pynq/demos/build.py
@@ -0,0 +1,120 @@
+"""
+Build precomputed demo assets for the website project page.
+
+Renders each of the 4,192 INT16 (Q12) weight tensors baked into the PL
+fabric as a labeled heatmap PNG, plus a metadata JSON consumed by the
+website's HeatmapViewer. No inference run — just a faithful visualisation of
+what's literally hardcoded into LUTRAM/BRAM at synthesis time.
+
+Refresh:
+
+    python3 demos/build.py
+    cp demos/out/* ../AbdullahAlNafisah.github.io/public/demos/pynq-microgpt/
+
+Pure numpy + matplotlib. Q12 sign convention: each value is a 16-bit
+two's-complement int read as Q12 fixed-point (one sign bit, 3 integer
+bits, 12 fractional bits → range [-8, 8)).
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import matplotlib
+
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+IP_DIR = REPO_ROOT / "hw" / "ip"
+OUT = Path(__file__).parent / "out"
+OUT.mkdir(parents=True, exist_ok=True)
+
+FRAC_BITS = 12               # matches hw/src/core/include/microgpt_exact_core_params.svh
+EMBED_DIM = 16
+VOCAB_SIZE = 27              # 26 letters + BOS sentinel
+MLP_DIM = 64
+SEQ_LEN = 16                 # wpe_q12.hex has 16*16 entries
+
+
+def load_q12_hex(path: Path, shape: tuple[int, int]) -> np.ndarray:
+    """Read a .hex file (one 16-bit hex word per line) into a float Q12 matrix."""
+    raw = np.array(
+        [int(line.strip(), 16) for line in path.read_text().splitlines() if line.strip()],
+        dtype=np.uint16,
+    )
+    signed = raw.astype(np.int32)
+    signed[signed >= 0x8000] -= 0x10000  # two's complement → signed int16 in int32 storage
+    fp = signed.astype(np.float64) / (1 << FRAC_BITS)
+    assert fp.size == shape[0] * shape[1], f"{path.name}: expected {shape}, got {fp.size} values"
+    return fp.reshape(shape)
+
+
+# (filename, label, shape, blurb)
+WEIGHTS: list[tuple[str, str, tuple[int, int], str]] = [
+    ("wte_q12.hex",            "WTE — token embedding",        (VOCAB_SIZE, EMBED_DIM), "27 tokens × 16-dim embedding"),
+    ("wpe_q12.hex",            "WPE — positional embedding",   (SEQ_LEN, EMBED_DIM),    "16 positions × 16-dim embedding"),
+    ("layer0_attn_wq_q12.hex", "W_Q — attention query",        (EMBED_DIM, EMBED_DIM),  "16 × 16"),
+    ("layer0_attn_wk_q12.hex", "W_K — attention key",          (EMBED_DIM, EMBED_DIM),  "16 × 16"),
+    ("layer0_attn_wv_q12.hex", "W_V — attention value",        (EMBED_DIM, EMBED_DIM),  "16 × 16"),
+    ("layer0_attn_wo_q12.hex", "W_O — attention output",       (EMBED_DIM, EMBED_DIM),  "16 × 16"),
+    ("layer0_mlp_fc1_q12.hex", "FC1 — MLP up-projection",      (EMBED_DIM, MLP_DIM),    "16 × 64"),
+    ("layer0_mlp_fc2_q12.hex", "FC2 — MLP down-projection",    (MLP_DIM, EMBED_DIM),    "64 × 16"),
+    ("lm_head_q12.hex",        "LM head — logits projection",  (EMBED_DIM, VOCAB_SIZE), "16 × 27"),
+]
+
+
+def render_heatmap(arr: np.ndarray, label: str, out_path: Path) -> None:
+    """Render a symmetric-around-zero heatmap. PNG output is dimensionless."""
+    vmax = float(np.max(np.abs(arr))) or 1e-9
+    fig, ax = plt.subplots(figsize=(4.0, 4.0 * arr.shape[0] / max(arr.shape[1], 1)), dpi=200)
+    ax.imshow(arr, cmap="RdBu_r", vmin=-vmax, vmax=vmax, interpolation="nearest", aspect="auto")
+    ax.set_axis_off()
+    fig.subplots_adjust(left=0.02, right=0.98, top=0.98, bottom=0.02)
+    fig.savefig(out_path, bbox_inches="tight", pad_inches=0)
+    plt.close(fig)
+
+
+def main() -> None:
+    presets = []
+    total_params = 0
+    for fname, label, shape, blurb in WEIGHTS:
+        path = IP_DIR / fname
+        w = load_q12_hex(path, shape)
+        total_params += w.size
+        png_name = fname.replace("_q12.hex", ".png")
+        png_path = OUT / png_name
+        render_heatmap(w, label, png_path)
+        presets.append({
+            "name": label,
+            "image_url": f"/demos/pynq-microgpt/{png_name}",
+            "subtitle": (
+                f"{blurb} · {w.size} INT16 (Q12) values · "
+                f"|w| ≤ {float(np.max(np.abs(w))):.2f} · σ = {float(np.std(w)):.3f}"
+            ),
+        })
+
+    meta = {
+        "kind": "heatmap",
+        "title": (
+            f"microgpt fabric weights · char-level GPT · {EMBED_DIM}-dim · 1 block · "
+            f"{total_params} INT16 (Q12) params total"
+        ),
+        "image_label": "weight matrix (red = positive · blue = negative · scaled per matrix)",
+        "presets": presets,
+        "caption": (
+            "Each preset renders one of the nine weight tensors literally "
+            "hardcoded into PL fabric (LUTRAM / BRAM / constants) by the "
+            "synthesizer — no DRAM, no DMA. Values read straight from "
+            "hw/ip/*.hex and interpreted as 16-bit Q12 fixed-point. "
+            "Generated by demos/build.py."
+        ),
+    }
+    (OUT / "weights.json").write_text(json.dumps(meta, indent=2))
+    print(f"Wrote {len(presets)} presets ({total_params} params) to {OUT}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/pynq/demos/out/layer0_attn_wk.png b/pynq/demos/out/layer0_attn_wk.png
diff --git a/pynq/demos/out/layer0_attn_wo.png b/pynq/demos/out/layer0_attn_wo.png
diff --git a/pynq/demos/out/layer0_attn_wq.png b/pynq/demos/out/layer0_attn_wq.png
diff --git a/pynq/demos/out/layer0_attn_wv.png b/pynq/demos/out/layer0_attn_wv.png
diff --git a/pynq/demos/out/layer0_mlp_fc1.png b/pynq/demos/out/layer0_mlp_fc1.png
diff --git a/pynq/demos/out/layer0_mlp_fc2.png b/pynq/demos/out/layer0_mlp_fc2.png
diff --git a/pynq/demos/out/lm_head.png b/pynq/demos/out/lm_head.png