Figure 1 of the paper: the brain's uniform / reusable structure and multi-time-scale updates motivate Nested Learning.
A tensorflow implementation of Nested Learning: The Illusion of Deep Learning Architectures (Behrouz, Razaviyayn, Zhong, Mirrokni; Google Research; NeurIPS 2025). arXiv:2512.24695
Status: study log + educational re-implementation. Self-Modifying layer is currently linear-attention scale (paper Eq. 18); CMS is a single outer-product memory (not the MLP-chain of Eq. 70-71). See Limitations.
Local working directory is
hope-architecture/; published repo and importable package areHOPE-tensorflow/hope.
Figure 5 of the paper: HOPE's Self-Modifying Titans → multi-frequency FFN stack vs the standard Transformer Attention → FFN stack. This repo implements the left-hand side.
HOPE pairs the Nested Learning paradigm with a recurrent backbone: a self-modifying layer plus a Continuum Memory System (CMS) that updates memory banks at multiple frequencies. PyTorch reimplementations exist; this repo fills the TF / Keras gap and doubles as a study log.
Every component in hope/ cites the corresponding paper equation / section number in its docstring. Notebook 01 maps each paper concept to a file.
git clone https://github.com/rlaope/HOPE-tensorflow.git
cd HOPE-tensorflow
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
bash scripts/download_paper.sh # arXiv 2512.24695 → papers/
python scripts/download_data.py # TinyShakespeare → data/Tested with Python 3.12 and TensorFlow 2.20. Should work on any TF ≥ 2.15 / Python ≥ 3.10.
import tensorflow as tf
from hope.model import HOPE
from hope.baseline import MiniTransformer
hope = HOPE(
vocab_size=65,
d_model=32,
n_self_mod_layers=1,
cms_banks=(1, 4),
cms_decays=(0.01, 0.005),
n_heads=2,
max_seq_len=64,
)
# A MiniTransformer with the same parameter budget (+/- 5%):
baseline = MiniTransformer.matched_to(hope, tolerance=0.05)
x = tf.constant([[1, 2, 3, 4]], dtype=tf.int32)
print(hope(x).shape, baseline(x).shape) # both (1, 4, 65)python scripts/train.py --model hope --dataset tinyshakespeare \
--steps 200 --seq-len 64 --batch-size 8 --d-model 64 --n-layers 1
python scripts/train.py --model transformer --dataset tinyshakespeare \
--steps 200 --seq-len 64 --batch-size 8 --d-model 64 --n-layers 2Both branches share the same Adam loop and print the parameter count at init, so the two models can be compared head-to-head on equal compute.
python scripts/benchmark.py --scenario all --steps 50 --seq-len 64 --batch-size 4Three scenarios, each emitting a PNG into assets/:
A (key, value) pair planted at the start of the sequence, recall queried near the end. CMS's claim is that long-range information survives.
Train on TinyShakespeare (domain A), then on random alphabet sequences (domain B), then re-measure cross-entropy on A. The plot reports both the raw before/after loss on A and the standard continual-learning metrics from Lopez-Paz & Ranzato 2017: BWT (Backward Transfer; closer to 0 = less forgetting, in loss-space) and ACC (mean final-checkpoint loss across A and B; lower = better).
k examples of a random character substitution in the prompt; ask the model to apply the same substitution to a query. Self-modifying-layer signal.
These plots use tiny models and tiny training budgets — the shape of the comparison is the takeaway, not the absolute numbers.
- Self-Modifying layer is implemented as linear attention with a Hebbian fast-weight update (paper Eq. 18), NOT the full Self-Referential Titans of paper §8.1 / Eq. 94-97.
- CMS is a single
dim×dimouter-product memory, NOT the MLP chain of paper §7.1 / Eq. 70-71. Nested / Sequential / Head-wise CMS variants are not implemented. - M3 (Multi-scale Momentum Muon) optimizer from paper §7.2 is not implemented.
- DGD / DeepOptimizer classes exist in
hope/optimizers.pybutscripts/train.pyuses Adam — they are reference/study implementations, not currently wired into training. - Benchmarks use tiny vocab/seq (
d_model=32,vocab=8) and TinyShakespeare only. No RULER / BABILong / WikiText / CLINC evaluation. - Educational scope — see "Hardware" section.
| # | Topic |
|---|---|
| 01 | Paper overview + map of paper concepts to repo modules |
| 02 | AssociativeMemory (Hebbian / Delta / Oja) |
| 03 | ContinuumMemorySystem (multi-frequency banks) |
| 04 | SelfModifyingLayer (per-token fast weight) |
| 05 | Full HOPE model + a training loop |
| 06 | Long-context retrieval scenario |
| 07 | Continual LM forgetting scenario |
Run them all in one shot:
jupyter nbconvert --to notebook --execute --inplace notebooks/*.ipynb| Phase | What | Status |
|---|---|---|
| 0 | Repo scaffolding, paper downloader, first push | done |
| 1 | AssociativeMemory, SelfModifyingLayer + tests |
done |
| 2 | ContinuumMemorySystem + visualization notebooks |
done |
| 3 | HOPE model assembly (LM head included) |
done |
| 4 | MiniTransformer baseline + DGD / DeepOptimizer + scripts/train.py + char-level loaders |
done |
| 5 | Three-scenario benchmark + assets/*.png | done |
| 6 | Documentation polish, notebook 01, final push | done |
| 7+ | Paper-faithful pass: Self-Mod Eq. 94-97, MLP-chain CMS (Eq. 70-71), M3 optimizer, standard benches (RULER / BABILong) | not started — see Limitations |
Single GPU. Minimum Colab T4 / 8 GB+ local VRAM recommended. The repo also runs on CPU for smoke tests (pytest -v exercises a CPU-only path).
hope-tensorflow deliberately stays at nanoGPT scale (a few million to tens of millions of parameters). No multi-GPU, no XLA tricks, no custom CUDA, no LLM-scale training.
Behrouz, A., Razaviyayn, M., Zhong, P., Mirrokni, V. Nested Learning: The Illusion of Deep Learning Architectures. NeurIPS 2025.
arXiv:2512.24695 — Blog — local PDF: bash scripts/download_paper.sh
@inproceedings{Behrouz2025NestedLearning,
title = {Nested Learning: The Illusion of Deep Learning Architectures},
author = {Behrouz, Ali and Razaviyayn, Meisam and Zhong, Peilin and Mirrokni, Vahab},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://arxiv.org/abs/2512.24695}
}MIT. See LICENSE.


