Skip to content

rlaope/HOPE-tensorflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HOPE-tensorflow

Nested Learning — brain analogy and multi-frequency update
Figure 1 of the paper: the brain's uniform / reusable structure and multi-time-scale updates motivate Nested Learning.

A tensorflow implementation of Nested Learning: The Illusion of Deep Learning Architectures (Behrouz, Razaviyayn, Zhong, Mirrokni; Google Research; NeurIPS 2025). arXiv:2512.24695

Status: study log + educational re-implementation. Self-Modifying layer is currently linear-attention scale (paper Eq. 18); CMS is a single outer-product memory (not the MLP-chain of Eq. 70-71). See Limitations.

Local working directory is hope-architecture/; published repo and importable package are HOPE-tensorflow / hope.


Why this repo

HOPE vs Transformer backbone (paper Figure 5)
Figure 5 of the paper: HOPE's Self-Modifying Titans → multi-frequency FFN stack vs the standard Transformer Attention → FFN stack. This repo implements the left-hand side.

HOPE pairs the Nested Learning paradigm with a recurrent backbone: a self-modifying layer plus a Continuum Memory System (CMS) that updates memory banks at multiple frequencies. PyTorch reimplementations exist; this repo fills the TF / Keras gap and doubles as a study log.

Every component in hope/ cites the corresponding paper equation / section number in its docstring. Notebook 01 maps each paper concept to a file.


Install

git clone https://github.com/rlaope/HOPE-tensorflow.git
cd HOPE-tensorflow
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
bash scripts/download_paper.sh        # arXiv 2512.24695 → papers/
python scripts/download_data.py       # TinyShakespeare → data/

Tested with Python 3.12 and TensorFlow 2.20. Should work on any TF ≥ 2.15 / Python ≥ 3.10.


Quickstart

import tensorflow as tf
from hope.model import HOPE
from hope.baseline import MiniTransformer

hope = HOPE(
    vocab_size=65,
    d_model=32,
    n_self_mod_layers=1,
    cms_banks=(1, 4),
    cms_decays=(0.01, 0.005),
    n_heads=2,
    max_seq_len=64,
)

# A MiniTransformer with the same parameter budget (+/- 5%):
baseline = MiniTransformer.matched_to(hope, tolerance=0.05)

x = tf.constant([[1, 2, 3, 4]], dtype=tf.int32)
print(hope(x).shape, baseline(x).shape)   # both (1, 4, 65)

Train

python scripts/train.py --model hope        --dataset tinyshakespeare \
    --steps 200 --seq-len 64 --batch-size 8 --d-model 64 --n-layers 1

python scripts/train.py --model transformer --dataset tinyshakespeare \
    --steps 200 --seq-len 64 --batch-size 8 --d-model 64 --n-layers 2

Both branches share the same Adam loop and print the parameter count at init, so the two models can be compared head-to-head on equal compute.


Benchmark — HOPE vs MiniTransformer

python scripts/benchmark.py --scenario all --steps 50 --seq-len 64 --batch-size 4

Three scenarios, each emitting a PNG into assets/:

Long-context retrieval

A (key, value) pair planted at the start of the sequence, recall queried near the end. CMS's claim is that long-range information survives.

longctx

Continual LM (catastrophic forgetting)

Train on TinyShakespeare (domain A), then on random alphabet sequences (domain B), then re-measure cross-entropy on A. The plot reports both the raw before/after loss on A and the standard continual-learning metrics from Lopez-Paz & Ranzato 2017: BWT (Backward Transfer; closer to 0 = less forgetting, in loss-space) and ACC (mean final-checkpoint loss across A and B; lower = better).

continual

In-context adaptation

k examples of a random character substitution in the prompt; ask the model to apply the same substitution to a query. Self-modifying-layer signal.

incontext

These plots use tiny models and tiny training budgets — the shape of the comparison is the takeaway, not the absolute numbers.


Limitations

  • Self-Modifying layer is implemented as linear attention with a Hebbian fast-weight update (paper Eq. 18), NOT the full Self-Referential Titans of paper §8.1 / Eq. 94-97.
  • CMS is a single dim×dim outer-product memory, NOT the MLP chain of paper §7.1 / Eq. 70-71. Nested / Sequential / Head-wise CMS variants are not implemented.
  • M3 (Multi-scale Momentum Muon) optimizer from paper §7.2 is not implemented.
  • DGD / DeepOptimizer classes exist in hope/optimizers.py but scripts/train.py uses Adam — they are reference/study implementations, not currently wired into training.
  • Benchmarks use tiny vocab/seq (d_model=32, vocab=8) and TinyShakespeare only. No RULER / BABILong / WikiText / CLINC evaluation.
  • Educational scope — see "Hardware" section.

Notebooks

# Topic
01 Paper overview + map of paper concepts to repo modules
02 AssociativeMemory (Hebbian / Delta / Oja)
03 ContinuumMemorySystem (multi-frequency banks)
04 SelfModifyingLayer (per-token fast weight)
05 Full HOPE model + a training loop
06 Long-context retrieval scenario
07 Continual LM forgetting scenario

Run them all in one shot:

jupyter nbconvert --to notebook --execute --inplace notebooks/*.ipynb

Roadmap

Phase What Status
0 Repo scaffolding, paper downloader, first push done
1 AssociativeMemory, SelfModifyingLayer + tests done
2 ContinuumMemorySystem + visualization notebooks done
3 HOPE model assembly (LM head included) done
4 MiniTransformer baseline + DGD / DeepOptimizer + scripts/train.py + char-level loaders done
5 Three-scenario benchmark + assets/*.png done
6 Documentation polish, notebook 01, final push done
7+ Paper-faithful pass: Self-Mod Eq. 94-97, MLP-chain CMS (Eq. 70-71), M3 optimizer, standard benches (RULER / BABILong) not started — see Limitations

Hardware

Single GPU. Minimum Colab T4 / 8 GB+ local VRAM recommended. The repo also runs on CPU for smoke tests (pytest -v exercises a CPU-only path).

hope-tensorflow deliberately stays at nanoGPT scale (a few million to tens of millions of parameters). No multi-GPU, no XLA tricks, no custom CUDA, no LLM-scale training.


Paper

Behrouz, A., Razaviyayn, M., Zhong, P., Mirrokni, V. Nested Learning: The Illusion of Deep Learning Architectures. NeurIPS 2025.

arXiv:2512.24695Blog — local PDF: bash scripts/download_paper.sh

@inproceedings{Behrouz2025NestedLearning,
  title     = {Nested Learning: The Illusion of Deep Learning Architectures},
  author    = {Behrouz, Ali and Razaviyayn, Meisam and Zhong, Peilin and Mirrokni, Vahab},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://arxiv.org/abs/2512.24695}
}

License

MIT. See LICENSE.

About

A tensorflow implementation of "Nested Learning: The Illusion of Deep Learning Architecture"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors