nanoParadigm

Four language modeling paradigms, one codebase.

Train, sample, and compare Autoregressive (GPT), Masked Diffusion (MDLM), Flow Matching, and Mamba (SSM) models on the same data with the same interface.

No GPUs required. Pure PyTorch. Runs on CPU.

Why

No clean, unified comparison of these four fundamentally different approaches to language modeling exists. Papers compare within paradigms, not across them.

Paradigm	Generation	Attention	Key property
AR (GPT)	Left-to-right, one token at a time	Causal	Strong sequential modeling
Diffusion (MDLM)	All tokens at once, iteratively denoised	Bidirectional	Native infilling, parallel generation
Flow Matching	Random tokens refined into text	Bidirectional	Time-conditioned, no mask tokens
Mamba (SSM)	Left-to-right, one token at a time	None (state space)	Linear-time, no attention

Quick Start

python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Train all four (2000 steps each)
python train.py --model ar --max_steps 2000 --eval_interval 500
python train.py --model diffusion --max_steps 2000 --eval_interval 500
python train.py --model flow --max_steps 2000 --eval_interval 500
python train.py --model mamba --max_steps 2000 --eval_interval 500 --n_embd 64 --d_state 8 --n_layer 2

# Sample
python sample.py --model ar --checkpoint out/ar/best.pt --prompt "ROMEO:"
python sample.py --model diffusion --checkpoint out/diffusion/best.pt --length 128
python sample.py --model flow --checkpoint out/flow/best.pt --length 128
python sample.py --model mamba --checkpoint out/mamba/best.pt --prompt "ROMEO:"

# Diffusion infilling
python sample.py --model diffusion --checkpoint out/diffusion/best.pt \
    --infill --prefix "To be" --suffix "that is the question" --fill_length 30

# Evaluate all
python eval.py

# Interactive demo
python demo.py

Datasets and Tokenizers

# Default: character-level Shakespeare
python train.py --model ar --max_steps 2000

# WikiText-2 (13M chars of Wikipedia)
python train.py --model ar --dataset wikitext --max_steps 2000

# BPE tokenizer (trained from scratch on the corpus)
python train.py --model ar --dataset shakespeare --tokenizer bpe --bpe_vocab_size 500

CPU Training Times

Benchmarked on Apple M4 Pro (12 cores, 24GB). PyTorch uses 4 threads.

Model	Config	Params	Speed	2000 steps
AR	default (n_embd=128, n_layer=4)	797K	~9 steps/s	~4 min
Diffusion	default (n_embd=128, n_layer=4)	806K	~9 steps/s	~4 min
Flow	default (n_embd=128, n_layer=4)	830K	~9 steps/s	~4 min
Mamba	small (n_embd=64, n_layer=2, d_state=8)	63K	~2 steps/s	~17 min

Mamba is slower on CPU because the SSM scan is inherently sequential. On GPU with custom CUDA kernels (not included here) it would be competitive.

Files

models/
    ar.py                Autoregressive Transformer (GPT-style)
    diffusion.py         Masked Diffusion Language Model (MDLM-style)
    flow.py              Discrete Flow Matching
    mamba.py             Mamba SSM (pure PyTorch, no custom CUDA)
data.py                  Tokenizers (char + BPE) + datasets (Shakespeare + WikiText-2)
train.py                 Unified training loop
sample.py                Text generation for all paradigms
eval.py                  Head-to-head benchmarks
demo.py                  Gradio interactive demo
notebooks/
    visualize.ipynb      Training curves, denoising visualization, comparison charts

Models

AR Transformer (models/ar.py)

Causal transformer following nanoGPT. CausalSelfAttention with is_causal=True, weight tying, top-k sampling.

Diffusion Transformer (models/diffusion.py)

Masked Diffusion LM following MDLM (Sahoo et al. 2024). Subs parameterization, LogLinear noise schedule, iterative unmasking, native infilling.

Flow Matching (models/flow.py)

Discrete Flow Matching following Campbell et al. 2024. Uniform random corruption (no mask tokens), sinusoidal time conditioning, cross-entropy training, iterative refinement sampling. Key difference from diffusion: the model sees corrupted real tokens and must use the time signal to know the corruption level.

Mamba SSM (models/mamba.py)

Selective State Space Model following Mamba (Gu & Dao 2023). Input-dependent selection, depthwise conv1d, SiLU gating, Blelloch parallel scan, RMSNorm, S4D-real initialization.

Evaluation

eval.py compares models on:

Metric	What it measures
Val Loss / Perplexity	Language modeling quality
Generation Speed	Tokens/sec on CPU
Forward Accuracy	Given A, predict B
Reverse Accuracy	Given B, predict A (tests bidirectionality)

Hyperparameters

python train.py --model ar \
    --n_layer 4 --n_head 4 --n_embd 128 \
    --block_size 128 --batch_size 32 \
    --epochs 5 --lr 3e-4 \
    --dropout 0.1 --grad_clip 1.0 \
    --max_steps 2000 --eval_interval 500

Mamba-specific: --d_state 16 --expand_factor 2 --d_conv 4

Diffusion/Flow-specific: --sampling_steps 64

References

nanoGPT - Karpathy's minimal GPT
MDLM - Masked Diffusion Language Models (Sahoo et al. 2024)
Discrete Flow Matching - Campbell et al. 2024
Mamba - Linear-Time Sequence Modeling (Gu & Dao 2023)
mamba.py - Pure PyTorch Mamba
Mercury - Diffusion LLMs at 1100+ tok/s (Inception Labs 2025)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoParadigm

Why

Quick Start

Datasets and Tokenizers

CPU Training Times

Files

Models

AR Transformer (models/ar.py)

Diffusion Transformer (models/diffusion.py)

Flow Matching (models/flow.py)

Mamba SSM (models/mamba.py)

Evaluation

Hyperparameters

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
models		models
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
data.py		data.py
demo.py		demo.py
eval.py		eval.py
requirements.txt		requirements.txt
sample.py		sample.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

nanoParadigm

Why

Quick Start

Datasets and Tokenizers

CPU Training Times

Files

Models

AR Transformer (models/ar.py)

Diffusion Transformer (models/diffusion.py)

Flow Matching (models/flow.py)

Mamba SSM (models/mamba.py)

Evaluation

Hyperparameters

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages