Four language modeling paradigms, one codebase.
Train, sample, and compare Autoregressive (GPT), Masked Diffusion (MDLM), Flow Matching, and Mamba (SSM) models on the same data with the same interface.
No GPUs required. Pure PyTorch. Runs on CPU.
No clean, unified comparison of these four fundamentally different approaches to language modeling exists. Papers compare within paradigms, not across them.
| Paradigm | Generation | Attention | Key property |
|---|---|---|---|
| AR (GPT) | Left-to-right, one token at a time | Causal | Strong sequential modeling |
| Diffusion (MDLM) | All tokens at once, iteratively denoised | Bidirectional | Native infilling, parallel generation |
| Flow Matching | Random tokens refined into text | Bidirectional | Time-conditioned, no mask tokens |
| Mamba (SSM) | Left-to-right, one token at a time | None (state space) | Linear-time, no attention |
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Train all four (2000 steps each)
python train.py --model ar --max_steps 2000 --eval_interval 500
python train.py --model diffusion --max_steps 2000 --eval_interval 500
python train.py --model flow --max_steps 2000 --eval_interval 500
python train.py --model mamba --max_steps 2000 --eval_interval 500 --n_embd 64 --d_state 8 --n_layer 2
# Sample
python sample.py --model ar --checkpoint out/ar/best.pt --prompt "ROMEO:"
python sample.py --model diffusion --checkpoint out/diffusion/best.pt --length 128
python sample.py --model flow --checkpoint out/flow/best.pt --length 128
python sample.py --model mamba --checkpoint out/mamba/best.pt --prompt "ROMEO:"
# Diffusion infilling
python sample.py --model diffusion --checkpoint out/diffusion/best.pt \
--infill --prefix "To be" --suffix "that is the question" --fill_length 30
# Evaluate all
python eval.py
# Interactive demo
python demo.py# Default: character-level Shakespeare
python train.py --model ar --max_steps 2000
# WikiText-2 (13M chars of Wikipedia)
python train.py --model ar --dataset wikitext --max_steps 2000
# BPE tokenizer (trained from scratch on the corpus)
python train.py --model ar --dataset shakespeare --tokenizer bpe --bpe_vocab_size 500Benchmarked on Apple M4 Pro (12 cores, 24GB). PyTorch uses 4 threads.
| Model | Config | Params | Speed | 2000 steps |
|---|---|---|---|---|
| AR | default (n_embd=128, n_layer=4) | 797K | ~9 steps/s | ~4 min |
| Diffusion | default (n_embd=128, n_layer=4) | 806K | ~9 steps/s | ~4 min |
| Flow | default (n_embd=128, n_layer=4) | 830K | ~9 steps/s | ~4 min |
| Mamba | small (n_embd=64, n_layer=2, d_state=8) | 63K | ~2 steps/s | ~17 min |
Mamba is slower on CPU because the SSM scan is inherently sequential. On GPU with custom CUDA kernels (not included here) it would be competitive.
models/
ar.py Autoregressive Transformer (GPT-style)
diffusion.py Masked Diffusion Language Model (MDLM-style)
flow.py Discrete Flow Matching
mamba.py Mamba SSM (pure PyTorch, no custom CUDA)
data.py Tokenizers (char + BPE) + datasets (Shakespeare + WikiText-2)
train.py Unified training loop
sample.py Text generation for all paradigms
eval.py Head-to-head benchmarks
demo.py Gradio interactive demo
notebooks/
visualize.ipynb Training curves, denoising visualization, comparison charts
Causal transformer following nanoGPT. CausalSelfAttention with is_causal=True, weight tying, top-k sampling.
Masked Diffusion LM following MDLM (Sahoo et al. 2024). Subs parameterization, LogLinear noise schedule, iterative unmasking, native infilling.
Discrete Flow Matching following Campbell et al. 2024. Uniform random corruption (no mask tokens), sinusoidal time conditioning, cross-entropy training, iterative refinement sampling. Key difference from diffusion: the model sees corrupted real tokens and must use the time signal to know the corruption level.
Selective State Space Model following Mamba (Gu & Dao 2023). Input-dependent selection, depthwise conv1d, SiLU gating, Blelloch parallel scan, RMSNorm, S4D-real initialization.
eval.py compares models on:
| Metric | What it measures |
|---|---|
| Val Loss / Perplexity | Language modeling quality |
| Generation Speed | Tokens/sec on CPU |
| Forward Accuracy | Given A, predict B |
| Reverse Accuracy | Given B, predict A (tests bidirectionality) |
python train.py --model ar \
--n_layer 4 --n_head 4 --n_embd 128 \
--block_size 128 --batch_size 32 \
--epochs 5 --lr 3e-4 \
--dropout 0.1 --grad_clip 1.0 \
--max_steps 2000 --eval_interval 500Mamba-specific: --d_state 16 --expand_factor 2 --d_conv 4
Diffusion/Flow-specific: --sampling_steps 64