Skip to content

jeffelin/nanoParadigm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nanoParadigm

Four language modeling paradigms, one codebase.

Train, sample, and compare Autoregressive (GPT), Masked Diffusion (MDLM), Flow Matching, and Mamba (SSM) models on the same data with the same interface.

No GPUs required. Pure PyTorch. Runs on CPU.

Why

No clean, unified comparison of these four fundamentally different approaches to language modeling exists. Papers compare within paradigms, not across them.

Paradigm Generation Attention Key property
AR (GPT) Left-to-right, one token at a time Causal Strong sequential modeling
Diffusion (MDLM) All tokens at once, iteratively denoised Bidirectional Native infilling, parallel generation
Flow Matching Random tokens refined into text Bidirectional Time-conditioned, no mask tokens
Mamba (SSM) Left-to-right, one token at a time None (state space) Linear-time, no attention

Quick Start

python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Train all four (2000 steps each)
python train.py --model ar --max_steps 2000 --eval_interval 500
python train.py --model diffusion --max_steps 2000 --eval_interval 500
python train.py --model flow --max_steps 2000 --eval_interval 500
python train.py --model mamba --max_steps 2000 --eval_interval 500 --n_embd 64 --d_state 8 --n_layer 2

# Sample
python sample.py --model ar --checkpoint out/ar/best.pt --prompt "ROMEO:"
python sample.py --model diffusion --checkpoint out/diffusion/best.pt --length 128
python sample.py --model flow --checkpoint out/flow/best.pt --length 128
python sample.py --model mamba --checkpoint out/mamba/best.pt --prompt "ROMEO:"

# Diffusion infilling
python sample.py --model diffusion --checkpoint out/diffusion/best.pt \
    --infill --prefix "To be" --suffix "that is the question" --fill_length 30

# Evaluate all
python eval.py

# Interactive demo
python demo.py

Datasets and Tokenizers

# Default: character-level Shakespeare
python train.py --model ar --max_steps 2000

# WikiText-2 (13M chars of Wikipedia)
python train.py --model ar --dataset wikitext --max_steps 2000

# BPE tokenizer (trained from scratch on the corpus)
python train.py --model ar --dataset shakespeare --tokenizer bpe --bpe_vocab_size 500

CPU Training Times

Benchmarked on Apple M4 Pro (12 cores, 24GB). PyTorch uses 4 threads.

Model Config Params Speed 2000 steps
AR default (n_embd=128, n_layer=4) 797K ~9 steps/s ~4 min
Diffusion default (n_embd=128, n_layer=4) 806K ~9 steps/s ~4 min
Flow default (n_embd=128, n_layer=4) 830K ~9 steps/s ~4 min
Mamba small (n_embd=64, n_layer=2, d_state=8) 63K ~2 steps/s ~17 min

Mamba is slower on CPU because the SSM scan is inherently sequential. On GPU with custom CUDA kernels (not included here) it would be competitive.

Files

models/
    ar.py                Autoregressive Transformer (GPT-style)
    diffusion.py         Masked Diffusion Language Model (MDLM-style)
    flow.py              Discrete Flow Matching
    mamba.py             Mamba SSM (pure PyTorch, no custom CUDA)
data.py                  Tokenizers (char + BPE) + datasets (Shakespeare + WikiText-2)
train.py                 Unified training loop
sample.py                Text generation for all paradigms
eval.py                  Head-to-head benchmarks
demo.py                  Gradio interactive demo
notebooks/
    visualize.ipynb      Training curves, denoising visualization, comparison charts

Models

AR Transformer (models/ar.py)

Causal transformer following nanoGPT. CausalSelfAttention with is_causal=True, weight tying, top-k sampling.

Diffusion Transformer (models/diffusion.py)

Masked Diffusion LM following MDLM (Sahoo et al. 2024). Subs parameterization, LogLinear noise schedule, iterative unmasking, native infilling.

Flow Matching (models/flow.py)

Discrete Flow Matching following Campbell et al. 2024. Uniform random corruption (no mask tokens), sinusoidal time conditioning, cross-entropy training, iterative refinement sampling. Key difference from diffusion: the model sees corrupted real tokens and must use the time signal to know the corruption level.

Mamba SSM (models/mamba.py)

Selective State Space Model following Mamba (Gu & Dao 2023). Input-dependent selection, depthwise conv1d, SiLU gating, Blelloch parallel scan, RMSNorm, S4D-real initialization.

Evaluation

eval.py compares models on:

Metric What it measures
Val Loss / Perplexity Language modeling quality
Generation Speed Tokens/sec on CPU
Forward Accuracy Given A, predict B
Reverse Accuracy Given B, predict A (tests bidirectionality)

Hyperparameters

python train.py --model ar \
    --n_layer 4 --n_head 4 --n_embd 128 \
    --block_size 128 --batch_size 32 \
    --epochs 5 --lr 3e-4 \
    --dropout 0.1 --grad_clip 1.0 \
    --max_steps 2000 --eval_interval 500

Mamba-specific: --d_state 16 --expand_factor 2 --d_conv 4

Diffusion/Flow-specific: --sampling_steps 64

References

  • nanoGPT - Karpathy's minimal GPT
  • MDLM - Masked Diffusion Language Models (Sahoo et al. 2024)
  • Discrete Flow Matching - Campbell et al. 2024
  • Mamba - Linear-Time Sequence Modeling (Gu & Dao 2023)
  • mamba.py - Pure PyTorch Mamba
  • Mercury - Diffusion LLMs at 1100+ tok/s (Inception Labs 2025)

About

Three language modeling paradigms, one codebase. Train and compare AR, Masked Diffusion, and Mamba on CPU with pure PyTorch.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors