This repository is my personal, from-scratch implementation of Stanford's CS336 Assignment 1 (Basics): building a Transformer language model from the ground up, starting with training a BPE tokenizer. I am not a Stanford student; I'm working through the publicly available course materials on my own, purely to learn.
My own code lives under src/assignment1/, with the scripts that drive it under scripts/. The only things used to build this are the assignment handout (cs336_assignment1_basics.pdf), the tests, and the starter file pretokenization_example.py. Beyond these three things, everything is written by me.
Note I do not give the most optimal implementation of every step, rather, I aim to simply give implementations that are performant enough for my needs.
I am building everything so that it runs on my MacBook (M3 Pro 18GB RAM) and on my desktop computer (Which runs Arch Linux with a Ryzen 7 9800X3D, 32GB of RAM, and an RTX 5070 Ti).
In keeping with the course's AI policy, all of the implementation code here is written by me. None of it is written or autocompleted by AI. I only use an AI assistant the way the policy permits: to occasionally ask high-level conceptual or low-level documentation questions on specific python libraries/syntax. All implementations and architectural designs are my own work.
The original assignment setup instructions follow.
We manage our environments with uv to ensure reproducibility, portability, and ease of use.
Install uv here (recommended), or run pip install uv/brew install uv.
We recommend reading a bit about managing projects in uv here (you will not regret it!).
You can now run any code in the repo using
uv run <python_file_path>and the environment will be automatically solved and activated when necessary.
uv run pytestInitially, all tests should fail with NotImplementedErrors.
To connect your implementation to the tests, complete the
functions in ./tests/adapters.py.
Download the TinyStories data and a subsample of OpenWebText
mkdir -p data
cd data
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt
wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_train.txt.gz
gunzip owt_train.txt.gz
wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_valid.txt.gz
gunzip owt_valid.txt.gz
cd ..