Skip to content

isaiahtx/cs336_assignment1

 
 

Repository files navigation

My Implementation of CS336 Assignment 1: Basics

This repository is my personal, from-scratch implementation of Stanford's CS336 Assignment 1 (Basics): building a Transformer language model from the ground up, starting with training a BPE tokenizer. I am not a Stanford student; I'm working through the publicly available course materials on my own, purely to learn.

My own code lives under src/assignment1/, with the scripts that drive it under scripts/. The only things used to build this are the assignment handout (cs336_assignment1_basics.pdf), the tests, and the starter file pretokenization_example.py. Beyond these three things, everything is written by me.

Note I do not give the most optimal implementation of every step, rather, I aim to simply give implementations that are performant enough for my needs.

I am building everything so that it runs on my MacBook (M3 Pro 18GB RAM) and on my desktop computer (Which runs Arch Linux with a Ryzen 7 9800X3D, 32GB of RAM, and an RTX 5070 Ti).

In keeping with the course's AI policy, all of the implementation code here is written by me. None of it is written or autocompleted by AI. I only use an AI assistant the way the policy permits: to occasionally ask high-level conceptual or low-level documentation questions on specific python libraries/syntax. All implementations and architectural designs are my own work.


The original assignment setup instructions follow.

Setup

Environment

We manage our environments with uv to ensure reproducibility, portability, and ease of use. Install uv here (recommended), or run pip install uv/brew install uv. We recommend reading a bit about managing projects in uv here (you will not regret it!).

You can now run any code in the repo using

uv run <python_file_path>

and the environment will be automatically solved and activated when necessary.

Run unit tests

uv run pytest

Initially, all tests should fail with NotImplementedErrors. To connect your implementation to the tests, complete the functions in ./tests/adapters.py.

Download data

Download the TinyStories data and a subsample of OpenWebText

mkdir -p data
cd data

wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt

wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_train.txt.gz
gunzip owt_train.txt.gz
wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_valid.txt.gz
gunzip owt_valid.txt.gz

cd ..

About

Student version of Assignment 1 for Stanford CS336 - Language Modeling From Scratch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.3%
  • Shell 0.7%