This project implements standard Attention and FlashAttention in CUDA, with Python bindings for benchmarking and comparison.
To run our benchmarking scripts you first have to set up a Python environment, then you can run the different python files within the python folder.
- Load modules. Run the following in the root of the project:
module load cuda/12.8 python/3.12.8 gcc/13.2.0 ninja/1.8.2- Setting up virtual Python environment. Run the following in the root of the project to create a Python virtual environment and load all requirements from
requirements.txt:
./setup_env.shThis might take a while. When done you should see a venv directory in the root of the project.
- Activate the created environment. Activate the newly created environment by running:
source venv/bin/activateYou have to be in the root of the project folder to run the python scripts. Run the following to run a basic benchmark example with verification:
python3 python/run_benchmark.py --verifyor
python3 python/run_benchmark.py --seq_lengths "64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576" --head_dim 128 --num_runs 4 --verify --output out_128.pdf --sweepto reproduce the benchmarking results from our report (it takes a while to run).
The following are descriptions of the options you can pass to our Python scripts.
usage: run_benchmark.py [-h] [--seq_lengths SEQ_LENGTHS] [--head_dim HEAD_DIM] [--num_runs NUM_RUNS] [--verify] [--output OUTPUT] [--sweep] [--hyperparam-search]
Benchmark attention implementations
options:
-h, --help show this help message and exit
--seq_lengths SEQ_LENGTHS Comma-separated list of sequence lengths (default: 1024)
--head_dim HEAD_DIM Head dimension (default: 64)
--num_runs NUM_RUNS Number of benchmark runs (default: 10)
--verify Verify correctness between implementations
--output OUTPUT Output path for benchmark plot
--sweep Run sequence length sweep instead of single benchmark
--hyperparam-search Run hyperparameter search to find optimal block sizes and thread dimensions