AURA is a quantization method that quantizes both weight and activation to low-bit augmented matrices.
We use an accuracy-aware strategy to determine which channels are more likely to suffer severe accuracy loss when performing low-bit quantization. Then we quantize the weight and activation to NVFP4 augmented matrices, concating additional channels to contain the quantize error in activation matrices.
In contrast to traditional mixed-precision quantization methods, AURA decouples the GEMM kernel from the quantization process. This design enables support for various data formats, such as MXFP4 and NVFP4, and facilitates easier adaptation to future data types, establishing it as a more universal strategy.
conda create -n aura python=3.10 -y
conda activate auraPlease make sure that CUDA 12.8 is in your environment.
git clone --recurse-submodules https://github.com/actypedef/AURA.git
cd AURA
pip install -r requirements.txtsudo apt-get update
sudo apt-get install python3-devconda install pybind11
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128cd kernels/
bash remake.shThis might take a few minutes.
Reorder_indices, select_num are needed for quantization:
python reorder_indices.py --model /PATH/TO/YOUR/MODEL/ --samples 32 --seqlen 2048 --act_sort_metric frobeniusResults are saved in ./saved/
bash run_micromix.sh /PATH/TO/YOUR/MODEL/End-to-end efficiency:
python benchmarks/benchmark_e2e_aura.py --model 'llama-2-7b' --batch_size 8 --prefill_seq_len 1024 --decode_steps 50TensorRT efficiency:
pip install tensorrt
python benchmark/trt-fp8-prefill-llama.py