😇AURA: Augmented Representation for Unified Accuracy-aware Quantization

AURA is a quantization method that quantizes both weight and activation to low-bit augmented matrices.

We use an accuracy-aware strategy to determine which channels are more likely to suffer severe accuracy loss when performing low-bit quantization. Then we quantize the weight and activation to NVFP4 augmented matrices, concating additional channels to contain the quantize error in activation matrices.

In contrast to traditional mixed-precision quantization methods, AURA decouples the GEMM kernel from the quantization process. This design enables support for various data formats, such as MXFP4 and NVFP4, and facilitates easier adaptation to future data types, establishing it as a more universal strategy.

1. Installation

conda create -n aura python=3.10 -y
conda activate aura

Please make sure that CUDA 12.8 is in your environment.

git clone --recurse-submodules https://github.com/actypedef/AURA.git
cd AURA
pip install -r requirements.txt

2. Usage

2.1 Building Kernels

sudo apt-get update
sudo apt-get install python3-dev

conda install pybind11
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

cd kernels/
bash remake.sh

This might take a few minutes.

2.2 Preprocessing

Reorder_indices, select_num are needed for quantization:

python reorder_indices.py --model /PATH/TO/YOUR/MODEL/ --samples 32 --seqlen 2048 --act_sort_metric frobenius

Results are saved in ./saved/

2.3 Accuracy Evaluation

bash run_micromix.sh /PATH/TO/YOUR/MODEL/

3. Efficiency Evaluation

End-to-end efficiency:

python benchmarks/benchmark_e2e_aura.py --model 'llama-2-7b' --batch_size 8 --prefill_seq_len 1024 --decode_steps 50

TensorRT efficiency:

pip install tensorrt
python benchmark/trt-fp8-prefill-llama.py

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
benchmarks		benchmarks
kernels		kernels
model		model
results		results
third-party		third-party
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
reorder_indices.py		reorder_indices.py
requirements.txt		requirements.txt
run_micromix.sh		run_micromix.sh
utilize.py		utilize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

😇AURA: Augmented Representation for Unified Accuracy-aware Quantization

1. Installation

2. Usage

2.1 Building Kernels

2.2 Preprocessing

2.3 Accuracy Evaluation

3. Efficiency Evaluation

About

Uh oh!

Releases

Packages

Languages

actypedef/AURA

Folders and files

Latest commit

History

Repository files navigation

😇AURA: Augmented Representation for Unified Accuracy-aware Quantization

1. Installation

2. Usage

2.1 Building Kernels

2.2 Preprocessing

2.3 Accuracy Evaluation

3. Efficiency Evaluation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages