ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

ARCQuant is a high-performance quantization framework designed to resolve the conflict between accuracy and inference efficiency in low-bit LLMs.

While fine-grained quantization (e.g., Block-wise/NVFP4) effectively isolates quantization noise, activation outliers still degrade performance in critical channels. Traditional mixed-precision methods address this by splitting computations into separate branches, which introduces significant kernel launch overhead and memory fragmentation.

ARCQuant takes a different approach. Instead of treating outliers separately, we leverage the structural sparsity of quantization errors in fine-grained settings. We capture the quantization residuals of these critical channels and fuse them back into the computation as Augmented Residual Channels (ARC).

1. Installation

conda create -n arcquant python=3.10 -y
conda activate arcquant

Please make sure that CUDA 12.8 is in your environment.

git clone --recurse-submodules https://github.com/actypedef/ARCQuant.git
cd ARCQuant
pip install -r requirements.txt

2. Usage

2.1 Building Kernels

sudo apt-get update
sudo apt-get install python3-dev

conda install pybind11
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

cd kernels/
bash remake.sh

This might take a few minutes.

2.2 Preprocessing

Reorder_indices, select_num are needed for quantization:

python reorder_indices.py --model /PATH/TO/YOUR/MODEL/ --samples 128 --seqlen 2048 --act_sort_metric max

Results are saved in ./saved/

2.3 Accuracy Evaluation

bash evaluate.sh /PATH/TO/YOUR/MODEL/

3. Efficiency Evaluation

FlashInfer:

cd third-party/flashinfer
python -m pip install -v .

End-to-end efficiency:

python benchmarks/benchmark_e2e_arc.py --model 'qwen2.5-7b' --batch_size 4 --prefill_seq_len 1024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

1. Installation

2. Usage

2.1 Building Kernels

2.2 Preprocessing

2.3 Accuracy Evaluation

3. Efficiency Evaluation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
benchmarks		benchmarks
kernels		kernels
model		model
third-party		third-party
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
evaluate.sh		evaluate.sh
reorder_indices.py		reorder_indices.py
requirements.txt		requirements.txt
utilize.py		utilize.py

actypedef/ARCQuant

Folders and files

Latest commit

History

Repository files navigation

ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

1. Installation

2. Usage

2.1 Building Kernels

2.2 Preprocessing

2.3 Accuracy Evaluation

3. Efficiency Evaluation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages