LLM Quantization

Maintainers - Aaron Zhao and Cheng Zhang

A curated list of LLM Quantization papers, partially taken from Sudarsharm Sreeram's initial effort.

Year 2024
Year 2023
Year 2022
Surveys
Implementation references
- Cuda kernels for quantization

2024

Title	Venue	Code	Notes
LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices	Arxiv	-	PTQ, A low-rank version of FlexRound with fewer trainable params

2023

Title	Venue	Code	Notes
AffineQuant: Affine Transformation Quantization for Large Language Models	ICLR	Github	PTQ, SVD
AWQ: Activation-aware weight quantization for LLM compression and acceleration	Arxiv	Github	-
BitNet: Scaling 1-bit Transformers for Large Language Models	Arxiv	-	-
Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization	ACL	-	-
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs	EMNLP	-	PTQ; weight-only; Keep salient weights in high-precision + use SGD to optimize weight column scalars for low-precision weights
Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization	EMNLP	-	SmoothQuant for some layers, AWQ for others?
FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization	PMLR	Github	Does this method really benefit hardware? I doubt it doesn't
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers	ICLR	Github	-
INT-FP-QSim: Mixed precision and formats for large language models and vision transformers	Arxiv	Github	-
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models	Arxiv	-	-
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models	Arxiv	Github	-
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models	Arxiv	Github	Uses iterative SVD to initialize low-rank A, B in QLoRA style PEFT
Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantisation	Arxiv	Github	-
Microscaling Data Formats for Deep Learning	Arxiv	Github	-
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models	ICLR	Github	PTQ, uses SGD to learn row/column scaling factors and fuses them back like SmoothQuant
Outlier Suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling	EMNLP	Github	Note
PB-LLM: Partially Binarized Large Language Models	ICLR	Github	QAT, Partially binary weight LLM
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models	ACL	-	-
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models	ICLR	Github	Quantized PEFT
QLoRA: Efficient Finetuning of Quantized LLMs	NeurIPS	Github	-
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing	Arxiv	Github	Note
QuIP: 2-bit quantisation of large language models with guarantees	Arxiv	Github	-
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models	ICML	Github	-
SpQR: A sparse-quantised representation for near-lossless LLM weight compression	Arxiv	Github	PTQ, smaller block size and more quantization scalars and zero-points
SqueezeLM: Dense-and-sparse quantisation	Arxiv	Github	-
The case for 4-bit precision: k-bit Inference Scaling Laws	ICML	-	-
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases	ICML	-	-
With Shared Microexponents, A Little Shifting Goes a Long Way	ISCA	-	-
Z-FOLD: A Frustratingly Easy Post-Training Quantization Scheme for LLMs	EMNLP	Github	PTQ; Following GPTQ, use Hessian to optimize scaling factors which can be fused into layerNorm/previous linear layers
Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models	EMNLP	-	QAT, W4A4/W8A8, small models, adversarial loss (pretrained vs quantized)
ZeroQuant-FP: A leap forward in LLMs post-training W4A8 quantisation using floating-point formats	Arxiv	Github	-
ZeroQuant-V2: Exploring post-training quantisation in LLMs from comprehensive study to low rank compensation	Arxiv	-	-

2022

Title	Venue	Code	Notes
LLM.int8(): 8-bit matrix multiplication for transformers at scale	Arxiv	Github	-
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers	Arxiv	Github	-

Surveys

Title	Venue	Type	Code
A survey on model compression for large language models	-	-	-

Implementation references

Cuda kernels for quantization

Name	Notes
BitsandBytes	Tim Dettmers, LLM.int8(), Single GPU
DeepSpeed	Microsoft, Multi-GPU
Flash Attention	Dao-AILab (CS Princeton), Single GPU
Megatron-LM	NVIDIA, Multi-GPU
MXScaling	Microsoft, MX

Notes

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

This paper address this with a new Dense-and-Sparse Quantization method. Dense-and-Sparse splits weight matrices into two components: A dense component that can be heavily quantized without affecting model performance, as well as a sparse part that preserves sensitive and outlier parts of the weight matrices With this approach, we are able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality. For instance, the Squeeze variant of the Vicuna models can be served within 6 GB of memory and reach 2% higher MMLU than the baseline model in FP16 with an even 2x larger memory footprint.

Outlier Suppression+

PTQ; weight-activation; INT4/6/8; Use statistics to calculate channel-wise zero-point and scale factor which can be fused into weight

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Quantization

Table of Contents

2024

2023

2022

Surveys

Implementation references

Cuda kernels for quantization

Notes

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Outlier Suppression+

FilesExpand file tree

quantization.md

Latest commit

History

quantization.md

File metadata and controls

LLM Quantization

Table of Contents

2024

2023

2022

Surveys

Implementation references

Cuda kernels for quantization

Notes

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Outlier Suppression+