Skip to content

Latest commit

 

History

History
100 lines (76 loc) · 15.8 KB

File metadata and controls

100 lines (76 loc) · 15.8 KB

LLM Quantization

Maintainers - Aaron Zhao and Cheng Zhang

A curated list of LLM Quantization papers, partially taken from Sudarsharm Sreeram's initial effort.

Table of Contents


2024

Title Venue Code Notes
LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices Arxiv - PTQ, A low-rank version of FlexRound with fewer trainable params

2023

Title Venue Code Notes
AffineQuant: Affine Transformation Quantization for Large Language Models ICLR Github PTQ, SVD
AWQ: Activation-aware weight quantization for LLM compression and acceleration Arxiv Github -
BitNet: Scaling 1-bit Transformers for Large Language Models Arxiv - -
Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization ACL - -
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs EMNLP - PTQ; weight-only; Keep salient weights in high-precision + use SGD to optimize weight column scalars for low-precision weights
Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization EMNLP - SmoothQuant for some layers, AWQ for others?
FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization PMLR Github Does this method really benefit hardware? I doubt it doesn't
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers ICLR Github -
INT-FP-QSim: Mixed precision and formats for large language models and vision transformers Arxiv Github -
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models Arxiv - -
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models Arxiv Github -
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models Arxiv Github Uses iterative SVD to initialize low-rank A, B in QLoRA style PEFT
Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantisation Arxiv Github -
Microscaling Data Formats for Deep Learning Arxiv Github -
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models ICLR Github PTQ, uses SGD to learn row/column scaling factors and fuses them back like SmoothQuant
Outlier Suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling EMNLP Github Note
PB-LLM: Partially Binarized Large Language Models ICLR Github QAT, Partially binary weight LLM
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models ACL - -
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models ICLR Github Quantized PEFT
QLoRA: Efficient Finetuning of Quantized LLMs NeurIPS Github -
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing Arxiv Github Note
QuIP: 2-bit quantisation of large language models with guarantees Arxiv Github -
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models ICML Github -
SpQR: A sparse-quantised representation for near-lossless LLM weight compression Arxiv Github PTQ, smaller block size and more quantization scalars and zero-points
SqueezeLM: Dense-and-sparse quantisation Arxiv Github -
The case for 4-bit precision: k-bit Inference Scaling Laws ICML - -
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases ICML - -
With Shared Microexponents, A Little Shifting Goes a Long Way ISCA - -
Z-FOLD: A Frustratingly Easy Post-Training Quantization Scheme for LLMs EMNLP Github PTQ; Following GPTQ, use Hessian to optimize scaling factors which can be fused into layerNorm/previous linear layers
Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models EMNLP - QAT, W4A4/W8A8, small models, adversarial loss (pretrained vs quantized)
ZeroQuant-FP: A leap forward in LLMs post-training W4A8 quantisation using floating-point formats Arxiv Github -
ZeroQuant-V2: Exploring post-training quantisation in LLMs from comprehensive study to low rank compensation Arxiv - -

2022

Title Venue Code Notes
LLM.int8(): 8-bit matrix multiplication for transformers at scale Arxiv Github -
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers Arxiv Github -

Surveys

Title Venue Type Code
A survey on model compression for large language models - - -

Implementation references

Cuda kernels for quantization

Name Notes
BitsandBytes Tim Dettmers, LLM.int8(), Single GPU
DeepSpeed Microsoft, Multi-GPU
Flash Attention Dao-AILab (CS Princeton), Single GPU
Megatron-LM NVIDIA, Multi-GPU
MXScaling Microsoft, MX

Notes

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

This paper address this with a new Dense-and-Sparse Quantization method. Dense-and-Sparse splits weight matrices into two components: A dense component that can be heavily quantized without affecting model performance, as well as a sparse part that preserves sensitive and outlier parts of the weight matrices With this approach, we are able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality. For instance, the Squeeze variant of the Vicuna models can be served within 6 GB of memory and reach 2% higher MMLU than the baseline model in FP16 with an even 2x larger memory footprint.

Outlier Suppression+

PTQ; weight-activation; INT4/6/8; Use statistics to calculate channel-wise zero-point and scale factor which can be fused into weight