Ultra-efficient post-training quantization for large language models — no fine-tuning required. Group-wise INT4 + AWQ-style activation scaling + Protected FP16 channels + Optional low-rank SVD correction
Achieve ~4× memory reduction and 2–3× inference speedup with near-zero accuracy degradation — no fine-tuning required.
- Group-wise INT4 quantization (per-row groups)
- AWQ-style activation-aware scaling using channel importance
- Protected FP16 channels for outlier-sensitive input dimensions
- Optional low-rank SVD correction to recover quantization error
- HuggingFace integration with TurboQuantizer
- PyTorch QuantizedLinear wrapper for easy model conversion
- CUDA kernels for high-performance inference
git clone https://github.com/Kubenew/TurboQuant-v3.git
cd TurboQuant-v3
pip install -e ".[dev]"cd TurboQuant-v3
python setup_cuda.py installimport numpy as np
from turboquant import QuantConfig, turboquant_v3_compress, turboquant_v3_decompress
# Example weight matrix (e.g., from nn.Linear.weight)
W = np.random.normal(0, 0.02, size=(512, 1024)).astype(np.float32)
config = QuantConfig(
group_size=64,
outlier_keep_ratio=0.02,
rank=8, # set to 0 to disable SVD correction
activation_aware=True
)
# Compress
comp = turboquant_v3_compress(W, config)
# Decompress / reconstruct
W_rec = turboquant_v3_decompress(comp)
# Metrics
from turboquant import compute_metrics
metrics = compute_metrics(W, W_rec)
print(f"MSE: {metrics['mse']:.8f}")
print(f"PSNR: {metrics['psnr_db']:.2f} dB")import torch
import torch.nn as nn
from turboquant import QuantConfig, QuantizedLinear
# Example: Replace a linear layer
linear = nn.Linear(1024, 512, bias=False)
# Quantize it
config = QuantConfig(group_size=64, outlier_keep_ratio=0.02, rank=8)
quantized_layer = QuantizedLinear.from_linear(linear, config=config)
# Use exactly like a normal nn.Linear
x = torch.randn(4, 1024)
output = quantized_layer(x)
print(output.shape) # torch.Size([4, 512])
# Get compression stats
stats = quantized_layer.get_weight_stats()
print(f"Compression ratio: {stats['compression_ratio']:.2f}x")from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantConfig, quantize_model
# Load model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Configure quantization
quant_config = TurboQuantConfig(
bits=4,
group_size=128,
version="gemm", # or "exllama" for faster inference
zero_point=True,
activation_aware=True,
outlier_keep_ratio=0.02,
rank=8,
)
# Quantize the model
quantized_model = quantize_model(model, quantization_config=quant_config)
# Save and load
from turboquant import save_quantized_model, load_quantized_model
save_quantized_model(quantized_model, "./quantized-llama/", quant_config)
loaded_model = load_quantized_model("./quantized-llama/", device_map="auto")from turboquant.cuda_ops import QuantizedLinearCUDA, is_cuda_available
if is_cuda_available():
# Create CUDA-optimized quantized layer
cuda_layer = QuantizedLinearCUDA(1024, 512, group_size=128)
cuda_layer.load_from_quantized_linear(quantized_layer)
# Use on GPU
x = torch.randn(4, 1024, device="cuda")
output = cuda_layer(x)TurboQuant-v3
├── src/turboquant/
│ ├── __init__.py # Main package exports
│ ├── config.py # QuantConfig, TurboQuantConfig
│ ├── core.py # Core compression/decompression algorithms
│ ├── linear.py # QuantizedLinear, TurboQuantLinear
│ ├── hf.py # HuggingFace TurboQuantizer integration
│ ├── hf_modules.py # Quantized attention and MLP modules
│ └── cuda_ops.py # CUDA kernel Python bindings
├── cuda/
│ ├── include/ # CUDA headers
│ ├── int4_cuda.cpp # PyTorch bindings
│ └── int4_cuda_kernel.cuh # CUDA kernel implementations
├── tests/ # Unit tests
└── examples/ # Jupyter notebooks and examples
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ FP32 Weights│────>│ AWQ Scaling │────>│ Channel Analysis │
└─────────────┘ └──────────────┘ └────────┬────────┘
│
┌──────────────────────────────┼──────────────────────────────┐
│ │ │
v v v
┌───────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Identify │ │ INT4 Quantize │ │ SVD Correction │
│ Outliers (2%) │ │ (Group-wise) │ │ (Low-rank) │
└───────┬───────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
v v │
┌───────────────┐ ┌─────────────────┐ │
│ Protect FP16 │ │ Pack INT4 │ │
│ Channels │ │ (2 weights/byte)│ │
└───────┬───────┘ └─────────────────┘ │
│ │ │
└──────────────────────────────┼──────────────────────────────┘
│
v
┌─────────────────┐
│ Compressed │
│ Weights │
└─────────────────┘
| Parameter | Type | Default | Description |
|---|---|---|---|
group_size |
int | 64 | Size of quantization groups |
outlier_keep_ratio |
float | 0.02 | Ratio of channels to keep in FP16 |
rank |
int | 8 | SVD correction rank (0 to disable) |
activation_aware |
bool | True | Use activation-aware weight quantization |
zero_point |
bool | True | Use asymmetric quantization |
| Parameter | Type | Default | Description |
|---|---|---|---|
bits |
int | 4 | Quantization bits (4 or 8) |
group_size |
int | 128 | Quantization group size |
version |
str | "gemm" | Kernel version: "gemm", "exllama", or "ipex" |
do_fuse |
bool | False | Fuse modules for better performance |
fuse_max_seq_len |
int | None | Max sequence length for fusing |
The CUDA extension provides optimized kernels for:
- INT4 Dequantization: Fast on-the-fly dequantization
- INT4 GEMM: Fused matmul with dequantization
- AWQ Scale Computation: Activation-aware weight quantization
cd TurboQuant-v3
python setup_cuda.py installRequires: CUDA Toolkit 11.0+, PyTorch with CUDA support
# Run all tests
pytest tests/ -v
# Run specific test class
pytest tests/test_turboquant.py::TestQuantizedLinear -v
# Run with coverage
pytest tests/ --cov=turboquant --cov-report=html| Model | Original Size | Quantized Size | Compression | MSE |
|---|---|---|---|---|
| Llama-2-7B | 13.5 GB | ~3.5 GB | ~4x | < 0.001 |
| Llama-2-13B | 26 GB | ~6.5 GB | ~4x | < 0.001 |
| Mistral-7B | 14.5 GB | ~3.7 GB | ~4x | < 0.001 |
Contributions are welcome! Areas where help is needed:
- Real activation scale collection from HF models
- PyTorch QuantizedLinear wrapper optimization
- CUDA kernels for INT4 GEMM
- Additional model architecture support
- Performance benchmarks
Please read our Contributing Guide before submitting PRs.
@misc{turboquant-v3,
author = {Kubenew},
title = {TurboQuant-v3: INT4 + AWQ + Protected Channels + Low-Rank SVD},
year = {2026},
url = {https://github.com/Kubenew/TurboQuant-v3}
}MIT License - see LICENSE for details.
- AWQ: MIT-HAN-Lab/llm-awq
- PyTorch Quantization: pytorch/ao
- HuggingFace Transformers: huggingface/transformers