Microsoftβs BitNet b1.58βββa 1.58-bit large-language model (LLM) capable of running on commodity CPUs β reignited interest in ultra-low-precision inference.
Inspired by this work, I explored single-bit (and ternary) quantization on the SST-2 sentiment-analysis task.
This repo walks through eight progressively refined approaches, starting from scratch-built transformers and culminating in a quantized, fine-tuned BERT.
- Baseline β scratch-built transformer + 1-bit weights.
- Incremental tricks β add positional encoding, dropout, mixed precision, QAT.
- Advanced tricks β median-scaling, Straight-Through Estimator (STE), progressive Mixed-precision Quantization (MoQ).
- Pre-trained models β swap in BERT, then apply STE / ternary + activation quantization.
At each stage I addressed shortcomings of the previous approach while monitoring accuracy/F1, model size, and training stability.
Approach 1 β Simple Quantized Transformer (Classifier)
- Goal β prove 1-bit feasibility.
- Key steps
- Scratch implementation of a miniature Transformer encoder.
- Replaced all linear layers with custom
BitLinear(sign-only weights). - Adam + CE loss; no fancy schedulers.
- Results β
Accuracy 76.38 %β|βF1 76.38 % - Takeaway β works, but capacity is tiny and no positional clues β limited ceiling.
Approach 2 β + Positional Encoding & Mixed Precision
- Added sinusoidal PE, automatic mixed precision (AMP), scheduler + grad-clip.
- Results β 62.27 % / 61.26 %.
- Why worse? AMP introduced instability with sign-only weights; capacity still low.
Approach 3 β + Dropout & Quantization-Aware Training (QAT)
- Injected dropout; trained with fake-quant ops (PyTorch QAT).
- Results β 63.88 % / 63.65 %.
- Takeaway β tiny bump; still under-fits.
Approach 4 β Median Scaling + Straight-Through Estimator (STE)
- Normalised activations via median scaling; back-prop with STE.
- Results β 69.84 % / 69.65 %.
- Takeaway β big jump β scaling + STE help gradients flow in 1-bit nets.
Approach 5 β Variant of (4)
- Tweaked scaling factor & clipping range.
- Results β 70.76 % / 70.66 %.
- Takeaway β careful hyper-tuning matters even in low-bit land.
Approach 6 β Multi-Head Attention & Progressive MoQ
- Upgraded to full MH-Attention encoder; progressively lowered precision (8β4β1-bit) during fine-tuning.
- Results β 70.18 % / 70.15 %.
- Takeaway β capacity β, but extra heads partly cancelled by quantization loss.
Approach 7 β Pre-trained BERT (+ STE)
- Started from
bert-base-uncased; swapped every dense/attn projection to BitLinear; STE for back-prop. - Results β 85.67 % / 85.65 % (best).
- Takeaway β pre-training supplies strong linguistic priors; 1-bit layers fine-tune well with STE.
Approach 8 β Ternary BERT (+ Activation Quant)
- Pushed further: ternary weights {-1,0,+1} + per-layer activation quant + sub-layer norm.
- Results β 50.92 % / 34.36 %.
- Takeaway β too aggressive; activation quant hurt expressive power.
β±οΈ Note: Each approach in this repository was trained for only 3 epochs due to time and resource constraints. Despite this, the results already reveal promising trends in low-bit training. I believe the community can build on these implementations β running longer training schedules, tuning hyperparameters, and applying these ideas to larger tasks β to unlock even better performance and deeper insights into ultra-low precision NLP.
| # | Model / Technique | Acc. | F1 |
|---|---|---|---|
| 1 | Scratch Transformer + 1-bit weights | 76.38 | 76.38 |
| 2 | + PosEnc & AMP | 62.27 | 61.26 |
| 3 | + Dropout & QAT | 63.88 | 63.65 |
| 4 | + Median Scaling & STE | 69.84 | 69.65 |
| 5 | Variant of 4 | 70.76 | 70.66 |
| 6 | + MH-Attention & Progressive MoQ | 70.18 | 70.15 |
| 7 | BERT-base + STE-quantized | 85.67 | 85.65 |
| 8 | Ternary BERT (+ Activation Quant) | 50.92 | 34.36 |
Due to a known compatibility issue with Jupyter widgets metadata (metadata.widgets.state missing), GitHub is currently unable to render the notebook properly on the web interface.
π Workaround:
To view and run the notebook without errors, please clone the repository locally and open the notebook in VS Code, JupyterLab, or another local IDE.