Fine-tuning LLMs and transformer baselines to classify natural-language math problems into 8 topics. 11th of 338 teams in KAChallenges Series 1: Classifying Math Problems (Kaggle · Kasut Academy · 2025).
Context. This was a 2-person team competition. I researched and built the top-scoring solution documented here — the Qwen2.5-Math-7B + LoRA pipeline that produced our final ranking (private leaderboard 0.9079). The competition concluded in May 2025; this repository documents that work.
Given a math problem written in natural language (often containing LaTeX), predict its topic from eight categories:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| Algebra | Geometry & Trig | Calculus & Analysis | Probability & Stats | Number Theory | Combinatorics | Linear Algebra | Abstract Algebra & Topology |
The training set has 10,189 labeled problems and is heavily imbalanced (largest class ~2,600, smallest ~86).
Predicted topic distribution across the 3,044 test problems — the model's actual output, which recovers the same imbalanced topic mix seen in training:
Training-set class distribution — motivates the class-weighted loss:
Question length distribution — motivates the 512-token cap:
- Cleans LaTeX-heavy text into plain text before tokenizing (
src/cleaner.py, usingpylatexenc), while preserving URLs. - Fine-tunes Qwen2.5-Math-7B for 8-way sequence classification with LoRA (r=16, α=32, dropout=0.05 on the q/k/v/o projections) and 4-bit quantization, so a 7B model trains on a single GPU.
- Counters class imbalance with class-weighted cross-entropy.
- Ensembles multiple seeds (softmax averaging / majority vote) for more stable predictions.
- Ships a lighter DistilBERT baseline (
src/train_distilbert.py,src/predict.py) for comparison.
flowchart LR
A["Raw math problem<br/>(natural language + LaTeX)"] --> B["Clean LaTeX to text<br/>(pylatexenc)"]
B --> C["Tokenize<br/>(max length 512)"]
C --> D["Qwen2.5-Math-7B<br/>+ LoRA, 4-bit"]
D --> E["Class-weighted<br/>fine-tuning"]
E --> F["Multi-seed ensemble<br/>(softmax avg / vote)"]
F --> G["Predicted topic (0-7)"]
Final standing: 11th of 338 teams — private LB 0.9079 (public 0.8965).
| Approach | Public LB | Private LB |
|---|---|---|
| RoBERTa-base | 0.8162 | 0.8268 |
| DeBERTa-v3-base | 0.8172 | 0.8343 |
| DistilBERT (cleaned text) | 0.8201 | 0.8298 |
| Qwen2.5-Math-7B + LoRA (best single model) | 0.8907 | 0.9094 |
| Qwen2.5-Math-7B ensemble (final, selected) | 0.8965 | 0.9079 |
For reference, the 1st-place solution scored 0.9253 using full 14B models (Qwen2.5-14B + Qwen3-14B) on multi-GPU hardware. This solution reached the same core recipe — a Qwen model fine-tuned for classification with LoRA, class-weighted loss, and ensembling — on a single Colab GPU by 4-bit-quantizing a 7B model.
Python · PyTorch · Hugging Face Transformers / Datasets / PEFT (LoRA) · bitsandbytes (4-bit) · scikit-learn · pandas / NumPy / SciPy · pylatexenc · Google Colab (A100)
pip install -r requirements.txtDownload the competition data into ./data (see data/README.md), then from the repository root:
python src/inspect_data.py 0 # preview a cleaned problem + its topic label
python src/train_distilbert.py # train the DistilBERT baseline (writes ./results/...)
python src/predict.py # generate submission.csv from a trained checkpointThe main solution — Qwen2.5-Math-7B + LoRA — is in
notebooks/qwen_math_classification.ipynb. It was built for
Google Colab + Drive on an A100 and needs a GPU that supports 4-bit quantization.
All of my code is here, but two things are intentionally excluded:
- The dataset (
data/*.csv) — it belongs to the Kaggle competition host and can't be redistributed. Seedata/README.mdto download it. - Trained weights — the LoRA adapters + optimizer states (~290 MB) exceed GitHub's file limits and aren't the interesting part. The exact LoRA configuration is documented above for reproducibility. (The trained adapter could optionally be published to the Hugging Face Hub and linked here.)
- Stratified K-fold cross-validation instead of a single train/test split — the winning solution used this for more reliable validation across the rare classes.
- Systematic learning-rate search (e.g. with a small proxy model), which the top solution called "crucial."
- Cross-architecture ensembling (e.g. weighted Qwen2.5 + Qwen3) rather than multi-seed ensembling of one model — and better final-submission selection (a single model actually scored higher on the private LB than the ensemble I selected).
MIT © 2025 Omar Althobaiti


