Skip to content

Omar0042/math-problem-topic-classification

Repository files navigation

Math Problem Topic Classification

Python PyTorch Hugging Face Kaggle License: MIT

Fine-tuning LLMs and transformer baselines to classify natural-language math problems into 8 topics. 11th of 338 teams in KAChallenges Series 1: Classifying Math Problems (Kaggle · Kasut Academy · 2025).

Context. This was a 2-person team competition. I researched and built the top-scoring solution documented here — the Qwen2.5-Math-7B + LoRA pipeline that produced our final ranking (private leaderboard 0.9079). The competition concluded in May 2025; this repository documents that work.

The task

Given a math problem written in natural language (often containing LaTeX), predict its topic from eight categories:

0 1 2 3 4 5 6 7
Algebra Geometry & Trig Calculus & Analysis Probability & Stats Number Theory Combinatorics Linear Algebra Abstract Algebra & Topology

The training set has 10,189 labeled problems and is heavily imbalanced (largest class ~2,600, smallest ~86).

Demo

Predicted topic distribution across the 3,044 test problems — the model's actual output, which recovers the same imbalanced topic mix seen in training:

Predicted topic distribution

Training-set class distribution — motivates the class-weighted loss:

Class distribution

Question length distribution — motivates the 512-token cap:

Question length distribution

What it does

  • Cleans LaTeX-heavy text into plain text before tokenizing (src/cleaner.py, using pylatexenc), while preserving URLs.
  • Fine-tunes Qwen2.5-Math-7B for 8-way sequence classification with LoRA (r=16, α=32, dropout=0.05 on the q/k/v/o projections) and 4-bit quantization, so a 7B model trains on a single GPU.
  • Counters class imbalance with class-weighted cross-entropy.
  • Ensembles multiple seeds (softmax averaging / majority vote) for more stable predictions.
  • Ships a lighter DistilBERT baseline (src/train_distilbert.py, src/predict.py) for comparison.

How it works

flowchart LR
    A["Raw math problem<br/>(natural language + LaTeX)"] --> B["Clean LaTeX to text<br/>(pylatexenc)"]
    B --> C["Tokenize<br/>(max length 512)"]
    C --> D["Qwen2.5-Math-7B<br/>+ LoRA, 4-bit"]
    D --> E["Class-weighted<br/>fine-tuning"]
    E --> F["Multi-seed ensemble<br/>(softmax avg / vote)"]
    F --> G["Predicted topic (0-7)"]
Loading

Results

Final standing: 11th of 338 teams — private LB 0.9079 (public 0.8965).

Approach Public LB Private LB
RoBERTa-base 0.8162 0.8268
DeBERTa-v3-base 0.8172 0.8343
DistilBERT (cleaned text) 0.8201 0.8298
Qwen2.5-Math-7B + LoRA (best single model) 0.8907 0.9094
Qwen2.5-Math-7B ensemble (final, selected) 0.8965 0.9079

For reference, the 1st-place solution scored 0.9253 using full 14B models (Qwen2.5-14B + Qwen3-14B) on multi-GPU hardware. This solution reached the same core recipe — a Qwen model fine-tuned for classification with LoRA, class-weighted loss, and ensembling — on a single Colab GPU by 4-bit-quantizing a 7B model.

Tech stack

Python · PyTorch · Hugging Face Transformers / Datasets / PEFT (LoRA) · bitsandbytes (4-bit) · scikit-learn · pandas / NumPy / SciPy · pylatexenc · Google Colab (A100)

Setup & run

pip install -r requirements.txt

Download the competition data into ./data (see data/README.md), then from the repository root:

python src/inspect_data.py 0     # preview a cleaned problem + its topic label
python src/train_distilbert.py   # train the DistilBERT baseline (writes ./results/...)
python src/predict.py            # generate submission.csv from a trained checkpoint

The main solution — Qwen2.5-Math-7B + LoRA — is in notebooks/qwen_math_classification.ipynb. It was built for Google Colab + Drive on an A100 and needs a GPU that supports 4-bit quantization.

What's not included

All of my code is here, but two things are intentionally excluded:

  • The dataset (data/*.csv) — it belongs to the Kaggle competition host and can't be redistributed. See data/README.md to download it.
  • Trained weights — the LoRA adapters + optimizer states (~290 MB) exceed GitHub's file limits and aren't the interesting part. The exact LoRA configuration is documented above for reproducibility. (The trained adapter could optionally be published to the Hugging Face Hub and linked here.)

What I'd improve next

  • Stratified K-fold cross-validation instead of a single train/test split — the winning solution used this for more reliable validation across the rare classes.
  • Systematic learning-rate search (e.g. with a small proxy model), which the top solution called "crucial."
  • Cross-architecture ensembling (e.g. weighted Qwen2.5 + Qwen3) rather than multi-seed ensembling of one model — and better final-submission selection (a single model actually scored higher on the private LB than the ensemble I selected).

License

MIT © 2025 Omar Althobaiti

About

Fine-tuning Qwen2.5-Math-7B (LoRA, 4-bit) and transformer baselines to classify math problems into 8 topics — 11th/338 in the KAChallenges Series 1 Kaggle competition.

Topics

Resources

License

Stars

Watchers

Forks

Contributors