Math Problem Topic Classification

Fine-tuning LLMs and transformer baselines to classify natural-language math problems into 8 topics. 11th of 338 teams in KAChallenges Series 1: Classifying Math Problems (Kaggle · Kasut Academy · 2025).

Context. This was a 2-person team competition. I researched and built the top-scoring solution documented here — the Qwen2.5-Math-7B + LoRA pipeline that produced our final ranking (private leaderboard 0.9079). The competition concluded in May 2025; this repository documents that work.

The task

Given a math problem written in natural language (often containing LaTeX), predict its topic from eight categories:

0	1	2	3	4	5	6	7
Algebra	Geometry & Trig	Calculus & Analysis	Probability & Stats	Number Theory	Combinatorics	Linear Algebra	Abstract Algebra & Topology

The training set has 10,189 labeled problems and is heavily imbalanced (largest class ~2,600, smallest ~86).

Demo

Predicted topic distribution across the 3,044 test problems — the model's actual output, which recovers the same imbalanced topic mix seen in training:

$Predicted topic distribution$

Training-set class distribution — motivates the class-weighted loss:

$Class distribution$

Question length distribution — motivates the 512-token cap:

$Question length distribution$

What it does

Cleans LaTeX-heavy text into plain text before tokenizing (src/cleaner.py, using pylatexenc), while preserving URLs.
Fine-tunes Qwen2.5-Math-7B for 8-way sequence classification with LoRA (r=16, α=32, dropout=0.05 on the q/k/v/o projections) and 4-bit quantization, so a 7B model trains on a single GPU.
Counters class imbalance with class-weighted cross-entropy.
Ensembles multiple seeds (softmax averaging / majority vote) for more stable predictions.
Ships a lighter DistilBERT baseline (src/train_distilbert.py, src/predict.py) for comparison.

How it works

flowchart LR
    A["Raw math problem<br/>(natural language + LaTeX)"] --> B["Clean LaTeX to text<br/>(pylatexenc)"]
    B --> C["Tokenize<br/>(max length 512)"]
    C --> D["Qwen2.5-Math-7B<br/>+ LoRA, 4-bit"]
    D --> E["Class-weighted<br/>fine-tuning"]
    E --> F["Multi-seed ensemble<br/>(softmax avg / vote)"]
    F --> G["Predicted topic (0-7)"]

Results

Final standing: 11th of 338 teams — private LB 0.9079 (public 0.8965).

Approach	Public LB	Private LB
RoBERTa-base	0.8162	0.8268
DeBERTa-v3-base	0.8172	0.8343
DistilBERT (cleaned text)	0.8201	0.8298
Qwen2.5-Math-7B + LoRA (best single model)	0.8907	0.9094
Qwen2.5-Math-7B ensemble (final, selected)	0.8965	0.9079

For reference, the 1st-place solution scored 0.9253 using full 14B models (Qwen2.5-14B + Qwen3-14B) on multi-GPU hardware. This solution reached the same core recipe — a Qwen model fine-tuned for classification with LoRA, class-weighted loss, and ensembling — on a single Colab GPU by 4-bit-quantizing a 7B model.

Tech stack

Python · PyTorch · Hugging Face Transformers / Datasets / PEFT (LoRA) · bitsandbytes (4-bit) · scikit-learn · pandas / NumPy / SciPy · pylatexenc · Google Colab (A100)

Setup & run

pip install -r requirements.txt

Download the competition data into ./data (see data/README.md), then from the repository root:

python src/inspect_data.py 0     # preview a cleaned problem + its topic label
python src/train_distilbert.py   # train the DistilBERT baseline (writes ./results/...)
python src/predict.py            # generate submission.csv from a trained checkpoint

The main solution — Qwen2.5-Math-7B + LoRA — is in notebooks/qwen_math_classification.ipynb. It was built for Google Colab + Drive on an A100 and needs a GPU that supports 4-bit quantization.

What's not included

All of my code is here, but two things are intentionally excluded:

The dataset (data/*.csv) — it belongs to the Kaggle competition host and can't be redistributed. See data/README.md to download it.
Trained weights — the LoRA adapters + optimizer states (~290 MB) exceed GitHub's file limits and aren't the interesting part. The exact LoRA configuration is documented above for reproducibility. (The trained adapter could optionally be published to the Hugging Face Hub and linked here.)

What I'd improve next

Stratified K-fold cross-validation instead of a single train/test split — the winning solution used this for more reliable validation across the rare classes.
Systematic learning-rate search (e.g. with a small proxy model), which the top solution called "crucial."
Cross-architecture ensembling (e.g. weighted Qwen2.5 + Qwen3) rather than multi-seed ensembling of one model — and better final-submission selection (a single model actually scored higher on the private LB than the ensemble I selected).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
data		data
notebooks		notebooks
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
old_submission.csv		old_submission.csv
requirements.txt		requirements.txt
submission.csv		submission.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Math Problem Topic Classification

The task

Demo

What it does

How it works

Results

Tech stack

Setup & run

What's not included

What I'd improve next

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Math Problem Topic Classification

The task

Demo

What it does

How it works

Results

Tech stack

Setup & run

What's not included

What I'd improve next

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages