Training a Mathematical Reasoning Model with GRPO

QLoRA fine-tuning of an LLM to be a reasoning model using GRPO reinforcement learning.
Trained in < 24 hours on an H100 GPU against the GSM8K dataset.

This code uses the Group Relative Policy Optimization (GRPO) reinforcement learning (RL) method invented by the Deepseek team, as described in:

DeepSeekMath:
https://arxiv.org/abs/2402.03300

DeepSeek-R1:
https://arxiv.org/abs/2501.12948

Some resources:

Why GRPO is Important and How it Works:

https://www.oxen.ai/blog/why-grpo-is-important-and-how-it-works
https://www.oxen.ai/blog/training-a-rust-1-5b-coder-lm-with-reinforcement-learning-grpo

These Videos are an amazing resource:

The Math Behind GRPO

https://medium.com/yugen-ai-technology-blog/understanding-the-math-behind-grpo-deepseek-r1-zero-9fb15e103a0a

The QLoRA Fine Tuning

The fine-tuning was made much easier by using https://unsloth.ai:

Specifically, unsloth leverages the GRPOTrainer class from the Transformer Reinforcement Learning (TRL) package

Finally, the folks at unsloth have a great blog post and Google Collab notebook where they do something very similar to what I've done here.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
finetune		finetune
llama.cpp		llama.cpp
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
grpo.py		grpo.py
merge.py		merge.py
requirements.txt		requirements.txt
sft.py		sft.py
synth.py		synth.py
synthetic_gsm8k_formatted2.jsonl		synthetic_gsm8k_formatted2.jsonl
test_model.py		test_model.py
train.py		train.py
workflow.txt		workflow.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training a Mathematical Reasoning Model with GRPO

Some resources:

The Math Behind GRPO

The QLoRA Fine Tuning

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Training a Mathematical Reasoning Model with GRPO

Some resources:

The Math Behind GRPO

The QLoRA Fine Tuning

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages