Skip to content

ui-insight/reasoning

Repository files navigation

Training a Mathematical Reasoning Model with GRPO

QLoRA fine-tuning of an LLM to be a reasoning model using GRPO reinforcement learning.
Trained in < 24 hours on an H100 GPU against the GSM8K dataset.

Untitled
This code uses the Group Relative Policy Optimization (GRPO) reinforcement learning (RL) method invented by the Deepseek team, as described in:

DeepSeekMath:
https://arxiv.org/abs/2402.03300

DeepSeek-R1:
https://arxiv.org/abs/2501.12948

image

Some resources:

Why GRPO is Important and How it Works:

https://www.oxen.ai/blog/why-grpo-is-important-and-how-it-works
https://www.oxen.ai/blog/training-a-rust-1-5b-coder-lm-with-reinforcement-learning-grpo

These Videos are an amazing resource:

IMAGE ALT TEXT HERE IMAGE ALT TEXT HERE

The Math Behind GRPO

https://medium.com/yugen-ai-technology-blog/understanding-the-math-behind-grpo-deepseek-r1-zero-9fb15e103a0a

The QLoRA Fine Tuning

The fine-tuning was made much easier by using https://unsloth.ai:

Alt text

Specifically, unsloth leverages the GRPOTrainer class from the Transformer Reinforcement Learning (TRL) package

Finally, the folks at unsloth have a great blog post and Google Collab notebook where they do something very similar to what I've done here.

About

Fine-tuning an LLM to be a reasoning model using GRPO reinforcement learning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages