QLoRA fine-tuning of an LLM to be a reasoning model using GRPO reinforcement learning.
Trained in < 24 hours on an H100 GPU against the GSM8K dataset.

This code uses the Group Relative Policy Optimization (GRPO) reinforcement learning (RL) method invented by the Deepseek team, as described in:
DeepSeekMath:
https://arxiv.org/abs/2402.03300
DeepSeek-R1:
https://arxiv.org/abs/2501.12948
Why GRPO is Important and How it Works:
https://www.oxen.ai/blog/why-grpo-is-important-and-how-it-works
https://www.oxen.ai/blog/training-a-rust-1-5b-coder-lm-with-reinforcement-learning-grpo
These Videos are an amazing resource:
The fine-tuning was made much easier by using https://unsloth.ai:
Specifically, unsloth leverages the GRPOTrainer class from the Transformer Reinforcement Learning (TRL) package
Finally, the folks at unsloth have a great blog post and Google Collab notebook where they do something very similar to what I've done here.


