Summary
Create a skill documenting GRPO (Group Relative Policy Optimization) training with verifiable rewards for math and code tasks.
Context
GRPO is powerful for tasks with verifiable outcomes. ReAlign supports GRPO but the workflow isn't documented.
Implementation Approach
Create .claude/skills/grpo-verifiable-training.md documenting:
What is GRPO?
- Online RL training using model-generated responses
- Groups multiple responses per prompt
- Rewards best responses based on verification
- No separate reward model needed (uses verifiable signals)
When to Use GRPO
| Use Case |
Verifier |
Example |
| Math problems |
Symbolic solver |
GSM8K, MATH |
| Code generation |
Execution sandbox |
HumanEval |
| Factual QA |
Knowledge base lookup |
TriviaQA |
| Format compliance |
Regex/parser |
JSON output |
GRPO Hyperparameters (DeepSeek-R1 Optimized)
GRPO_CONFIG = {
"group_size": 16, # Responses per prompt
"beta": 0.001, # KL coefficient (NOT 0.1!)
"clip_epsilon": 10.0, # NOT 0.2 (DeepSeek finding)
"learning_rate": 3e-7, # Conservative
"max_length": 2048,
}
Data Format
{
"prompt": "Solve: 2x + 5 = 15",
"responses": [
{"text": "x = 5", "score": 1.0, "correct": true},
{"text": "x = 10", "score": 0.0, "correct": false},
{"text": "x = 5 because 2(5) + 5 = 15", "score": 1.0, "correct": true}
]
}
Verifier Types
-
Math Verifier
- Symbolic comparison (sympy)
- Numerical tolerance (1e-6)
- Step-by-step verification optional
-
Code Verifier
- Sandbox execution
- Test case pass/fail
- Timeout handling (30s default)
-
Custom Verifier
- User-defined function
- Returns bool or score
Commands
# Generate GRPO data (extract prompts from SFT)
realign data extract-prompts \
--input sft.jsonl \
--filter-verifiable \
--output grpo_prompts.jsonl
# Train with GRPO
realign train --method grpo \
--data grpo_data.jsonl \
--verifier math \
--group-size 16 \
--beta 0.001
GRPO vs RLVR
| Aspect |
GRPO |
RLVR |
| Scoring |
Relative (group) |
Absolute (verify) |
| Reward |
Score 0.0-1.0 |
Correct/Incorrect |
| Use case |
Ranking quality |
Binary verification |
Acceptance Criteria
Related
Summary
Create a skill documenting GRPO (Group Relative Policy Optimization) training with verifiable rewards for math and code tasks.
Context
GRPO is powerful for tasks with verifiable outcomes. ReAlign supports GRPO but the workflow isn't documented.
Implementation Approach
Create
.claude/skills/grpo-verifiable-training.mddocumenting:What is GRPO?
When to Use GRPO
GRPO Hyperparameters (DeepSeek-R1 Optimized)
Data Format
{ "prompt": "Solve: 2x + 5 = 15", "responses": [ {"text": "x = 5", "score": 1.0, "correct": true}, {"text": "x = 10", "score": 0.0, "correct": false}, {"text": "x = 5 because 2(5) + 5 = 15", "score": 1.0, "correct": true} ] }Verifier Types
Math Verifier
Code Verifier
Custom Verifier
Commands
GRPO vs RLVR
Acceptance Criteria
Related