Skip to content

feat(skills): Create grpo-verifiable-training skill for math/code verification #309

@akaszubski

Description

@akaszubski

Summary

Create a skill documenting GRPO (Group Relative Policy Optimization) training with verifiable rewards for math and code tasks.

Context

GRPO is powerful for tasks with verifiable outcomes. ReAlign supports GRPO but the workflow isn't documented.

Implementation Approach

Create .claude/skills/grpo-verifiable-training.md documenting:

What is GRPO?

  • Online RL training using model-generated responses
  • Groups multiple responses per prompt
  • Rewards best responses based on verification
  • No separate reward model needed (uses verifiable signals)

When to Use GRPO

Use Case Verifier Example
Math problems Symbolic solver GSM8K, MATH
Code generation Execution sandbox HumanEval
Factual QA Knowledge base lookup TriviaQA
Format compliance Regex/parser JSON output

GRPO Hyperparameters (DeepSeek-R1 Optimized)

GRPO_CONFIG = {
    "group_size": 16,        # Responses per prompt
    "beta": 0.001,           # KL coefficient (NOT 0.1!)
    "clip_epsilon": 10.0,    # NOT 0.2 (DeepSeek finding)
    "learning_rate": 3e-7,   # Conservative
    "max_length": 2048,
}

Data Format

{
  "prompt": "Solve: 2x + 5 = 15",
  "responses": [
    {"text": "x = 5", "score": 1.0, "correct": true},
    {"text": "x = 10", "score": 0.0, "correct": false},
    {"text": "x = 5 because 2(5) + 5 = 15", "score": 1.0, "correct": true}
  ]
}

Verifier Types

  1. Math Verifier

    • Symbolic comparison (sympy)
    • Numerical tolerance (1e-6)
    • Step-by-step verification optional
  2. Code Verifier

    • Sandbox execution
    • Test case pass/fail
    • Timeout handling (30s default)
  3. Custom Verifier

    • User-defined function
    • Returns bool or score

Commands

# Generate GRPO data (extract prompts from SFT)
realign data extract-prompts \
  --input sft.jsonl \
  --filter-verifiable \
  --output grpo_prompts.jsonl

# Train with GRPO
realign train --method grpo \
  --data grpo_data.jsonl \
  --verifier math \
  --group-size 16 \
  --beta 0.001

GRPO vs RLVR

Aspect GRPO RLVR
Scoring Relative (group) Absolute (verify)
Reward Score 0.0-1.0 Correct/Incorrect
Use case Ranking quality Binary verification

Acceptance Criteria

  • GRPO concept explained
  • Hyperparameters documented (DeepSeek-R1 optimized)
  • Data format with examples
  • Verifier types documented
  • GRPO vs RLVR comparison
  • CLI commands provided

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions