feat(skills): Create grpo-verifiable-training skill for math/code verification

## Summary

Create a skill documenting GRPO (Group Relative Policy Optimization) training with verifiable rewards for math and code tasks.

## Context

GRPO is powerful for tasks with verifiable outcomes. ReAlign supports GRPO but the workflow isn't documented.

## Implementation Approach

Create `.claude/skills/grpo-verifiable-training.md` documenting:

### What is GRPO?

- Online RL training using model-generated responses
- Groups multiple responses per prompt
- Rewards best responses based on verification
- No separate reward model needed (uses verifiable signals)

### When to Use GRPO

| Use Case | Verifier | Example |
|----------|----------|---------|
| Math problems | Symbolic solver | GSM8K, MATH |
| Code generation | Execution sandbox | HumanEval |
| Factual QA | Knowledge base lookup | TriviaQA |
| Format compliance | Regex/parser | JSON output |

### GRPO Hyperparameters (DeepSeek-R1 Optimized)

```python
GRPO_CONFIG = {
    "group_size": 16,        # Responses per prompt
    "beta": 0.001,           # KL coefficient (NOT 0.1!)
    "clip_epsilon": 10.0,    # NOT 0.2 (DeepSeek finding)
    "learning_rate": 3e-7,   # Conservative
    "max_length": 2048,
}
```

### Data Format

```json
{
  "prompt": "Solve: 2x + 5 = 15",
  "responses": [
    {"text": "x = 5", "score": 1.0, "correct": true},
    {"text": "x = 10", "score": 0.0, "correct": false},
    {"text": "x = 5 because 2(5) + 5 = 15", "score": 1.0, "correct": true}
  ]
}
```

### Verifier Types

1. **Math Verifier**
   - Symbolic comparison (sympy)
   - Numerical tolerance (1e-6)
   - Step-by-step verification optional

2. **Code Verifier**
   - Sandbox execution
   - Test case pass/fail
   - Timeout handling (30s default)

3. **Custom Verifier**
   - User-defined function
   - Returns bool or score

### Commands

```bash
# Generate GRPO data (extract prompts from SFT)
realign data extract-prompts \
  --input sft.jsonl \
  --filter-verifiable \
  --output grpo_prompts.jsonl

# Train with GRPO
realign train --method grpo \
  --data grpo_data.jsonl \
  --verifier math \
  --group-size 16 \
  --beta 0.001
```

### GRPO vs RLVR

| Aspect | GRPO | RLVR |
|--------|------|------|
| Scoring | Relative (group) | Absolute (verify) |
| Reward | Score 0.0-1.0 | Correct/Incorrect |
| Use case | Ranking quality | Binary verification |

## Acceptance Criteria

- [ ] GRPO concept explained
- [ ] Hyperparameters documented (DeepSeek-R1 optimized)
- [ ] Data format with examples
- [ ] Verifier types documented
- [ ] GRPO vs RLVR comparison
- [ ] CLI commands provided

## Related

- Issue #306 (training-methods)
- Issue #307 (dpo-rlvr-generation)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skills): Create grpo-verifiable-training skill for math/code verification #309

Summary

Context

Implementation Approach

What is GRPO?

When to Use GRPO

GRPO Hyperparameters (DeepSeek-R1 Optimized)

Data Format

Verifier Types

Commands

GRPO vs RLVR

Acceptance Criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Use Case	Verifier	Example
Math problems	Symbolic solver	GSM8K, MATH
Code generation	Execution sandbox	HumanEval
Factual QA	Knowledge base lookup	TriviaQA
Format compliance	Regex/parser	JSON output

Aspect	GRPO	RLVR
Scoring	Relative (group)	Absolute (verify)
Reward	Score 0.0-1.0	Correct/Incorrect
Use case	Ranking quality	Binary verification

feat(skills): Create grpo-verifiable-training skill for math/code verification #309

Description

Summary

Context

Implementation Approach

What is GRPO?

When to Use GRPO

GRPO Hyperparameters (DeepSeek-R1 Optimized)

Data Format

Verifier Types

Commands

GRPO vs RLVR

Acceptance Criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions