Summary
Create a skill for generating DPO preference pairs and RLVR verification data from existing SFT datasets.
Context
ReAlign has RefusalDPOPairGenerator, FinanceDPOGenerator, and FinanceRLVRGenerator but they're not easily discoverable. This skill documents the workflow.
What Does NOT Work
- No documented workflow for DPO pair generation
- RLVR data generation not documented
- Quality thresholds for preference data scattered
Implementation Approach
Create .claude/skills/dpo-rlvr-generation.md documenting:
DPO Pair Generation
Available Generators:
RefusalDPOPairGenerator - Refusal vs compliance pairs
FinanceDPOGenerator - Domain-specific with intentional flaws
Flaw Types for Rejected Responses:
risky_advice - Dangerous recommendations
oversimplified - Missing important details
incomplete - Truncated or partial
hallucinated - Made up facts
irrelevant - Off-topic tangent
overconfident - Uncalibrated certainty
Quality Thresholds:
| Field |
Threshold |
| Chosen quality |
≥9.0 |
| Rejected quality |
≤6.0 |
| Preference gap |
≥3.0 |
RLVR Data Generation
Available Generators:
FinanceRLVRGenerator - Finance calculations with verification
- Code execution verification (via sandbox)
- Math answer verification (symbolic)
Verification Types:
math - Symbolic math verification
code - Code execution sandbox
custom - User-defined verifier
Categories:
- Position sizing, risk calculations
- Options Greeks, P&L calculations
- Code execution (pass/fail)
- Math solutions (correct/incorrect)
Commands
# Generate DPO pairs
python -m realign.data.refusal_dpo_generator \
--input sft.jsonl \
--output dpo.jsonl \
--chosen-threshold 9.0 \
--rejected-threshold 6.0
# Generate RLVR data
python -m realign.data.finance_rlvr_generator \
--domain finance \
--count 10000 \
--output rlvr.jsonl
Validation
from realign.data.preference_quality_validator import PreferenceQualityValidator
validator = PreferenceQualityValidator()
result = validator.validate(dpo_pairs)
assert result.preference_gap >= 3.0
Acceptance Criteria
Related
Summary
Create a skill for generating DPO preference pairs and RLVR verification data from existing SFT datasets.
Context
ReAlign has
RefusalDPOPairGenerator,FinanceDPOGenerator, andFinanceRLVRGeneratorbut they're not easily discoverable. This skill documents the workflow.What Does NOT Work
Implementation Approach
Create
.claude/skills/dpo-rlvr-generation.mddocumenting:DPO Pair Generation
Available Generators:
RefusalDPOPairGenerator- Refusal vs compliance pairsFinanceDPOGenerator- Domain-specific with intentional flawsFlaw Types for Rejected Responses:
risky_advice- Dangerous recommendationsoversimplified- Missing important detailsincomplete- Truncated or partialhallucinated- Made up factsirrelevant- Off-topic tangentoverconfident- Uncalibrated certaintyQuality Thresholds:
RLVR Data Generation
Available Generators:
FinanceRLVRGenerator- Finance calculations with verificationVerification Types:
math- Symbolic math verificationcode- Code execution sandboxcustom- User-defined verifierCategories:
Commands
Validation
Acceptance Criteria
Related