Skip to content

feat(skills): Create dpo-rlvr-generation skill for preference and verification data #307

@akaszubski

Description

@akaszubski

Summary

Create a skill for generating DPO preference pairs and RLVR verification data from existing SFT datasets.

Context

ReAlign has RefusalDPOPairGenerator, FinanceDPOGenerator, and FinanceRLVRGenerator but they're not easily discoverable. This skill documents the workflow.

What Does NOT Work

  • No documented workflow for DPO pair generation
  • RLVR data generation not documented
  • Quality thresholds for preference data scattered

Implementation Approach

Create .claude/skills/dpo-rlvr-generation.md documenting:

DPO Pair Generation

Available Generators:

  1. RefusalDPOPairGenerator - Refusal vs compliance pairs
  2. FinanceDPOGenerator - Domain-specific with intentional flaws

Flaw Types for Rejected Responses:

  • risky_advice - Dangerous recommendations
  • oversimplified - Missing important details
  • incomplete - Truncated or partial
  • hallucinated - Made up facts
  • irrelevant - Off-topic tangent
  • overconfident - Uncalibrated certainty

Quality Thresholds:

Field Threshold
Chosen quality ≥9.0
Rejected quality ≤6.0
Preference gap ≥3.0

RLVR Data Generation

Available Generators:

  1. FinanceRLVRGenerator - Finance calculations with verification
  2. Code execution verification (via sandbox)
  3. Math answer verification (symbolic)

Verification Types:

  • math - Symbolic math verification
  • code - Code execution sandbox
  • custom - User-defined verifier

Categories:

  • Position sizing, risk calculations
  • Options Greeks, P&L calculations
  • Code execution (pass/fail)
  • Math solutions (correct/incorrect)

Commands

# Generate DPO pairs
python -m realign.data.refusal_dpo_generator \
  --input sft.jsonl \
  --output dpo.jsonl \
  --chosen-threshold 9.0 \
  --rejected-threshold 6.0

# Generate RLVR data
python -m realign.data.finance_rlvr_generator \
  --domain finance \
  --count 10000 \
  --output rlvr.jsonl

Validation

from realign.data.preference_quality_validator import PreferenceQualityValidator
validator = PreferenceQualityValidator()
result = validator.validate(dpo_pairs)
assert result.preference_gap >= 3.0

Acceptance Criteria

  • Both DPO generators documented
  • RLVR generators documented
  • Flaw types listed
  • Quality thresholds specified
  • CLI commands provided
  • Validation process documented

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions