Summary
Create a skill documenting all quality scoring methods available in ReAlign for data curation.
Context
ReAlign has 6+ quality dimensions and multiple scorers but they're not documented together.
Implementation Approach
Create .claude/skills/quality-scoring.md documenting:
Quality Scorers
| Scorer |
Purpose |
Speed |
File |
FastIFDScorer |
Instruction Following Difficulty |
10-20x faster |
ifd_scorer_fast.py |
QualityScorer |
LLM-based quality (Qwen3-30B) |
0.85 ex/s |
quality_scorer.py |
MultiDimensionalScorer |
5-dimension composite |
Medium |
quality/multi_scorer.py |
LLMQualityScorer |
Multi-backend (MLX/OpenRouter) |
Variable |
llm_quality_scorer.py |
EnsembleQualityScorer |
Cross-model ensemble |
Slow |
ensemble_scorer.py |
Quality Dimensions
-
IFD Score (0.0-1.0)
- Measures instruction complexity
- Higher = more diverse/challenging
- Formula: PPL(response) / PPL(response|instruction)
-
Factuality Score (0.0-1.0)
- Detects hallucinations
- Checks factual consistency
-
Reasoning Score (0.0-1.0)
- Quality of reasoning chains
- Step-by-step logic
-
Diversity Score (0.0-1.0)
- Dataset-level diversity
- Avoids mode collapse
-
Domain Score (0.0-1.0)
- Domain-specific relevance
- Expertise alignment
-
LLM Quality Score (1-10)
- Tulu3 dimensions: helpfulness, accuracy, clarity, completeness
- Each dimension: 1-5, composite: 1-10
Thresholds by Training Type
| Type |
Quality |
IFD |
Use |
| SFT |
≥8.0 |
≥0.3 |
Base training |
| DPO chosen |
≥9.0 |
≥0.5 |
High quality only |
| DPO rejected |
≤6.0 |
any |
Low quality |
| RLVR |
≥9.0 |
≥0.5 |
Verified solutions |
| Calibration |
≥8.0 |
≥0.4 |
Uncertainty examples |
Commands
# Fast IFD scoring
python -m realign.data.ifd_scorer_fast \
--input data.jsonl \
--output scored.jsonl \
--batch-size 32
# Multi-dimensional scoring
python scripts/qwen3_quality_pipeline.py \
--input data.jsonl \
--task score \
--scorer multi \
--output scored.jsonl
# Distributed scoring (M3 Ultra + M4 Max)
python scripts/qwen3_quality_pipeline.py \
--input data.jsonl \
--task score \
--distributed \
--output scored.jsonl
Distributed Scoring Performance
| Machine |
Rate |
Notes |
| M4 Max |
~0.85 ex/s |
Pipeline is CPU-bound |
| M3 Ultra |
~0.85 ex/s |
Equal due to tokenization overhead |
| Combined |
~1.7 ex/s |
50/50 split |
Note: Use 50/50 split for pipelines (not 65/35 from benchmarks).
Acceptance Criteria
Related
Summary
Create a skill documenting all quality scoring methods available in ReAlign for data curation.
Context
ReAlign has 6+ quality dimensions and multiple scorers but they're not documented together.
Implementation Approach
Create
.claude/skills/quality-scoring.mddocumenting:Quality Scorers
FastIFDScorerifd_scorer_fast.pyQualityScorerquality_scorer.pyMultiDimensionalScorerquality/multi_scorer.pyLLMQualityScorerllm_quality_scorer.pyEnsembleQualityScorerensemble_scorer.pyQuality Dimensions
IFD Score (0.0-1.0)
Factuality Score (0.0-1.0)
Reasoning Score (0.0-1.0)
Diversity Score (0.0-1.0)
Domain Score (0.0-1.0)
LLM Quality Score (1-10)
Thresholds by Training Type
Commands
Distributed Scoring Performance
Note: Use 50/50 split for pipelines (not 65/35 from benchmarks).
Acceptance Criteria
Related