Summary
Update the data-curation-workflow skill to document the complete A-grade 9-stage data curation pipeline discovered during the ReAlign training capabilities audit.
Context
ReAlign audit identified 8 stable training methods and 10+ data curation stages. The data-curation-workflow skill needs to reflect this comprehensive pipeline.
What Does NOT Work
- Current skill doesn't document all 9 stages
- Missing KenLM pre-filtering stage
- Missing decontamination stage
- Missing specialized data generation (DPO, RLVR, anti-hallucination)
Implementation Approach
Update the skill to document:
9-Stage Pipeline
- Persona-driven extraction -
PersonaGenerator + HybridCurationCoordinator
- KenLM pre-filtering - Remove bottom 30-50% by perplexity (~10K/sec)
- Multi-dimensional scoring - IFD + Factuality + Reasoning + Diversity + Domain
- Bloom deduplication - Probabilistic + fuzzy dedup (1GB for 100M docs)
- Benchmark decontamination - 13-gram matching against MMLU, GSM8K, etc.
- Quality threshold filtering - Score ≥8.0, IFD ≥0.3
- Specialized data generation - DPO, RLVR, anti-hallucination, calibration
- Dataset mixing - Weighted combination with curriculum scheduling
- Validation - Format, contamination, bias, duplicate checks
Key Classes
| Purpose |
Class |
Location |
| IFD scoring |
FastIFDScorer |
src/realign/data/ifd_scorer_fast.py |
| Multi-dimensional |
MultiDimensionalScorer |
src/realign/data/quality/multi_scorer.py |
| Deduplication |
BloomDeduplicator |
src/realign/data/processors/bloom_deduplicator.py |
| DPO generation |
RefusalDPOPairGenerator |
src/realign/data/refusal_dpo_generator.py |
| RLVR generation |
FinanceRLVRGenerator |
src/realign/data/finance_rlvr_generator.py |
| Anti-hallucination |
AntiHallucinationGenerator |
src/realign/data/antihallucination_generator.py |
| Mixing |
DatasetMixer |
src/realign/data/dataset_mixer.py |
Quality Thresholds by Training Type
| Type |
Quality Score |
IFD Score |
| SFT |
≥8.0 |
≥0.3 |
| DPO chosen |
≥9.0 |
≥0.5 |
| DPO rejected |
≤6.0 |
any |
| RLVR |
≥9.0 |
≥0.5 |
| Calibration |
≥8.0 |
≥0.4 |
Acceptance Criteria
Related
- ReAlign:
docs/guides/A_GRADE_DATA_CURATION_PIPELINE.md
Summary
Update the
data-curation-workflowskill to document the complete A-grade 9-stage data curation pipeline discovered during the ReAlign training capabilities audit.Context
ReAlign audit identified 8 stable training methods and 10+ data curation stages. The data-curation-workflow skill needs to reflect this comprehensive pipeline.
What Does NOT Work
Implementation Approach
Update the skill to document:
9-Stage Pipeline
PersonaGenerator+HybridCurationCoordinatorKey Classes
FastIFDScorersrc/realign/data/ifd_scorer_fast.pyMultiDimensionalScorersrc/realign/data/quality/multi_scorer.pyBloomDeduplicatorsrc/realign/data/processors/bloom_deduplicator.pyRefusalDPOPairGeneratorsrc/realign/data/refusal_dpo_generator.pyFinanceRLVRGeneratorsrc/realign/data/finance_rlvr_generator.pyAntiHallucinationGeneratorsrc/realign/data/antihallucination_generator.pyDatasetMixersrc/realign/data/dataset_mixer.pyQuality Thresholds by Training Type
Acceptance Criteria
Related
docs/guides/A_GRADE_DATA_CURATION_PIPELINE.md