Skip to content

feat(skills): Update data-curation-workflow skill with A-grade 9-stage pipeline #305

@akaszubski

Description

@akaszubski

Summary

Update the data-curation-workflow skill to document the complete A-grade 9-stage data curation pipeline discovered during the ReAlign training capabilities audit.

Context

ReAlign audit identified 8 stable training methods and 10+ data curation stages. The data-curation-workflow skill needs to reflect this comprehensive pipeline.

What Does NOT Work

  • Current skill doesn't document all 9 stages
  • Missing KenLM pre-filtering stage
  • Missing decontamination stage
  • Missing specialized data generation (DPO, RLVR, anti-hallucination)

Implementation Approach

Update the skill to document:

9-Stage Pipeline

  1. Persona-driven extraction - PersonaGenerator + HybridCurationCoordinator
  2. KenLM pre-filtering - Remove bottom 30-50% by perplexity (~10K/sec)
  3. Multi-dimensional scoring - IFD + Factuality + Reasoning + Diversity + Domain
  4. Bloom deduplication - Probabilistic + fuzzy dedup (1GB for 100M docs)
  5. Benchmark decontamination - 13-gram matching against MMLU, GSM8K, etc.
  6. Quality threshold filtering - Score ≥8.0, IFD ≥0.3
  7. Specialized data generation - DPO, RLVR, anti-hallucination, calibration
  8. Dataset mixing - Weighted combination with curriculum scheduling
  9. Validation - Format, contamination, bias, duplicate checks

Key Classes

Purpose Class Location
IFD scoring FastIFDScorer src/realign/data/ifd_scorer_fast.py
Multi-dimensional MultiDimensionalScorer src/realign/data/quality/multi_scorer.py
Deduplication BloomDeduplicator src/realign/data/processors/bloom_deduplicator.py
DPO generation RefusalDPOPairGenerator src/realign/data/refusal_dpo_generator.py
RLVR generation FinanceRLVRGenerator src/realign/data/finance_rlvr_generator.py
Anti-hallucination AntiHallucinationGenerator src/realign/data/antihallucination_generator.py
Mixing DatasetMixer src/realign/data/dataset_mixer.py

Quality Thresholds by Training Type

Type Quality Score IFD Score
SFT ≥8.0 ≥0.3
DPO chosen ≥9.0 ≥0.5
DPO rejected ≤6.0 any
RLVR ≥9.0 ≥0.5
Calibration ≥8.0 ≥0.4

Acceptance Criteria

  • All 9 pipeline stages documented with commands
  • Key classes listed with file paths
  • Quality thresholds table by training type
  • Example commands for each stage
  • Recommended mix weights documented

Related

  • ReAlign: docs/guides/A_GRADE_DATA_CURATION_PIPELINE.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions