Skip to content

feat(skills): Create quality-scoring skill for multi-dimensional data assessment #310

@akaszubski

Description

@akaszubski

Summary

Create a skill documenting all quality scoring methods available in ReAlign for data curation.

Context

ReAlign has 6+ quality dimensions and multiple scorers but they're not documented together.

Implementation Approach

Create .claude/skills/quality-scoring.md documenting:

Quality Scorers

Scorer Purpose Speed File
FastIFDScorer Instruction Following Difficulty 10-20x faster ifd_scorer_fast.py
QualityScorer LLM-based quality (Qwen3-30B) 0.85 ex/s quality_scorer.py
MultiDimensionalScorer 5-dimension composite Medium quality/multi_scorer.py
LLMQualityScorer Multi-backend (MLX/OpenRouter) Variable llm_quality_scorer.py
EnsembleQualityScorer Cross-model ensemble Slow ensemble_scorer.py

Quality Dimensions

  1. IFD Score (0.0-1.0)

    • Measures instruction complexity
    • Higher = more diverse/challenging
    • Formula: PPL(response) / PPL(response|instruction)
  2. Factuality Score (0.0-1.0)

    • Detects hallucinations
    • Checks factual consistency
  3. Reasoning Score (0.0-1.0)

    • Quality of reasoning chains
    • Step-by-step logic
  4. Diversity Score (0.0-1.0)

    • Dataset-level diversity
    • Avoids mode collapse
  5. Domain Score (0.0-1.0)

    • Domain-specific relevance
    • Expertise alignment
  6. LLM Quality Score (1-10)

    • Tulu3 dimensions: helpfulness, accuracy, clarity, completeness
    • Each dimension: 1-5, composite: 1-10

Thresholds by Training Type

Type Quality IFD Use
SFT ≥8.0 ≥0.3 Base training
DPO chosen ≥9.0 ≥0.5 High quality only
DPO rejected ≤6.0 any Low quality
RLVR ≥9.0 ≥0.5 Verified solutions
Calibration ≥8.0 ≥0.4 Uncertainty examples

Commands

# Fast IFD scoring
python -m realign.data.ifd_scorer_fast \
  --input data.jsonl \
  --output scored.jsonl \
  --batch-size 32

# Multi-dimensional scoring
python scripts/qwen3_quality_pipeline.py \
  --input data.jsonl \
  --task score \
  --scorer multi \
  --output scored.jsonl

# Distributed scoring (M3 Ultra + M4 Max)
python scripts/qwen3_quality_pipeline.py \
  --input data.jsonl \
  --task score \
  --distributed \
  --output scored.jsonl

Distributed Scoring Performance

Machine Rate Notes
M4 Max ~0.85 ex/s Pipeline is CPU-bound
M3 Ultra ~0.85 ex/s Equal due to tokenization overhead
Combined ~1.7 ex/s 50/50 split

Note: Use 50/50 split for pipelines (not 65/35 from benchmarks).

Acceptance Criteria

  • All scorers documented
  • Quality dimensions explained
  • Thresholds by training type
  • CLI commands provided
  • Distributed performance notes

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions