Skip to content

feat(agents): Create data-curator agent for A-grade pipeline orchestration #311

@akaszubski

Description

@akaszubski

Summary

Create a data-curator agent that orchestrates the complete A-grade 9-stage data curation pipeline.

Context

The A-grade pipeline has 9 stages with multiple tools. An agent can orchestrate this workflow automatically.

Implementation Approach

Create agent in .claude/agents/data-curator.md:

Agent Responsibilities

  1. Assess source data - Analyze input files, estimate pipeline requirements
  2. Run 9-stage pipeline - Execute each stage in order with checkpoints
  3. Generate reports - Quality distribution, dedup stats, contamination results
  4. Handle failures - Resume from checkpoints, retry failed items

Pipeline Stages

PIPELINE_STAGES = [
    "1_extract",         # Persona-driven extraction
    "2_prefilter",       # KenLM perplexity filter
    "3_score",           # Multi-dimensional quality scoring
    "4_dedup",           # Bloom + fuzzy deduplication
    "5_decontaminate",   # Benchmark contamination removal
    "6_filter",          # Quality threshold filtering
    "7_generate",        # DPO, RLVR, anti-hallucination
    "8_mix",             # Weighted dataset mixing
    "9_validate",        # Final validation checks
]

Agent Tools

The agent should have access to:

  • Bash - Run curation commands
  • Read/Write - Handle data files
  • Grep/Glob - Find and analyze files
  • Task - Spawn sub-agents for parallel work

Invocation

/curate --source books/ --output data/curated/ --budget 20.0

Agent Prompts

Initial assessment:

Analyze the source data at {source_path}:
1. Count total examples
2. Estimate processing time
3. Check for existing checkpoints
4. Recommend pipeline configuration

Per-stage execution:

Execute stage {stage_name}:
1. Run the appropriate command
2. Check for errors
3. Report statistics
4. Save checkpoint

Checkpointing

CHECKPOINT_SCHEMA = {
    "stage": int,
    "stage_name": str,
    "processed": int,
    "total": int,
    "stats": {
        "kept": int,
        "filtered": int,
        "errors": int,
    },
    "timestamp": str,
}

Reports

Agent should generate:

  • curation_report.json - Full statistics
  • quality_distribution.png - Score histogram
  • stage_timings.json - Performance metrics

Acceptance Criteria

  • Agent definition in .claude/agents/
  • All 9 stages implemented
  • Checkpoint/resume support
  • Quality report generation
  • Error handling and retry
  • Parallel stage support where applicable

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions