Summary
Create a data-curator agent that orchestrates the complete A-grade 9-stage data curation pipeline.
Context
The A-grade pipeline has 9 stages with multiple tools. An agent can orchestrate this workflow automatically.
Implementation Approach
Create agent in .claude/agents/data-curator.md:
Agent Responsibilities
- Assess source data - Analyze input files, estimate pipeline requirements
- Run 9-stage pipeline - Execute each stage in order with checkpoints
- Generate reports - Quality distribution, dedup stats, contamination results
- Handle failures - Resume from checkpoints, retry failed items
Pipeline Stages
PIPELINE_STAGES = [
"1_extract", # Persona-driven extraction
"2_prefilter", # KenLM perplexity filter
"3_score", # Multi-dimensional quality scoring
"4_dedup", # Bloom + fuzzy deduplication
"5_decontaminate", # Benchmark contamination removal
"6_filter", # Quality threshold filtering
"7_generate", # DPO, RLVR, anti-hallucination
"8_mix", # Weighted dataset mixing
"9_validate", # Final validation checks
]
Agent Tools
The agent should have access to:
Bash - Run curation commands
Read/Write - Handle data files
Grep/Glob - Find and analyze files
Task - Spawn sub-agents for parallel work
Invocation
/curate --source books/ --output data/curated/ --budget 20.0
Agent Prompts
Initial assessment:
Analyze the source data at {source_path}:
1. Count total examples
2. Estimate processing time
3. Check for existing checkpoints
4. Recommend pipeline configuration
Per-stage execution:
Execute stage {stage_name}:
1. Run the appropriate command
2. Check for errors
3. Report statistics
4. Save checkpoint
Checkpointing
CHECKPOINT_SCHEMA = {
"stage": int,
"stage_name": str,
"processed": int,
"total": int,
"stats": {
"kept": int,
"filtered": int,
"errors": int,
},
"timestamp": str,
}
Reports
Agent should generate:
curation_report.json - Full statistics
quality_distribution.png - Score histogram
stage_timings.json - Performance metrics
Acceptance Criteria
Related
Summary
Create a
data-curatoragent that orchestrates the complete A-grade 9-stage data curation pipeline.Context
The A-grade pipeline has 9 stages with multiple tools. An agent can orchestrate this workflow automatically.
Implementation Approach
Create agent in
.claude/agents/data-curator.md:Agent Responsibilities
Pipeline Stages
Agent Tools
The agent should have access to:
Bash- Run curation commandsRead/Write- Handle data filesGrep/Glob- Find and analyze filesTask- Spawn sub-agents for parallel workInvocation
Agent Prompts
Initial assessment:
Per-stage execution:
Checkpointing
Reports
Agent should generate:
curation_report.json- Full statisticsquality_distribution.png- Score histogramstage_timings.json- Performance metricsAcceptance Criteria
.claude/agents/Related