feat(agents): Create data-curator agent for A-grade pipeline orchestration

## Summary

Create a `data-curator` agent that orchestrates the complete A-grade 9-stage data curation pipeline.

## Context

The A-grade pipeline has 9 stages with multiple tools. An agent can orchestrate this workflow automatically.

## Implementation Approach

Create agent in `.claude/agents/data-curator.md`:

### Agent Responsibilities

1. **Assess source data** - Analyze input files, estimate pipeline requirements
2. **Run 9-stage pipeline** - Execute each stage in order with checkpoints
3. **Generate reports** - Quality distribution, dedup stats, contamination results
4. **Handle failures** - Resume from checkpoints, retry failed items

### Pipeline Stages

```python
PIPELINE_STAGES = [
    "1_extract",         # Persona-driven extraction
    "2_prefilter",       # KenLM perplexity filter
    "3_score",           # Multi-dimensional quality scoring
    "4_dedup",           # Bloom + fuzzy deduplication
    "5_decontaminate",   # Benchmark contamination removal
    "6_filter",          # Quality threshold filtering
    "7_generate",        # DPO, RLVR, anti-hallucination
    "8_mix",             # Weighted dataset mixing
    "9_validate",        # Final validation checks
]
```

### Agent Tools

The agent should have access to:
- `Bash` - Run curation commands
- `Read/Write` - Handle data files
- `Grep/Glob` - Find and analyze files
- `Task` - Spawn sub-agents for parallel work

### Invocation

```
/curate --source books/ --output data/curated/ --budget 20.0
```

### Agent Prompts

**Initial assessment:**
```
Analyze the source data at {source_path}:
1. Count total examples
2. Estimate processing time
3. Check for existing checkpoints
4. Recommend pipeline configuration
```

**Per-stage execution:**
```
Execute stage {stage_name}:
1. Run the appropriate command
2. Check for errors
3. Report statistics
4. Save checkpoint
```

### Checkpointing

```python
CHECKPOINT_SCHEMA = {
    "stage": int,
    "stage_name": str,
    "processed": int,
    "total": int,
    "stats": {
        "kept": int,
        "filtered": int,
        "errors": int,
    },
    "timestamp": str,
}
```

### Reports

Agent should generate:
- `curation_report.json` - Full statistics
- `quality_distribution.png` - Score histogram
- `stage_timings.json` - Performance metrics

## Acceptance Criteria

- [ ] Agent definition in `.claude/agents/`
- [ ] All 9 stages implemented
- [ ] Checkpoint/resume support
- [ ] Quality report generation
- [ ] Error handling and retry
- [ ] Parallel stage support where applicable

## Related

- Issue #305 (data-curation-workflow skill)
- Issue #310 (quality-scoring skill)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agents): Create data-curator agent for A-grade pipeline orchestration #311

Summary

Context

Implementation Approach

Agent Responsibilities

Pipeline Stages

Agent Tools

Invocation

Agent Prompts

Checkpointing

Reports

Acceptance Criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(agents): Create data-curator agent for A-grade pipeline orchestration #311

Description

Summary

Context

Implementation Approach

Agent Responsibilities

Pipeline Stages

Agent Tools

Invocation

Agent Prompts

Checkpointing

Reports

Acceptance Criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions