Status: 🚧 Work in Progress — Research project in active development.
A reinforcement learning framework that learns transferable data cleaning policies optimized for downstream machine learning task performance — across different datasets and error distributions.
Real-world datasets commonly contain missing values, incorrect values, duplicate records, and outliers. Current cleaning pipelines are largely manual and heuristic-driven. The optimal cleaning strategy depends on the dataset's characteristics, its error distribution, and the downstream ML task — making it nearly impossible to generalize by hand.
Central research question:
Can a machine automatically learn transferable cleaning policies that optimize downstream ML performance across different datasets?
An RL agent observes a dirty dataset and iteratively selects cleaning operations. The reward signal is derived directly from downstream ML model performance — not from heuristic data quality metrics. Training across diverse datasets allows the agent to learn when to apply which strategy, producing a policy that transfers to unseen datasets without retraining.
Dirty Dataset
│
▼
RL Agent ──► Cleaning Action (impute / drop / clip / deduplicate)
│
▼
Cleaned Dataset ──► Train ML Model ──► Performance Score
│
Reward ◄───────┘
│
Policy Update
| System | RL-based | Downstream-aware | Generalizes to unseen datasets |
|---|---|---|---|
| ActiveClean | ❌ | Partial (labels only) | ❌ |
| RLclean | ✅ | ❌ | ❌ |
| CleanSurvival | ✅ | Partial (survival tasks) | ❌ |
| This work | ✅ | ✅ | ✅ (goal) |
- RL environment wrapping tabular datasets as MDP states
- Cleaning action space (imputation, outlier handling, deduplication, type correction)
- Downstream reward computation (accuracy / F1 / AUC on held-out split)
- Transferable policy training across multiple source datasets
- Zero-shot / few-shot evaluation on unseen datasets
- SHAP-based explainability layer for cleaning decisions
- Python 3.10+
- Stable-Baselines3 — RL algorithms
- Gymnasium — RL environment API
- scikit-learn — downstream ML models
- SHAP — explainability
- pandas / numpy — data handling
rl-data-cleaning/
├── README.md
├── docs/
│ ├── research-overview.md # Detailed problem + approach
│ ├── related-work.md # Literature review notes
│ └── learning-notes.md # Study notes (RL, MDPs, SHAP, etc.)
├── src/ # Source code (coming soon)
├── notebooks/ # Experiments and exploration (coming soon)
├── data/ # Sample/benchmark datasets (coming soon)
└── results/ # Evaluation results (coming soon)
- ActiveClean: Interactive Data Cleaning For Statistical Models
- RLclean: An Unsupervised Integrated Data Cleaning Framework
- CleanSurvival (DOI:10.1016/j.ins.2024.121281)
- LLaPipe / HaiPipe – LLM and Human-AI pipelines
Updates will be added here as the project develops.
| Date | Milestone |
|---|---|
| TBD | Initial repo setup |
MIT License — see LICENSE for details.