Transferable RL Policies for Downstream-Aware Automated Data Cleaning

Status: 🚧 Work in Progress — Research project in active development.

A reinforcement learning framework that learns transferable data cleaning policies optimized for downstream machine learning task performance — across different datasets and error distributions.

Problem

Real-world datasets commonly contain missing values, incorrect values, duplicate records, and outliers. Current cleaning pipelines are largely manual and heuristic-driven. The optimal cleaning strategy depends on the dataset's characteristics, its error distribution, and the downstream ML task — making it nearly impossible to generalize by hand.

Central research question:

Can a machine automatically learn transferable cleaning policies that optimize downstream ML performance across different datasets?

Proposed Approach

An RL agent observes a dirty dataset and iteratively selects cleaning operations. The reward signal is derived directly from downstream ML model performance — not from heuristic data quality metrics. Training across diverse datasets allows the agent to learn when to apply which strategy, producing a policy that transfers to unseen datasets without retraining.

Dirty Dataset
     │
     ▼
 RL Agent ──► Cleaning Action (impute / drop / clip / deduplicate)
     │
     ▼
Cleaned Dataset ──► Train ML Model ──► Performance Score
                                             │
                              Reward ◄───────┘
                                 │
                          Policy Update

Research Gap

System	RL-based	Downstream-aware	Generalizes to unseen datasets
ActiveClean	❌	Partial (labels only)	❌
RLclean	✅	❌	❌
CleanSurvival	✅	Partial (survival tasks)	❌
This work	✅	✅	✅ (goal)

Planned Components

RL environment wrapping tabular datasets as MDP states
Cleaning action space (imputation, outlier handling, deduplication, type correction)
Downstream reward computation (accuracy / F1 / AUC on held-out split)
Transferable policy training across multiple source datasets
Zero-shot / few-shot evaluation on unseen datasets
SHAP-based explainability layer for cleaning decisions

Tech Stack (planned)

Python 3.10+
Stable-Baselines3 — RL algorithms
Gymnasium — RL environment API
scikit-learn — downstream ML models
SHAP — explainability
pandas / numpy — data handling

Repository Structure (evolving)

rl-data-cleaning/
├── README.md
├── docs/
│   ├── research-overview.md     # Detailed problem + approach
│   ├── related-work.md          # Literature review notes
│   └── learning-notes.md        # Study notes (RL, MDPs, SHAP, etc.)
├── src/                         # Source code (coming soon)
├── notebooks/                   # Experiments and exploration (coming soon)
├── data/                        # Sample/benchmark datasets (coming soon)
└── results/                     # Evaluation results (coming soon)

References

Progress Log

Updates will be added here as the project develops.

Date	Milestone
TBD	Initial repo setup

License

MIT License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transferable RL Policies for Downstream-Aware Automated Data Cleaning

Problem

Proposed Approach

Research Gap

Planned Components

Tech Stack (planned)

Repository Structure (evolving)

References

Progress Log

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Transferable RL Policies for Downstream-Aware Automated Data Cleaning

Problem

Proposed Approach

Research Gap

Planned Components

Tech Stack (planned)

Repository Structure (evolving)

References

Progress Log

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages