Skip to content

MrAshwin2142/rl-data-cleaning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Transferable RL Policies for Downstream-Aware Automated Data Cleaning

Status: 🚧 Work in Progress — Research project in active development.

A reinforcement learning framework that learns transferable data cleaning policies optimized for downstream machine learning task performance — across different datasets and error distributions.


Problem

Real-world datasets commonly contain missing values, incorrect values, duplicate records, and outliers. Current cleaning pipelines are largely manual and heuristic-driven. The optimal cleaning strategy depends on the dataset's characteristics, its error distribution, and the downstream ML task — making it nearly impossible to generalize by hand.

Central research question:

Can a machine automatically learn transferable cleaning policies that optimize downstream ML performance across different datasets?


Proposed Approach

An RL agent observes a dirty dataset and iteratively selects cleaning operations. The reward signal is derived directly from downstream ML model performance — not from heuristic data quality metrics. Training across diverse datasets allows the agent to learn when to apply which strategy, producing a policy that transfers to unseen datasets without retraining.

Dirty Dataset
     │
     ▼
 RL Agent ──► Cleaning Action (impute / drop / clip / deduplicate)
     │
     ▼
Cleaned Dataset ──► Train ML Model ──► Performance Score
                                             │
                              Reward ◄───────┘
                                 │
                          Policy Update

Research Gap

System RL-based Downstream-aware Generalizes to unseen datasets
ActiveClean Partial (labels only)
RLclean
CleanSurvival Partial (survival tasks)
This work ✅ (goal)

Planned Components

  • RL environment wrapping tabular datasets as MDP states
  • Cleaning action space (imputation, outlier handling, deduplication, type correction)
  • Downstream reward computation (accuracy / F1 / AUC on held-out split)
  • Transferable policy training across multiple source datasets
  • Zero-shot / few-shot evaluation on unseen datasets
  • SHAP-based explainability layer for cleaning decisions

Tech Stack (planned)


Repository Structure (evolving)

rl-data-cleaning/
├── README.md
├── docs/
│   ├── research-overview.md     # Detailed problem + approach
│   ├── related-work.md          # Literature review notes
│   └── learning-notes.md        # Study notes (RL, MDPs, SHAP, etc.)
├── src/                         # Source code (coming soon)
├── notebooks/                   # Experiments and exploration (coming soon)
├── data/                        # Sample/benchmark datasets (coming soon)
└── results/                     # Evaluation results (coming soon)

References


Progress Log

Updates will be added here as the project develops.

Date Milestone
TBD Initial repo setup

License

MIT License — see LICENSE for details.

About

Transferable RL policies for downstream-aware automated data cleaning across diverse tabular datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors