MedAdhereAI is a research-grade machine learning pipeline built to predict the risk of medication non-adherence among patients with chronic conditions like diabetes and hypertension. Using real-world claims and refill data from a developing nation's healthcare system, the project aims to deliver interpretable, deployable, and publishable adherence prediction models.
π Preprint Published on medRxiv
π Citation:
Subash Yadav, Saijal Rajbhandari. MedAdhereAI: An Interpretable Machine Learning Pipeline for Predicting Medication Non-Adherence in Chronic Disease Patients Using Real-World Refill Data. medRxiv 2025.07.01.25330675; doi: 10.1101/2025.07.01.25330675
π Read the Preprint
- π Forecast whether a patient will adhere or not to prescribed medications
- βοΈ Engineer features from real claim-level refill data
- π Focus on chronic conditions in resource-constrained settings
- π Use interpretable AI via SHAP to understand drivers of non-adherence
- π Support reproducible publication with notebooks + scripts
- Cleaned and loaded raw diabetes adherence dataset
- Created binary adherence target using domain threshold (β₯ 8)
- Converted date columns to datetime for time-based feature creation
- Engineered temporal features: time between service, assess, and refill dates
- Computed refill gaps per patient and visualized refill behavior trends
- Aggregated refill behavior features per patient:
avg_refill_gap,max_refill_gap,total_visits
- Merged most recent binary adherence label (
ADHERENT_BINARY) per patient - Enriched dataset with demographic features:
GENDERandAGE - Handled missing values:
- Refill gaps filled with 0.0 for single-visit patients
- Dropped intermediate date fields after transformation
- Exported cleaned dataset as
.csvand.pklfor modeling
- Trained baseline models:
- Logistic Regression (ROC AUC: 0.82)
- Random Forest (ROC AUC: 0.77)
- Performed evaluation with accuracy, F1-score, and ROC AUC
- Validated model stability using 5-fold cross-validation
- Assessed probability calibration via Brier score and calibration curve
- Applied SHAP for local and global explainability
- Exported trained models for deployment (
.pklformat)
- All visualizations (SHAP plots, feature importance, calibration curve) completed
- Public health impact framing and interpretation added
- Ready for research publication, deployment, and stakeholder engagement
π All project phases are complete as per the attached documentation and deliverables.
MedAdhereAI/
βββ dataset/
β βββ raw/ # Real-world data (CSV files, not committed)
βββ notebooks/
β βββ 01_data_exploration.ipynb # β
EDA + target engineering (complete)
β βββ 02_feature_engineering.ipynb # β³ Feature aggregation (in progress)
β βββ 03_model_training.ipynb # β³ Model building
β βββ 04_model_explainability.ipynb # β³ SHAP analysis
βββ scripts/
β βββ data_cleaning.py # Placeholder for modular code
β βββ feature_engineering.py
β βββ model_utils.py
β βββ shap_explainer.py
βββ reports/
β βββ figures/ # Output graphs, charts
βββ requirements.txt # Python packages
βββ .gitignore
βββ README.md # This file
βββ LICENSE # MIT License- Python 3.11
- Pandas, NumPy
- scikit-learn, XGBoost
- SHAP
- Jupyter Notebook
# 1. Clone the repo
git clone https://github.com/mathachew7/MedAdhereAI.git
cd MedAdhereAI
# 2. Create & activate virtual environment
python3.11 -m venv .venv
source .venv/bin/activate
# 3. Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# 4. Launch notebooks
jupyter notebook- ADHERENT_BINARY label (78% adherent / 22% non-adherent)
- Logistic Regression AUC: 0.82 | Random Forest AUC: 0.77
- Brier Score: 0.1749 (well-calibrated)
- SHAP summary:
total_visits,AGE,refill_gapas top predictors - Logistic Regression Coefficients: Bar plot
- Random Forest Feature Importance: Bar plot
- SHAP Local Explanations: Individual-level interpretability
All outputs, visuals, and impact framing have been generated and included in the documentation.
Medication non-adherence contributes to over $300 billion in preventable U.S. healthcare costs annually.
This project provides an interpretable system to flag patients at risk of skipping essential medications using refill behavior and minimal demographic data.
This supports:
- Early risk stratification
- Targeted outreach and follow-ups
- Clinically explainable, data-driven care optimization
This project is licensed under the MIT License.
- Dataset by researchers on Mendeley Data
- Built by Subash Yadav for real-world predictive health research
For collab or publication inquiries:
π§ subashyadav7@outlook.com
π LinkedIn