🔮 Drug Side Effect Predictor

This project aims to predict the potential side effects of a drug molecule based on its name (encoded chemical structure) and its biological targets. It utilizes Natural Language Processing (NLP) techniques and multi-label supervised learning.

📋 Table of Contents

Project Overview
Repository Structure
Installation
Data Pipeline
Modeling and Performance
Dashboard Usage
Contribution of each member

🚀 Project Overview

The project merges data from STITCH, SIDER (MedDRA), and DrugBank to create a comprehensive dataset. This dataset associates:

Drug names.
Anatomical Therapeutic Chemical (ATC) codes.
Biological targets.
Indications and side effects.

An optimized Logistic Regression (OneVsRest) model is trained to predict the presence or absence of the 300 most frequent side effects.

📂 Repository Structure

Fusionetconcat.py: Script for initial merging of STITCH and DrugBank databases, featuring XML parsing and Fuzzy Matching.
Firstcleaning.py: Script for cleaning and formatting ATC codes.
data_cleaning.ipynb: Exploratory data analysis and final dataset cleaning.
word_embedding.ipynb: Data encoding (TF-IDF on character n-grams) and model training/comparison.
dashboard.py: Interactive Dash application to test the model in real-time.
models_export/: Directory containing trained models and vectorizers in .pkl format.

🛠 Installation

Clone the repository.
Install the necessary dependencies:

pip install pandas numpy scikit-learn dash joblib matplotlib plotly rapidfuzz

Ensure you have the source data files or the generated cleaned_drug_side_effects.csv.

📊 Data Pipeline

1. Preprocessing

XML Parsing: Extraction of targets and descriptions from the DrugBank database.
Fuzzy Matching: Reconciling drug names between STITCH and DrugBank using the token_sort_ratio algorithm.
Cleaning: Removal of missing values and filtering of non-exploitable columns such as IDs and long descriptions.

2. Encoding (Embedding)

Biological Targets: Binary encoding via MultiLabelBinarizer.
Molecule Names: Use of TF-IDF on characters (n-grams of 3 to 5 letters) to capture chemical roots like -zole or -meth.

🤖 Modeling and Performance

The project compared several approaches:

Baseline (Logistic Regression): Optimized via GridSearchCV.
Random Forest.
Voting Classifier.

Result: The Logistic Regression model achieved the best performance with a Micro F1-Score of approximately 0.49 on the top 300 side effects. This is highly significant for a multi-label problem of this complexity.

🖥 Dashboard Usage

To launch the prediction interface:

python dashboard.py

Once launched, access http://127.0.0.1:8050/ in your web browser. You can:

Enter a molecule name (e.g., Aspirin).
Select one or more biological targets from the dropdown menu.
Instantly receive a list of potential side effects detected by the model.

👥 Contibutions of each member

Ghita Fadili : Dataset selection and merging, Scientific Papers & References
Thomas Jego : Model training, Ensemble models, Voting, Model Comparison
Victoire Dezombre : Data Exploration & Data Cleaning
Léandre durand-Terasson : Word Embedding (TF-IDF, syntactic similarity, MultiLabelBinarizer)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔮 Drug Side Effect Predictor

📋 Table of Contents

🚀 Project Overview

📂 Repository Structure

🛠 Installation

📊 Data Pipeline

1. Preprocessing

2. Encoding (Embedding)

🤖 Modeling and Performance

🖥 Dashboard Usage

👥 Contibutions of each member

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
models_export		models_export
.DS_Store		.DS_Store
Code_Generated_Image.png		Code_Generated_Image.png
Firstcleaning.py		Firstcleaning.py
Fusionetconcat.py		Fusionetconcat.py
README.md		README.md
cleaned_drug_side_effects.csv		cleaned_drug_side_effects.csv
dashboard.py		dashboard.py
data_cleaning.ipynb		data_cleaning.ipynb
df.csv		df.csv
model_comparison.png		model_comparison.png
word_embedding.ipynb		word_embedding.ipynb

Folders and files

Latest commit

History

Repository files navigation

🔮 Drug Side Effect Predictor

📋 Table of Contents

🚀 Project Overview

📂 Repository Structure

🛠 Installation

📊 Data Pipeline

1. Preprocessing

2. Encoding (Embedding)

🤖 Modeling and Performance

🖥 Dashboard Usage

👥 Contibutions of each member

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages