This project aims to predict the potential side effects of a drug molecule based on its name (encoded chemical structure) and its biological targets. It utilizes Natural Language Processing (NLP) techniques and multi-label supervised learning.
- Project Overview
- Repository Structure
- Installation
- Data Pipeline
- Modeling and Performance
- Dashboard Usage
- Contribution of each member
The project merges data from STITCH, SIDER (MedDRA), and DrugBank to create a comprehensive dataset. This dataset associates:
- Drug names.
- Anatomical Therapeutic Chemical (ATC) codes.
- Biological targets.
- Indications and side effects.
An optimized Logistic Regression (OneVsRest) model is trained to predict the presence or absence of the 300 most frequent side effects.
Fusionetconcat.py: Script for initial merging of STITCH and DrugBank databases, featuring XML parsing and Fuzzy Matching.Firstcleaning.py: Script for cleaning and formatting ATC codes.data_cleaning.ipynb: Exploratory data analysis and final dataset cleaning.word_embedding.ipynb: Data encoding (TF-IDF on character n-grams) and model training/comparison.dashboard.py: Interactive Dash application to test the model in real-time.models_export/: Directory containing trained models and vectorizers in.pklformat.
- Clone the repository.
- Install the necessary dependencies:
pip install pandas numpy scikit-learn dash joblib matplotlib plotly rapidfuzz
- Ensure you have the source data files or the generated
cleaned_drug_side_effects.csv.
- XML Parsing: Extraction of targets and descriptions from the DrugBank database.
- Fuzzy Matching: Reconciling drug names between STITCH and DrugBank using the
token_sort_ratioalgorithm. - Cleaning: Removal of missing values and filtering of non-exploitable columns such as IDs and long descriptions.
- Biological Targets: Binary encoding via
MultiLabelBinarizer. - Molecule Names: Use of TF-IDF on characters (n-grams of 3 to 5 letters) to capture chemical roots like -zole or -meth.
The project compared several approaches:
- Baseline (Logistic Regression): Optimized via
GridSearchCV. - Random Forest.
- Voting Classifier.
Result: The Logistic Regression model achieved the best performance with a Micro F1-Score of approximately 0.49 on the top 300 side effects. This is highly significant for a multi-label problem of this complexity.
To launch the prediction interface:
python dashboard.py
Once launched, access http://127.0.0.1:8050/ in your web browser. You can:
- Enter a molecule name (e.g., Aspirin).
- Select one or more biological targets from the dropdown menu.
- Instantly receive a list of potential side effects detected by the model.
Ghita Fadili : Dataset selection and merging, Scientific Papers & References
Thomas Jego : Model training, Ensemble models, Voting, Model Comparison
Victoire Dezombre : Data Exploration & Data Cleaning
Léandre durand-Terasson : Word Embedding (TF-IDF, syntactic similarity, MultiLabelBinarizer)