Synapse Forge: Disease Prediction from Symptoms

Project Overview

This project, "Synapse Forge," demonstrates an end-to-end machine learning pipeline for predicting diseases based on user-provided symptoms. It showcases skills in data acquisition, text preprocessing, machine learning model training, evaluation, and interactive web application development using Streamlit.

Problem Statement

In healthcare, identifying potential diseases early based on reported symptoms is crucial for timely intervention. This project aims to build a predictive model that can suggest a likely disease given a textual description of symptoms, serving as a demonstration of how AI can be applied in health informatics.

Dataset

Source: ["Kaggle:Disease Prediction"]
Columns Used: symptoms (input feature), disease (target variable).
Characteristics: The dataset contains [Number of Rows, 99] rows and [Number of Columns, 5] columns. It features [Number of Unique Diseases, 88] unique disease classes, making it a multi-class classification problem. Due to the high number of classes relative to the dataset size, the model faces significant challenges in generalization, leading to limited accuracy.

Project Phases & Key Learnings

Phase 0: Foundation & Setup

Goal: Project ideation, environment setup, and Git version control.
Learnings: Defining project scope, virtual environments (conda/venv), Git commands (git init, add, commit, push).

Phase 1: Data Acquisition & Exploratory Data Analysis (EDA)

Goal: Load the dataset and understand its structure and characteristics.
Key Activities: Loading CSV data with pandas, inspecting data types (.info()), checking for missing values (.isnull().sum()), analyzing unique values in disease and symptoms, calculating text lengths for symptoms, and identifying common words.
Learnings: Importance of initial data inspection, understanding text data properties, identifying class imbalance.

Phase 2: Data Preprocessing & Feature Engineering

Goal: Transform raw text data into numerical features suitable for machine learning.
Key Activities:
- Text Cleaning: Lowercasing, removing punctuation/numbers, stop word removal, lemmatization (nltk).
- Text Vectorization: Converting cleaned symptom text into numerical TF-IDF vectors (TfidfVectorizer).
- Target Encoding: Converting categorical disease names into numerical labels (LabelEncoder).
- Data Splitting: Dividing data into training and testing sets (train_test_split), addressing challenges with sparse classes (removed stratify).
Learnings: Text preprocessing techniques, TF-IDF for text representation, label encoding, handling data sparsity in splits.

Phase 3: Model Training & Evaluation

Goal: Train a classification model and assess its performance.
Key Activities:
- Model Selection: Experimented with LogisticRegression and RandomForestClassifier.
- Training: Fitting the model to the training data.
- Prediction: Making predictions on the unseen test data.
- Evaluation: Calculating accuracy_score and generating a classification_report (Precision, Recall, F1-score, Support).
Learnings: Model training workflow, interpreting classification metrics, understanding model limitations due to data characteristics (small size, high number of unique classes, class imbalance). Observed low accuracy due to these inherent data challenges.

Phase 4: Interactive Dashboard Development

Goal: Create a user-friendly web interface for the disease prediction model.
Key Activities:
- Model Persistence: Saving the trained model, TF-IDF vectorizer, and label encoder using joblib.
- Streamlit App: Building the UI, integrating model loading and prediction logic.
- User Input: Accepting symptom text from the user.
- Displaying Results: Showing the predicted disease and top probabilities.
Learnings: Building interactive web applications with Streamlit, model serialization/deserialization, creating a user-friendly front-end for an ML model.

Technologies Used

Programming Language: Python
Data Manipulation: pandas, numpy
Text Preprocessing: nltk, re
Machine Learning: scikit-learn (LogisticRegression, RandomForestClassifier, TfidfVectorizer, LabelEncoder)
Model Persistence: joblib
Interactive Dashboard: Streamlit
Version Control: Git, GitHub
Environment Management: conda / venv
Development Environment: JupyterLab, VS Code

How to Run This Project

Clone the repository:

git clone https://github.com/Kalpnaa/synapse-forge-project.git
cd synapse-forge-project

Create and activate a virtual environment:

conda create -n synapse_forge_env python=3.9  # or python=3.10
conda activate synapse_forge_env
# OR for venv:
# python -m venv synapse_forge_env
# .\synapse_forge_env\Scripts\activate  # Windows
# source ./synapse_forge_env/bin/activate  # macOS/Linux

Install required Python packages:
```
pip install pandas numpy matplotlib seaborn scikit-learn jupyterlab streamlit joblib nltk
```
(You may need to run nltk.download('stopwords') and nltk.download('wordnet') separately in a Python console if not already downloaded).
Acquire the dataset:
- Download your specific dataset CSV file from [Original Dataset Source Link, e.g., Kaggle].
- Place the downloaded CSV file into your project directory (or create a data/ folder and place it there, then update 01_EDA_Disease_Prediction.ipynb accordingly).
Run the Jupyter Notebooks (in order):
- jupyter lab
- Open and run all cells in 01_EDA_Disease_Prediction.ipynb (update FILE_NAME to your CSV path).
- Open and run all cells in 02_Feature_Engineering.ipynb (ensure FILE_NAME is correct).
- Open and run all cells in 03_Model_Training_Evaluation.ipynb (ensure FILE_NAME is correct). This notebook will save the necessary .joblib files into a models/ folder.
Run the Streamlit application:
```
streamlit run app.py
```
This will open the application in your web browser.

Future Enhancements

Larger Dataset: Acquire a significantly larger and more balanced dataset for improved model accuracy.
Advanced NLP: Experiment with word embeddings (Word2Vec, GloVe) or transformer-based models (BERT, GPT-2) for symptom representation.
Model Optimization: Hyperparameter tuning, cross-validation, and trying other advanced classification models (e.g., XGBoost, LightGBM).
MLOps Integration: Implement Docker for containerization, set up CI/CD pipelines (e.g., GitHub Actions) for automated deployment, and add model monitoring (e.g., for data drift, performance degradation).
User Feedback Loop: Incorporate a mechanism in the dashboard to collect user feedback on predictions, which could be used to retrain/improve the model.
More Features: Incorporate cures, doctor, risk level as additional features in the model.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.ipynb_checkpoints		.ipynb_checkpoints
models		models
synapse_forge_en		synapse_forge_en
01_EDA_Disease_Prediction.ipynb.ipynb		01_EDA_Disease_Prediction.ipynb.ipynb
02_Feature_Engineering.ipynb		02_Feature_Engineering.ipynb
03_Model_Training_Evaluation.ipynb.ipynb		03_Model_Training_Evaluation.ipynb.ipynb
README.md		README.md
app.py		app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synapse Forge: Disease Prediction from Symptoms

Project Overview

Problem Statement

Dataset

Project Phases & Key Learnings

Phase 0: Foundation & Setup

Phase 1: Data Acquisition & Exploratory Data Analysis (EDA)

Phase 2: Data Preprocessing & Feature Engineering

Phase 3: Model Training & Evaluation

Phase 4: Interactive Dashboard Development

Technologies Used

How to Run This Project

Future Enhancements

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Synapse Forge: Disease Prediction from Symptoms

Project Overview

Problem Statement

Dataset

Project Phases & Key Learnings

Phase 0: Foundation & Setup

Phase 1: Data Acquisition & Exploratory Data Analysis (EDA)

Phase 2: Data Preprocessing & Feature Engineering

Phase 3: Model Training & Evaluation

Phase 4: Interactive Dashboard Development

Technologies Used

How to Run This Project

Future Enhancements

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages