This project, "Synapse Forge," demonstrates an end-to-end machine learning pipeline for predicting diseases based on user-provided symptoms. It showcases skills in data acquisition, text preprocessing, machine learning model training, evaluation, and interactive web application development using Streamlit.
In healthcare, identifying potential diseases early based on reported symptoms is crucial for timely intervention. This project aims to build a predictive model that can suggest a likely disease given a textual description of symptoms, serving as a demonstration of how AI can be applied in health informatics.
- Source: ["Kaggle:Disease Prediction"]
- Columns Used:
symptoms(input feature),disease(target variable). - Characteristics: The dataset contains [Number of Rows, 99] rows and [Number of Columns, 5] columns. It features [Number of Unique Diseases, 88] unique disease classes, making it a multi-class classification problem. Due to the high number of classes relative to the dataset size, the model faces significant challenges in generalization, leading to limited accuracy.
- Goal: Project ideation, environment setup, and Git version control.
- Learnings: Defining project scope, virtual environments (
conda/venv), Git commands (git init,add,commit,push).
- Goal: Load the dataset and understand its structure and characteristics.
- Key Activities: Loading CSV data with
pandas, inspecting data types (.info()), checking for missing values (.isnull().sum()), analyzing unique values indiseaseandsymptoms, calculating text lengths forsymptoms, and identifying common words. - Learnings: Importance of initial data inspection, understanding text data properties, identifying class imbalance.
- Goal: Transform raw text data into numerical features suitable for machine learning.
- Key Activities:
- Text Cleaning: Lowercasing, removing punctuation/numbers, stop word removal, lemmatization (
nltk). - Text Vectorization: Converting cleaned symptom text into numerical TF-IDF vectors (
TfidfVectorizer). - Target Encoding: Converting categorical disease names into numerical labels (
LabelEncoder). - Data Splitting: Dividing data into training and testing sets (
train_test_split), addressing challenges with sparse classes (removedstratify).
- Text Cleaning: Lowercasing, removing punctuation/numbers, stop word removal, lemmatization (
- Learnings: Text preprocessing techniques, TF-IDF for text representation, label encoding, handling data sparsity in splits.
- Goal: Train a classification model and assess its performance.
- Key Activities:
- Model Selection: Experimented with
LogisticRegressionandRandomForestClassifier. - Training: Fitting the model to the training data.
- Prediction: Making predictions on the unseen test data.
- Evaluation: Calculating
accuracy_scoreand generating aclassification_report(Precision, Recall, F1-score, Support).
- Model Selection: Experimented with
- Learnings: Model training workflow, interpreting classification metrics, understanding model limitations due to data characteristics (small size, high number of unique classes, class imbalance). Observed low accuracy due to these inherent data challenges.
- Goal: Create a user-friendly web interface for the disease prediction model.
- Key Activities:
- Model Persistence: Saving the trained model, TF-IDF vectorizer, and label encoder using
joblib. - Streamlit App: Building the UI, integrating model loading and prediction logic.
- User Input: Accepting symptom text from the user.
- Displaying Results: Showing the predicted disease and top probabilities.
- Model Persistence: Saving the trained model, TF-IDF vectorizer, and label encoder using
- Learnings: Building interactive web applications with Streamlit, model serialization/deserialization, creating a user-friendly front-end for an ML model.
- Programming Language: Python
- Data Manipulation:
pandas,numpy - Text Preprocessing:
nltk,re - Machine Learning:
scikit-learn(LogisticRegression, RandomForestClassifier, TfidfVectorizer, LabelEncoder) - Model Persistence:
joblib - Interactive Dashboard:
Streamlit - Version Control:
Git,GitHub - Environment Management:
conda/venv - Development Environment:
JupyterLab,VS Code
-
Clone the repository:
git clone https://github.com/Kalpnaa/synapse-forge-project.git cd synapse-forge-project -
Create and activate a virtual environment:
conda create -n synapse_forge_env python=3.9 # or python=3.10 conda activate synapse_forge_env # OR for venv: # python -m venv synapse_forge_env # .\synapse_forge_env\Scripts\activate # Windows # source ./synapse_forge_env/bin/activate # macOS/Linux
-
Install required Python packages:
pip install pandas numpy matplotlib seaborn scikit-learn jupyterlab streamlit joblib nltk
(You may need to run
nltk.download('stopwords')andnltk.download('wordnet')separately in a Python console if not already downloaded). -
Acquire the dataset:
- Download your specific dataset CSV file from [Original Dataset Source Link, e.g., Kaggle].
- Place the downloaded CSV file into your project directory (or create a
data/folder and place it there, then update01_EDA_Disease_Prediction.ipynbaccordingly).
-
Run the Jupyter Notebooks (in order):
jupyter lab- Open and run all cells in
01_EDA_Disease_Prediction.ipynb(updateFILE_NAMEto your CSV path). - Open and run all cells in
02_Feature_Engineering.ipynb(ensureFILE_NAMEis correct). - Open and run all cells in
03_Model_Training_Evaluation.ipynb(ensureFILE_NAMEis correct). This notebook will save the necessary.joblibfiles into amodels/folder.
-
Run the Streamlit application:
streamlit run app.py
This will open the application in your web browser.
- Larger Dataset: Acquire a significantly larger and more balanced dataset for improved model accuracy.
- Advanced NLP: Experiment with word embeddings (Word2Vec, GloVe) or transformer-based models (BERT, GPT-2) for symptom representation.
- Model Optimization: Hyperparameter tuning, cross-validation, and trying other advanced classification models (e.g., XGBoost, LightGBM).
- MLOps Integration: Implement Docker for containerization, set up CI/CD pipelines (e.g., GitHub Actions) for automated deployment, and add model monitoring (e.g., for data drift, performance degradation).
- User Feedback Loop: Incorporate a mechanism in the dashboard to collect user feedback on predictions, which could be used to retrain/improve the model.
- More Features: Incorporate
cures,doctor,risk levelas additional features in the model.