Skip to content

Kalpnaa/synapse-forge-project

Repository files navigation

Synapse Forge: Disease Prediction from Symptoms

Project Overview

This project, "Synapse Forge," demonstrates an end-to-end machine learning pipeline for predicting diseases based on user-provided symptoms. It showcases skills in data acquisition, text preprocessing, machine learning model training, evaluation, and interactive web application development using Streamlit.

Problem Statement

In healthcare, identifying potential diseases early based on reported symptoms is crucial for timely intervention. This project aims to build a predictive model that can suggest a likely disease given a textual description of symptoms, serving as a demonstration of how AI can be applied in health informatics.

Dataset

  • Source: ["Kaggle:Disease Prediction"]
  • Columns Used: symptoms (input feature), disease (target variable).
  • Characteristics: The dataset contains [Number of Rows, 99] rows and [Number of Columns, 5] columns. It features [Number of Unique Diseases, 88] unique disease classes, making it a multi-class classification problem. Due to the high number of classes relative to the dataset size, the model faces significant challenges in generalization, leading to limited accuracy.

Project Phases & Key Learnings

Phase 0: Foundation & Setup

  • Goal: Project ideation, environment setup, and Git version control.
  • Learnings: Defining project scope, virtual environments (conda/venv), Git commands (git init, add, commit, push).

Phase 1: Data Acquisition & Exploratory Data Analysis (EDA)

  • Goal: Load the dataset and understand its structure and characteristics.
  • Key Activities: Loading CSV data with pandas, inspecting data types (.info()), checking for missing values (.isnull().sum()), analyzing unique values in disease and symptoms, calculating text lengths for symptoms, and identifying common words.
  • Learnings: Importance of initial data inspection, understanding text data properties, identifying class imbalance.

Phase 2: Data Preprocessing & Feature Engineering

  • Goal: Transform raw text data into numerical features suitable for machine learning.
  • Key Activities:
    • Text Cleaning: Lowercasing, removing punctuation/numbers, stop word removal, lemmatization (nltk).
    • Text Vectorization: Converting cleaned symptom text into numerical TF-IDF vectors (TfidfVectorizer).
    • Target Encoding: Converting categorical disease names into numerical labels (LabelEncoder).
    • Data Splitting: Dividing data into training and testing sets (train_test_split), addressing challenges with sparse classes (removed stratify).
  • Learnings: Text preprocessing techniques, TF-IDF for text representation, label encoding, handling data sparsity in splits.

Phase 3: Model Training & Evaluation

  • Goal: Train a classification model and assess its performance.
  • Key Activities:
    • Model Selection: Experimented with LogisticRegression and RandomForestClassifier.
    • Training: Fitting the model to the training data.
    • Prediction: Making predictions on the unseen test data.
    • Evaluation: Calculating accuracy_score and generating a classification_report (Precision, Recall, F1-score, Support).
  • Learnings: Model training workflow, interpreting classification metrics, understanding model limitations due to data characteristics (small size, high number of unique classes, class imbalance). Observed low accuracy due to these inherent data challenges.

Phase 4: Interactive Dashboard Development

  • Goal: Create a user-friendly web interface for the disease prediction model.
  • Key Activities:
    • Model Persistence: Saving the trained model, TF-IDF vectorizer, and label encoder using joblib.
    • Streamlit App: Building the UI, integrating model loading and prediction logic.
    • User Input: Accepting symptom text from the user.
    • Displaying Results: Showing the predicted disease and top probabilities.
  • Learnings: Building interactive web applications with Streamlit, model serialization/deserialization, creating a user-friendly front-end for an ML model.

Technologies Used

  • Programming Language: Python
  • Data Manipulation: pandas, numpy
  • Text Preprocessing: nltk, re
  • Machine Learning: scikit-learn (LogisticRegression, RandomForestClassifier, TfidfVectorizer, LabelEncoder)
  • Model Persistence: joblib
  • Interactive Dashboard: Streamlit
  • Version Control: Git, GitHub
  • Environment Management: conda / venv
  • Development Environment: JupyterLab, VS Code

How to Run This Project

  1. Clone the repository:

    git clone https://github.com/Kalpnaa/synapse-forge-project.git
    cd synapse-forge-project
  2. Create and activate a virtual environment:

    conda create -n synapse_forge_env python=3.9  # or python=3.10
    conda activate synapse_forge_env
    # OR for venv:
    # python -m venv synapse_forge_env
    # .\synapse_forge_env\Scripts\activate  # Windows
    # source ./synapse_forge_env/bin/activate  # macOS/Linux
  3. Install required Python packages:

    pip install pandas numpy matplotlib seaborn scikit-learn jupyterlab streamlit joblib nltk

    (You may need to run nltk.download('stopwords') and nltk.download('wordnet') separately in a Python console if not already downloaded).

  4. Acquire the dataset:

    • Download your specific dataset CSV file from [Original Dataset Source Link, e.g., Kaggle].
    • Place the downloaded CSV file into your project directory (or create a data/ folder and place it there, then update 01_EDA_Disease_Prediction.ipynb accordingly).
  5. Run the Jupyter Notebooks (in order):

    • jupyter lab
    • Open and run all cells in 01_EDA_Disease_Prediction.ipynb (update FILE_NAME to your CSV path).
    • Open and run all cells in 02_Feature_Engineering.ipynb (ensure FILE_NAME is correct).
    • Open and run all cells in 03_Model_Training_Evaluation.ipynb (ensure FILE_NAME is correct). This notebook will save the necessary .joblib files into a models/ folder.
  6. Run the Streamlit application:

    streamlit run app.py

    This will open the application in your web browser.

Future Enhancements

  • Larger Dataset: Acquire a significantly larger and more balanced dataset for improved model accuracy.
  • Advanced NLP: Experiment with word embeddings (Word2Vec, GloVe) or transformer-based models (BERT, GPT-2) for symptom representation.
  • Model Optimization: Hyperparameter tuning, cross-validation, and trying other advanced classification models (e.g., XGBoost, LightGBM).
  • MLOps Integration: Implement Docker for containerization, set up CI/CD pipelines (e.g., GitHub Actions) for automated deployment, and add model monitoring (e.g., for data drift, performance degradation).
  • User Feedback Loop: Incorporate a mechanism in the dashboard to collect user feedback on predictions, which could be used to retrain/improve the model.
  • More Features: Incorporate cures, doctor, risk level as additional features in the model.

Contact

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors