Skip to content

arv-anshul/yt-comment-sentiment

Repository files navigation

YouTube Comment Sentiment

An end-to-end project to predict the sentiment of YouTube video comments using Machine Learning.

Overview

This project focuses on building a sentiment analysis system for YouTube comments, complete with a FastAPI-based inference endpoint and insights-providing API endpoints. The development process included robust experimentation, tracking, and pipeline reproduction (using MLFlow and DVC).

diagram

Key Features

  • Inference Endpoint: Built using the FastAPI framework to classify sentiment of comments.
  • Insights Endpoints: Additional APIs to provide analytics around comment sentiments.
  • Experiment Tracking: Leveraged MLFlow for tracking experiments.
  • Pipeline Reproduction: Utilized DVC (Data Version Control) for reproducibility.
  • Text Vectorization: Used TfidfVectorizer for transforming text data into feature vectors.
  • Model Selection: Experimented with various models and selected HistGradientBoostingClassifier as the best-performing classifier.

Experimentation

The experimentation phase focused on optimizing hyperparameters for the TfidfVectorizer and HistGradientBoostingClassifier model. Below is a screenshot showcasing how different hyperparameter combinations impacted accuracy:

Experiment Results

Tech Stack

Tech Stack
Data Handling Polars
Backend Tools MLflow DVC FastAPI
Machine Learning scikit-learn NLTK
Frontend pnpm shadcn/ui Tailwind CSS Vite Vue.js
Dev Tools uv pre-commit Ruff Zed Loguru

Setup

Model Training

  1. Model training code is located in backend directory.
    cd backend
  2. Use uv to sync project's dependencies as well as training dependencies.
    uv sync --extra=training --compile-bytecode --locked --no-dev
  3. You may want to update the params.yaml file before training:
    • To update dataset source and column names.
    • To update text vectorizer class or hyperparameters.
    • To update model or hyperparameters.
  4. (Optional) You may also want to set a Remote Tracking URI for MLFlow (i.e. MLFLOW_TRACKING_URI environment variable) so that the logs and artifacts will store there. I uses Dagshub for this. If you don't set it then mlruns directory will be created locally and get store there.
    # https://dagshub.com/docs/integration_guide/mlflow_tracking
    export MLFLOW_TRACKING_URI=<tracking-uri>
  5. (Optional) You can also set experiment name using MLFLOW_EXPERIMENT_NAME environment variable.
  6. Use DVC cli command to start training pipeline.
    # https://dvc.org/doc/command-reference/repro
    uv run dvc repro
  7. After some time, your first model will be trained. Then, you can see the logs and artifacts using MLFlow UI.
    uv run mlflow ui
    This will start a server at http://localhost:5000 (by default).
  8. Now that you know how to train a model you can re-train another model with different parameters and compare their metrics and params in MLFlow UI with intuitive charts and graphs. See Experimentation section.
  9. After model comparision you should select a best model and register it to Model Registry, so that you can use it in Backend API Server.

Backend API Server

  1. Set MLFLOW_MODEL_URI environment variable (specially for backend server).
    export MLFLOW_MODEL_URI=<model-uri>
    See mlflow.sklearn.load_model API reference to know how to get MLFLOW_MODEL_URI.
  2. Sync dependencies using uv.
    cd backend
    uv sync --compile-bytecode --no-dev --locked
  3. Start FastAPI server using fastapi-cli.
    uv run fastapi run src/app.py
    The server is started at http://localhost:8000 (by default).

About

An end-to-end project to predict the sentiment of YouTube video comments using Machine Learning.

Topics

Resources

Stars

Watchers

Forks

Contributors