An end-to-end project to predict the sentiment of YouTube video comments using Machine Learning.
This project focuses on building a sentiment analysis system for YouTube comments, complete with a FastAPI-based inference endpoint and insights-providing API endpoints. The development process included robust experimentation, tracking, and pipeline reproduction (using MLFlow and DVC).
- Inference Endpoint: Built using the FastAPI framework to classify sentiment of comments.
- Insights Endpoints: Additional APIs to provide analytics around comment sentiments.
- Experiment Tracking: Leveraged MLFlow for tracking experiments.
- Pipeline Reproduction: Utilized DVC (Data Version Control) for reproducibility.
- Text Vectorization: Used
TfidfVectorizerfor transforming text data into feature vectors. - Model Selection: Experimented with various models and selected
HistGradientBoostingClassifieras the best-performing classifier.
The experimentation phase focused on optimizing hyperparameters for the TfidfVectorizer and
HistGradientBoostingClassifier model. Below is a screenshot showcasing how different hyperparameter combinations
impacted accuracy:
| Tech | Stack |
|---|---|
| Data Handling | |
| Backend Tools | |
| Machine Learning | |
| Frontend | |
| Dev Tools |
- Model training code is located in
backenddirectory.cd backend - Use uv to sync project's dependencies as well as training dependencies.
uv sync --extra=training --compile-bytecode --locked --no-dev
- You may want to update the
params.yamlfile before training:- To update dataset source and column names.
- To update text vectorizer class or hyperparameters.
- To update model or hyperparameters.
- (Optional) You may also want to set a Remote Tracking URI for MLFlow (i.e.
MLFLOW_TRACKING_URIenvironment variable) so that the logs and artifacts will store there. I uses Dagshub for this. If you don't set it thenmlrunsdirectory will be created locally and get store there.# https://dagshub.com/docs/integration_guide/mlflow_tracking export MLFLOW_TRACKING_URI=<tracking-uri>
- (Optional) You can also set experiment name using
MLFLOW_EXPERIMENT_NAMEenvironment variable. - Use DVC cli command to start training pipeline.
# https://dvc.org/doc/command-reference/repro uv run dvc repro - After some time, your first model will be trained. Then, you can see the logs and artifacts using MLFlow UI.
This will start a server at http://localhost:5000 (by default).
uv run mlflow ui
- Now that you know how to train a model you can re-train another model with different parameters and compare their metrics and params in MLFlow UI with intuitive charts and graphs. See Experimentation section.
- After model comparision you should select a best model and register it to Model Registry, so that you can use it in Backend API Server.
- Set
MLFLOW_MODEL_URIenvironment variable (specially for backend server).Seeexport MLFLOW_MODEL_URI=<model-uri>
mlflow.sklearn.load_modelAPI reference to know how to getMLFLOW_MODEL_URI. - Sync dependencies using uv.
cd backend uv sync --compile-bytecode --no-dev --locked - Start FastAPI server using
fastapi-cli.The server is started at http://localhost:8000 (by default).uv run fastapi run src/app.py

