An AI-Powered Pipeline for Historical Review Classification (1999–2012)
This project is a high-precision sentiment analysis pipeline designed to categorize customer feedback for a coconut water brand. Using OpenAI's GPT-4o-mini, the system transforms raw JSON review data into sentiment labels (positive, negative, neutral, irrelevant).
The pipeline utilizes a 'Ralph Wiggum' agentic workflow (seen in label.py); this pattern ensures that the classification logic and API calls are autonomously iterated upon until they satisfy all rigorous unit tests before deployment.
"I'm helping!" - Keep looping until tests pass.
while not tests_passed:
rerun_sentiment_analysis()
The pipeline is modularized into three main components:
label.py(AI Engine): Interfaces with the OpenAI API using advanced prompt engineering. It features robust input validation to handle data-type anomalies and empty datasets.visualize.py: Aggregates sentiment distribution and generates a simple pie chart, automatically exporting them to a dedicatedimages/directory.main.py(Pipeline Orchestration): The "brain" of the project that handles file I/O and executes the end-to-end flow from raw JSON to final classification.
Fig 1. Output from one execution of the visualize.py script.
Instead of basic queries, I implemented a System-Prompt strategy that provides the LLM with cultural context and specific examples of nuanced sentiment. This ensures that a review like "its a ring" is correctly identified as irrelevant rather than neutral.
To ensure long-term maintainability, the project includes a comprehensive suite of automated tests (test_*.py). These verify:
- API response consistency.
- Correct visualization output formatting.
- Error handling for "Wrong input" scenarios.
- Python 3.10+
- OpenAI API Key (Stored securely via environment variables)
- Clone the repository:
git clone https://github.com/your-username/sentiment-pipeline.git - Install dependencies:
pip install -r requirements.txt - Run the core pipeline:
python main.py - Execute tests:
python test_run.py
├── images/ # Generated sentiment distribution plots
├── reviews.json # Source dataset (Coconut water reviews 1999-2012)
├── label.py # GPT-4o-mini integration logic
├── visualize.py # Data visualization module
├── main.py # Pipeline entry point
├── writeup.md # Qualitative analysis of results
└── .gitignore # Safeguards for API keys and data artifacts

