Predict whether a customer is happy (Y=1) or unhappy (Y=0) based on survey responses.
This project was built as an applied ML decision-support system, with an emphasis on
interpretability, evaluation discipline, and deployability.
Customer satisfaction surveys often collect many signals, but not all of them are equally useful for decision-making.
The goal of this project is to:
- Predict customer happiness reliably
- Identify the minimal set of features required to support that prediction
- Package the resulting model as a deployable inference service
- Source:
data/ACME-HappinessSurvey2020.csv - Target:
Y(1 = happy, 0 = unhappy) - Features:
X1–X6(Likert scale: 1–5)
- Train baseline classifiers using Stratified 5-Fold Cross-Validation
- Perform exhaustive feature subset evaluation (64 possible subsets)
- Select the smallest feature set that preserves predictive performance
- Prefer interpretability and stability over marginal accuracy gains
A Decision Tree classifier was selected due to its balance of performance and explainability.
- Achieved strong cross-validated accuracy after feature reduction
- Captures threshold-based decision logic common in survey responses
- Produces transparent, auditable decision rules
- Enables survey simplification without sacrificing predictive power
Although cross-validated accuracy (~70%) is a conservative estimate due to dataset size, the model generalizes well and meets the required accuracy threshold on the private test set.
Feature importance and subset evaluation showed that optimal features depend on model inductive bias:
- Tree-based models perform best using:
X1— delivery timelinessX5— satisfaction with courier
- Distance-based and ensemble models rely more heavily on:
X6— app usability
Given its performance and interpretability, the final deployed model uses only X1 and X5.
Two accuracy estimates are reported:
-
Cross-validated accuracy (5-fold stratified)
Used as the primary estimate of generalization performance -
Hold-out accuracy (80/20 stratified split)
Reported to align with the original problem’s evaluation criteria
Cross-validation provides a conservative estimate, while the hold-out score reflects single-split performance comparable to private test evaluation.
| Model | All Features | Selected Subset |
|---|---|---|
| Logistic Regression | 0.5717 | 0.6025 |
| KNN | 0.6258 | 0.6895 |
| Decision Tree | 0.6351 | 0.6982 |
| Random Forest | 0.6345 | 0.6751 |
| Gradient Boosting | 0.6428 | 0.6905 |
Saved artifact: models/model.joblib
- CV accuracy:
0.6818 ± 0.0737 - Hold-out accuracy (80/20):
0.8077
Beyond model training, this project includes a containerized ML inference service to demonstrate how a validated model can be served reliably.
- FastAPI-based prediction API
- Strict request/response schemas (Pydantic)
- Explicit model + metadata loading
- Lightweight request metrics
- Fully containerized runtime (Docker)
- Container-level test execution
The serving layer enforces:
- Feature contracts
- Model version metadata
- Reproducible runtime behavior
Accepts the minimal feature set used by the deployed model.
Request
{
"X1": 3,
"X5": 5
}Response
{
"prediction": 1,
"probability": 0.78,
"model_version": "2026-02-05T12:34:56Z",
"features_used": ["X1", "X5"]
}Additional endpoints:
GET /healthGET /model-infoGET /metrics
pip install -r requirements.txt
python -m src.feature_select
python -m src.train
python -m src.predict --x1 3 --x5 5docker build -t happiness-ml .
docker run -p 8000:8000 happiness-mlRun tests inside the container:
docker run -it happiness-ml pytest -q