METCS 767 Advanced Machine Learning · Boston University
Dr. Farshid Alizadeh-Shabdiz
Real-time driver drowsiness detection using a 4-stage modular computer vision + temporal ML pipeline that processes 30 FPS video and produces a live alert decision with interpretable intermediate outputs at every stage.
The problem: Drowsy driving causes ~100,000 crashes annually in the US (NHTSA), resulting in 1,550 fatalities, 71,000 injuries, and $15.9B in economic losses. Existing physiological solutions (EEG, EOG) are intrusive and can't scale to millions of vehicles. A cabin-camera approach is practical and deployable.
Why this architecture: The system is decomposed into 4 sequential tasks rather than a single black-box end-to-end model. Each stage is independently interpretable, debuggable, and replaceable. This is intentional design, not a limitation.
Video Input (30 FPS, RGB 640×480)
│
▼ Task 1: Face Detection
RetinaFace (ResNet-50 + FPN)
→ Bounding box + face landmarks
→ Loss: Focal Loss + Smooth L1
│
▼ Task 2: Facial Keypoint Regression
ResNet-18 backbone (ImageNet pre-trained)
→ 20 landmarks (6 per eye, 8 inner mouth) + confidence scores
→ Loss: MSE + Confidence-Aware Loss (novel)
│
▼ Task 3: EAR / MAR Feature Extraction
Eye Aspect Ratio (EAR) — quantifies eye opening
Mouth Aspect Ratio (MAR) — quantifies yawning
→ Per-second aggregation: median, max, fraction closed, yawn count
│
▼ Task 4: Temporal Drowsiness Classification
5-second sliding window → feature vector
Random Forest Classifier
→ P(drowsy), threshold τ*, ALARM decision
Datasets: WIDER Face (train/val) → FDDB (cross-dataset test)
Architecture: RetinaFace with ResNet-50 backbone + Feature Pyramid Network (FPN)
Loss: Focal Loss (handles anchor class imbalance) + Smooth L1 (stabilizes box regression)
Model selection: RetinaFace chosen after FasterRCNN-ResNet18 failed to generalize.
| Model | Dataset | mAP@0.5 | DICE | Notes |
|---|---|---|---|---|
| Custom ResNet-50 | WIDER | 0.601 | 0.643 | Initial attempt |
| RetinaFace (pre-trained) | WIDER | 0.6123 | 0.7159 | Selected |
| RetinaFace (pre-trained) | FDDB | 0.9528 | 0.9143 | Cross-dataset test |
The 0.9528 mAP on FDDB demonstrates strong cross-dataset generalization. The gap between WIDER (0.6123) and FDDB (0.9528) reflects domain shift: WIDER contains extreme scales and occlusions. For driving scenarios, 0.6123 is more than sufficient — false negatives are recoverable via temporal smoothing.
Dataset: 300W (600 training images, 68 landmarks → 20 selected: 6 per eye, 8 inner mouth)
Architecture: ResNet-18 backbone (ImageNet pre-trained, 11.5M parameters) → Global Average Pooling → FC(512) → ReLU → Dropout(0.3) → FC(60) → 20 landmarks × (x, y, confidence)
Standard MSE regression treats all predictions equally. Our loss function explicitly trains the network to assign high confidence when accurate and low confidence when uncertain:
L = Σ [ confidence_i × ||pred_i - gt_i||² + λ × (1 - confidence_i) × max(0, threshold - ||pred_i - gt_i||) ]
This enables downstream EAR/MAR computation to weight unreliable landmarks less — critical for occlusion (glasses, hair, poor lighting).
Results on 300W validation set:
| Metric | Value |
|---|---|
| Validation MSE (per sample) | 0.0002 – 0.0009 |
| Average landmark confidence | ~0.958 |
| Pixel-level error at 256×256 | ~0.71 px (negligible for EAR/MAR) |
Sample predictions (green = ground truth, red = predicted):
Samples 1–5 show tight overlap between GT and predictions across diverse faces, poses, lighting conditions, and ages. Average confidence 0.958 indicates the model learned to distinguish reliable from unreliable predictions.
Eye Aspect Ratio (EAR) — based on Soukupová & Terzopoulos (2016):
EAR = (||p2-p6|| + ||p3-p5||) / (2 × ||p1-p4||)
- EAR ~0.20–0.25 → Eyes open
- EAR ~0.15–0.18 → Eyes partially closed
- EAR ~0.10–0.15 → Eyes closed
Mouth Aspect Ratio (MAR):
MAR = Σ(vertical distances) / (2 × horizontal mouth width)
- MAR ~0.10–0.12 → Mouth closed
- MAR ~0.13+ → Mouth open / yawning
Thresholds calibrated on 300W validation set (100 alert subjects), then tuned for NTHU-DDD driving conditions:
EAR threshold: 0.220 (more sensitive to closed eyes with glasses/backlighting)
MAR threshold: 0.160
Task 3 Classification Performance on NTHU-DDD:
| Condition | Eye Closed Accuracy | Yawn Accuracy |
|---|---|---|
| Drowsy frames | 74.78% | 68.11% |
| Alert frames | 82.43% | 53.28% |
Live output example: Eye:0 Yawn:0 EAR:0.221 MAR:0.079 overlaid on video frame in real-time.
Input: 5-second sliding windows of per-second EAR/MAR aggregated features
Feature vector: [eye_closed_avg, eye_closed_max, eye_closed_frac, MAR_avg, MAR_max, yawn_count, yawn_prob]
Five models were compared on the NTHU-DDD test set (89 sequences, 441 five-second windows):
| Model | Test F1 | Val F1 | Decision |
|---|---|---|---|
| Logistic Regression | 0.752 | — | Baseline |
| SVM | ~0.78 | — | Rejected |
| Gradient Boosting | ~0.85 | — | Rejected |
| LSTM | ~0.90 | 1.000 | ❌ Rejected — overfitting |
| Random Forest | 0.923 | 0.923 | ✅ Selected |
The LSTM trap: LSTM achieved perfect F1=1.000 on validation — which looks ideal. On the test set it collapsed. Random Forest maintained F1=0.923 across train/val/test, demonstrating genuine generalization rather than memorization. For a safety-critical system, consistent generalization matters more than peak validation performance.
Optimal threshold: τ* = 0.510 (F1-optimized on validation set)
Confusion matrix at τ = 0.510:*
| Predicted Awake | Predicted Drowsy | |
|---|---|---|
| Actual Awake | 40 (TN) | 2 (FP) |
| Actual Drowsy | 5 (FN) | 42 (TP) |
[FRAME METRICS] EAR_R=0.1760 EAR_L=0.1629
EAR_avg=0.1694 Eye_closed=1
MAR=0.1371 Yawn=0
[5-SEC WINDOW] Eye_closed_avg=0.947
MAR_avg=0.1359 Yawn_prob=0.920
[DROWSINESS MODEL] P(drowsy)=0.412
Threshold=0.619 F1_Score=0.725 ALARM=0
| Scenario | RF Prediction | LSTM Prediction | Ground Truth |
|---|---|---|---|
| Alert driver (eyes open) | AWAKE P=0.060 ✅ | DROWSY P=0.429 ❌ | Awake |
| Drowsy driver (eyes closed) | DROWSY P=1.000 ✅ | DROWSY P=0.524 ✅ | Drowsy |
RF correctly identifies the alert driver (false positive from LSTM would trigger unnecessary alerts, desensitizing the driver). Both catch the drowsy driver, but RF does so with higher confidence (P=1.000 vs P=0.524).
| Stage | Metric | Value |
|---|---|---|
| Task 1: Face Detection | mAP@0.5 (FDDB cross-dataset) | 0.9528 |
| Task 1: Face Detection | DICE (FDDB) | 0.9143 |
| Task 2: Keypoint Regression | Avg landmark confidence | 0.958 |
| Task 2: Keypoint Regression | Validation MSE | 0.0002–0.0009 |
| Task 3: EAR/MAR | Drowsy frame eye-closed accuracy | 74.78% |
| Task 3: EAR/MAR | Alert frame eye-closed accuracy | 82.43% |
| Task 4: Temporal Classification | F1-Score | 0.923 |
| Task 4: Temporal Classification | ROC-AUC | 0.964 |
| Stage | Loss | Reason |
|---|---|---|
| Face Detection | Focal Loss + Smooth L1 | Focal handles anchor class imbalance in dense detectors; Smooth L1 stabilizes bounding-box regression |
| Keypoint Regression | MSE + Confidence-Aware | MSE for coordinate accuracy; confidence modeling improves reliability for downstream geometric estimation |
| Eye/Yawn Classification | Binary Cross-Entropy | Natural loss for binary probabilistic outputs |
| Temporal Classification | BCE / Gini Impurity (RF) | Threshold-calibrated probabilities; RF's Gini impurity behaves like regularized BCE |
pip install -r requirements.txt
# Full pipeline on video input
python pipeline/task1_face_detection.py --input video.mp4
python pipeline/task2_keypoint_regression.py
python pipeline/task3_ear_mar.py
python pipeline/task4_classification.py --threshold 0.510- Soukupová & Terzopoulos (2016). Real-Time Eye Blink Detection using Facial Landmarks.
- Deng et al. (2019). RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. CVPR.
- He et al. (2016). Deep Residual Learning for Image Recognition. CVPR.
- NTHU-DDD Dataset: Driver drowsiness detection benchmark.
Aryan Meena · LinkedIn · araj7042@gmail.com Boston University, METCS 767 Advanced Machine Learning