This document outlines the evolution of the Vehicle Risk Prediction model, the challenges of working with synthetic data, and the path to achieving a realistic predictive system.
Initially, the XGBoost model achieved an AUC-ROC score of 1.0000. While statistically perfect, this indicated a critical flaw in the machine learning design: Label Leakage.
In the first iteration, the model was "cheating" by observing the exact same variables used to calculate the ground truth health score.
- The Target:
is_at_riskwas derived from a linear SQL formula in the Gold layer. - The Features: The model was given the exact averages (e.g.,
avg_coolant_temp) used in that formula. - The Result: The model didn't learn mechanical failure patterns; it simply reverse-engineered the SQL thresholds.
To break the deterministic link, we refactored the simulator to move from random offsets to a physics-based correlation engine.
- Air Flow (MAF): Realistically scales with both RPM and Throttle Position.
- Thermal Dynamics: Engine temperature is now a function of Engine Load. High RPMs under heavy throttle cause the coolant temperature to rise above average.
- Alternator Charging: Battery voltage fluctuations are tied to RPM, simulating the alternator's charging cycle.
Latent Factors (Hidden Variables)
We introduced "Latent Factors" that affect the sensors but are not provided to the model:
- Ambient Temperature: A hidden weather variable that offsets engine heat. The model must now distinguish between "Normal high heat" (a hot day) and "Anomalous high heat" (engine failure).
- Latent Pre-Failure Noise: Added a "latent instability" state where sensors begin to "smell bad" (increased jitter) before they actually trigger a hard anomaly.
We removed the "Smoking Gun" features from train.py:
- Removed:
avg_coolant_tempandavg_battery_voltage(The direct SQL inputs). - Added:
maf_rpm_ratio(Efficiency),volt_volatility(Electrical stability), andmax_coolant_temp_delta(Thermal rate of change).
To prevent a "Data Swamp" where 70% of vehicles were marked as "Fair," we implemented Elastic Scoring:
- Dead Zones: No penalties are applied for normal physical fluctuations (e.g., up to 108°C under load).
- Exponential Scaling: Penalties for degraded vehicles now scale non-linearly (
pow(delta, 1.8)), ensuring that truly failing vehicles are clearly separated from healthy ones.
After these refactors, the XGBoost model achieved a realistic AUC-ROC Score of 0.9786.
| Metric | Result |
|---|---|
| Accuracy | 94% |
| Healthy (N) | 446 |
| At-Risk (N) | 178 |
| AUC-ROC | 0.9786 |
The model is now learning latent mechanical signatures rather than simple thresholds. It can identify a vehicle as "At-Risk" even before it triggers a hard anomaly by observing subtle changes in efficiency ratios and sensor volatility—exactly how a production-grade predictive maintenance system operates.