Bandit assumes stationarity in a non-stationary system

The system environment is non-stationary due to cache warm-up, traffic drift, evolving heuristics, and routing changes. However, the bandit currently assumes stable reward distributions.

This mismatch can cause policy decay or overfitting to early conditions.