The system environment is non-stationary due to cache warm-up, traffic drift, evolving heuristics, and routing changes. However, the bandit currently assumes stable reward distributions.
This mismatch can cause policy decay or overfitting to early conditions.
The system environment is non-stationary due to cache warm-up, traffic drift, evolving heuristics, and routing changes. However, the bandit currently assumes stable reward distributions.
This mismatch can cause policy decay or overfitting to early conditions.