fix(api): suppress stuck-instance alerts for hosts in maintenance mode#1828
Conversation
Signed-off-by: Krishna Dhulipala <kdhulipala@nvidia.com>
Signed-off-by: Krishna Dhulipala <kdhulipala@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
| classifications: vec![ | ||
| health_report::HealthAlertClassification::prevent_allocations(), | ||
| health_report::HealthAlertClassification::suppress_external_alerting(), | ||
| health_report::HealthAlertClassification::exclude_from_state_machine_sla(), |
There was a problem hiding this comment.
Just semantics, but this will also make time_in_state_above_sla false for non-assigned machines. There’s a Grafana dashboard panel that uses forge_machines_per_state_above_sla{fresh="true", state!="assigned"}, so putting one of those machines into maintenance mode will now remove it from that view too.
I think that’s expected behavior, but maybe update the PR description/comments to call out that this applies to all machine state SLA tracking, not just assigned machines. The unit test also doesn’t create an instance, so it’s already validating this broader behavior 😄
|
/ok to test fe74b65 |
Description
Stuck Assigned-substate machines sometimes take days to resolve, and operators put them into maintenance mode to silence the alert. The existing setup does not suppress alerts for these machines when they are put into maintenance mode, and we are still receiving the PD alerts.
This PR makes
SetMaintenance::Enablealso writeExcludeFromStateMachineSla— matching what the admin-cliInternalMaintenancetemplate has been doing, sostate_sla()short-circuits tono_sla()and a host in maintenance stops contributing to stuck-instance alerts regardless of which state or substate it's in.Scope
The motivating case is Assigned-substate machines paging via
stuckInstanceCritical, but the fix is applied at thestate_sla()layer, which means it applies to all machine state-SLA trackingm, not only theAssignedfamily.stuckInstanceCriticalandmanyStuckInstancesCriticalalready filter on{state="assigned"}, so the on-call behavior is exactly what the bug report asked for.forge_machines_per_state_above_slawith broader filters (e.g.state!="assigned") will also stop counting a machine while it's in maintenance.Type of Change
Testing