fix(api): suppress stuck-instance alerts for hosts in maintenance mode by kdhulipala-wq · Pull Request #1828 · NVIDIA/infra-controller

kdhulipala-wq · 2026-05-19T23:57:03Z

Description

Stuck Assigned-substate machines sometimes take days to resolve, and operators put them into maintenance mode to silence the alert. The existing setup does not suppress alerts for these machines when they are put into maintenance mode, and we are still receiving the PD alerts.

This PR makes SetMaintenance::Enable also write ExcludeFromStateMachineSla — matching what the admin-cli InternalMaintenance template has been doing, so state_sla() short-circuits to no_sla() and a host in maintenance stops contributing to stuck-instance alerts regardless of which state or substate it's in.

Scope

The motivating case is Assigned-substate machines paging via stuckInstanceCritical, but the fix is applied at the state_sla() layer, which means it applies to all machine state-SLA trackingm, not only the Assigned family.

Alerts : stuckInstanceCritical and manyStuckInstancesCritical already filter on {state="assigned"}, so the on-call behavior is exactly what the bug report asked for.
Grafana / observability : panels that query forge_machines_per_state_above_sla with broader filters (e.g. state!="assigned") will also stop counting a machine while it's in maintenance.

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Signed-off-by: Krishna Dhulipala <kdhulipala@nvidia.com>

copy-pr-bot · 2026-05-19T23:57:06Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

krish-nvidia · 2026-05-21T15:33:41Z

                                classifications: vec![
                                    health_report::HealthAlertClassification::prevent_allocations(),
                                    health_report::HealthAlertClassification::suppress_external_alerting(),
+                                    health_report::HealthAlertClassification::exclude_from_state_machine_sla(),


Just semantics, but this will also make time_in_state_above_sla false for non-assigned machines. There’s a Grafana dashboard panel that uses forge_machines_per_state_above_sla{fresh="true", state!="assigned"}, so putting one of those machines into maintenance mode will now remove it from that view too.

I think that’s expected behavior, but maybe update the PR description/comments to call out that this applies to all machine state SLA tracking, not just assigned machines. The unit test also doesn’t create an instance, so it’s already validating this broader behavior 😄

ajf · 2026-05-26T20:20:42Z

/ok to test fe74b65

kdhulipala-wq added 2 commits May 19, 2026 20:35

Suppress alerts for machines in maintenance mode

f521226

Signed-off-by: Krishna Dhulipala <kdhulipala@nvidia.com>

Add regression test for maintenance-mode SLA suppression

fe74b65

Signed-off-by: Krishna Dhulipala <kdhulipala@nvidia.com>

kdhulipala-wq marked this pull request as ready for review May 20, 2026 19:41

kdhulipala-wq requested review from a team and Coco-Ben as code owners May 20, 2026 19:41

kdhulipala-wq requested a review from krish-nvidia May 20, 2026 19:44

krish-nvidia approved these changes May 21, 2026

View reviewed changes

Coco-Ben approved these changes May 26, 2026

View reviewed changes

Merge branch 'main' into kcd-suppress-maintenance-alerts

cdac431

ajf enabled auto-merge (squash) May 26, 2026 21:22

ajf merged commit 6b6db15 into NVIDIA:main May 26, 2026
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): suppress stuck-instance alerts for hosts in maintenance mode#1828

fix(api): suppress stuck-instance alerts for hosts in maintenance mode#1828
ajf merged 3 commits into
NVIDIA:mainfrom
kdhulipala-wq:kcd-suppress-maintenance-alerts

kdhulipala-wq commented May 19, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

krish-nvidia May 21, 2026 •

edited

Loading

Uh oh!

kdhulipala-wq May 26, 2026

Uh oh!

ajf commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kdhulipala-wq commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Scope

Type of Change

Testing

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

krish-nvidia May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kdhulipala-wq May 26, 2026

Choose a reason for hiding this comment

Uh oh!

ajf commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kdhulipala-wq commented May 19, 2026 •

edited

Loading

krish-nvidia May 21, 2026 •

edited

Loading