Skip to content

fix(api): suppress stuck-instance alerts for hosts in maintenance mode#1828

Merged
ajf merged 3 commits into
NVIDIA:mainfrom
kdhulipala-wq:kcd-suppress-maintenance-alerts
May 26, 2026
Merged

fix(api): suppress stuck-instance alerts for hosts in maintenance mode#1828
ajf merged 3 commits into
NVIDIA:mainfrom
kdhulipala-wq:kcd-suppress-maintenance-alerts

Conversation

@kdhulipala-wq
Copy link
Copy Markdown
Contributor

@kdhulipala-wq kdhulipala-wq commented May 19, 2026

Description

Stuck Assigned-substate machines sometimes take days to resolve, and operators put them into maintenance mode to silence the alert. The existing setup does not suppress alerts for these machines when they are put into maintenance mode, and we are still receiving the PD alerts.

This PR makes SetMaintenance::Enable also write ExcludeFromStateMachineSla — matching what the admin-cli InternalMaintenance template has been doing, so state_sla() short-circuits to no_sla() and a host in maintenance stops contributing to stuck-instance alerts regardless of which state or substate it's in.

Scope

The motivating case is Assigned-substate machines paging via stuckInstanceCritical, but the fix is applied at the state_sla() layer, which means it applies to all machine state-SLA trackingm, not only the Assigned family.

  • Alerts : stuckInstanceCritical and manyStuckInstancesCritical already filter on {state="assigned"}, so the on-call behavior is exactly what the bug report asked for.
  • Grafana / observability : panels that query forge_machines_per_state_above_sla with broader filters (e.g. state!="assigned") will also stop counting a machine while it's in maintenance.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Signed-off-by: Krishna Dhulipala <kdhulipala@nvidia.com>
Signed-off-by: Krishna Dhulipala <kdhulipala@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@kdhulipala-wq kdhulipala-wq marked this pull request as ready for review May 20, 2026 19:41
@kdhulipala-wq kdhulipala-wq requested review from a team and Coco-Ben as code owners May 20, 2026 19:41
classifications: vec![
health_report::HealthAlertClassification::prevent_allocations(),
health_report::HealthAlertClassification::suppress_external_alerting(),
health_report::HealthAlertClassification::exclude_from_state_machine_sla(),
Copy link
Copy Markdown
Contributor

@krish-nvidia krish-nvidia May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just semantics, but this will also make time_in_state_above_sla false for non-assigned machines. There’s a Grafana dashboard panel that uses forge_machines_per_state_above_sla{fresh="true", state!="assigned"}, so putting one of those machines into maintenance mode will now remove it from that view too.

I think that’s expected behavior, but maybe update the PR description/comments to call out that this applies to all machine state SLA tracking, not just assigned machines. The unit test also doesn’t create an instance, so it’s already validating this broader behavior 😄

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

@ajf
Copy link
Copy Markdown
Collaborator

ajf commented May 26, 2026

/ok to test fe74b65

@ajf ajf enabled auto-merge (squash) May 26, 2026 21:22
@ajf ajf merged commit 6b6db15 into NVIDIA:main May 26, 2026
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants