Skip to content

[Feature]: Add remediation-enabled overlay profile for clusters with cloud-provider credentials #1014

@atif1996

Description

@atif1996

Prerequisites

  • I searched existing issues

Feature Summary

Add an opt-in overlay profile that enables NVSentinel's full remediation path (persistent datastore + automatic node quarantine/drain/reboot/termination) for clusters where operators have provisioned the required cloud-provider credentials. The base recipe stays detection-only.

Problem/Use Case

AICR inherits NVSentinel's upstream monitoring-only defaults (v1.3.0). Health collection is enabled out of the box (gpuHealthMonitor, syslogHealthMonitor, labeler, metadataCollector, platformConnector), but the datastore and remediation chain (mongodbStore, janitor, janitorProvider, faultQuarantine, nodeDrainer, faultRemediation) are off.

This is intentional upstream behavior — AICR isn't explicitly disabling remediation, it's inheriting the safe default. But the practical effect is that when a GPU node needs a reboot or replacement, operators must intervene manually. For a production GPU cluster, that's a suboptimal customer experience.

Flipping these on in the base recipe is not safe today for two reasons:

  • janitor can reboot or terminate nodes and requires cloud-provider credentials that may not be present on all clusters.
  • Enabling mongodbStore pulls in bitnamilegacy/mongodb:8.0.3-debian-12-r0, which has no linux/arm64 manifest and breaks ARM clusters.

Proposed Solution

Introduce a named remediation overlay profile (e.g. nvsentinel-remediation) that users opt into explicitly via the AICR overlay mechanism. The profile would:

  • Enable mongodbStore, janitor, janitorProvider, faultQuarantine, nodeDrainer, and faultRemediation
  • Document required cloud-provider credential prerequisites
  • Be explicitly unsupported on ARM clusters until the upstream NVSentinel ARM/MongoDB issue is resolved (tracked separately in the NVSentinel repo)
  • The base recipe remains detection-only and is unchanged.

Success Criteria

  • A named overlay profile enables the full NVSentinel remediation path
  • Docs enumerate required cloud-provider credential prerequisites
  • Profile validation rejects ARM cluster targets until ARM support is confirmed upstream
  • KWOK e2e smoke test validates the overlay renders without error

Alternatives Considered

No response

Component

CLI (aicr)

Priority

Nice to have

Compatibility / Breaking Changes

No response

Operational Considerations

No response

Are you willing to contribute?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions