Prerequisites
Feature Summary
Add an opt-in overlay profile that enables NVSentinel's full remediation path (persistent datastore + automatic node quarantine/drain/reboot/termination) for clusters where operators have provisioned the required cloud-provider credentials. The base recipe stays detection-only.
Problem/Use Case
AICR inherits NVSentinel's upstream monitoring-only defaults (v1.3.0). Health collection is enabled out of the box (gpuHealthMonitor, syslogHealthMonitor, labeler, metadataCollector, platformConnector), but the datastore and remediation chain (mongodbStore, janitor, janitorProvider, faultQuarantine, nodeDrainer, faultRemediation) are off.
This is intentional upstream behavior — AICR isn't explicitly disabling remediation, it's inheriting the safe default. But the practical effect is that when a GPU node needs a reboot or replacement, operators must intervene manually. For a production GPU cluster, that's a suboptimal customer experience.
Flipping these on in the base recipe is not safe today for two reasons:
- janitor can reboot or terminate nodes and requires cloud-provider credentials that may not be present on all clusters.
- Enabling mongodbStore pulls in bitnamilegacy/mongodb:8.0.3-debian-12-r0, which has no linux/arm64 manifest and breaks ARM clusters.
Proposed Solution
Introduce a named remediation overlay profile (e.g. nvsentinel-remediation) that users opt into explicitly via the AICR overlay mechanism. The profile would:
- Enable mongodbStore, janitor, janitorProvider, faultQuarantine, nodeDrainer, and faultRemediation
- Document required cloud-provider credential prerequisites
- Be explicitly unsupported on ARM clusters until the upstream NVSentinel ARM/MongoDB issue is resolved (tracked separately in the NVSentinel repo)
- The base recipe remains detection-only and is unchanged.
Success Criteria
- A named overlay profile enables the full NVSentinel remediation path
- Docs enumerate required cloud-provider credential prerequisites
- Profile validation rejects ARM cluster targets until ARM support is confirmed upstream
- KWOK e2e smoke test validates the overlay renders without error
Alternatives Considered
No response
Component
CLI (aicr)
Priority
Nice to have
Compatibility / Breaking Changes
No response
Operational Considerations
No response
Are you willing to contribute?
No response
Prerequisites
Feature Summary
Add an opt-in overlay profile that enables NVSentinel's full remediation path (persistent datastore + automatic node quarantine/drain/reboot/termination) for clusters where operators have provisioned the required cloud-provider credentials. The base recipe stays detection-only.
Problem/Use Case
AICR inherits NVSentinel's upstream monitoring-only defaults (v1.3.0). Health collection is enabled out of the box (gpuHealthMonitor, syslogHealthMonitor, labeler, metadataCollector, platformConnector), but the datastore and remediation chain (mongodbStore, janitor, janitorProvider, faultQuarantine, nodeDrainer, faultRemediation) are off.
This is intentional upstream behavior — AICR isn't explicitly disabling remediation, it's inheriting the safe default. But the practical effect is that when a GPU node needs a reboot or replacement, operators must intervene manually. For a production GPU cluster, that's a suboptimal customer experience.
Flipping these on in the base recipe is not safe today for two reasons:
Proposed Solution
Introduce a named remediation overlay profile (e.g. nvsentinel-remediation) that users opt into explicitly via the AICR overlay mechanism. The profile would:
Success Criteria
Alternatives Considered
No response
Component
CLI (aicr)
Priority
Nice to have
Compatibility / Breaking Changes
No response
Operational Considerations
No response
Are you willing to contribute?
No response