[Feature]: Add remediation-enabled overlay profile for clusters with cloud-provider credentials

### Prerequisites

- [x] I searched existing issues

### Feature Summary

Add an opt-in overlay profile that enables NVSentinel's full remediation path (persistent datastore + automatic node quarantine/drain/reboot/termination) for clusters where operators have provisioned the required cloud-provider credentials. The base recipe stays detection-only.

### Problem/Use Case

AICR inherits NVSentinel's upstream monitoring-only defaults (v1.3.0). Health collection is enabled out of the box (gpuHealthMonitor, syslogHealthMonitor, labeler, metadataCollector, platformConnector), but the datastore and remediation chain (mongodbStore, janitor, janitorProvider, faultQuarantine, nodeDrainer, faultRemediation) are off.

This is intentional upstream behavior — AICR isn't explicitly disabling remediation, it's inheriting the safe default. But the practical effect is that when a GPU node needs a reboot or replacement, operators must intervene manually. For a production GPU cluster, that's a suboptimal customer experience.

Flipping these on in the base recipe is not safe today for two reasons:
- janitor can reboot or terminate nodes and requires cloud-provider credentials that may not be present on all clusters.
- Enabling mongodbStore pulls in bitnamilegacy/mongodb:8.0.3-debian-12-r0, which has no linux/arm64 manifest and breaks ARM clusters.


### Proposed Solution

Introduce a named remediation overlay profile (e.g. nvsentinel-remediation) that users opt into explicitly via the AICR overlay mechanism. The profile would:

- Enable mongodbStore, janitor, janitorProvider, faultQuarantine, nodeDrainer, and faultRemediation
- Document required cloud-provider credential prerequisites
- Be explicitly unsupported on ARM clusters until the upstream NVSentinel ARM/MongoDB issue is resolved (tracked separately in the NVSentinel repo)
- The base recipe remains detection-only and is unchanged.

### Success Criteria

- A named overlay profile enables the full NVSentinel remediation path
-  Docs enumerate required cloud-provider credential prerequisites
-  Profile validation rejects ARM cluster targets until ARM support is confirmed upstream
-  KWOK e2e smoke test validates the overlay renders without error

### Alternatives Considered

_No response_

### Component

CLI (aicr)

### Priority

Nice to have

### Compatibility / Breaking Changes

_No response_

### Operational Considerations

_No response_

### Are you willing to contribute?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add remediation-enabled overlay profile for clusters with cloud-provider credentials #1014

Prerequisites

Feature Summary

Problem/Use Case

Proposed Solution

Success Criteria

Alternatives Considered

Component

Priority

Compatibility / Breaking Changes

Operational Considerations

Are you willing to contribute?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: Add remediation-enabled overlay profile for clusters with cloud-provider credentials #1014

Description

Prerequisites

Feature Summary

Problem/Use Case

Proposed Solution

Success Criteria

Alternatives Considered

Component

Priority

Compatibility / Breaking Changes

Operational Considerations

Are you willing to contribute?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions