diff --git a/README.md b/README.md index c04b735f..cd261e2b 100644 --- a/README.md +++ b/README.md @@ -129,7 +129,7 @@ The autopilot uses a **"Patched Baseline"** approach: - **Convention over Configuration**: Opinionated defaults, customizable when needed **Three-Tier Management:** -1. **Always-On**: Critical baseline configurations (MachineConfig, NodeHealthCheck, Kubelet settings) +1. **Always-On**: Critical baseline configurations (MachineConfig, Kubelet settings) 2. **Context-Aware**: Activated based on conditions (KubeDescheduler, CPU Manager) 3. **Advanced**: Specialized features (VFIO, USB passthrough, AAQ operator) diff --git a/assets/active/metadata.yaml b/assets/active/metadata.yaml index 4e5bddbb..99604023 100644 --- a/assets/active/metadata.yaml +++ b/assets/active/metadata.yaml @@ -76,15 +76,6 @@ assets: component: KubeletConfig reconcile_order: 1 - # Phase 1: Always installed - NodeHealthCheck - - name: node-health-check - path: active/node-health/standard-remediation.yaml - phase: 1 - install: always - component: NodeHealthCheck - reconcile_order: 1 - conditions: [] - # Phase 1: Optional Operators (opt-in for clusters with CRDs) - name: mtv-operator path: active/operators/mtv.yaml.tpl diff --git a/assets/active/node-health/OWNERS b/assets/active/node-health/OWNERS deleted file mode 100644 index 1b5ef937..00000000 --- a/assets/active/node-health/OWNERS +++ /dev/null @@ -1,4 +0,0 @@ -approvers: - - dominikholler -reviewers: - - dominikholler diff --git a/assets/active/node-health/standard-remediation.yaml b/assets/active/node-health/standard-remediation.yaml deleted file mode 100644 index 6231a3f7..00000000 --- a/assets/active/node-health/standard-remediation.yaml +++ /dev/null @@ -1,18 +0,0 @@ -apiVersion: remediation.medik8s.io/v1alpha1 -kind: NodeHealthCheck -metadata: - name: virt-node-health-check - namespace: openshift-operators -spec: - minHealthy: 51% - selector: - matchExpressions: - - key: node-role.kubernetes.io/worker - operator: Exists - unhealthyConditions: - - duration: 5m - status: "False" - type: Ready - - duration: 5m - status: Unknown - type: Ready diff --git a/assets/tombstones/v0-cleanup/node-health-check.yaml b/assets/tombstones/v0-cleanup/node-health-check.yaml new file mode 100644 index 00000000..daa4cb73 --- /dev/null +++ b/assets/tombstones/v0-cleanup/node-health-check.yaml @@ -0,0 +1,7 @@ +apiVersion: remediation.medik8s.io/v1alpha1 +kind: NodeHealthCheck +metadata: + name: virt-node-health-check + namespace: openshift-operators + labels: + platform.kubevirt.io/managed-by: virt-platform-autopilot diff --git a/cmd/rbac-gen/README.md b/cmd/rbac-gen/README.md index 07315edc..e33641ad 100644 --- a/cmd/rbac-gen/README.md +++ b/cmd/rbac-gen/README.md @@ -127,7 +127,7 @@ Templates use Go's `text/template` syntax. The generator replaces all `{{ ... }} ### Resource Pluralization The generator pluralizes resource kinds using simple heuristics: - Standard: `Example` → `examples` -- Special cases: `MachineConfig` → `machineconfigs`, `NodeHealthCheck` → `nodehealthchecks` +- Special cases: `MachineConfig` → `machineconfigs` ### Deduplication Resources are deduplicated by `apiVersion/kind` to avoid duplicate RBAC rules even if the same resource type appears in multiple assets. diff --git a/config/rbac/role.yaml b/config/rbac/role.yaml index 02feafa0..56b0a7b5 100644 --- a/config/rbac/role.yaml +++ b/config/rbac/role.yaml @@ -187,13 +187,14 @@ rules: - patch - update - watch - # NodeHealthCheck + # NodeHealthCheck (includes tombstone cleanup) - apiGroups: - remediation.medik8s.io resources: - nodehealthchecks verbs: - create + - delete - get - list - patch diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 77498180..bd7ff7a1 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -66,12 +66,12 @@ Only the named assets are considered for reconciliation. All other assets — in ```yaml annotations: - platform.kubevirt.io/autopilot: "swap-enable,descheduler-loadaware,node-health-check" + platform.kubevirt.io/autopilot: "swap-enable,descheduler-loadaware" ``` ```bash kubectl annotate hyperconverged kubevirt-hyperconverged -n openshift-cnv \ - "platform.kubevirt.io/autopilot=swap-enable,descheduler-loadaware,node-health-check" + "platform.kubevirt.io/autopilot=swap-enable,descheduler-loadaware" ``` Asset names correspond to the `name` field in `assets/active/metadata.yaml`. The current set includes: @@ -84,7 +84,6 @@ Asset names correspond to the `name` field in `assets/active/metadata.yaml`. The | `pci-passthrough` | | MachineConfig | Opt-in: hardware + annotation condition | | `kubelet-perf-settings` | | KubeletConfig | Always-on baseline | | `kubelet-cpu-manager` | | KubeletConfig | Opt-in: CPUManager feature gate | -| `node-health-check` | | NodeHealthCheck | Always-on baseline | | `descheduler-loadaware` | | KubeDescheduler | Soft dependency on KubeDescheduler CRD | | `monitoring-ui-plugin` | | UIPlugin | Soft dependency on COO CRD; enables Perses dashboards in the OpenShift console | | `mtv-operator` | | ForkliftController | Opt-in: annotation condition | @@ -120,7 +119,6 @@ The autopilot manages resources across three tiers based on criticality and acti Critical baseline configurations applied to all clusters: -- **NodeHealthCheck**: Automatic node remediation for failed hosts - **MachineConfig**: OS-level optimizations - Swap optimization for memory management - NUMA topology awareness @@ -479,7 +477,6 @@ virt-platform-autopilot/ │ │ ├── machine-config/ # OS-level configs │ │ ├── kubelet/ # Kubelet settings │ │ ├── descheduler/ # KubeDescheduler -│ │ ├── node-health/ # NodeHealthCheck │ │ ├── observability/ # PrometheusRules │ │ ├── operators/ # Third-party operator CRs (UIPlugin, MetalLB, MTV…) │ │ └── metadata.yaml # Asset catalog diff --git a/docs/adding-assets.md b/docs/adding-assets.md index c066fa9c..caa03913 100644 --- a/docs/adding-assets.md +++ b/docs/adding-assets.md @@ -25,7 +25,6 @@ assets/active/ ├── hco/ # HyperConverged resource (only one, order: 0) ├── machine-config/ # MachineConfig resources ├── kubelet/ # KubeletConfig resources -├── node-health/ # NodeHealthCheck resources ├── descheduler/ # Descheduler resources └── operators/ # Third-party operator CRs @@ -98,32 +97,18 @@ This scans all assets and generates the necessary ClusterRole permissions. For resources that don't need dynamic values: -**File:** `assets/active/node-health/standard-remediation.yaml` +**File:** `assets/active/operators/monitoring-uiplugin.yaml` ```yaml -apiVersion: remediation.medik8s.io/v1alpha1 -kind: NodeHealthCheck +apiVersion: observability.openshift.io/v1alpha1 +kind: UIPlugin metadata: - name: virt-node-health-check - namespace: openshift-operators + name: monitoring spec: - minHealthy: 51% - remediationTemplate: - apiVersion: self-node-remediation.medik8s.io/v1alpha1 - kind: SelfNodeRemediationTemplate - name: self-node-remediation-automatic-strategy-template - namespace: openshift-operators - selector: - matchExpressions: - - key: node-role.kubernetes.io/worker - operator: Exists - unhealthyConditions: - - duration: 5m - status: "False" - type: Ready - - duration: 5m - status: Unknown - type: Ready + type: Monitoring + monitoring: + perses: + enabled: true ``` No templating needed - this is applied as-is. @@ -321,14 +306,13 @@ The `assets/active/metadata.yaml` catalog defines all managed assets. - `HyperConverged` - `MachineConfig` - `KubeletConfig` -- `NodeHealthCheck` - `KubeDescheduler` - `ForkliftController` - `MetalLB` **reconcile_order**: Processing order (lower numbers first). - `0`: HCO only (must be first - serves as RenderContext source) -- `1-9`: Critical baseline (MachineConfig, Kubelet, NodeHealthCheck) +- `1-9`: Critical baseline (MachineConfig, Kubelet) - `10-19`: Scheduling and placement (Descheduler) - `20+`: Optional operators and advanced features @@ -610,7 +594,7 @@ rules: ### 2. Set Appropriate Reconcile Order - `0`: HCO only -- `1-9`: Infrastructure (MachineConfig, Kubelet, NodeHealthCheck) +- `1-9`: Infrastructure (MachineConfig, Kubelet) - `10-19`: Scheduling and placement - `20+`: Optional features @@ -810,12 +794,6 @@ createdAt: {{ $timestamp }} Production-ready HCO configuration with opinionated defaults. Must have `reconcile_order: 0`. -### NodeHealthCheck - -**File:** `assets/active/node-health/standard-remediation.yaml` - -Simple static YAML - no templating needed. - ### Descheduler (Conditional) **File:** `assets/active/descheduler/recommended.yaml.tpl` diff --git a/docs/debug-endpoints.md b/docs/debug-endpoints.md index 068588ae..9d0679af 100644 --- a/docs/debug-endpoints.md +++ b/docs/debug-endpoints.md @@ -327,14 +327,13 @@ swap-enable INCLUDED MachineConfig - pci-passthrough EXCLUDED MachineConfig Conditions not met numa-topology EXCLUDED MachineConfig Conditions not met kubelet-perf-settings INCLUDED KubeletConfig - -node-health-check INCLUDED NodeHealthCheck - mtv-operator EXCLUDED ForkliftController Conditions not met metallb-operator EXCLUDED MetalLB Conditions not met observability-operator EXCLUDED UIPlugin Conditions not met descheduler-loadaware FILTERED KubeDescheduler Root exclusion kubelet-cpu-manager EXCLUDED KubeletConfig Conditions not met ---------------------------------------------------------------------------------------------------- -Summary: 3 included, 7 excluded, 1 filtered, 0 errors +Summary: 2 included, 7 excluded, 1 filtered, 0 errors ``` ## Use Cases diff --git a/docs/runbooks/VirtPlatformDependencyMissing.md b/docs/runbooks/VirtPlatformDependencyMissing.md index 97a67434..864e249b 100644 --- a/docs/runbooks/VirtPlatformDependencyMissing.md +++ b/docs/runbooks/VirtPlatformDependencyMissing.md @@ -74,49 +74,7 @@ spec: EOF ``` -### 2. NodeHealthCheck (Node Auto-Remediation) - -**CRD:** `nodehealthchecks.remediation.medik8s.io` -**Provided by:** `node-healthcheck-operator` (via MediK8s) - -**Features affected:** -- Automatic node health monitoring -- Self-node remediation integration -- Fence-agents remediation integration -- Unhealthy node detection - -**Check if installed:** -```bash -# Check CRD exists -kubectl get crd nodehealthchecks.remediation.medik8s.io - -# Check operator is installed -oc get csv -n openshift-operators | grep node-healthcheck - -# Check NHC instance -kubectl get nodehealthcheck -A -``` - -**Install if missing:** -```bash -# Install via OperatorHub (OpenShift Console) -# Search for "Node Health Check Operator" -# Or via CLI: -cat <