Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions charts/kubex-automation-engine/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ The Helm chart supports both Helm-managed configuration and manually managed cus

Important:

- The Helm-managed `scope` and `policy.policies` values preserve the existing values-driven flow from `values-edit.yaml` by generating `AutomationStrategy` and `ClusterProactivePolicy`, but those CRs can also be created and managed independently of Helm
- `ProactivePolicy`, `StaticPolicy`, `ClusterStaticPolicy`, and `ClusterAutomationStrategy` are supported by the controller but are managed as separate CR manifests today
- The Helm-managed `scope` and `policy.policies` values preserve the existing values-driven flow from `values-edit.yaml` by generating `ClusterAutomationStrategy` and `ClusterProactivePolicy`, but those CRs can also be created and managed independently of Helm
- `ProactivePolicy`, `StaticPolicy`, `ClusterStaticPolicy`, and namespaced `AutomationStrategy` are supported by the controller but are managed as separate CR manifests today

## Core Components

Expand Down Expand Up @@ -101,6 +101,7 @@ This guide covers:
| **[Global Configuration Reference](./docs/Global-Configuration.md)** | Field-by-field reference for the `GlobalConfiguration` custom resource |
| **[Policy Configuration](./docs/Policy-Configuration.md)** | Configure strategies, policy scope, precedence, and Helm-managed policy generation |
| **[Policy Evaluation Reference](./docs/Policy-Evaluation.md)** | Policy type precedence configuration via the `PolicyEvaluation` singleton |
| **[GPU Sharing with KAI](./docs/GPU-Sharing-with-KAI.md)** | Configure KAI-backed GPU sharing, rebalancing, and early consolidation |
| **[Apply Updates](./docs/Getting-Started.md#apply-configuration-updates)** | Re-run `helm upgrade` after configuration changes |

## Advanced Topics
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Automation Strategies

> Experimental: GPU/KAI-related fields in this resource are subject to breaking changes. When using them, set `spec.experimental.gpuKaiContract: v1alpha1-2026-04`.

`AutomationStrategy` defines how resizing is allowed to happen within a namespace.

Use it when a team owns its own namespace and should manage resize behavior locally.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Cluster Automation Strategies

> Experimental: GPU/KAI-related fields in this resource are subject to breaking changes. When using them, set `spec.experimental.gpuKaiContract: v1alpha1-2026-04`.

`ClusterAutomationStrategy` defines how resizing is allowed to happen for cluster-scoped policy flows.

Use it when a platform team wants one reusable resize behavior that can be referenced by `ClusterProactivePolicy` and `ClusterStaticPolicy` across multiple namespaces.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,7 @@ Use [Global Configuration Reference](./Global-Configuration.md) for the CR field
| `globalConfiguration.webhookProbe.resources` | `{}` | Resource requests and limits for the dry-run webhook probe container |
| `globalConfiguration.webhookProbe.podSecurityContext` | `{}` | Pod security context for the dry-run webhook probe Pod |
| `globalConfiguration.webhookProbe.securityContext` | `{}` | Container security context for the dry-run webhook probe container |
| `experimental.gpuKaiContract` | `v1alpha1-2026-04` | Required acknowledgement token for experimental GPU/KAI CR fields rendered by the chart |

## Helm-Managed Policy Values

Expand Down
175 changes: 175 additions & 0 deletions charts/kubex-automation-engine/docs/GPU-Sharing-with-KAI.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# GPU Sharing with KAI

This guide shows how to configure GPU sharing with KAI and Kubex Automation Engine.

Tested with KAI `v0.12.16`.

> [!IMPORTANT]
> GPU/KAI fields and related custom resources are experimental and subject to breaking changes. Set `spec.experimental.gpuKaiContract: v1alpha1-2026-04` on GPU/KAI resources.

## Prerequisites

- KAI is already installed in the cluster
- `kubex-crds` and `kubex-automation-engine` are already installed
- Prometheus is available for GPU utilization metrics if you want to use `GpuRebalancingPolicy`

This guide works with either:

- a new KAI installation
- an existing KAI installation

For existing KAI-managed workloads, Kubex Automation Engine can update the `gpu-fraction` annotation without replacing the existing `kai.scheduler/queue` label.

## Starter Example

The following example creates:

- an `AutomationStrategy` for KAI-enabled workloads in namespace `ml-team-a`
- a `StaticPolicy` that sets an initial shared GPU request for matching `Deployment` workloads
- a `GpuRebalancingPolicy` that adjusts that shared GPU request based on Prometheus GPU metrics

Both policies target `Deployment` workloads in a specific namespace that carry `nvidia.com/gpu.present: "true"`.

```yaml
apiVersion: rightsizing.kubex.ai/v1alpha1
kind: AutomationStrategy
metadata:
name: kai-gpu-sharing
namespace: ml-team-a
spec:
experimental:
gpuKaiContract: v1alpha1-2026-04
enablement:
gpu:
overrideScheduler: "kai"
requests:
downsize: true
upsize: true
setFromUnspecified: false
kai:
queue: kubex-unlimited-gpu-queue
setQueueWhenSpecified: false
inPlaceResize:
enabled: false
podEviction:
enabled: true
---
apiVersion: rightsizing.kubex.ai/v1alpha1
kind: StaticPolicy
metadata:
name: kai-gpu-sharing-baseline
namespace: ml-team-a
spec:
scope:
labelSelector:
matchLabels:
nvidia.com/gpu.present: "true"
workloadTypes:
- Deployment
resources:
containers:
"*":
requests:
gpu: "0.25"
automationStrategyRef:
name: kai-gpu-sharing
---
apiVersion: rightsizing.kubex.ai/v1alpha1
kind: GpuRebalancingPolicy
metadata:
name: kai-gpu-sharing-rebalancing
namespace: ml-team-a
spec:
experimental:
gpuKaiContract: v1alpha1-2026-04
scope:
labelSelector:
matchLabels:
nvidia.com/gpu.present: "true"
workloadTypes:
- Deployment
minPodMetricsAge: 15m
metrics:
compute:
upsize:
thresholdPercent: 125
metricsWindow: 10m
headroomPercent: 20
maxPercent: 200
scaleBack:
thresholdPercent: 60
metricsWindow: 10m
headroomPercent: 20
prometheus:
metric: kubex_gpu_container_compute_utilization_percent
namespaceLabel: namespace
podLabel: pod
containerLabel: container
memory:
upsize:
thresholdPercent: 125
metricsWindow: 10m
headroomPercent: 20
maxPercent: 200
scaleBack:
thresholdPercent: 60
metricsWindow: 10m
headroomPercent: 20
prometheus:
metric: kubex_gpu_container_memory_utilization_percent
namespaceLabel: namespace
podLabel: pod
containerLabel: container
automationStrategyRef:
name: kai-gpu-sharing
```

## Automation Strategy Notes

For KAI-enabled workloads, start with `spec.inPlaceResize.enabled: false`.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: Medium]

Location: charts/kubex-automation-engine/docs/GPU-Sharing-with-KAI.md:130 (the setQueueWhenSpecified description)

Issue: The GPU-Sharing-with-KAI.md guide states:

If you want queue assignment to be done via Kubex, set spec.kai.setQueueWhenSpecified: false in your AutomationStrategy.

This is likely a copy/paste documentation error. Setting the field to false means Kubex will not overwrite an existing queue label — the opposite of enabling Kubex to do queue assignment. The correct instruction to allow Kubex to own queue assignment would be setQueueWhenSpecified: true.

Suggested fix:

If you want Kubex to manage queue assignment (overwriting an existing `kai.scheduler/queue` label), set `spec.kai.setQueueWhenSpecified: true` in your AutomationStrategy.

- Eviction-based resize is the safer path today for KAI-enabled workloads.
- In-place resizing for KAI-enabled workloads can be experimented with, but it is currently unstable.

## Existing KAI Installations

For workloads that are already scheduled through KAI:

- keep the existing `kai.scheduler/queue` label on the workload template
- let Kubex Automation Engine update `gpu-fraction` as policies are applied

That allows Kubex Automation Engine to participate in GPU sharing without taking over queue assignment.

If you want queue assignment to be done via Kubex, set `spec.kai.setQueueWhenSpecified: false` in your AutomationStrategy.

## GPU Node Consolidation

`GpuConsolidationPolicy` can be used to consolidate KAI GPU workloads onto fewer GPU nodes.

Example targeting a specific worker pool:

```yaml
apiVersion: rightsizing.kubex.ai/v1alpha1
kind: GpuConsolidationPolicy
metadata:
name: kai-gpu-workers-a
spec:
experimental:
gpuKaiContract: v1alpha1-2026-04
nodeSelector:
matchLabels:
nodepool: gpu-workers-a
utilizationThresholdPercent: 70
requeueAfter: 2m
```

## Consolidation Limitations

GPU node consolidation is very early and has known limitations.

- It assumes pods will be schedulable on other nodes if they fit by GPU fraction.
- It does not yet fully model all other scheduler constraints.
- That can lead to frequent evictions when the controller chooses a node that looks drainable from GPU capacity alone but cannot actually be rescheduled cleanly.
- It may behave unpredictably with nodes that have multiple GPUs.

Use it carefully and start with a narrowly scoped worker pool.
2 changes: 2 additions & 0 deletions charts/kubex-automation-engine/docs/Global-Configuration.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Global Configuration

> Experimental: GPU/KAI-related fields in this resource are subject to breaking changes. When using them, set `spec.experimental.gpuKaiContract: v1alpha1-2026-04`.

`GlobalConfiguration` defines cluster-wide controller behavior that applies across strategies and policies.

Use it to control recommendation refresh timing, proactive rescans, heartbeat reporting, global automation switches, protected namespaces, and webhook health thresholds.
Expand Down
86 changes: 86 additions & 0 deletions charts/kubex-automation-engine/docs/gpu-consolidation-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# GPU Consolidation Policy

> Experimental: GPU/KAI fields and related custom resources are subject to breaking changes. Set `spec.experimental.gpuKaiContract: v1alpha1-2026-04`.

`GpuConsolidationPolicy` is a cluster-scoped controller that looks at scheduled pods carrying the `gpu-fraction` annotation and tries to consolidate them off an underutilized node.

## Behavior

- The controller scans all scheduled, non-terminal pods with `metadata.annotations["gpu-fraction"]`.
- `spec.nodeSelector` is required and uses standard Kubernetes label selector semantics.
- Each policy defines one compatibility pool. Create multiple policies when you need multiple compatible node pools.
- Only nodes selected by `spec.nodeSelector` are considered compatible for candidate selection and destination placement.
- Selected nodes are expected to be mutually compatible for GPU workload movement.
- Node GPU capacity is taken from `status.allocatable["nvidia.com/gpu"]`.
- Nodes with utilization below `spec.utilizationThresholdPercent` are candidates, but nodes with no GPU-fraction pods are ignored.
- Candidates are evaluated from most underutilized to least underutilized.
- A node is consolidated only when every GPU-fraction pod on that node can fit onto other non-empty GPU nodes without exceeding their allocatable capacity.
- The controller evicts all pods from the first drainable candidate node it finds in a reconcile loop.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Location: charts/kubex-automation-engine/docs/gpu-consolidation-policy.md:18 (the node-drain behavior description)

Issue: The documentation correctly calls out that consolidation evicts all evictable pods from a selected node, including pods without workload owners (static pods). However, this behavior — draining ownerless/static pods — is a significant operational risk that is buried in a notes section. Ownerless pods will not be rescheduled after eviction, which can permanently remove cluster infrastructure components (e.g., a kube-proxy or node-local-dns static pod) with no automated recovery.

The GPU-Sharing-with-KAI.md guide does not cross-reference this limitation at all.

Suggested fix: Elevate this to a > [!WARNING] callout block near the top of the gpu-consolidation-policy.md behavior section, and add a cross-reference in GPU-Sharing-with-KAI.md under "Consolidation Limitations".

- Eviction is node-wide for a selected consolidation candidate: once a node is marked for consolidation, every evictable pod on that node is targeted, including pods without workload owners such as static pods.
- Reconciliation is policy-driven: the controller runs on `GpuConsolidationPolicy` changes and on the periodic timer from `spec.requeueAfter`.
- Pod and Node changes do not trigger immediate rescans.
- If no node can be fully drained, the controller records that outcome in status and waits for the next `spec.requeueAfter`.

## Examples

```yaml
apiVersion: rightsizing.kubex.ai/v1alpha1
kind: GpuConsolidationPolicy
metadata:
name: gpu-consolidation-pool-a
spec:
experimental:
gpuKaiContract: v1alpha1-2026-04
nodeSelector:
matchLabels:
kubex.ai/gpu-pool: pool-a
utilizationThresholdPercent: 75
requeueAfter: 1m
```

Use one policy per compatibility pool:

```yaml
apiVersion: rightsizing.kubex.ai/v1alpha1
kind: GpuConsolidationPolicy
metadata:
name: gpu-consolidation-l40s
spec:
experimental:
gpuKaiContract: v1alpha1-2026-04
nodeSelector:
matchExpressions:
- key: kubex.ai/gpu-pool
operator: In
values:
- batch-l40s
- key: accelerator.nvidia.com/class
operator: In
values:
- l40s
utilizationThresholdPercent: 70
requeueAfter: 2m
---
apiVersion: rightsizing.kubex.ai/v1alpha1
kind: GpuConsolidationPolicy
metadata:
name: gpu-consolidation-h100
spec:
experimental:
gpuKaiContract: v1alpha1-2026-04
nodeSelector:
matchLabels:
kubex.ai/gpu-pool: training-h100
utilizationThresholdPercent: 80
requeueAfter: 1m
```

## Notes

- This policy is cluster-scoped only.
- `spec.nodeSelector` is the compatibility boundary for consolidation.
- It is self-contained and does not reference `AutomationStrategy`.
- Consolidation is based on GPU-fraction capacity only; it does not model CPU, memory, or scheduler affinity constraints.
- Consolidation drain behavior is not limited to GPU-fraction pods. After a node is selected, the node is drained by evicting all evictable pods on it, even when some of those pods do not have owners.
- If `spec.nodeSelector` matches no nodes, the policy reports `NoMatchingNodeSelector` and performs no evictions.
- If you need faster reaction to workload churn, lower `spec.requeueAfter`.
Loading