-
Notifications
You must be signed in to change notification settings - Fork 4
Sync kubex charts from automation-controller main @ 168df7d #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,175 @@ | ||
| # GPU Sharing with KAI | ||
|
|
||
| This guide shows how to configure GPU sharing with KAI and Kubex Automation Engine. | ||
|
|
||
| Tested with KAI `v0.12.16`. | ||
|
|
||
| > [!IMPORTANT] | ||
| > GPU/KAI fields and related custom resources are experimental and subject to breaking changes. Set `spec.experimental.gpuKaiContract: v1alpha1-2026-04` on GPU/KAI resources. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - KAI is already installed in the cluster | ||
| - `kubex-crds` and `kubex-automation-engine` are already installed | ||
| - Prometheus is available for GPU utilization metrics if you want to use `GpuRebalancingPolicy` | ||
|
|
||
| This guide works with either: | ||
|
|
||
| - a new KAI installation | ||
| - an existing KAI installation | ||
|
|
||
| For existing KAI-managed workloads, Kubex Automation Engine can update the `gpu-fraction` annotation without replacing the existing `kai.scheduler/queue` label. | ||
|
|
||
| ## Starter Example | ||
|
|
||
| The following example creates: | ||
|
|
||
| - an `AutomationStrategy` for KAI-enabled workloads in namespace `ml-team-a` | ||
| - a `StaticPolicy` that sets an initial shared GPU request for matching `Deployment` workloads | ||
| - a `GpuRebalancingPolicy` that adjusts that shared GPU request based on Prometheus GPU metrics | ||
|
|
||
| Both policies target `Deployment` workloads in a specific namespace that carry `nvidia.com/gpu.present: "true"`. | ||
|
|
||
| ```yaml | ||
| apiVersion: rightsizing.kubex.ai/v1alpha1 | ||
| kind: AutomationStrategy | ||
| metadata: | ||
| name: kai-gpu-sharing | ||
| namespace: ml-team-a | ||
| spec: | ||
| experimental: | ||
| gpuKaiContract: v1alpha1-2026-04 | ||
| enablement: | ||
| gpu: | ||
| overrideScheduler: "kai" | ||
| requests: | ||
| downsize: true | ||
| upsize: true | ||
| setFromUnspecified: false | ||
| kai: | ||
| queue: kubex-unlimited-gpu-queue | ||
| setQueueWhenSpecified: false | ||
| inPlaceResize: | ||
| enabled: false | ||
| podEviction: | ||
| enabled: true | ||
| --- | ||
| apiVersion: rightsizing.kubex.ai/v1alpha1 | ||
| kind: StaticPolicy | ||
| metadata: | ||
| name: kai-gpu-sharing-baseline | ||
| namespace: ml-team-a | ||
| spec: | ||
| scope: | ||
| labelSelector: | ||
| matchLabels: | ||
| nvidia.com/gpu.present: "true" | ||
| workloadTypes: | ||
| - Deployment | ||
| resources: | ||
| containers: | ||
| "*": | ||
| requests: | ||
| gpu: "0.25" | ||
| automationStrategyRef: | ||
| name: kai-gpu-sharing | ||
| --- | ||
| apiVersion: rightsizing.kubex.ai/v1alpha1 | ||
| kind: GpuRebalancingPolicy | ||
| metadata: | ||
| name: kai-gpu-sharing-rebalancing | ||
| namespace: ml-team-a | ||
| spec: | ||
| experimental: | ||
| gpuKaiContract: v1alpha1-2026-04 | ||
| scope: | ||
| labelSelector: | ||
| matchLabels: | ||
| nvidia.com/gpu.present: "true" | ||
| workloadTypes: | ||
| - Deployment | ||
| minPodMetricsAge: 15m | ||
| metrics: | ||
| compute: | ||
| upsize: | ||
| thresholdPercent: 125 | ||
| metricsWindow: 10m | ||
| headroomPercent: 20 | ||
| maxPercent: 200 | ||
| scaleBack: | ||
| thresholdPercent: 60 | ||
| metricsWindow: 10m | ||
| headroomPercent: 20 | ||
| prometheus: | ||
| metric: kubex_gpu_container_compute_utilization_percent | ||
| namespaceLabel: namespace | ||
| podLabel: pod | ||
| containerLabel: container | ||
| memory: | ||
| upsize: | ||
| thresholdPercent: 125 | ||
| metricsWindow: 10m | ||
| headroomPercent: 20 | ||
| maxPercent: 200 | ||
| scaleBack: | ||
| thresholdPercent: 60 | ||
| metricsWindow: 10m | ||
| headroomPercent: 20 | ||
| prometheus: | ||
| metric: kubex_gpu_container_memory_utilization_percent | ||
| namespaceLabel: namespace | ||
| podLabel: pod | ||
| containerLabel: container | ||
| automationStrategyRef: | ||
| name: kai-gpu-sharing | ||
| ``` | ||
|
|
||
| ## Automation Strategy Notes | ||
|
|
||
| For KAI-enabled workloads, start with `spec.inPlaceResize.enabled: false`. | ||
|
|
||
| - Eviction-based resize is the safer path today for KAI-enabled workloads. | ||
| - In-place resizing for KAI-enabled workloads can be experimented with, but it is currently unstable. | ||
|
|
||
| ## Existing KAI Installations | ||
|
|
||
| For workloads that are already scheduled through KAI: | ||
|
|
||
| - keep the existing `kai.scheduler/queue` label on the workload template | ||
| - let Kubex Automation Engine update `gpu-fraction` as policies are applied | ||
|
|
||
| That allows Kubex Automation Engine to participate in GPU sharing without taking over queue assignment. | ||
|
|
||
| If you want queue assignment to be done via Kubex, set `spec.kai.setQueueWhenSpecified: false` in your AutomationStrategy. | ||
|
|
||
| ## GPU Node Consolidation | ||
|
|
||
| `GpuConsolidationPolicy` can be used to consolidate KAI GPU workloads onto fewer GPU nodes. | ||
|
|
||
| Example targeting a specific worker pool: | ||
|
|
||
| ```yaml | ||
| apiVersion: rightsizing.kubex.ai/v1alpha1 | ||
| kind: GpuConsolidationPolicy | ||
| metadata: | ||
| name: kai-gpu-workers-a | ||
| spec: | ||
| experimental: | ||
| gpuKaiContract: v1alpha1-2026-04 | ||
| nodeSelector: | ||
| matchLabels: | ||
| nodepool: gpu-workers-a | ||
| utilizationThresholdPercent: 70 | ||
| requeueAfter: 2m | ||
| ``` | ||
|
|
||
| ## Consolidation Limitations | ||
|
|
||
| GPU node consolidation is very early and has known limitations. | ||
|
|
||
| - It assumes pods will be schedulable on other nodes if they fit by GPU fraction. | ||
| - It does not yet fully model all other scheduler constraints. | ||
| - That can lead to frequent evictions when the controller chooses a node that looks drainable from GPU capacity alone but cannot actually be rescheduled cleanly. | ||
| - It may behave unpredictably with nodes that have multiple GPUs. | ||
|
|
||
| Use it carefully and start with a narrowly scoped worker pool. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| # GPU Consolidation Policy | ||
|
|
||
| > Experimental: GPU/KAI fields and related custom resources are subject to breaking changes. Set `spec.experimental.gpuKaiContract: v1alpha1-2026-04`. | ||
|
|
||
| `GpuConsolidationPolicy` is a cluster-scoped controller that looks at scheduled pods carrying the `gpu-fraction` annotation and tries to consolidate them off an underutilized node. | ||
|
|
||
| ## Behavior | ||
|
|
||
| - The controller scans all scheduled, non-terminal pods with `metadata.annotations["gpu-fraction"]`. | ||
| - `spec.nodeSelector` is required and uses standard Kubernetes label selector semantics. | ||
| - Each policy defines one compatibility pool. Create multiple policies when you need multiple compatible node pools. | ||
| - Only nodes selected by `spec.nodeSelector` are considered compatible for candidate selection and destination placement. | ||
| - Selected nodes are expected to be mutually compatible for GPU workload movement. | ||
| - Node GPU capacity is taken from `status.allocatable["nvidia.com/gpu"]`. | ||
| - Nodes with utilization below `spec.utilizationThresholdPercent` are candidates, but nodes with no GPU-fraction pods are ignored. | ||
| - Candidates are evaluated from most underutilized to least underutilized. | ||
| - A node is consolidated only when every GPU-fraction pod on that node can fit onto other non-empty GPU nodes without exceeding their allocatable capacity. | ||
| - The controller evicts all pods from the first drainable candidate node it finds in a reconcile loop. | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CONTENT OF THIS REVIEW IS AI GENERATED [Severity: Minor] [Confidence: High] Location: Issue: The documentation correctly calls out that consolidation evicts all evictable pods from a selected node, including pods without workload owners (static pods). However, this behavior — draining ownerless/static pods — is a significant operational risk that is buried in a notes section. Ownerless pods will not be rescheduled after eviction, which can permanently remove cluster infrastructure components (e.g., a kube-proxy or node-local-dns static pod) with no automated recovery. The Suggested fix: Elevate this to a |
||
| - Eviction is node-wide for a selected consolidation candidate: once a node is marked for consolidation, every evictable pod on that node is targeted, including pods without workload owners such as static pods. | ||
| - Reconciliation is policy-driven: the controller runs on `GpuConsolidationPolicy` changes and on the periodic timer from `spec.requeueAfter`. | ||
| - Pod and Node changes do not trigger immediate rescans. | ||
| - If no node can be fully drained, the controller records that outcome in status and waits for the next `spec.requeueAfter`. | ||
|
|
||
| ## Examples | ||
|
|
||
| ```yaml | ||
| apiVersion: rightsizing.kubex.ai/v1alpha1 | ||
| kind: GpuConsolidationPolicy | ||
| metadata: | ||
| name: gpu-consolidation-pool-a | ||
| spec: | ||
| experimental: | ||
| gpuKaiContract: v1alpha1-2026-04 | ||
| nodeSelector: | ||
| matchLabels: | ||
| kubex.ai/gpu-pool: pool-a | ||
| utilizationThresholdPercent: 75 | ||
| requeueAfter: 1m | ||
| ``` | ||
|
|
||
| Use one policy per compatibility pool: | ||
|
|
||
| ```yaml | ||
| apiVersion: rightsizing.kubex.ai/v1alpha1 | ||
| kind: GpuConsolidationPolicy | ||
| metadata: | ||
| name: gpu-consolidation-l40s | ||
| spec: | ||
| experimental: | ||
| gpuKaiContract: v1alpha1-2026-04 | ||
| nodeSelector: | ||
| matchExpressions: | ||
| - key: kubex.ai/gpu-pool | ||
| operator: In | ||
| values: | ||
| - batch-l40s | ||
| - key: accelerator.nvidia.com/class | ||
| operator: In | ||
| values: | ||
| - l40s | ||
| utilizationThresholdPercent: 70 | ||
| requeueAfter: 2m | ||
| --- | ||
| apiVersion: rightsizing.kubex.ai/v1alpha1 | ||
| kind: GpuConsolidationPolicy | ||
| metadata: | ||
| name: gpu-consolidation-h100 | ||
| spec: | ||
| experimental: | ||
| gpuKaiContract: v1alpha1-2026-04 | ||
| nodeSelector: | ||
| matchLabels: | ||
| kubex.ai/gpu-pool: training-h100 | ||
| utilizationThresholdPercent: 80 | ||
| requeueAfter: 1m | ||
| ``` | ||
|
|
||
| ## Notes | ||
|
|
||
| - This policy is cluster-scoped only. | ||
| - `spec.nodeSelector` is the compatibility boundary for consolidation. | ||
| - It is self-contained and does not reference `AutomationStrategy`. | ||
| - Consolidation is based on GPU-fraction capacity only; it does not model CPU, memory, or scheduler affinity constraints. | ||
| - Consolidation drain behavior is not limited to GPU-fraction pods. After a node is selected, the node is drained by evicting all evictable pods on it, even when some of those pods do not have owners. | ||
| - If `spec.nodeSelector` matches no nodes, the policy reports `NoMatchingNodeSelector` and performs no evictions. | ||
| - If you need faster reaction to workload churn, lower `spec.requeueAfter`. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CONTENT OF THIS REVIEW IS AI GENERATED
[Severity: Minor] [Confidence: Medium]
Location:
charts/kubex-automation-engine/docs/GPU-Sharing-with-KAI.md:130(thesetQueueWhenSpecifieddescription)Issue: The
GPU-Sharing-with-KAI.mdguide states:This is likely a copy/paste documentation error. Setting the field to
falsemeans Kubex will not overwrite an existing queue label — the opposite of enabling Kubex to do queue assignment. The correct instruction to allow Kubex to own queue assignment would besetQueueWhenSpecified: true.Suggested fix: