Skip to content

Sync kubex charts from automation-controller main @ 168df7d#107

Merged
gasarekubex merged 1 commit into
masterfrom
sync/kubex-charts/main-168df7d
May 12, 2026
Merged

Sync kubex charts from automation-controller main @ 168df7d#107
gasarekubex merged 1 commit into
masterfrom
sync/kubex-charts/main-168df7d

Conversation

@kubexautomation
Copy link
Copy Markdown
Contributor

This PR was created automatically to sync managed Kubex chart content.

Source commit:

  • 168df7d

Managed chart scope:

  • charts/kubex-automation-engine
  • charts/kubex-crds

@kubexautomation kubexautomation Bot requested a review from a team as a code owner May 8, 2026 19:25
Copy link
Copy Markdown
Contributor Author

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall Assessment

This PR introduces the GPU/KAI experimental feature set (rebalancing, consolidation, KAI queue integration) for the kubex-automation-engine and kubex-crds charts. The scope is large (22 files, ~2,100 additions) but the code is generally well-structured and the experimental feature gating via gpuKaiContract acknowledgement token is a sound design. Three issues deserve attention before merge: an ungated densify-config volume that will break pod starts when gateway.enabled: false, unconditional kai/prometheus fields in GlobalConfiguration that create a hard CRD upgrade-ordering dependency, and a documentation inversion in the setQueueWhenSpecified guide.

Risk level: Medium


Critical issues

None.


Major issues

  • deployment.yamldensify-config volume always rendered regardless of gateway.enabled.
    When gateway.enabled: false, the densify-config Secret volume is still declared in volumes:, and Kubernetes will try to resolve the kubex-gateway-config Secret. If the Secret is absent (normal when gateway is disabled), the pod will fail to start with CreateContainerConfigError. The volume must be wrapped in the same {{- if .Values.gateway.enabled }} guard as the container.
    Location: charts/kubex-automation-engine/templates/deployment.yaml — the volumes: section after line 166.

  • globalconfiguration.yamlkai and prometheus blocks always emitted unconditionally.
    These new fields are rendered into the GlobalConfiguration CR even when GPU/KAI is not being used. If kubex-crds has not been upgraded to the version that adds kai/prometheus to the GlobalConfiguration CRD, kubectl apply will reject the CR with unknown-field validation errors. This silently breaks the entire controller configuration. Either guard with a condition or document the mandatory upgrade ordering (kubex-crds first).
    Location: charts/kubex-automation-engine/templates/globalconfiguration.yaml:33

  • globalconfiguration.yaml — configurable singleton name risks orphaned CRs on rename.
    If an operator changes globalConfiguration.name from the default global-config, Helm creates a second GlobalConfiguration without removing the original. Controller behavior with two GlobalConfiguration CRs is undefined. A NOTES.txt warning or a Helm pre-upgrade hook is advisable.
    Location: charts/kubex-automation-engine/templates/globalconfiguration.yaml:9

  • clusterautomationstrategy.yamlexperimental block not emitted when kai: is set without enablement.gpu:.
    The CRD marks spec.experimental.gpuKaiContract as required when experimental is present. If a user configures spec.kai.* but omits enablement.gpu, the experimental block is absent and the CR will fail validation with a confusing error.
    Location: charts/kubex-automation-engine/templates/clusterautomationstrategy.yaml:11


Minor issues

  • GPU-Sharing-with-KAI.md:130setQueueWhenSpecified instruction is inverted.
    The guide says "set setQueueWhenSpecified: false if you want Kubex to manage queue assignment", but false means Kubex will not overwrite an existing queue label. The correct value is true.

  • clusterautomationstrategy.yaml:147request/requests alias precedence not documented.
    default .gpu.request .gpu.requests gives requests (plural) priority when both are set, but values.yaml only documents request (singular). Undocumented dual-key aliasing will confuse users.

  • kai-queues.yaml — Queue resource parameters hard-coded with no values exposure.
    quota: -1, limit: -1, and overQuotaWeight: 1 are fixed. Operators needing customization must disable the built-in queues entirely. Consider documenting this limitation or exposing at least overQuotaWeight.

  • gpu-consolidation-policy.md — static pod drain risk under-emphasized.
    The fact that consolidation evicts all pods including ownerless/static pods is noted only in a nested bullet. This should be a top-level > [!WARNING] block, and GPU-Sharing-with-KAI.md should cross-reference the limitation.


DRY improvement opportunities

  • The gpu.requests CRD schema block (properties: ceiling, containers, downsize, floor, setFromUnspecified, upsize, plus the floor≤ceiling CEL validation) is copy-pasted verbatim between rightsizing.kubex.ai_automationstrategies.yaml and rightsizing.kubex.ai_clusterautomationstrategies.yaml. Since these are generated CRDs this is likely unavoidable at the source-of-truth level, but worth noting if the generator supports shared schema fragments.

  • The experimental + kai Helm template blocks in clusterautomationstrategy.yaml are presumably identical (or very similar) to what would be needed for a namespaced automationstrategy.yaml template. If a namespaced template exists or will exist, extracting these into a shared partial (_helpers.tpl) would prevent future drift.


Suggested next steps

  1. Fix the densify-config volume gating in deployment.yaml — this is the highest-risk functional regression.
  2. Either guard kai/prometheus blocks in globalconfiguration.yaml or add explicit upgrade-ordering documentation.
  3. Correct the setQueueWhenSpecified documentation inversion in GPU-Sharing-with-KAI.md.
  4. Expand the experimental block condition in clusterautomationstrategy.yaml to also fire on kai: config.
  5. Add a > [!WARNING] callout in gpu-consolidation-policy.md for the static pod drain behavior.

{{- end }}
volumes:
- name: cert
secret:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Location: charts/kubex-automation-engine/templates/deployment.yaml:166 (the {{- end }} closing the gateway.enabled block)

Issue: The densify-config and densify-data volumes in the volumes: section are always rendered (outside the {{- if .Values.gateway.enabled }} block), but densify-config is only mounted inside the gateway container which is conditionally included. When gateway.enabled: false, densify-config will be defined as a volume pointing to the kubex-gateway-config Secret — Kubernetes will still attempt to resolve that Secret at pod creation time. If the Secret does not exist (which is likely when gateway is disabled), the pod will fail to start.

Additionally, the manager container directly mounts densify-data (line ~113), which is unconditionally defined outside the if block — that is fine. But densify-config has no mount in the manager container and is purely a gateway volume. It should be conditionally rendered alongside the gateway container.

Suggested fix: Wrap the densify-config volume definition with the same gateway.enabled guard:

volumes:
- name: cert
  ...
- name: densify-data
  emptyDir: {}
{{- if .Values.gateway.enabled }}
- name: densify-config
  secret:
    secretName: {{ .Values.gateway.configSecretName | default "kubex-gateway-config" }}
{{- end }}

@@ -31,6 +32,11 @@ spec:
failureThreshold: {{ .Values.globalConfiguration.webhookHealth.failureThreshold }}
successThreshold: {{ .Values.globalConfiguration.webhookHealth.successThreshold }}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Location: charts/kubex-automation-engine/templates/globalconfiguration.yaml:33 (the kai: block)

Issue: The kai and prometheus blocks are always emitted unconditionally into the GlobalConfiguration CR spec, regardless of whether the GPU/KAI feature is enabled. If an operator has upgraded kubex-automation-engine but has not yet upgraded kubex-crds to the version in this PR (which adds kai and prometheus fields to the GlobalConfiguration CRD), applying this chart will produce a CR that fails CRD schema validation with "unknown field" errors, breaking the GlobalConfiguration reconciliation loop entirely.

This creates a hard upgrade-ordering dependency that is not called out in any documentation.

Suggested fix: Either:

  1. Guard these blocks with a condition (e.g., {{- if .Values.globalConfiguration.kai.schedulerName }}), or
  2. Add an explicit note in the chart's README.md / upgrade guide that kubex-crds must be upgraded before kubex-automation-engine when moving to this version.

metadata:
name: global-config
name: {{ .Values.globalConfiguration.name }}
labels:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Location: charts/kubex-automation-engine/templates/globalconfiguration.yaml:9 (the name: field)

Issue: GlobalConfiguration is a singleton CR (there is typically only one per cluster). Making the name configurable via globalConfiguration.name (defaulting to global-config) is functionally correct, but during a Helm upgrade if an operator sets a different name, Helm will create a new CR with the new name while the old global-config CR remains. Two GlobalConfiguration CRs could then both be reconciled, causing undefined controller behavior.

The controller would need to handle or reject duplicate singletons, and there is no chart-level guard against this.

Suggested fix: Either:

  1. Add a NOTES.txt or validation hook warning that changing globalConfiguration.name requires manual deletion of the old CR, or
  2. Consider whether this field actually needs to be configurable; most controllers treat this singleton by a well-known name.

labels:
{{- include "kubex-automation-engine.labels" $ | nindent 4 }}
spec:
{{- if and $policyConfig.enablement $policyConfig.enablement.gpu }}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Location: charts/kubex-automation-engine/templates/clusterautomationstrategy.yaml:11 (the experimental: block)

Issue: The experimental.gpuKaiContract field is only emitted when $policyConfig.enablement.gpu is set. However, the CRD for both AutomationStrategy and ClusterAutomationStrategy marks spec.experimental.gpuKaiContract as required whenever the experimental object is present. More critically, if a user configures spec.kai (e.g. queue, setQueueWhenSpecified) without also setting enablement.gpu, the experimental block will be absent from the rendered YAML but the CRD will still require it for any GPU/KAI usage — resulting in a validation failure that gives no clear signal to the user about what is missing.

Suggested fix: Either expand the condition to also include kai fields:

{{- if or (and $policyConfig.enablement $policyConfig.enablement.gpu) $policyConfig.kai }}
experimental:
  gpuKaiContract: {{ $.Values.experimental.gpuKaiContract | quote }}
{{- end }}

Or document clearly in values.yaml that kai: cannot be set without enablement.gpu:.

{{- end }}
{{- end }}
{{- end }}
{{- if $policyConfig.enablement.gpu }}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Location: charts/kubex-automation-engine/templates/clusterautomationstrategy.yaml:147 (the $gpuReq assignment line)

Issue: The Helm default function has the semantics default <default_value> <test_value> — it returns <test_value> if non-empty, otherwise <default_value>. The current expression:

{{- $gpuReq := default $policyConfig.enablement.gpu.request $policyConfig.enablement.gpu.requests }}

Returns .gpu.requests if it is non-empty, otherwise falls back to .gpu.request. This means .gpu.requests (plural) takes priority, which is the intended new canonical form — good. However, the comment in values.yaml documents only request: (singular) without acknowledging the requests: alias or specifying which takes priority. A user who sets both will silently have requests win, which may be surprising.

Suggested fix: In values.yaml, document the alias behavior:

# gpu:
#   # 'requests' is the canonical field; 'request' (singular) is a deprecated alias kept for backward compatibility.
#   requests:
#     downsize: true

## Automation Strategy Notes

For KAI-enabled workloads, start with `spec.inPlaceResize.enabled: false`.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: Medium]

Location: charts/kubex-automation-engine/docs/GPU-Sharing-with-KAI.md:130 (the setQueueWhenSpecified description)

Issue: The GPU-Sharing-with-KAI.md guide states:

If you want queue assignment to be done via Kubex, set spec.kai.setQueueWhenSpecified: false in your AutomationStrategy.

This is likely a copy/paste documentation error. Setting the field to false means Kubex will not overwrite an existing queue label — the opposite of enabling Kubex to do queue assignment. The correct instruction to allow Kubex to own queue assignment would be setQueueWhenSpecified: true.

Suggested fix:

If you want Kubex to manage queue assignment (overwriting an existing `kai.scheduler/queue` label), set `spec.kai.setQueueWhenSpecified: true` in your AutomationStrategy.

@@ -0,0 +1,47 @@
{{- if .Values.kaiQueues.enabled -}}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Location: charts/kubex-automation-engine/templates/kai-queues.yaml:1

Issue: The kubex-parent-queue and kubex-unlimited-gpu-queue resources have their quota, limit, and overQuotaWeight values hard-coded (quota: -1, limit: -1, overQuotaWeight: 1). There is no values-level exposure to customize these, meaning any operator who needs different quota limits must either disable kaiQueues.enabled entirely and manage the Queues manually, or patch them post-install (which Helm will overwrite on the next upgrade).

This is acceptable for a v1 opt-in feature but limits the usefulness of the built-in queue creation.

Suggested fix (minor, optional): Expose at least overQuotaWeight through values to allow priority tuning, or call out in the documentation that the queues are intentionally unlimited and operators needing custom quotas should manage them outside Helm.

- Nodes with utilization below `spec.utilizationThresholdPercent` are candidates, but nodes with no GPU-fraction pods are ignored.
- Candidates are evaluated from most underutilized to least underutilized.
- A node is consolidated only when every GPU-fraction pod on that node can fit onto other non-empty GPU nodes without exceeding their allocatable capacity.
- The controller evicts all pods from the first drainable candidate node it finds in a reconcile loop.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Location: charts/kubex-automation-engine/docs/gpu-consolidation-policy.md:18 (the node-drain behavior description)

Issue: The documentation correctly calls out that consolidation evicts all evictable pods from a selected node, including pods without workload owners (static pods). However, this behavior — draining ownerless/static pods — is a significant operational risk that is buried in a notes section. Ownerless pods will not be rescheduled after eviction, which can permanently remove cluster infrastructure components (e.g., a kube-proxy or node-local-dns static pod) with no automated recovery.

The GPU-Sharing-with-KAI.md guide does not cross-reference this limitation at all.

Suggested fix: Elevate this to a > [!WARNING] callout block near the top of the gpu-consolidation-policy.md behavior section, and add a cross-reference in GPU-Sharing-with-KAI.md under "Consolidation Limitations".

@gasarekubex gasarekubex merged commit f4ea901 into master May 12, 2026
1 check passed
@gasarekubex gasarekubex deleted the sync/kubex-charts/main-168df7d branch May 12, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant