Skip to content

Sync kubex charts from automation-controller main @ d8bde98#108

Merged
gasarekubex merged 2 commits into
masterfrom
sync/kubex-charts/main-d8bde98
May 12, 2026
Merged

Sync kubex charts from automation-controller main @ d8bde98#108
gasarekubex merged 2 commits into
masterfrom
sync/kubex-charts/main-d8bde98

Conversation

@kubexautomation
Copy link
Copy Markdown
Contributor

This PR was created automatically to sync managed Kubex chart content.

Source commit:

  • d8bde98

Managed chart scope:

  • charts/kubex-automation-engine
  • charts/kubex-crds

@kubexautomation kubexautomation Bot requested a review from a team as a code owner May 12, 2026 17:16
Copy link
Copy Markdown
Contributor Author

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall assessment: This PR is a well-structured, large-scope feature addition introducing GPU/KAI (Run:ai Kai scheduler) integration as an experimental capability. The experimental gating mechanism (gpuKaiContract acknowledgement token), CRD validations, RBAC completeness, and webhook coverage are well thought out. A handful of correctness issues in the Helm templates and documentation need attention before merging.

Risk level: Medium


Critical issues

None.


Major issues

  • experimental block in clusterautomationstrategy.yaml is structurally decoupled from the gpu block it guards (line 14, clusterautomationstrategy.yaml): The experimental section is rendered at the top of the spec before the enablement block, separated by ~130 lines from the GPU rendering logic. An empty enablement.gpu: {} map would still trigger the experimental block even though no GPU fields follow. Co-locating the two guards will eliminate silent partial renders.

  • Silent key precedence for gpu.request vs gpu.requests (line 146, clusterautomationstrategy.yaml): If both the deprecated request key and the new requests key are provided, request wins silently. The same pattern exists for cpu and memory. Either add a fail guard for the conflict case or clearly document the precedence order in values.yaml.

  • globalConfiguration.name change is a breaking in-place upgrade (line 8, globalconfiguration.yaml): The name was previously hardcoded as global-config. Making it configurable is correct, but without a migration note, any operator who upgrades from a prior version and supplies a different name will create an orphaned duplicate GlobalConfiguration CR, causing unpredictable controller behavior.

  • prometheus.url and kai.schedulerName are rendered unconditionally (line 35, globalconfiguration.yaml): Non-GPU users will have a hardcoded http://prometheus.monitoring.svc:9090 baked into their GlobalConfiguration on every upgrade, regardless of whether they use GPU rebalancing. This is a silent regression for non-GPU deployments where Prometheus is absent.


Minor issues

  • imagePullSecrets value is defined but never rendered (values.yaml line 72): The field is documented in values.yaml with a helpful example, but no template reads or applies it. It is silently a no-op.

  • overrideScheduler field is missing from the Helm template (clusterautomationstrategy.yaml line 150): The CRD defines spec.enablement.gpu.overrideScheduler (enum: none/kai), but the template does not render it. Users cannot activate KAI GPU mutation mode through Helm-managed strategies.

  • spec.kai block is missing from the Helm template (clusterautomationstrategy.yaml line 183): The CRD defines spec.kai.queue and spec.kai.setQueueWhenSpecified, but neither is rendered. KAI queue assignment cannot be configured through Helm-managed strategies.

  • Documentation inversion in GPU-Sharing-with-KAI.md (line 120): The guide incorrectly states that setting spec.kai.setQueueWhenSpecified: false enables Kubex to take over queue assignment. The correct value is true; false preserves the existing label.

  • Capabilities.APIVersions.Has check in kai-queues.yaml (line 2): The check validates API version group presence, not that the Queue CRD kind is actually registered. A partial Run:ai install could pass the guard and still fail at server-side apply. This is low risk but worth noting in the error message.


DRY improvement opportunities

  • GPU enablement rendering is duplicated across clusterautomationstrategy.yaml: The identical requests sub-block pattern (downsize/upsize/setFromUnspecified/floor/ceiling/containers) is repeated for CPU, memory, and now GPU. Consider extracting a named template partial (e.g., {{- define "kubex.renderResourceEnablement" }}) to reduce the 200+ line repetition and ensure future field additions (e.g., overrideScheduler, kai) are applied consistently to all resource types.

  • experimental block pattern is duplicated across CRD docs: The identical > Experimental: GPU/KAI-related fields... callout appears verbatim in Automation-Strategies.md, Cluster-Automation-Strategies.md, and Global-Configuration.md. This is appropriate for docs that may drift independently, but worth noting.


Suggested next steps

  1. Add rendering for overrideScheduler and the kai block to clusterautomationstrategy.yaml (missing fields from the CRD schema).
  2. Wire imagePullSecrets into deployment.yaml or remove the value stub until it is implemented.
  3. Add an upgrade note about globalConfiguration.name in the README or CHANGELOG.
  4. Gate prometheus and kai in globalconfiguration.yaml to render only when non-empty/non-default values are set.
  5. Fix the setQueueWhenSpecified documentation inversion in GPU-Sharing-with-KAI.md.
  6. Confirm Chart.yaml version bumps for both kubex-crds and kubex-automation-engine are present (none appear in this diff).

{{- if and $policyConfig.enablement $policyConfig.enablement.gpu }}
experimental:
gpuKaiContract: {{ $.Values.experimental.gpuKaiContract | quote }}
{{- end }}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Issue: The experimental block is rendered when $policyConfig.enablement.gpu is truthy, but the gpu enablement block itself is rendered under the broader {{- if $policyConfig.enablement }} guard further down. These two guards are structurally independent, meaning if enablement is set without a gpu key, no experimental block is emitted — correct. However, the experimental block is placed outside the enablement block (rendered before it), so if enablement.gpu is set, the rendered YAML will have experimental: appear before enablement:. This is valid YAML, but it means removing gpu from enablement in a future values update will silently drop the experimental block without any warning, while the gpu sub-block inside enablement continues rendering until the full enablement guard fires.

Why it matters: If a user sets enablement.gpu to {} (empty map), Helm evaluates it as truthy, so experimental.gpuKaiContract gets emitted even though no actual GPU fields follow. Conversely, a template author editing the file later may not notice the experimental guard is separated from the gpu rendering block by ~130 lines, leading to drift between the two.

Suggested fix: Move the experimental block to render inside the {{- if $policyConfig.enablement }} conditional, right before the gpu block, or consolidate both guards into a single {{- if and $policyConfig.enablement $policyConfig.enablement.gpu }} wrapping the entire section. Example:

{{- if $policyConfig.enablement }}
enablement:
  ...
  {{- if $policyConfig.enablement.gpu }}
  experimental:
    gpuKaiContract: {{ $.Values.experimental.gpuKaiContract | quote }}
  gpu:
    ...
  {{- end }}
{{- end }}

Note: experimental is a sibling of enablement in the CRD spec, so it must remain at the spec level — but it should still be co-located structurally with the GPU block that triggers it.

{{- end }}
{{- end }}
{{- end }}
{{- end }}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Issue: Silent precedence when both request (singular) and requests (plural) keys are present in values — request wins silently.

Why it matters: default $policyConfig.enablement.gpu.request $policyConfig.enablement.gpu.requests returns the second argument only when the first is falsy/empty. If a user supplies both gpu.request and gpu.requests (e.g. during a migration from old to new key), gpu.request silently wins and gpu.requests is completely ignored. There is no warning emitted and no documentation noting which key takes precedence. This exact pattern is also used for cpu.request/cpu.requests and memory.request/memory.requests in the same template.

Suggested fix: Add an explicit mutual-exclusivity check at the top of each block (or document the precedence clearly in values.yaml comments):

{{- if and $policyConfig.enablement.gpu.request $policyConfig.enablement.gpu.requests }}
{{- fail (printf "Policy %s: set either enablement.gpu.request or enablement.gpu.requests, not both" $policyName) }}
{{- end }}
{{- $gpuReq := coalesce $policyConfig.enablement.gpu.requests $policyConfig.enablement.gpu.request }}

At minimum, add a comment to values.yaml making it clear that requests is the canonical key and request is the deprecated alias.

kind: GlobalConfiguration
metadata:
name: global-config
name: {{ .Values.globalConfiguration.name }}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Issue: Changing metadata.name from the hardcoded string global-config to {{ .Values.globalConfiguration.name }} is a breaking change for in-place upgrades when users have customized this value or when any external resource (e.g., automationStrategyRef, documentation links, or manually-created CRs) references the name global-config.

Why it matters: On helm upgrade, Helm tracks resources by their rendered name. If globalConfiguration.name is ever set to anything other than global-config (the default), Helm will attempt to create a new GlobalConfiguration CR with the new name while leaving the old global-config CR orphaned in the cluster. The controller may now manage two GlobalConfiguration CRs simultaneously (since the resource is a singleton by convention, not by CRD enforcement), causing unpredictable behavior.

Suggested fix:

  1. Add an upgrade note/warning in README.md explaining that globalConfiguration.name must match the name of any previously deployed GlobalConfiguration CR.
  2. Consider adding a Helm pre-upgrade hook or a _helpers.tpl note asserting that renaming this value from its previous default requires manual deletion of the old CR.
  3. If backward compatibility must be preserved, keep the default as global-config but document the field clearly.

failureThreshold: {{ .Values.globalConfiguration.webhookHealth.failureThreshold }}
successThreshold: {{ .Values.globalConfiguration.webhookHealth.successThreshold }}
transitionCheckInterval: {{ .Values.globalConfiguration.webhookHealth.transitionCheckInterval | quote }}
kai:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Issue: prometheus.url and kai.schedulerName are rendered unconditionally into the GlobalConfiguration CR even when no GPU/KAI features are configured.

Why it matters: A user who upgrades without any GPU workloads will now have prometheus.url: "http://prometheus.monitoring.svc:9090" baked into their GlobalConfiguration CR. If Prometheus is not installed, the controller may attempt to connect and generate warning logs or failed health checks on every reconcile cycle, even when no GpuRebalancingPolicy exists. This is a silent operational regression for non-GPU users.

Suggested fix: Gate these blocks behind a {{- if .Values.globalConfiguration.prometheus.url }} (or a dedicated feature flag) so they are omitted from the CR when not needed. Alternatively, leave the fields optional at the CRD level and only render them when non-empty:

{{- with .Values.globalConfiguration.prometheus.url }}
prometheus:
  url: {{ . | quote }}
  requestTimeout: {{ $.Values.globalConfiguration.prometheus.requestTimeout | quote }}
{{- end }}

@@ -0,0 +1,47 @@
{{- if .Values.kaiQueues.enabled -}}
{{- if not (.Capabilities.APIVersions.Has "scheduling.run.ai/v2/Queue") -}}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Issue: The Capabilities.APIVersions.Has check uses "scheduling.run.ai/v2/Queue" but the apiVersion in the rendered Queue resources is scheduling.run.ai/v2. The Has function expects the format <group>/<version>/<kind> when kind is included, which is scheduling.run.ai/v2/Queue. This should work correctly with Helm's capabilities check — but note that Capabilities.APIVersions.Has checks for installed API versions, not installed CRDs by kind. If the Run:ai operator installs scheduling.run.ai/v2 but does not register the Queue kind under it (e.g., a partial install), this check will pass and the apply will still fail at the server side.

Why it matters: The guard gives a false sense of pre-flight safety. In practice, a partial or misconfigured Run:ai installation could pass the check but still fail resource creation.

Suggested fix: Consider adding a note in the error message that the check validates API group presence, not full Queue CRD availability. Alternatively, use a lookup to verify a Queue resource exists, though that requires cluster connectivity:

{{- if not (lookup "scheduling.run.ai/v2" "Queue" "" "") }}
{{- fail "..." }}
{{- end }}

This is a minor concern since the fail message already guides users to install the CRDs first.

# -- Image pull secrets for private registries
# Example:
# imagePullSecrets:
# - name: regcred
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: Medium]

Issue: imagePullSecrets is added to values.yaml but there is no corresponding rendering in templates/deployment.yaml (or any other template).

Why it matters: A user who sets imagePullSecrets expecting it to be applied to the deployment pod spec will get no effect — the value is silently ignored. This is misleading documentation.

Suggested fix: Either wire imagePullSecrets into the deployment template:

{{- with .Values.imagePullSecrets }}
imagePullSecrets:
  {{- toYaml . | nindent 8 }}
{{- end }}

Or remove the field from values.yaml until the template support is in place, and add it in a follow-up PR with a proper implementation.

headroomPercent: 20
prometheus:
metric: kubex_gpu_container_memory_utilization_percent
namespaceLabel: namespace
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Issue: The example in the GPU-Sharing-with-KAI.md "Existing KAI Installations" section contains a logic inversion in the description.

The prose says:

"If you want queue assignment to be done via Kubex, set spec.kai.setQueueWhenSpecified: false in your AutomationStrategy."

But setQueueWhenSpecified: false means Kubex will not overwrite an existing queue label — i.e. queue assignment stays with whatever label the workload already has. To have Kubex take over queue assignment, the value should be true.

Why it matters: Users following this guide to enable Kubex-driven queue assignment will set the wrong value, resulting in no queue override being applied, with no error to indicate the misconfiguration.

Suggested fix: Correct the sentence to either:

  • "If you want queue assignment to be done via Kubex, set spec.kai.setQueueWhenSpecified: true"
  • Or clarify that false preserves the existing queue label (opt-out mode).

{{- if $policyConfig.enablement.gpu }}
{{- $gpuReq := default $policyConfig.enablement.gpu.request $policyConfig.enablement.gpu.requests }}
gpu:
{{- if $gpuReq }}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Issue: The overrideScheduler field is not rendered in the clusterautomationstrategy.yaml template, even though it is a defined field in the CRD schema for spec.enablement.gpu.overrideScheduler.

Why it matters: Users who set policy.policies.<name>.enablement.gpu.overrideScheduler: kai in their values will have the field silently dropped from the rendered CR. The KAI GPU mutation mode will never be activated via Helm-managed strategies, requiring users to manage raw CRs manually for this field — an inconsistency with all other enablement sub-fields.

Suggested fix: Add rendering for overrideScheduler in the gpu: block of the template, analogous to how downsize, upsize, etc. are rendered:

{{- if $gpuReq.overrideScheduler }}
overrideScheduler: {{ $gpuReq.overrideScheduler }}
{{- end }}

Note: overrideScheduler sits at the gpu: level, not inside gpu.requests:, so it should be added before the requests: sub-block.

{{- end }}
{{- end }}
{{- if $policyConfig.inPlaceResize }}
inPlaceResize:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Issue: The kai block (spec.kai.queue and spec.kai.setQueueWhenSpecified) defined in the CRD for AutomationStrategy / ClusterAutomationStrategy is also not rendered in the Helm template. This is the same class of omission as overrideScheduler.

Why it matters: Users relying on Helm-managed strategies to control which KAI queue is assigned to GPU workloads will have this configuration silently dropped. This means the only way to configure spec.kai is via manually managed raw CRs, which is inconsistent with the overall Helm-driven workflow advertised in the documentation and README.

Suggested fix: Add a {{- if $policyConfig.kai }} block in the template after the enablement section, rendering queue and setQueueWhenSpecified:

{{- if $policyConfig.kai }}
kai:
  {{- if $policyConfig.kai.queue }}
  queue: {{ $policyConfig.kai.queue | quote }}
  {{- end }}
  {{- if hasKey $policyConfig.kai "setQueueWhenSpecified" }}
  setQueueWhenSpecified: {{ $policyConfig.kai.setQueueWhenSpecified }}
  {{- end }}
{{- end }}

@gasarekubex gasarekubex merged commit b4b62b7 into master May 12, 2026
1 check passed
@gasarekubex gasarekubex deleted the sync/kubex-charts/main-d8bde98 branch May 12, 2026 17:24
Copy link
Copy Markdown
Contributor Author

@kubexautomation kubexautomation Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

Overall assessment: This PR's final merged diff is a clean, documentation-only change — 4 lines added to Safety-Controls.md documenting the new floor-clamped and ceiling-clamped filter runtime names. The additions are accurate, internally consistent with the existing table and example format, and add clear operational value for users debugging bounds-enforcement behavior.

Risk level: Low


Critical issues

None.

Major issues

None.

Minor issues

None.

DRY improvement opportunities

None in this diff.


Notes on prior review comments

The 9 existing inline review comments on this PR (covering clusterautomationstrategy.yaml, globalconfiguration.yaml, kai-queues.yaml, values.yaml, and GPU-Sharing-with-KAI.md) are all marked outdated — they reference an earlier commit (f62c18418d) that is not reflected in the final merged state. The items they raise (missing overrideScheduler/kai template rendering, unconditional Prometheus rendering, imagePullSecrets stub, documentation inversion in GPU-Sharing-with-KAI.md, etc.) are worth tracking as follow-up issues if those files were reverted or not carried forward in the merge.

Suggested next steps

  1. Confirm that the issues raised in the outdated review comments (from the earlier commit) were intentionally deferred or are tracked in follow-up issues.
  2. No changes required for the merged documentation content — it is correct as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant