Sync kubex charts from automation-controller main @ d8bde98#108
Conversation
There was a problem hiding this comment.
CONTENT OF THIS REVIEW IS AI GENERATED
Overall assessment: This PR is a well-structured, large-scope feature addition introducing GPU/KAI (Run:ai Kai scheduler) integration as an experimental capability. The experimental gating mechanism (gpuKaiContract acknowledgement token), CRD validations, RBAC completeness, and webhook coverage are well thought out. A handful of correctness issues in the Helm templates and documentation need attention before merging.
Risk level: Medium
Critical issues
None.
Major issues
-
experimentalblock inclusterautomationstrategy.yamlis structurally decoupled from thegpublock it guards (line 14,clusterautomationstrategy.yaml): Theexperimentalsection is rendered at the top of the spec before theenablementblock, separated by ~130 lines from the GPU rendering logic. An emptyenablement.gpu: {}map would still trigger theexperimentalblock even though no GPU fields follow. Co-locating the two guards will eliminate silent partial renders. -
Silent key precedence for
gpu.requestvsgpu.requests(line 146,clusterautomationstrategy.yaml): If both the deprecatedrequestkey and the newrequestskey are provided,requestwins silently. The same pattern exists forcpuandmemory. Either add afailguard for the conflict case or clearly document the precedence order invalues.yaml. -
globalConfiguration.namechange is a breaking in-place upgrade (line 8,globalconfiguration.yaml): The name was previously hardcoded asglobal-config. Making it configurable is correct, but without a migration note, any operator who upgrades from a prior version and supplies a different name will create an orphaned duplicateGlobalConfigurationCR, causing unpredictable controller behavior. -
prometheus.urlandkai.schedulerNameare rendered unconditionally (line 35,globalconfiguration.yaml): Non-GPU users will have a hardcodedhttp://prometheus.monitoring.svc:9090baked into theirGlobalConfigurationon every upgrade, regardless of whether they use GPU rebalancing. This is a silent regression for non-GPU deployments where Prometheus is absent.
Minor issues
-
imagePullSecretsvalue is defined but never rendered (values.yamlline 72): The field is documented invalues.yamlwith a helpful example, but no template reads or applies it. It is silently a no-op. -
overrideSchedulerfield is missing from the Helm template (clusterautomationstrategy.yamlline 150): The CRD definesspec.enablement.gpu.overrideScheduler(enum:none/kai), but the template does not render it. Users cannot activate KAI GPU mutation mode through Helm-managed strategies. -
spec.kaiblock is missing from the Helm template (clusterautomationstrategy.yamlline 183): The CRD definesspec.kai.queueandspec.kai.setQueueWhenSpecified, but neither is rendered. KAI queue assignment cannot be configured through Helm-managed strategies. -
Documentation inversion in
GPU-Sharing-with-KAI.md(line 120): The guide incorrectly states that settingspec.kai.setQueueWhenSpecified: falseenables Kubex to take over queue assignment. The correct value istrue;falsepreserves the existing label. -
Capabilities.APIVersions.Hascheck inkai-queues.yaml(line 2): The check validates API version group presence, not that theQueueCRD kind is actually registered. A partial Run:ai install could pass the guard and still fail at server-side apply. This is low risk but worth noting in the error message.
DRY improvement opportunities
-
GPU enablement rendering is duplicated across
clusterautomationstrategy.yaml: The identicalrequestssub-block pattern (downsize/upsize/setFromUnspecified/floor/ceiling/containers) is repeated for CPU, memory, and now GPU. Consider extracting a named template partial (e.g.,{{- define "kubex.renderResourceEnablement" }}) to reduce the 200+ line repetition and ensure future field additions (e.g.,overrideScheduler,kai) are applied consistently to all resource types. -
experimentalblock pattern is duplicated across CRD docs: The identical> Experimental: GPU/KAI-related fields...callout appears verbatim inAutomation-Strategies.md,Cluster-Automation-Strategies.md, andGlobal-Configuration.md. This is appropriate for docs that may drift independently, but worth noting.
Suggested next steps
- Add rendering for
overrideSchedulerand thekaiblock toclusterautomationstrategy.yaml(missing fields from the CRD schema). - Wire
imagePullSecretsintodeployment.yamlor remove the value stub until it is implemented. - Add an upgrade note about
globalConfiguration.namein the README or CHANGELOG. - Gate
prometheusandkaiinglobalconfiguration.yamlto render only when non-empty/non-default values are set. - Fix the
setQueueWhenSpecifieddocumentation inversion inGPU-Sharing-with-KAI.md. - Confirm
Chart.yamlversion bumps for bothkubex-crdsandkubex-automation-engineare present (none appear in this diff).
| {{- if and $policyConfig.enablement $policyConfig.enablement.gpu }} | ||
| experimental: | ||
| gpuKaiContract: {{ $.Values.experimental.gpuKaiContract | quote }} | ||
| {{- end }} |
There was a problem hiding this comment.
CONTENT OF THIS REVIEW IS AI GENERATED
[Severity: Major] [Confidence: High]
Issue: The experimental block is rendered when $policyConfig.enablement.gpu is truthy, but the gpu enablement block itself is rendered under the broader {{- if $policyConfig.enablement }} guard further down. These two guards are structurally independent, meaning if enablement is set without a gpu key, no experimental block is emitted — correct. However, the experimental block is placed outside the enablement block (rendered before it), so if enablement.gpu is set, the rendered YAML will have experimental: appear before enablement:. This is valid YAML, but it means removing gpu from enablement in a future values update will silently drop the experimental block without any warning, while the gpu sub-block inside enablement continues rendering until the full enablement guard fires.
Why it matters: If a user sets enablement.gpu to {} (empty map), Helm evaluates it as truthy, so experimental.gpuKaiContract gets emitted even though no actual GPU fields follow. Conversely, a template author editing the file later may not notice the experimental guard is separated from the gpu rendering block by ~130 lines, leading to drift between the two.
Suggested fix: Move the experimental block to render inside the {{- if $policyConfig.enablement }} conditional, right before the gpu block, or consolidate both guards into a single {{- if and $policyConfig.enablement $policyConfig.enablement.gpu }} wrapping the entire section. Example:
{{- if $policyConfig.enablement }}
enablement:
...
{{- if $policyConfig.enablement.gpu }}
experimental:
gpuKaiContract: {{ $.Values.experimental.gpuKaiContract | quote }}
gpu:
...
{{- end }}
{{- end }}Note: experimental is a sibling of enablement in the CRD spec, so it must remain at the spec level — but it should still be co-located structurally with the GPU block that triggers it.
| {{- end }} | ||
| {{- end }} | ||
| {{- end }} | ||
| {{- end }} |
There was a problem hiding this comment.
CONTENT OF THIS REVIEW IS AI GENERATED
[Severity: Major] [Confidence: High]
Issue: Silent precedence when both request (singular) and requests (plural) keys are present in values — request wins silently.
Why it matters: default $policyConfig.enablement.gpu.request $policyConfig.enablement.gpu.requests returns the second argument only when the first is falsy/empty. If a user supplies both gpu.request and gpu.requests (e.g. during a migration from old to new key), gpu.request silently wins and gpu.requests is completely ignored. There is no warning emitted and no documentation noting which key takes precedence. This exact pattern is also used for cpu.request/cpu.requests and memory.request/memory.requests in the same template.
Suggested fix: Add an explicit mutual-exclusivity check at the top of each block (or document the precedence clearly in values.yaml comments):
{{- if and $policyConfig.enablement.gpu.request $policyConfig.enablement.gpu.requests }}
{{- fail (printf "Policy %s: set either enablement.gpu.request or enablement.gpu.requests, not both" $policyName) }}
{{- end }}
{{- $gpuReq := coalesce $policyConfig.enablement.gpu.requests $policyConfig.enablement.gpu.request }}At minimum, add a comment to values.yaml making it clear that requests is the canonical key and request is the deprecated alias.
| kind: GlobalConfiguration | ||
| metadata: | ||
| name: global-config | ||
| name: {{ .Values.globalConfiguration.name }} |
There was a problem hiding this comment.
CONTENT OF THIS REVIEW IS AI GENERATED
[Severity: Major] [Confidence: High]
Issue: Changing metadata.name from the hardcoded string global-config to {{ .Values.globalConfiguration.name }} is a breaking change for in-place upgrades when users have customized this value or when any external resource (e.g., automationStrategyRef, documentation links, or manually-created CRs) references the name global-config.
Why it matters: On helm upgrade, Helm tracks resources by their rendered name. If globalConfiguration.name is ever set to anything other than global-config (the default), Helm will attempt to create a new GlobalConfiguration CR with the new name while leaving the old global-config CR orphaned in the cluster. The controller may now manage two GlobalConfiguration CRs simultaneously (since the resource is a singleton by convention, not by CRD enforcement), causing unpredictable behavior.
Suggested fix:
- Add an upgrade note/warning in
README.mdexplaining thatglobalConfiguration.namemust match the name of any previously deployedGlobalConfigurationCR. - Consider adding a Helm pre-upgrade hook or a
_helpers.tplnote asserting that renaming this value from its previous default requires manual deletion of the old CR. - If backward compatibility must be preserved, keep the default as
global-configbut document the field clearly.
| failureThreshold: {{ .Values.globalConfiguration.webhookHealth.failureThreshold }} | ||
| successThreshold: {{ .Values.globalConfiguration.webhookHealth.successThreshold }} | ||
| transitionCheckInterval: {{ .Values.globalConfiguration.webhookHealth.transitionCheckInterval | quote }} | ||
| kai: |
There was a problem hiding this comment.
CONTENT OF THIS REVIEW IS AI GENERATED
[Severity: Major] [Confidence: High]
Issue: prometheus.url and kai.schedulerName are rendered unconditionally into the GlobalConfiguration CR even when no GPU/KAI features are configured.
Why it matters: A user who upgrades without any GPU workloads will now have prometheus.url: "http://prometheus.monitoring.svc:9090" baked into their GlobalConfiguration CR. If Prometheus is not installed, the controller may attempt to connect and generate warning logs or failed health checks on every reconcile cycle, even when no GpuRebalancingPolicy exists. This is a silent operational regression for non-GPU users.
Suggested fix: Gate these blocks behind a {{- if .Values.globalConfiguration.prometheus.url }} (or a dedicated feature flag) so they are omitted from the CR when not needed. Alternatively, leave the fields optional at the CRD level and only render them when non-empty:
{{- with .Values.globalConfiguration.prometheus.url }}
prometheus:
url: {{ . | quote }}
requestTimeout: {{ $.Values.globalConfiguration.prometheus.requestTimeout | quote }}
{{- end }}| @@ -0,0 +1,47 @@ | |||
| {{- if .Values.kaiQueues.enabled -}} | |||
| {{- if not (.Capabilities.APIVersions.Has "scheduling.run.ai/v2/Queue") -}} | |||
There was a problem hiding this comment.
CONTENT OF THIS REVIEW IS AI GENERATED
[Severity: Minor] [Confidence: High]
Issue: The Capabilities.APIVersions.Has check uses "scheduling.run.ai/v2/Queue" but the apiVersion in the rendered Queue resources is scheduling.run.ai/v2. The Has function expects the format <group>/<version>/<kind> when kind is included, which is scheduling.run.ai/v2/Queue. This should work correctly with Helm's capabilities check — but note that Capabilities.APIVersions.Has checks for installed API versions, not installed CRDs by kind. If the Run:ai operator installs scheduling.run.ai/v2 but does not register the Queue kind under it (e.g., a partial install), this check will pass and the apply will still fail at the server side.
Why it matters: The guard gives a false sense of pre-flight safety. In practice, a partial or misconfigured Run:ai installation could pass the check but still fail resource creation.
Suggested fix: Consider adding a note in the error message that the check validates API group presence, not full Queue CRD availability. Alternatively, use a lookup to verify a Queue resource exists, though that requires cluster connectivity:
{{- if not (lookup "scheduling.run.ai/v2" "Queue" "" "") }}
{{- fail "..." }}
{{- end }}This is a minor concern since the fail message already guides users to install the CRDs first.
| # -- Image pull secrets for private registries | ||
| # Example: | ||
| # imagePullSecrets: | ||
| # - name: regcred |
There was a problem hiding this comment.
CONTENT OF THIS REVIEW IS AI GENERATED
[Severity: Minor] [Confidence: Medium]
Issue: imagePullSecrets is added to values.yaml but there is no corresponding rendering in templates/deployment.yaml (or any other template).
Why it matters: A user who sets imagePullSecrets expecting it to be applied to the deployment pod spec will get no effect — the value is silently ignored. This is misleading documentation.
Suggested fix: Either wire imagePullSecrets into the deployment template:
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}Or remove the field from values.yaml until the template support is in place, and add it in a follow-up PR with a proper implementation.
| headroomPercent: 20 | ||
| prometheus: | ||
| metric: kubex_gpu_container_memory_utilization_percent | ||
| namespaceLabel: namespace |
There was a problem hiding this comment.
CONTENT OF THIS REVIEW IS AI GENERATED
[Severity: Minor] [Confidence: High]
Issue: The example in the GPU-Sharing-with-KAI.md "Existing KAI Installations" section contains a logic inversion in the description.
The prose says:
"If you want queue assignment to be done via Kubex, set
spec.kai.setQueueWhenSpecified: falsein your AutomationStrategy."
But setQueueWhenSpecified: false means Kubex will not overwrite an existing queue label — i.e. queue assignment stays with whatever label the workload already has. To have Kubex take over queue assignment, the value should be true.
Why it matters: Users following this guide to enable Kubex-driven queue assignment will set the wrong value, resulting in no queue override being applied, with no error to indicate the misconfiguration.
Suggested fix: Correct the sentence to either:
- "If you want queue assignment to be done via Kubex, set
spec.kai.setQueueWhenSpecified: true" - Or clarify that
falsepreserves the existing queue label (opt-out mode).
| {{- if $policyConfig.enablement.gpu }} | ||
| {{- $gpuReq := default $policyConfig.enablement.gpu.request $policyConfig.enablement.gpu.requests }} | ||
| gpu: | ||
| {{- if $gpuReq }} |
There was a problem hiding this comment.
CONTENT OF THIS REVIEW IS AI GENERATED
[Severity: Minor] [Confidence: High]
Issue: The overrideScheduler field is not rendered in the clusterautomationstrategy.yaml template, even though it is a defined field in the CRD schema for spec.enablement.gpu.overrideScheduler.
Why it matters: Users who set policy.policies.<name>.enablement.gpu.overrideScheduler: kai in their values will have the field silently dropped from the rendered CR. The KAI GPU mutation mode will never be activated via Helm-managed strategies, requiring users to manage raw CRs manually for this field — an inconsistency with all other enablement sub-fields.
Suggested fix: Add rendering for overrideScheduler in the gpu: block of the template, analogous to how downsize, upsize, etc. are rendered:
{{- if $gpuReq.overrideScheduler }}
overrideScheduler: {{ $gpuReq.overrideScheduler }}
{{- end }}Note: overrideScheduler sits at the gpu: level, not inside gpu.requests:, so it should be added before the requests: sub-block.
| {{- end }} | ||
| {{- end }} | ||
| {{- if $policyConfig.inPlaceResize }} | ||
| inPlaceResize: |
There was a problem hiding this comment.
CONTENT OF THIS REVIEW IS AI GENERATED
[Severity: Minor] [Confidence: High]
Issue: The kai block (spec.kai.queue and spec.kai.setQueueWhenSpecified) defined in the CRD for AutomationStrategy / ClusterAutomationStrategy is also not rendered in the Helm template. This is the same class of omission as overrideScheduler.
Why it matters: Users relying on Helm-managed strategies to control which KAI queue is assigned to GPU workloads will have this configuration silently dropped. This means the only way to configure spec.kai is via manually managed raw CRs, which is inconsistent with the overall Helm-driven workflow advertised in the documentation and README.
Suggested fix: Add a {{- if $policyConfig.kai }} block in the template after the enablement section, rendering queue and setQueueWhenSpecified:
{{- if $policyConfig.kai }}
kai:
{{- if $policyConfig.kai.queue }}
queue: {{ $policyConfig.kai.queue | quote }}
{{- end }}
{{- if hasKey $policyConfig.kai "setQueueWhenSpecified" }}
setQueueWhenSpecified: {{ $policyConfig.kai.setQueueWhenSpecified }}
{{- end }}
{{- end }}There was a problem hiding this comment.
CONTENT OF THIS REVIEW IS AI GENERATED
Overall assessment: This PR's final merged diff is a clean, documentation-only change — 4 lines added to Safety-Controls.md documenting the new floor-clamped and ceiling-clamped filter runtime names. The additions are accurate, internally consistent with the existing table and example format, and add clear operational value for users debugging bounds-enforcement behavior.
Risk level: Low
Critical issues
None.
Major issues
None.
Minor issues
None.
DRY improvement opportunities
None in this diff.
Notes on prior review comments
The 9 existing inline review comments on this PR (covering clusterautomationstrategy.yaml, globalconfiguration.yaml, kai-queues.yaml, values.yaml, and GPU-Sharing-with-KAI.md) are all marked outdated — they reference an earlier commit (f62c18418d) that is not reflected in the final merged state. The items they raise (missing overrideScheduler/kai template rendering, unconditional Prometheus rendering, imagePullSecrets stub, documentation inversion in GPU-Sharing-with-KAI.md, etc.) are worth tracking as follow-up issues if those files were reverted or not carried forward in the merge.
Suggested next steps
- Confirm that the issues raised in the outdated review comments (from the earlier commit) were intentionally deferred or are tracked in follow-up issues.
- No changes required for the merged documentation content — it is correct as-is.
This PR was created automatically to sync managed Kubex chart content.
Source commit:
d8bde98Managed chart scope:
charts/kubex-automation-enginecharts/kubex-crds