Skip to content

chore(recipes): check and update runtime component versions across all recipes #698

@yuanchen8911

Description

@yuanchen8911

Summary

Audit the pinned versions of every runtime component referenced by AICR recipes and update them to current stable upstream releases where appropriate.

Motivation

Component versions in recipes/registry.yaml, recipes/overlays/*.yaml, and recipes/components/*/values.yaml drift over time as upstream projects ship security fixes, bug fixes, and Kubernetes compatibility updates. Without a periodic refresh, AICR-generated bundles deploy stale versions that may miss CVE fixes or fail against newer Kubernetes versions.

Scope

Check and (where appropriate) update versions for every component declared in recipes/registry.yaml, including (non-exhaustive list — registry is authoritative):

  • GPU stack: gpu-operator, nvidia-device-plugin, dcgm-exporter
  • Schedulers / CRDs: kai-scheduler, kueue, volcano
  • Inference: dynamo-platform, kgateway
  • Training: kubeflow-trainer
  • Observability: kube-prometheus-stack, prometheus-adapter
  • Platform: nfd, cert-manager, node-feature-discovery
  • Any other components in recipes/registry.yaml not listed here

Also check overlay-level pins in recipes/overlays/*.yaml and any version constraints embedded in mixins (recipes/mixins/*.yaml).

Definition of done

  • For each component, current pinned version is compared against latest stable upstream release.
  • Where a newer version is selected, the version pin is updated and the change is justified (CVE fix, K8s compatibility, etc.) in the PR description.
  • Where a version is intentionally held back, the reason is documented (in recipes/registry.yaml comments or a short note in docs/contributor/data.md).
  • make qualify passes locally on the updated recipes.
  • KWOK simulated cluster tests pass (make kwok-test-all).
  • At least one full GPU CI run on H100 nvkind passes against the updated bundle.
  • Documentation under docs/user/component-catalog.md is regenerated/updated if any displayName, repository, or default chart changes.

Out of scope

  • Adding new components (handled in separate issues).
  • Upgrading recipes to a different Kubernetes minor (separate compatibility effort).
  • Removing components.

Notes

  • Compatibility constraint: AICR targets Kubernetes 1.33+. Verify each upgraded component supports this floor.
  • Constraints in recipes (spec.constraints[*].name = "K8s.server.version") should be re-evaluated when upgrading.
  • For Helm-installed components, confirm the upstream chart version matches the app version expected by nodeScheduling and any value overrides we set.

Metadata

Metadata

Assignees

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions