chore(recipes): migrate kgateway -> agentgateway for v2.2 inference routing#871
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR replaces the inference gateway implementation and all references from kgateway to agentgateway. Changes touch Helm charts and generated manifests (new AgentgatewayParameters CR), the recipes registry and mixins, deployment/undeploy preflight logic and tests, Chainsaw/UAT/CI test fixtures, validators and evidence collection (namespace/resource targets), diagnostics script and workflow triggers, container-image inventory, and documentation. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Three small observations inline, none blocking:
- pod vs Deployment scope of the
managed-by: aicrlabel (no consumers select on it, so cosmetic) - substantial deploymentOrder reshuffles in the UAT fixtures (presumably regenerator-driven, worth a sentence)
- doc/CRD drift on
InferenceModelRewritein the evidence script (no runtime impact)
Rollout notes for existing v2.0 clusters are clearly documented (manual helm uninstall kgateway* + namespace cleanup). LGTM once you've confirmed the inline questions.
…outing kgateway v2.2 removed InferencePool routing from its Envoy data plane (PR kgateway-dev/kgateway#12689, deprecated in v2.1). The Gateway API Inference Extension support that AICR uses for CNCF AI Conformance Advanced Ingress moved entirely to the separate agentgateway project, which ships its own Helm charts, GatewayClass, controller, and AgentgatewayParameters CRD. AICR's only kgateway consumer was the inference Gateway resource — zero HTTPRoutes, no TrafficPolicy/BackendConfigPolicy/etc. — so this PR replaces kgateway entirely rather than running both side-by-side. Changes: - registry + mixin: kgateway/kgateway-crds (v2.0.0, cr.kgateway.dev) -> agentgateway/agentgateway-crds (v2.2.1, cr.agentgateway.dev) - inference-gateway.yaml: GatewayParameters (gateway.kgateway.dev) -> AgentgatewayParameters (agentgateway.dev/v1alpha1, strategic-merge patch on Deployment spec); gatewayClassName: kgateway -> agentgateway; namespace kgateway-system -> agentgateway-system - health check + component dirs renamed under git mv to preserve history - conformance validator (validators/conformance/inference_gateway_check.go) updated to gate on `agentgateway` component and query the new GatewayClass / namespace / Deployment names - evidence collector script + requirement title updated - GPU H100 inference workflow path filter follows the renamed component directories so PRs touching only agentgateway* still trigger CI - Go test fixtures, undeploy.sh.tmpl + golden files, chainsaw assertions, UAT recipe snapshots regenerated against the new component graph - docs/user (component-catalog, container-images, api-reference, cli-reference) refreshed; container-images.md regenerated via `make bom-docs` - demos/cuj2-* + demos/query.md + demos/images/meta.md updated so kubectl commands and architecture diagrams match the new namespace and component names Conformance-equivalence: AICR's `ai_inference` requirement is data-plane agnostic — it asserts a GatewayClass is Accepted, a Gateway is Programmed, and the InferencePool CRDs exist. Swapping to agentgateway preserves all five evidence checks; only the names in the captured output change. Historical evidence snapshots under docs/conformance/cncf/v1.35/nim-eks/, demos/examples/CUJ2-Test-Report.md, design doc 005, and CHANGELOG entries are intentionally left referencing kgateway as frozen records. Closes follow-up #1 from issue NVIDIA#698. Validation: - make qualify (Go unit tests, golangci-lint, yamllint, 20/20 chainsaw including cli-bundle-agentgateway-templates, vulnerability scan, license headers): all green - Bundle generation verified end-to-end (recipe -> bundle -> deploy.sh layout under /tmp); rendered post chart confirmed to emit AgentgatewayParameters + Gateway with class agentgateway in namespace agentgateway-system - validators/conformance package tests pass (22.6s)
9cf5226 to
97adcd8
Compare
…-crds comment The slinky-slurm-operator-crds values file's comment references `components/kgateway-crds/values.yaml` as an example of the "AICR umbrella convention for CRD-only components." That path no longer exists — the kgateway component was renamed to agentgateway in NVIDIA#871, and the file is now at `components/agentgateway-crds/ values.yaml`. One-line comment fix; no behavior change. Surfaced during a post-NVIDIA#871 audit for residual kgateway references in active code paths.
…-crds comment The slinky-slurm-operator-crds values file's comment references `components/kgateway-crds/values.yaml` as an example of the "AICR umbrella convention for CRD-only components." That path no longer exists — the kgateway component was renamed to agentgateway in NVIDIA#871, and the file is now at `components/agentgateway-crds/ values.yaml`. One-line comment fix; no behavior change. Surfaced during a post-NVIDIA#871 audit for residual kgateway references in active code paths.
…-crds comment The slinky-slurm-operator-crds values file's comment references `components/kgateway-crds/values.yaml` as an example of the "AICR umbrella convention for CRD-only components." That path no longer exists — the kgateway component was renamed to agentgateway in NVIDIA#871, and the file is now at `components/agentgateway-crds/ values.yaml`. One-line comment fix; no behavior change. Surfaced during a post-NVIDIA#871 audit for residual kgateway references in active code paths.
Sweep of doc-quality issues uncovered by an audit of docs/, demos/, and examples/ against current main. Fixes break into four buckets: ERROR (breaking or contradicts code): - examples/recipes/eks-training.yaml: deploymentOrder listed kube-prometheus-stack with no matching componentRefs entry — bundle generation rejected the recipe. Added the missing entry. - examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same defect; performance phase referenced nonexistent checks (nccl-bandwidth-test, fabric-health-check) and an unknown `infrastructure: nccl-doctor` line — replaced with the real nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults. - docs/contributor/validator.md: validators.Context struct field `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput *v1.ValidationInput`; gated skip example updated to match. Added an RBAC subsection documenting per-run aicr-validator-<runID> naming introduced in PR NVIDIA#888. - docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to match RecipeCacheTTL (10 min); root endpoint `routes` list expanded from [/v1/recipe] to all three registered routes; Query parameter table gained the missing `platform` row; broken anchor index.md#cicd-architecture replaced with a working link. - docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (CreateNodeTopologyCollector was missing); snapshot measurements list gained `topology`; bogus [INTERNAL]-wrapped invalid-accelerator example corrected to [INVALID_REQUEST]; duplicate log line and empty section removed. - docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table; `--best-effort` (a real flag) is no longer used as the "rejected typo" example. - docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd, slinky-slurm-operator, slinky-slurm-operator-crds) and was re-sorted alphabetically. - docs/conformance/cncf/index.md: corrected script path pkg/evidence/scripts/collect-evidence.sh → pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the directory tree; section-name list updated to match the script's case block. INCONSISTENCY / drift: - platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm) in docs/README.md, docs/contributor/validations.md. - OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux, talos) in docs/integrator/data-flow.md. - Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in docs/integrator/{recipe-development,automation,data-flow}.md. - demos/cuj1-eks.md: --system-node-selector dedicated=system-workload (taint key as a label selector) → nodeGroup=system-worker; matching prose adjustment. - demos/cuj2.md: toleration grammar field [operation] → [effect]. - demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box borders re-aligned. - demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per project artifact-location convention; stale "18 + 1 = 19" component count replaced with placeholders. - demos/data.md: "Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm". - demos/README.md: index table gained the missing CUJ pages. - docs/integrator/index.md and docs/README.md: integrator-pages tables gained missing rows. STYLE / slug hygiene (CLAUDE.md doc-style rules): - docs/user/cli-reference.md: stripped trailing (--flag) parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides) — the Deploy/Undeploy variants are left alone in this PR because their slugs have many inbound references in bundle templates and golden-files which would need a coordinated follow-up. - docs/contributor/api-server.md: three headings rewritten with "and" in place of "&" to fix double-hyphen slugs. - docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)" from a heading. TYPO / NIT: - docs/README.md "a automated" → "an automated". - docs/integrator/recipe-development.md "end to end" → "end-to-end". - demos/cuj2.md "comma delimination" → "comma-separated". - demos/valid.md sentence fragment completed. - docs/user/component-catalog.md `***Note:***` → `**Note:**`. - docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr → github.com/NVIDIA/aicr. - docs/user/agent-deployment.md and docs/user/cli-reference.md literal `app.kubernetes.io/version: v0.17.0` → <aicr-version> placeholder. - docs/contributor/cli.md CronJob example pin v0.6.4 → <release-tag> placeholder. - examples/recipes/README.md driver version 580.82.07 → 580.105.08 to match current gpu-operator pin. - examples/recipes/{eks-training,eks-gb200-...,kind}.yaml metadata.version v0.26.7-next → dev (hand-written-example convention shared with aks-training.yaml). - demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-NVIDIA#871 historical capture (kgateway → agentgateway migration). Also adds the missing CNCF feature list to docs/user/cli-reference.md --feature row and a "Scoping conformance to specific features" section in docs/user/validation.md, since the 9 ValidFeatures names from pkg/evidence/cncf/collector.go were not documented for users. Skipped (out of scope or risk-deferred): - ADR-002 cross-references to per-run RBAC names beyond the existing 2026-05-14 implementation note — ADRs are frozen historical records. - Deploy Script Behavior / Undeploy Script Behavior heading parentheticals — ~23 inbound anchor links in bundle templates and golden-files would need a coordinated PR. - demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both files already explicitly note "No podTemplateOverrides / runtimePatches needed" since the torch-distributed runtime bakes scheduling at bundle time. - pre-existing site/docs/* renders (gitignored, auto-generated). Doc-only PR; full `make qualify` skipped per CLAUDE.local.md ("doc-only / infra-only change ... cheap checks are enough"). Ran `make lint` (yamllint, gofmt, license-check, sidebar check, agents sync, chart-pin verification) — all green.
Sweep of doc-quality issues uncovered by an audit of docs/, demos/, and examples/ against current main. Fixes break into four buckets: ERROR (breaking or contradicts code): - examples/recipes/eks-training.yaml: deploymentOrder listed kube-prometheus-stack with no matching componentRefs entry — bundle generation rejected the recipe. Added the missing entry. - examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same defect; performance phase referenced nonexistent checks (nccl-bandwidth-test, fabric-health-check) and an unknown `infrastructure: nccl-doctor` line — replaced with the real nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults. - docs/contributor/validator.md: validators.Context struct field `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput *v1.ValidationInput`; gated skip example updated to match. Added an RBAC subsection documenting per-run aicr-validator-<runID> naming introduced in PR NVIDIA#888. - docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to match RecipeCacheTTL (10 min); root endpoint `routes` list expanded from [/v1/recipe] to all three registered routes; Query parameter table gained the missing `platform` row; broken anchor index.md#cicd-architecture replaced with a working link. - docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (CreateNodeTopologyCollector was missing); snapshot measurements list gained `topology`; bogus [INTERNAL]-wrapped invalid-accelerator example corrected to [INVALID_REQUEST]; duplicate log line and empty section removed. - docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table; `--best-effort` (a real flag) is no longer used as the "rejected typo" example. - docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd, slinky-slurm-operator, slinky-slurm-operator-crds) and was re-sorted alphabetically. - docs/conformance/cncf/index.md: corrected script path pkg/evidence/scripts/collect-evidence.sh → pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the directory tree; section-name list updated to match the script's case block. INCONSISTENCY / drift: - platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm) in docs/README.md, docs/contributor/validations.md. - OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux, talos) in docs/integrator/data-flow.md. - Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in docs/integrator/{recipe-development,automation,data-flow}.md. - demos/cuj1-eks.md: --system-node-selector dedicated=system-workload (taint key as a label selector) → nodeGroup=system-worker; matching prose adjustment. - demos/cuj2.md: toleration grammar field [operation] → [effect]. - demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box borders re-aligned. - demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per project artifact-location convention; stale "18 + 1 = 19" component count replaced with placeholders. - demos/data.md: "Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm". - demos/README.md: index table gained the missing CUJ pages. - docs/integrator/index.md and docs/README.md: integrator-pages tables gained missing rows. STYLE / slug hygiene (CLAUDE.md doc-style rules): - docs/user/cli-reference.md: stripped trailing (--flag) parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides) — the Deploy/Undeploy variants are left alone in this PR because their slugs have many inbound references in bundle templates and golden-files which would need a coordinated follow-up. - docs/contributor/api-server.md: three headings rewritten with "and" in place of "&" to fix double-hyphen slugs. - docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)" from a heading. TYPO / NIT: - docs/README.md "a automated" → "an automated". - docs/integrator/recipe-development.md "end to end" → "end-to-end". - demos/cuj2.md "comma delimination" → "comma-separated". - demos/valid.md sentence fragment completed. - docs/user/component-catalog.md `***Note:***` → `**Note:**`. - docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr → github.com/NVIDIA/aicr. - docs/user/agent-deployment.md and docs/user/cli-reference.md literal `app.kubernetes.io/version: v0.17.0` → <aicr-version> placeholder. - docs/contributor/cli.md CronJob example pin v0.6.4 → <release-tag> placeholder. - examples/recipes/README.md driver version 580.82.07 → 580.105.08 to match current gpu-operator pin. - examples/recipes/{eks-training,eks-gb200-...,kind}.yaml metadata.version v0.26.7-next → dev (hand-written-example convention shared with aks-training.yaml). - demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-NVIDIA#871 historical capture (kgateway → agentgateway migration). Also adds the missing CNCF feature list to docs/user/cli-reference.md --feature row and a "Scoping conformance to specific features" section in docs/user/validation.md, since the 9 ValidFeatures names from pkg/evidence/cncf/collector.go were not documented for users. Skipped (out of scope or risk-deferred): - ADR-002 cross-references to per-run RBAC names beyond the existing 2026-05-14 implementation note — ADRs are frozen historical records. - Deploy Script Behavior / Undeploy Script Behavior heading parentheticals — ~23 inbound anchor links in bundle templates and golden-files would need a coordinated PR. - demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both files already explicitly note "No podTemplateOverrides / runtimePatches needed" since the torch-distributed runtime bakes scheduling at bundle time. - pre-existing site/docs/* renders (gitignored, auto-generated). Doc-only PR; full `make qualify` skipped per CLAUDE.local.md ("doc-only / infra-only change ... cheap checks are enough"). Ran `make lint` (yamllint, gofmt, license-check, sidebar check, agents sync, chart-pin verification) — all green.
Sweep of doc-quality issues uncovered by an audit of docs/, demos/, and examples/ against current main. Fixes break into four buckets: ERROR (breaking or contradicts code): - examples/recipes/eks-training.yaml: deploymentOrder listed kube-prometheus-stack with no matching componentRefs entry — bundle generation rejected the recipe. Added the missing entry. - examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same defect; performance phase referenced nonexistent checks (nccl-bandwidth-test, fabric-health-check) and an unknown `infrastructure: nccl-doctor` line — replaced with the real nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults. - docs/contributor/validator.md: validators.Context struct field `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput *v1.ValidationInput`; gated skip example updated to match. Added an RBAC subsection documenting per-run aicr-validator-<runID> naming introduced in PR NVIDIA#888. - docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to match RecipeCacheTTL (10 min); root endpoint `routes` list expanded from [/v1/recipe] to all three registered routes; Query parameter table gained the missing `platform` row; broken anchor index.md#cicd-architecture replaced with a working link. - docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (CreateNodeTopologyCollector was missing); snapshot measurements list gained `topology`; bogus [INTERNAL]-wrapped invalid-accelerator example corrected to [INVALID_REQUEST]; duplicate log line and empty section removed. - docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table; `--best-effort` (a real flag) is no longer used as the "rejected typo" example. - docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd, slinky-slurm-operator, slinky-slurm-operator-crds) and was re-sorted alphabetically. - docs/conformance/cncf/index.md: corrected script path pkg/evidence/scripts/collect-evidence.sh → pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the directory tree; section-name list updated to match the script's case block. INCONSISTENCY / drift: - platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm) in docs/README.md, docs/contributor/validations.md. - OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux, talos) in docs/integrator/data-flow.md. - Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in docs/integrator/{recipe-development,automation,data-flow}.md. - demos/cuj1-eks.md: --system-node-selector dedicated=system-workload (taint key as a label selector) → nodeGroup=system-worker; matching prose adjustment. - demos/cuj2.md: toleration grammar field [operation] → [effect]. - demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box borders re-aligned. - demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per project artifact-location convention; stale "18 + 1 = 19" component count replaced with placeholders. - demos/data.md: "Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm". - demos/README.md: index table gained the missing CUJ pages. - docs/integrator/index.md and docs/README.md: integrator-pages tables gained missing rows. STYLE / slug hygiene (CLAUDE.md doc-style rules): - docs/user/cli-reference.md: stripped trailing (--flag) parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides) — the Deploy/Undeploy variants are left alone in this PR because their slugs have many inbound references in bundle templates and golden-files which would need a coordinated follow-up. - docs/contributor/api-server.md: three headings rewritten with "and" in place of "&" to fix double-hyphen slugs. - docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)" from a heading. TYPO / NIT: - docs/README.md "a automated" → "an automated". - docs/integrator/recipe-development.md "end to end" → "end-to-end". - demos/cuj2.md "comma delimination" → "comma-separated". - demos/valid.md sentence fragment completed. - docs/user/component-catalog.md `***Note:***` → `**Note:**`. - docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr → github.com/NVIDIA/aicr. - docs/user/agent-deployment.md and docs/user/cli-reference.md literal `app.kubernetes.io/version: v0.17.0` → <aicr-version> placeholder. - docs/contributor/cli.md CronJob example pin v0.6.4 → <release-tag> placeholder. - examples/recipes/README.md driver version 580.82.07 → 580.105.08 to match current gpu-operator pin. - examples/recipes/{eks-training,eks-gb200-...,kind}.yaml metadata.version v0.26.7-next → dev (hand-written-example convention shared with aks-training.yaml). - demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-NVIDIA#871 historical capture (kgateway → agentgateway migration). Also adds the missing CNCF feature list to docs/user/cli-reference.md --feature row and a "Scoping conformance to specific features" section in docs/user/validation.md, since the 9 ValidFeatures names from pkg/evidence/cncf/collector.go were not documented for users. Skipped (out of scope or risk-deferred): - ADR-002 cross-references to per-run RBAC names beyond the existing 2026-05-14 implementation note — ADRs are frozen historical records. - Deploy Script Behavior / Undeploy Script Behavior heading parentheticals — ~23 inbound anchor links in bundle templates and golden-files would need a coordinated PR. - demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both files already explicitly note "No podTemplateOverrides / runtimePatches needed" since the torch-distributed runtime bakes scheduling at bundle time. - pre-existing site/docs/* renders (gitignored, auto-generated). Doc-only PR; full `make qualify` skipped per CLAUDE.local.md ("doc-only / infra-only change ... cheap checks are enough"). Ran `make lint` (yamllint, gofmt, license-check, sidebar check, agents sync, chart-pin verification) — all green.
Sweep of doc-quality issues uncovered by an audit of docs/, demos/, and examples/ against current main. Fixes break into four buckets: ERROR (breaking or contradicts code): - examples/recipes/eks-training.yaml: deploymentOrder listed kube-prometheus-stack with no matching componentRefs entry — bundle generation rejected the recipe. Added the missing entry. - examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same defect; performance phase referenced nonexistent checks (nccl-bandwidth-test, fabric-health-check) and an unknown `infrastructure: nccl-doctor` line — replaced with the real nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults. - docs/contributor/validator.md: validators.Context struct field `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput *v1.ValidationInput`; gated skip example updated to match. Added an RBAC subsection documenting per-run aicr-validator-<runID> naming introduced in PR NVIDIA#888. - docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to match RecipeCacheTTL (10 min); root endpoint `routes` list expanded from [/v1/recipe] to all three registered routes; Query parameter table gained the missing `platform` row; broken anchor index.md#cicd-architecture replaced with a working link. - docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (CreateNodeTopologyCollector was missing); snapshot measurements list gained `topology`; bogus [INTERNAL]-wrapped invalid-accelerator example corrected to [INVALID_REQUEST]; duplicate log line and empty section removed. - docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table; `--best-effort` (a real flag) is no longer used as the "rejected typo" example. - docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd, slinky-slurm-operator, slinky-slurm-operator-crds) and was re-sorted alphabetically. - docs/conformance/cncf/index.md: corrected script path pkg/evidence/scripts/collect-evidence.sh → pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the directory tree; section-name list updated to match the script's case block. INCONSISTENCY / drift: - platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm) in docs/README.md, docs/contributor/validations.md. - OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux, talos) in docs/integrator/data-flow.md. - Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in docs/integrator/{recipe-development,automation,data-flow}.md. - demos/cuj1-eks.md: --system-node-selector dedicated=system-workload (taint key as a label selector) → nodeGroup=system-worker; matching prose adjustment. - demos/cuj2.md: toleration grammar field [operation] → [effect]. - demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box borders re-aligned. - demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per project artifact-location convention; stale "18 + 1 = 19" component count replaced with placeholders. - demos/data.md: "Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm". - demos/README.md: index table gained the missing CUJ pages. - docs/integrator/index.md and docs/README.md: integrator-pages tables gained missing rows. STYLE / slug hygiene (CLAUDE.md doc-style rules): - docs/user/cli-reference.md: stripped trailing (--flag) parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides) — the Deploy/Undeploy variants are left alone in this PR because their slugs have many inbound references in bundle templates and golden-files which would need a coordinated follow-up. - docs/contributor/api-server.md: three headings rewritten with "and" in place of "&" to fix double-hyphen slugs. - docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)" from a heading. TYPO / NIT: - docs/README.md "a automated" → "an automated". - docs/integrator/recipe-development.md "end to end" → "end-to-end". - demos/cuj2.md "comma delimination" → "comma-separated". - demos/valid.md sentence fragment completed. - docs/user/component-catalog.md `***Note:***` → `**Note:**`. - docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr → github.com/NVIDIA/aicr. - docs/user/agent-deployment.md and docs/user/cli-reference.md literal `app.kubernetes.io/version: v0.17.0` → <aicr-version> placeholder. - docs/contributor/cli.md CronJob example pin v0.6.4 → <release-tag> placeholder. - examples/recipes/README.md driver version 580.82.07 → 580.105.08 to match current gpu-operator pin. - examples/recipes/{eks-training,eks-gb200-...,kind}.yaml metadata.version v0.26.7-next → dev (hand-written-example convention shared with aks-training.yaml). - demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-NVIDIA#871 historical capture (kgateway → agentgateway migration). Also adds the missing CNCF feature list to docs/user/cli-reference.md --feature row and a "Scoping conformance to specific features" section in docs/user/validation.md, since the 9 ValidFeatures names from pkg/evidence/cncf/collector.go were not documented for users. Skipped (out of scope or risk-deferred): - ADR-002 cross-references to per-run RBAC names beyond the existing 2026-05-14 implementation note — ADRs are frozen historical records. - Deploy Script Behavior / Undeploy Script Behavior heading parentheticals — ~23 inbound anchor links in bundle templates and golden-files would need a coordinated PR. - demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both files already explicitly note "No podTemplateOverrides / runtimePatches needed" since the torch-distributed runtime bakes scheduling at bundle time. - pre-existing site/docs/* renders (gitignored, auto-generated). Doc-only PR; full `make qualify` skipped per CLAUDE.local.md ("doc-only / infra-only change ... cheap checks are enough"). Ran `make lint` (yamllint, gofmt, license-check, sidebar check, agents sync, chart-pin verification) — all green.
Sweep of doc-quality issues uncovered by an audit of docs/, demos/, and examples/ against current main. Fixes break into four buckets: ERROR (breaking or contradicts code): - examples/recipes/eks-training.yaml: deploymentOrder listed kube-prometheus-stack with no matching componentRefs entry — bundle generation rejected the recipe. Added the missing entry. - examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same defect; performance phase referenced nonexistent checks (nccl-bandwidth-test, fabric-health-check) and an unknown `infrastructure: nccl-doctor` line — replaced with the real nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults. - docs/contributor/validator.md: validators.Context struct field `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput *v1.ValidationInput`; gated skip example updated to match. Added an RBAC subsection documenting per-run aicr-validator-<runID> naming introduced in PR NVIDIA#888. - docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to match RecipeCacheTTL (10 min); root endpoint `routes` list expanded from [/v1/recipe] to all three registered routes; Query parameter table gained the missing `platform` row; broken anchor index.md#cicd-architecture replaced with a working link. - docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (CreateNodeTopologyCollector was missing); snapshot measurements list gained `topology`; bogus [INTERNAL]-wrapped invalid-accelerator example corrected to [INVALID_REQUEST]; duplicate log line and empty section removed. - docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table; `--best-effort` (a real flag) is no longer used as the "rejected typo" example. - docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd, slinky-slurm-operator, slinky-slurm-operator-crds) and was re-sorted alphabetically. - docs/conformance/cncf/index.md: corrected script path pkg/evidence/scripts/collect-evidence.sh → pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the directory tree; section-name list updated to match the script's case block. INCONSISTENCY / drift: - platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm) in docs/README.md, docs/contributor/validations.md. - OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux, talos) in docs/integrator/data-flow.md. - Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in docs/integrator/{recipe-development,automation,data-flow}.md. - demos/cuj1-eks.md: --system-node-selector dedicated=system-workload (taint key as a label selector) → nodeGroup=system-worker; matching prose adjustment. - demos/cuj2.md: toleration grammar field [operation] → [effect]. - demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box borders re-aligned. - demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per project artifact-location convention; stale "18 + 1 = 19" component count replaced with placeholders. - demos/data.md: "Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm". - demos/README.md: index table gained the missing CUJ pages. - docs/integrator/index.md and docs/README.md: integrator-pages tables gained missing rows. STYLE / slug hygiene (CLAUDE.md doc-style rules): - docs/user/cli-reference.md: stripped trailing (--flag) parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides) — the Deploy/Undeploy variants are left alone in this PR because their slugs have many inbound references in bundle templates and golden-files which would need a coordinated follow-up. - docs/contributor/api-server.md: three headings rewritten with "and" in place of "&" to fix double-hyphen slugs. - docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)" from a heading. TYPO / NIT: - docs/README.md "a automated" → "an automated". - docs/integrator/recipe-development.md "end to end" → "end-to-end". - demos/cuj2.md "comma delimination" → "comma-separated". - demos/valid.md sentence fragment completed. - docs/user/component-catalog.md `***Note:***` → `**Note:**`. - docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr → github.com/NVIDIA/aicr. - docs/user/agent-deployment.md and docs/user/cli-reference.md literal `app.kubernetes.io/version: v0.17.0` → <aicr-version> placeholder. - docs/contributor/cli.md CronJob example pin v0.6.4 → <release-tag> placeholder. - examples/recipes/README.md driver version 580.82.07 → 580.105.08 to match current gpu-operator pin. - examples/recipes/{eks-training,eks-gb200-...,kind}.yaml metadata.version v0.26.7-next → dev (hand-written-example convention shared with aks-training.yaml). - demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-NVIDIA#871 historical capture (kgateway → agentgateway migration). Also adds the missing CNCF feature list to docs/user/cli-reference.md --feature row and a "Scoping conformance to specific features" section in docs/user/validation.md, since the 9 ValidFeatures names from pkg/evidence/cncf/collector.go were not documented for users. Skipped (out of scope or risk-deferred): - ADR-002 cross-references to per-run RBAC names beyond the existing 2026-05-14 implementation note — ADRs are frozen historical records. - Deploy Script Behavior / Undeploy Script Behavior heading parentheticals — ~23 inbound anchor links in bundle templates and golden-files would need a coordinated PR. - demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both files already explicitly note "No podTemplateOverrides / runtimePatches needed" since the torch-distributed runtime bakes scheduling at bundle time. - pre-existing site/docs/* renders (gitignored, auto-generated). Doc-only PR; full `make qualify` skipped per CLAUDE.local.md ("doc-only / infra-only change ... cheap checks are enough"). Ran `make lint` (yamllint, gofmt, license-check, sidebar check, agents sync, chart-pin verification) — all green.
Summary
Replace kgateway v2.0.0 with agentgateway v2.2.1 for the inference Gateway. kgateway v2.2 removed InferencePool routing from its Envoy data plane (deprecated v2.1, removed v2.2); the Gateway API Inference Extension support AICR uses for CNCF AI Conformance Advanced Ingress moved to the separate agentgateway project (different chart registry, GatewayClass, controller, and a new
AgentgatewayParametersCRD that replacesGatewayParameters).Motivation / Context
AICR uses kgateway for exactly one resource — the
inference-gatewayGateway — and zero kgateway-Envoy-specific CRs (noTrafficPolicy,HTTPRoute,BackendConfigPolicy, etc.). Running both kgateway + agentgateway side-by-side would double CRDs, namespaces, controllers, and RBAC to serve a single Gateway. Replacing entirely is the simpler operational story going forward, and the project's investment in AI/ML routing (AgentgatewayPolicy, prompt guards, MCP auth) all lives in agentgateway now.Fixes: follow-up #1 from #698 (kgateway / kgateway-crds
v2.0.0→v2.2.3— the kgateway chart is at v2.2.3 but the relocated agentgateway chart's latest is v2.2.1, so this PR pins v2.2.1)Related: #698, #715, #720, #724, #725
Type of Change
gatewayClassName: kgateway→agentgateway; namespacekgateway-system→agentgateway-system;gateway.kgateway.dev/v1alpha1.GatewayParameters→agentgateway.dev/v1alpha1.AgentgatewayParameters(strategic-merge-patch schema on Deployment spec). Existing deployed clusters require a manual cleanup step — see Rollout notes.Component(s) Affected
recipes/registry.yaml,recipes/mixins/platform-inference.yaml,recipes/components/agentgateway*,recipes/checks/agentgateway/)pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl+ golden fixtures)validators/conformance/inference_gateway_check.go)pkg/evidence/cncf/{requirements.go,scripts/collect-evidence.sh})docs/user/*,demos/cuj2-*,recipes/validators/README.md,tests/chainsaw/ai-conformance/README.md).github/workflows/gpu-h100-inference-test.yaml,.github/scripts/gpu-debug-diagnostics.sh)Implementation Notes
Conformance equivalence. AICR's
ai_inferencerequirement (pkg/evidence/cncf/requirements.go) is data-plane-agnostic — it asserts aGatewayClassisAccepted, aGatewayisProgrammed, and the InferencePool CRDs exist. Swapping kgateway → agentgateway preserves all five evidence checks; only the names in the captured output change. The requirement title was updated fromInference API Gateway (kgateway)→Inference API Gateway (agentgateway).Inference Gateway manifest rewrite.
recipes/components/agentgateway/manifests/inference-gateway.yaml:GatewayParameters(gateway.kgateway.dev/v1alpha1) →AgentgatewayParameters(agentgateway.dev/v1alpha1). The new schema uses a strategic-merge-patch on the generated Deployment viaspec.deployment.spec.template.spec.{nodeSelector,tolerations}rather than the previous structuredspec.kube.podTemplate.*fields.gatewayClassName: kgateway→agentgateway;parametersRef.group: gateway.kgateway.dev→agentgateway.dev.kgateway-system→agentgateway-systemthroughout.Manual CRDs preserved. AICR ships standard Gateway API CRDs and the Inference Extension CRDs via vendored manifest files (
gateway-api-crds.yaml,inference-extension-crds.yaml); neither the old kgateway-crds nor the new agentgateway-crds chart provides them. The manifests move unchanged underrecipes/components/agentgateway-crds/manifests/viagit mv(rename detection preserved at 99% similarity).Conformance validator rewired.
validators/conformance/inference_gateway_check.gopreviously gated onrecipeHasComponent(ctx, "kgateway")and queriedGatewayClass kgateway/Gateway inference-gateway -n kgateway-system. After this PR, upgraded inference recipes contain onlyagentgateway/agentgateway-crds, so the gate, the GatewayClass query, and every namespace reference (5 sites: Gateway Get, EndpointSlices List, Deployments List, Pods List, error message) were updated. Without this fix the conformance check would silently skip.GPU CI path filter.
.github/workflows/gpu-h100-inference-test.yamlwatchedrecipes/components/kgateway/**andrecipes/components/kgateway-crds/**. Now watchesagentgateway*paths so future PRs touching only the renamed component files still trigger H100 inference CI.Version note. The audit in #698 listed v2.2.3, but the agentgateway chart's release cadence trails kgateway-Envoy's; latest published is v2.2.1 (verified via
helm pull oci://cr.agentgateway.dev/charts/agentgateway --version vX). v2.2.2 and v2.2.3 don't exist for agentgateway yet. Pinning v2.2.1.Historical records intentionally untouched:
docs/conformance/cncf/v1.35/nim-eks/**(frozen evidence snapshots),demos/examples/CUJ2-Test-Report.md(frozen test report),docs/design/005-overlay-refactoring.md(historical design doc),CHANGELOG.md/site/docs/project/changelog.md. Future conformance runs will produce evidence with the new namespace/names; old snapshots stay as historical accuracy.Testing
Results (full gate, post-rebase onto main):
pkg/recipe,pkg/evidence/cncf76.5%,validators/conformance22.6s,pkg/bundler/deployer/helm14s — all with-race)cli-bundle-agentgateway-templatestest which verifies the rendered post-chart containskind: AgentgatewayParameters,kind: Gateway, and bundler-injectednodeSelector/nodeGroup: system-poolmake bom-docs: regenerateddocs/user/container-images.mdagainst the newcr.agentgateway.devregistry;agentgatewayandagentgateway-crdsappear at v2.2.1Manual bundle verification (out-of-tree):
aicr recipe --service eks --accelerator h100 --intent inference --os ubuntu --platform dynamoaicr bundle ... --system-node-selector nodeGroup=system-worker --accelerated-node-selector nodeGroup=gpu-worker007-agentgateway-post/templates/inference-gateway.yaml: confirmedAgentgatewayParameters(agentgateway.dev/v1alpha1),gatewayClassName: agentgateway,parametersRef.group: agentgateway.dev, namespaceagentgateway-system, and bundler-injected scheduling.Codex cross-review: ran twice (initial pass found P1 missed conformance validator + P2 missed GPU CI path filter; rerun confirmed both addressed, no new findings; Codex also independently generated a recipe and bundle and verified the rendered shape).
Risk Assessment
Rollout notes:
kgateway-systemnamespace are now orphaned after regenerating a bundle. After the new bundle'sdeploy.shsucceeds, operators should manually clean up the old install:TrafficPolicy,BackendConfigPolicy) on top of an AICR bundle will need to migrate those toAgentgatewayPolicy/AgentgatewayBackend. AICR itself ships none of these CRs.agentgatewayinagentgateway-systemrather thankgatewayinkgateway-system. The frozenv1.35/nim-ekssnapshots are intentionally preserved as historical record.Checklist
make testwith-race)make lint)cli-bundle-agentgateway-templates, updated chainsaw + UAT fixtures + Go test fixtures + golden undeploy.sh files)git commit -S)