Skip to content

chore(recipes): migrate kgateway -> agentgateway for v2.2 inference routing#871

Merged
mchmarny merged 3 commits into
NVIDIA:mainfrom
yuanchen8911:chore/kgateway-v2.2-agentgateway-migration
May 14, 2026
Merged

chore(recipes): migrate kgateway -> agentgateway for v2.2 inference routing#871
mchmarny merged 3 commits into
NVIDIA:mainfrom
yuanchen8911:chore/kgateway-v2.2-agentgateway-migration

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Replace kgateway v2.0.0 with agentgateway v2.2.1 for the inference Gateway. kgateway v2.2 removed InferencePool routing from its Envoy data plane (deprecated v2.1, removed v2.2); the Gateway API Inference Extension support AICR uses for CNCF AI Conformance Advanced Ingress moved to the separate agentgateway project (different chart registry, GatewayClass, controller, and a new AgentgatewayParameters CRD that replaces GatewayParameters).

Motivation / Context

AICR uses kgateway for exactly one resource — the inference-gateway Gateway — and zero kgateway-Envoy-specific CRs (no TrafficPolicy, HTTPRoute, BackendConfigPolicy, etc.). Running both kgateway + agentgateway side-by-side would double CRDs, namespaces, controllers, and RBAC to serve a single Gateway. Replacing entirely is the simpler operational story going forward, and the project's investment in AI/ML routing (AgentgatewayPolicy, prompt guards, MCP auth) all lives in agentgateway now.

Fixes: follow-up #1 from #698 (kgateway / kgateway-crds v2.0.0v2.2.3 — the kgateway chart is at v2.2.3 but the relocated agentgateway chart's latest is v2.2.1, so this PR pins v2.2.1)
Related: #698, #715, #720, #724, #725

Type of Change

  • Breaking changegatewayClassName: kgatewayagentgateway; namespace kgateway-systemagentgateway-system; gateway.kgateway.dev/v1alpha1.GatewayParametersagentgateway.dev/v1alpha1.AgentgatewayParameters (strategic-merge-patch schema on Deployment spec). Existing deployed clusters require a manual cleanup step — see Rollout notes.

Component(s) Affected

  • Recipe engine / data (recipes/registry.yaml, recipes/mixins/platform-inference.yaml, recipes/components/agentgateway*, recipes/checks/agentgateway/)
  • Bundlers (pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl + golden fixtures)
  • Validator (validators/conformance/inference_gateway_check.go)
  • CLI evidence collection (pkg/evidence/cncf/{requirements.go,scripts/collect-evidence.sh})
  • Docs/examples (docs/user/*, demos/cuj2-*, recipes/validators/README.md, tests/chainsaw/ai-conformance/README.md)
  • CI workflow path filter (.github/workflows/gpu-h100-inference-test.yaml, .github/scripts/gpu-debug-diagnostics.sh)

Implementation Notes

Conformance equivalence. AICR's ai_inference requirement (pkg/evidence/cncf/requirements.go) is data-plane-agnostic — it asserts a GatewayClass is Accepted, a Gateway is Programmed, and the InferencePool CRDs exist. Swapping kgateway → agentgateway preserves all five evidence checks; only the names in the captured output change. The requirement title was updated from Inference API Gateway (kgateway)Inference API Gateway (agentgateway).

Inference Gateway manifest rewrite. recipes/components/agentgateway/manifests/inference-gateway.yaml:

  • GatewayParameters (gateway.kgateway.dev/v1alpha1) → AgentgatewayParameters (agentgateway.dev/v1alpha1). The new schema uses a strategic-merge-patch on the generated Deployment via spec.deployment.spec.template.spec.{nodeSelector,tolerations} rather than the previous structured spec.kube.podTemplate.* fields.
  • Gateway gatewayClassName: kgatewayagentgateway; parametersRef.group: gateway.kgateway.devagentgateway.dev.
  • Namespace kgateway-systemagentgateway-system throughout.

Manual CRDs preserved. AICR ships standard Gateway API CRDs and the Inference Extension CRDs via vendored manifest files (gateway-api-crds.yaml, inference-extension-crds.yaml); neither the old kgateway-crds nor the new agentgateway-crds chart provides them. The manifests move unchanged under recipes/components/agentgateway-crds/manifests/ via git mv (rename detection preserved at 99% similarity).

Conformance validator rewired. validators/conformance/inference_gateway_check.go previously gated on recipeHasComponent(ctx, "kgateway") and queried GatewayClass kgateway / Gateway inference-gateway -n kgateway-system. After this PR, upgraded inference recipes contain only agentgateway/agentgateway-crds, so the gate, the GatewayClass query, and every namespace reference (5 sites: Gateway Get, EndpointSlices List, Deployments List, Pods List, error message) were updated. Without this fix the conformance check would silently skip.

GPU CI path filter. .github/workflows/gpu-h100-inference-test.yaml watched recipes/components/kgateway/** and recipes/components/kgateway-crds/**. Now watches agentgateway* paths so future PRs touching only the renamed component files still trigger H100 inference CI.

Version note. The audit in #698 listed v2.2.3, but the agentgateway chart's release cadence trails kgateway-Envoy's; latest published is v2.2.1 (verified via helm pull oci://cr.agentgateway.dev/charts/agentgateway --version vX). v2.2.2 and v2.2.3 don't exist for agentgateway yet. Pinning v2.2.1.

Historical records intentionally untouched: docs/conformance/cncf/v1.35/nim-eks/** (frozen evidence snapshots), demos/examples/CUJ2-Test-Report.md (frozen test report), docs/design/005-overlay-refactoring.md (historical design doc), CHANGELOG.md / site/docs/project/changelog.md. Future conformance runs will produce evidence with the new namespace/names; old snapshots stay as historical accuracy.

Testing

make qualify

Results (full gate, post-rebase onto main):

  • Go unit tests: all packages pass (incl. pkg/recipe, pkg/evidence/cncf 76.5%, validators/conformance 22.6s, pkg/bundler/deployer/helm 14s — all with -race)
  • Chainsaw: 20 passed / 0 failed / 0 skipped, including the renamed cli-bundle-agentgateway-templates test which verifies the rendered post-chart contains kind: AgentgatewayParameters, kind: Gateway, and bundler-injected nodeSelector / nodeGroup: system-pool
  • Lint (golangci-lint + yamllint): clean
  • Vulnerability scan: only pre-existing npm/python doc-build CVEs, unrelated to this change
  • License headers: clean
  • make bom-docs: regenerated docs/user/container-images.md against the new cr.agentgateway.dev registry; agentgateway and agentgateway-crds appear at v2.2.1

Manual bundle verification (out-of-tree):

  • Generated aicr recipe --service eks --accelerator h100 --intent inference --os ubuntu --platform dynamo
  • Generated aicr bundle ... --system-node-selector nodeGroup=system-worker --accelerated-node-selector nodeGroup=gpu-worker
  • Inspected 007-agentgateway-post/templates/inference-gateway.yaml: confirmed AgentgatewayParameters (agentgateway.dev/v1alpha1), gatewayClassName: agentgateway, parametersRef.group: agentgateway.dev, namespace agentgateway-system, and bundler-injected scheduling.

Codex cross-review: ran twice (initial pass found P1 missed conformance validator + P2 missed GPU CI path filter; rerun confirmed both addressed, no new findings; Codex also independently generated a recipe and bundle and verified the rendered shape).

Risk Assessment

  • Medium — Touches the inference data plane (swap from Envoy + inference extension to standalone agentgateway Rust binary), but blast radius is bounded: AICR uses only the InferencePool routing surface, no Envoy-specific filters or CRs. KWOK + offline chainsaw coverage protects the bundle-generation path; the H100 GPU CI workflow exercises the deployed cluster.

Rollout notes:

  • Migration for existing AICR-deployed clusters running v2.0: the old kgateway Helm releases + kgateway-system namespace are now orphaned after regenerating a bundle. After the new bundle's deploy.sh succeeds, operators should manually clean up the old install:
    helm uninstall kgateway kgateway-crds -n kgateway-system 2>/dev/null
    kubectl delete namespace kgateway-system 2>/dev/null
    Greenfield deploys don't need this.
  • Downstream consumers who layered kgateway-Envoy-specific CRs (e.g. TrafficPolicy, BackendConfigPolicy) on top of an AICR bundle will need to migrate those to AgentgatewayPolicy / AgentgatewayBackend. AICR itself ships none of these CRs.
  • CNCF AI Conformance evidence baseline shifts. Future evidence captures will show agentgateway in agentgateway-system rather than kgateway in kgateway-system. The frozen v1.35/nim-eks snapshots are intentionally preserved as historical record.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (renamed cli-bundle-agentgateway-templates, updated chainsaw + UAT fixtures + Go test fixtures + golden undeploy.sh files)
  • I updated docs if user-facing behavior changed (component-catalog, container-images, api-reference, cli-reference, demo guides, validators README, chainsaw README, evidence script)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: dcbff694-6159-47bb-8009-d8a21cfcd67b

📥 Commits

Reviewing files that changed from the base of the PR and between 1b2f4b6 and 5541758.

📒 Files selected for processing (2)
  • docs/user/container-images.md
  • recipes/registry.yaml

📝 Walkthrough

Walkthrough

This PR replaces the inference gateway implementation and all references from kgateway to agentgateway. Changes touch Helm charts and generated manifests (new AgentgatewayParameters CR), the recipes registry and mixins, deployment/undeploy preflight logic and tests, Chainsaw/UAT/CI test fixtures, validators and evidence collection (namespace/resource targets), diagnostics script and workflow triggers, container-image inventory, and documentation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/aicr#869: Modifies the "How Components Are Selected" section in docs/user/component-catalog.md with overlapping gateway component documentation changes.

Suggested labels

documentation

Suggested reviewers

  • mchmarny
  • lockwobr
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly summarizes the main change: migrating from kgateway to agentgateway for inference routing, including the version context (v2.2). It is concise and specific.
Description check ✅ Passed Description is comprehensive and directly related to the changeset. It explains the motivation (kgateway v2.2 removed InferencePool routing), implementation details (component migration, CRD updates, validator changes), testing results, risk assessment, and rollout notes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three small observations inline, none blocking:

  • pod vs Deployment scope of the managed-by: aicr label (no consumers select on it, so cosmetic)
  • substantial deploymentOrder reshuffles in the UAT fixtures (presumably regenerator-driven, worth a sentence)
  • doc/CRD drift on InferenceModelRewrite in the evidence script (no runtime impact)

Rollout notes for existing v2.0 clusters are clearly documented (manual helm uninstall kgateway* + namespace cleanup). LGTM once you've confirmed the inline questions.

Comment thread recipes/components/agentgateway/manifests/inference-gateway.yaml
Comment thread tests/uat/azure/tests/cuj2-inference/assert-recipe.yaml
Comment thread pkg/evidence/cncf/scripts/collect-evidence.sh Outdated
…outing

kgateway v2.2 removed InferencePool routing from its Envoy data plane
(PR kgateway-dev/kgateway#12689, deprecated in v2.1). The Gateway API
Inference Extension support that AICR uses for CNCF AI Conformance
Advanced Ingress moved entirely to the separate agentgateway project,
which ships its own Helm charts, GatewayClass, controller, and
AgentgatewayParameters CRD.

AICR's only kgateway consumer was the inference Gateway resource — zero
HTTPRoutes, no TrafficPolicy/BackendConfigPolicy/etc. — so this PR
replaces kgateway entirely rather than running both side-by-side.

Changes:
- registry + mixin: kgateway/kgateway-crds (v2.0.0, cr.kgateway.dev)
  -> agentgateway/agentgateway-crds (v2.2.1, cr.agentgateway.dev)
- inference-gateway.yaml: GatewayParameters (gateway.kgateway.dev) ->
  AgentgatewayParameters (agentgateway.dev/v1alpha1, strategic-merge
  patch on Deployment spec); gatewayClassName: kgateway -> agentgateway;
  namespace kgateway-system -> agentgateway-system
- health check + component dirs renamed under git mv to preserve history
- conformance validator (validators/conformance/inference_gateway_check.go)
  updated to gate on `agentgateway` component and query the new
  GatewayClass / namespace / Deployment names
- evidence collector script + requirement title updated
- GPU H100 inference workflow path filter follows the renamed component
  directories so PRs touching only agentgateway* still trigger CI
- Go test fixtures, undeploy.sh.tmpl + golden files, chainsaw assertions,
  UAT recipe snapshots regenerated against the new component graph
- docs/user (component-catalog, container-images, api-reference,
  cli-reference) refreshed; container-images.md regenerated via
  `make bom-docs`
- demos/cuj2-* + demos/query.md + demos/images/meta.md updated so
  kubectl commands and architecture diagrams match the new namespace
  and component names

Conformance-equivalence: AICR's `ai_inference` requirement is data-plane
agnostic — it asserts a GatewayClass is Accepted, a Gateway is
Programmed, and the InferencePool CRDs exist. Swapping to agentgateway
preserves all five evidence checks; only the names in the captured
output change.

Historical evidence snapshots under docs/conformance/cncf/v1.35/nim-eks/,
demos/examples/CUJ2-Test-Report.md, design doc 005, and CHANGELOG
entries are intentionally left referencing kgateway as frozen records.

Closes follow-up #1 from issue NVIDIA#698.

Validation:
- make qualify (Go unit tests, golangci-lint, yamllint, 20/20 chainsaw
  including cli-bundle-agentgateway-templates, vulnerability scan,
  license headers): all green
- Bundle generation verified end-to-end (recipe -> bundle -> deploy.sh
  layout under /tmp); rendered post chart confirmed to emit
  AgentgatewayParameters + Gateway with class agentgateway in namespace
  agentgateway-system
- validators/conformance package tests pass (22.6s)
@yuanchen8911 yuanchen8911 force-pushed the chore/kgateway-v2.2-agentgateway-migration branch from 9cf5226 to 97adcd8 Compare May 14, 2026 00:39
coderabbitai[bot]

This comment was marked as resolved.

@mchmarny mchmarny enabled auto-merge (squash) May 14, 2026 12:13
@mchmarny mchmarny merged commit f548633 into NVIDIA:main May 14, 2026
91 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
…-crds comment

The slinky-slurm-operator-crds values file's comment references
`components/kgateway-crds/values.yaml` as an example of the
"AICR umbrella convention for CRD-only components." That path no
longer exists — the kgateway component was renamed to agentgateway
in NVIDIA#871, and the file is now at `components/agentgateway-crds/
values.yaml`.

One-line comment fix; no behavior change.

Surfaced during a post-NVIDIA#871 audit for residual kgateway references in
active code paths.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
…-crds comment

The slinky-slurm-operator-crds values file's comment references
`components/kgateway-crds/values.yaml` as an example of the
"AICR umbrella convention for CRD-only components." That path no
longer exists — the kgateway component was renamed to agentgateway
in NVIDIA#871, and the file is now at `components/agentgateway-crds/
values.yaml`.

One-line comment fix; no behavior change.

Surfaced during a post-NVIDIA#871 audit for residual kgateway references in
active code paths.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
…-crds comment

The slinky-slurm-operator-crds values file's comment references
`components/kgateway-crds/values.yaml` as an example of the
"AICR umbrella convention for CRD-only components." That path no
longer exists — the kgateway component was renamed to agentgateway
in NVIDIA#871, and the file is now at `components/agentgateway-crds/
values.yaml`.

One-line comment fix; no behavior change.

Surfaced during a post-NVIDIA#871 audit for residual kgateway references in
active code paths.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Sweep of doc-quality issues uncovered by an audit of docs/, demos/,
and examples/ against current main. Fixes break into four buckets:

ERROR (breaking or contradicts code):
- examples/recipes/eks-training.yaml: deploymentOrder listed
  kube-prometheus-stack with no matching componentRefs entry — bundle
  generation rejected the recipe. Added the missing entry.
- examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same
  defect; performance phase referenced nonexistent checks
  (nccl-bandwidth-test, fabric-health-check) and an unknown
  `infrastructure: nccl-doctor` line — replaced with the real
  nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training
  overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and
  nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults.
- docs/contributor/validator.md: validators.Context struct field
  `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput
  *v1.ValidationInput`; gated skip example updated to match. Added an
  RBAC subsection documenting per-run aicr-validator-<runID> naming
  introduced in PR NVIDIA#888.
- docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to
  match RecipeCacheTTL (10 min); root endpoint `routes` list expanded
  from [/v1/recipe] to all three registered routes; Query parameter
  table gained the missing `platform` row; broken anchor
  index.md#cicd-architecture replaced with a working link.
- docs/contributor/cli.md: Factory interface snippet went from 4 to 5
  methods (CreateNodeTopologyCollector was missing); snapshot
  measurements list gained `topology`; bogus [INTERNAL]-wrapped
  invalid-accelerator example corrected to [INVALID_REQUEST];
  duplicate log line and empty section removed.
- docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT)
  to the validate exit-code table; `--best-effort` (a real flag) is no
  longer used as the "rejected typo" example.
- docs/user/api-reference.md: Bundle Components table gained the three
  missing components (nfd, slinky-slurm-operator,
  slinky-slurm-operator-crds) and was re-sorted alphabetically.
- docs/conformance/cncf/index.md: corrected script path
  pkg/evidence/scripts/collect-evidence.sh →
  pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the
  directory tree; section-name list updated to match the script's
  case block.

INCONSISTENCY / drift:
- platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm)
  in docs/README.md, docs/contributor/validations.md.
- OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux,
  talos) in docs/integrator/data-flow.md.
- Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in
  docs/integrator/{recipe-development,automation,data-flow}.md.
- demos/cuj1-eks.md: --system-node-selector dedicated=system-workload
  (taint key as a label selector) → nodeGroup=system-worker; matching
  prose adjustment.
- demos/cuj2.md: toleration grammar field [operation] → [effect].
- demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box
  borders re-aligned.
- demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per
  project artifact-location convention; stale "18 + 1 = 19" component
  count replaced with placeholders.
- demos/data.md: "Asymmetric rule matching" → "Dependency-driven
  ordering based on Kahn's algorithm".
- demos/README.md: index table gained the missing CUJ pages.
- docs/integrator/index.md and docs/README.md: integrator-pages tables
  gained missing rows.

STYLE / slug hygiene (CLAUDE.md doc-style rules):
- docs/user/cli-reference.md: stripped trailing (--flag)
  parentheticals from three headings (Storage Class, Deployment
  Methods, Value Overrides) — the Deploy/Undeploy variants are left
  alone in this PR because their slugs have many inbound references in
  bundle templates and golden-files which would need a coordinated
  follow-up.
- docs/contributor/api-server.md: three headings rewritten with
  "and" in place of "&" to fix double-hyphen slugs.
- docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)"
  from a heading.

TYPO / NIT:
- docs/README.md "a automated" → "an automated".
- docs/integrator/recipe-development.md "end to end" → "end-to-end".
- demos/cuj2.md "comma delimination" → "comma-separated".
- demos/valid.md sentence fragment completed.
- docs/user/component-catalog.md `***Note:***` → `**Note:**`.
- docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr
  → github.com/NVIDIA/aicr.
- docs/user/agent-deployment.md and docs/user/cli-reference.md
  literal `app.kubernetes.io/version: v0.17.0` → <aicr-version>
  placeholder.
- docs/contributor/cli.md CronJob example pin v0.6.4 →
  <release-tag> placeholder.
- examples/recipes/README.md driver version 580.82.07 → 580.105.08 to
  match current gpu-operator pin.
- examples/recipes/{eks-training,eks-gb200-...,kind}.yaml
  metadata.version v0.26.7-next → dev (hand-written-example
  convention shared with aks-training.yaml).
- demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log
  as a pre-NVIDIA#871 historical capture (kgateway → agentgateway
  migration).

Also adds the missing CNCF feature list to docs/user/cli-reference.md
--feature row and a "Scoping conformance to specific features" section
in docs/user/validation.md, since the 9 ValidFeatures names from
pkg/evidence/cncf/collector.go were not documented for users.

Skipped (out of scope or risk-deferred):
- ADR-002 cross-references to per-run RBAC names beyond the existing
  2026-05-14 implementation note — ADRs are frozen historical records.
- Deploy Script Behavior / Undeploy Script Behavior heading
  parentheticals — ~23 inbound anchor links in bundle templates and
  golden-files would need a coordinated PR.
- demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both
  files already explicitly note "No podTemplateOverrides /
  runtimePatches needed" since the torch-distributed runtime bakes
  scheduling at bundle time.
- pre-existing site/docs/* renders (gitignored, auto-generated).

Doc-only PR; full `make qualify` skipped per CLAUDE.local.md
("doc-only / infra-only change ... cheap checks are enough"). Ran
`make lint` (yamllint, gofmt, license-check, sidebar check, agents
sync, chart-pin verification) — all green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Sweep of doc-quality issues uncovered by an audit of docs/, demos/,
and examples/ against current main. Fixes break into four buckets:

ERROR (breaking or contradicts code):
- examples/recipes/eks-training.yaml: deploymentOrder listed
  kube-prometheus-stack with no matching componentRefs entry — bundle
  generation rejected the recipe. Added the missing entry.
- examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same
  defect; performance phase referenced nonexistent checks
  (nccl-bandwidth-test, fabric-health-check) and an unknown
  `infrastructure: nccl-doctor` line — replaced with the real
  nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training
  overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and
  nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults.
- docs/contributor/validator.md: validators.Context struct field
  `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput
  *v1.ValidationInput`; gated skip example updated to match. Added an
  RBAC subsection documenting per-run aicr-validator-<runID> naming
  introduced in PR NVIDIA#888.
- docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to
  match RecipeCacheTTL (10 min); root endpoint `routes` list expanded
  from [/v1/recipe] to all three registered routes; Query parameter
  table gained the missing `platform` row; broken anchor
  index.md#cicd-architecture replaced with a working link.
- docs/contributor/cli.md: Factory interface snippet went from 4 to 5
  methods (CreateNodeTopologyCollector was missing); snapshot
  measurements list gained `topology`; bogus [INTERNAL]-wrapped
  invalid-accelerator example corrected to [INVALID_REQUEST];
  duplicate log line and empty section removed.
- docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT)
  to the validate exit-code table; `--best-effort` (a real flag) is no
  longer used as the "rejected typo" example.
- docs/user/api-reference.md: Bundle Components table gained the three
  missing components (nfd, slinky-slurm-operator,
  slinky-slurm-operator-crds) and was re-sorted alphabetically.
- docs/conformance/cncf/index.md: corrected script path
  pkg/evidence/scripts/collect-evidence.sh →
  pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the
  directory tree; section-name list updated to match the script's
  case block.

INCONSISTENCY / drift:
- platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm)
  in docs/README.md, docs/contributor/validations.md.
- OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux,
  talos) in docs/integrator/data-flow.md.
- Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in
  docs/integrator/{recipe-development,automation,data-flow}.md.
- demos/cuj1-eks.md: --system-node-selector dedicated=system-workload
  (taint key as a label selector) → nodeGroup=system-worker; matching
  prose adjustment.
- demos/cuj2.md: toleration grammar field [operation] → [effect].
- demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box
  borders re-aligned.
- demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per
  project artifact-location convention; stale "18 + 1 = 19" component
  count replaced with placeholders.
- demos/data.md: "Asymmetric rule matching" → "Dependency-driven
  ordering based on Kahn's algorithm".
- demos/README.md: index table gained the missing CUJ pages.
- docs/integrator/index.md and docs/README.md: integrator-pages tables
  gained missing rows.

STYLE / slug hygiene (CLAUDE.md doc-style rules):
- docs/user/cli-reference.md: stripped trailing (--flag)
  parentheticals from three headings (Storage Class, Deployment
  Methods, Value Overrides) — the Deploy/Undeploy variants are left
  alone in this PR because their slugs have many inbound references in
  bundle templates and golden-files which would need a coordinated
  follow-up.
- docs/contributor/api-server.md: three headings rewritten with
  "and" in place of "&" to fix double-hyphen slugs.
- docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)"
  from a heading.

TYPO / NIT:
- docs/README.md "a automated" → "an automated".
- docs/integrator/recipe-development.md "end to end" → "end-to-end".
- demos/cuj2.md "comma delimination" → "comma-separated".
- demos/valid.md sentence fragment completed.
- docs/user/component-catalog.md `***Note:***` → `**Note:**`.
- docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr
  → github.com/NVIDIA/aicr.
- docs/user/agent-deployment.md and docs/user/cli-reference.md
  literal `app.kubernetes.io/version: v0.17.0` → <aicr-version>
  placeholder.
- docs/contributor/cli.md CronJob example pin v0.6.4 →
  <release-tag> placeholder.
- examples/recipes/README.md driver version 580.82.07 → 580.105.08 to
  match current gpu-operator pin.
- examples/recipes/{eks-training,eks-gb200-...,kind}.yaml
  metadata.version v0.26.7-next → dev (hand-written-example
  convention shared with aks-training.yaml).
- demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log
  as a pre-NVIDIA#871 historical capture (kgateway → agentgateway
  migration).

Also adds the missing CNCF feature list to docs/user/cli-reference.md
--feature row and a "Scoping conformance to specific features" section
in docs/user/validation.md, since the 9 ValidFeatures names from
pkg/evidence/cncf/collector.go were not documented for users.

Skipped (out of scope or risk-deferred):
- ADR-002 cross-references to per-run RBAC names beyond the existing
  2026-05-14 implementation note — ADRs are frozen historical records.
- Deploy Script Behavior / Undeploy Script Behavior heading
  parentheticals — ~23 inbound anchor links in bundle templates and
  golden-files would need a coordinated PR.
- demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both
  files already explicitly note "No podTemplateOverrides /
  runtimePatches needed" since the torch-distributed runtime bakes
  scheduling at bundle time.
- pre-existing site/docs/* renders (gitignored, auto-generated).

Doc-only PR; full `make qualify` skipped per CLAUDE.local.md
("doc-only / infra-only change ... cheap checks are enough"). Ran
`make lint` (yamllint, gofmt, license-check, sidebar check, agents
sync, chart-pin verification) — all green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Sweep of doc-quality issues uncovered by an audit of docs/, demos/,
and examples/ against current main. Fixes break into four buckets:

ERROR (breaking or contradicts code):
- examples/recipes/eks-training.yaml: deploymentOrder listed
  kube-prometheus-stack with no matching componentRefs entry — bundle
  generation rejected the recipe. Added the missing entry.
- examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same
  defect; performance phase referenced nonexistent checks
  (nccl-bandwidth-test, fabric-health-check) and an unknown
  `infrastructure: nccl-doctor` line — replaced with the real
  nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training
  overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and
  nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults.
- docs/contributor/validator.md: validators.Context struct field
  `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput
  *v1.ValidationInput`; gated skip example updated to match. Added an
  RBAC subsection documenting per-run aicr-validator-<runID> naming
  introduced in PR NVIDIA#888.
- docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to
  match RecipeCacheTTL (10 min); root endpoint `routes` list expanded
  from [/v1/recipe] to all three registered routes; Query parameter
  table gained the missing `platform` row; broken anchor
  index.md#cicd-architecture replaced with a working link.
- docs/contributor/cli.md: Factory interface snippet went from 4 to 5
  methods (CreateNodeTopologyCollector was missing); snapshot
  measurements list gained `topology`; bogus [INTERNAL]-wrapped
  invalid-accelerator example corrected to [INVALID_REQUEST];
  duplicate log line and empty section removed.
- docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT)
  to the validate exit-code table; `--best-effort` (a real flag) is no
  longer used as the "rejected typo" example.
- docs/user/api-reference.md: Bundle Components table gained the three
  missing components (nfd, slinky-slurm-operator,
  slinky-slurm-operator-crds) and was re-sorted alphabetically.
- docs/conformance/cncf/index.md: corrected script path
  pkg/evidence/scripts/collect-evidence.sh →
  pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the
  directory tree; section-name list updated to match the script's
  case block.

INCONSISTENCY / drift:
- platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm)
  in docs/README.md, docs/contributor/validations.md.
- OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux,
  talos) in docs/integrator/data-flow.md.
- Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in
  docs/integrator/{recipe-development,automation,data-flow}.md.
- demos/cuj1-eks.md: --system-node-selector dedicated=system-workload
  (taint key as a label selector) → nodeGroup=system-worker; matching
  prose adjustment.
- demos/cuj2.md: toleration grammar field [operation] → [effect].
- demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box
  borders re-aligned.
- demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per
  project artifact-location convention; stale "18 + 1 = 19" component
  count replaced with placeholders.
- demos/data.md: "Asymmetric rule matching" → "Dependency-driven
  ordering based on Kahn's algorithm".
- demos/README.md: index table gained the missing CUJ pages.
- docs/integrator/index.md and docs/README.md: integrator-pages tables
  gained missing rows.

STYLE / slug hygiene (CLAUDE.md doc-style rules):
- docs/user/cli-reference.md: stripped trailing (--flag)
  parentheticals from three headings (Storage Class, Deployment
  Methods, Value Overrides) — the Deploy/Undeploy variants are left
  alone in this PR because their slugs have many inbound references in
  bundle templates and golden-files which would need a coordinated
  follow-up.
- docs/contributor/api-server.md: three headings rewritten with
  "and" in place of "&" to fix double-hyphen slugs.
- docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)"
  from a heading.

TYPO / NIT:
- docs/README.md "a automated" → "an automated".
- docs/integrator/recipe-development.md "end to end" → "end-to-end".
- demos/cuj2.md "comma delimination" → "comma-separated".
- demos/valid.md sentence fragment completed.
- docs/user/component-catalog.md `***Note:***` → `**Note:**`.
- docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr
  → github.com/NVIDIA/aicr.
- docs/user/agent-deployment.md and docs/user/cli-reference.md
  literal `app.kubernetes.io/version: v0.17.0` → <aicr-version>
  placeholder.
- docs/contributor/cli.md CronJob example pin v0.6.4 →
  <release-tag> placeholder.
- examples/recipes/README.md driver version 580.82.07 → 580.105.08 to
  match current gpu-operator pin.
- examples/recipes/{eks-training,eks-gb200-...,kind}.yaml
  metadata.version v0.26.7-next → dev (hand-written-example
  convention shared with aks-training.yaml).
- demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log
  as a pre-NVIDIA#871 historical capture (kgateway → agentgateway
  migration).

Also adds the missing CNCF feature list to docs/user/cli-reference.md
--feature row and a "Scoping conformance to specific features" section
in docs/user/validation.md, since the 9 ValidFeatures names from
pkg/evidence/cncf/collector.go were not documented for users.

Skipped (out of scope or risk-deferred):
- ADR-002 cross-references to per-run RBAC names beyond the existing
  2026-05-14 implementation note — ADRs are frozen historical records.
- Deploy Script Behavior / Undeploy Script Behavior heading
  parentheticals — ~23 inbound anchor links in bundle templates and
  golden-files would need a coordinated PR.
- demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both
  files already explicitly note "No podTemplateOverrides /
  runtimePatches needed" since the torch-distributed runtime bakes
  scheduling at bundle time.
- pre-existing site/docs/* renders (gitignored, auto-generated).

Doc-only PR; full `make qualify` skipped per CLAUDE.local.md
("doc-only / infra-only change ... cheap checks are enough"). Ran
`make lint` (yamllint, gofmt, license-check, sidebar check, agents
sync, chart-pin verification) — all green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Sweep of doc-quality issues uncovered by an audit of docs/, demos/,
and examples/ against current main. Fixes break into four buckets:

ERROR (breaking or contradicts code):
- examples/recipes/eks-training.yaml: deploymentOrder listed
  kube-prometheus-stack with no matching componentRefs entry — bundle
  generation rejected the recipe. Added the missing entry.
- examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same
  defect; performance phase referenced nonexistent checks
  (nccl-bandwidth-test, fabric-health-check) and an unknown
  `infrastructure: nccl-doctor` line — replaced with the real
  nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training
  overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and
  nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults.
- docs/contributor/validator.md: validators.Context struct field
  `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput
  *v1.ValidationInput`; gated skip example updated to match. Added an
  RBAC subsection documenting per-run aicr-validator-<runID> naming
  introduced in PR NVIDIA#888.
- docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to
  match RecipeCacheTTL (10 min); root endpoint `routes` list expanded
  from [/v1/recipe] to all three registered routes; Query parameter
  table gained the missing `platform` row; broken anchor
  index.md#cicd-architecture replaced with a working link.
- docs/contributor/cli.md: Factory interface snippet went from 4 to 5
  methods (CreateNodeTopologyCollector was missing); snapshot
  measurements list gained `topology`; bogus [INTERNAL]-wrapped
  invalid-accelerator example corrected to [INVALID_REQUEST];
  duplicate log line and empty section removed.
- docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT)
  to the validate exit-code table; `--best-effort` (a real flag) is no
  longer used as the "rejected typo" example.
- docs/user/api-reference.md: Bundle Components table gained the three
  missing components (nfd, slinky-slurm-operator,
  slinky-slurm-operator-crds) and was re-sorted alphabetically.
- docs/conformance/cncf/index.md: corrected script path
  pkg/evidence/scripts/collect-evidence.sh →
  pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the
  directory tree; section-name list updated to match the script's
  case block.

INCONSISTENCY / drift:
- platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm)
  in docs/README.md, docs/contributor/validations.md.
- OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux,
  talos) in docs/integrator/data-flow.md.
- Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in
  docs/integrator/{recipe-development,automation,data-flow}.md.
- demos/cuj1-eks.md: --system-node-selector dedicated=system-workload
  (taint key as a label selector) → nodeGroup=system-worker; matching
  prose adjustment.
- demos/cuj2.md: toleration grammar field [operation] → [effect].
- demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box
  borders re-aligned.
- demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per
  project artifact-location convention; stale "18 + 1 = 19" component
  count replaced with placeholders.
- demos/data.md: "Asymmetric rule matching" → "Dependency-driven
  ordering based on Kahn's algorithm".
- demos/README.md: index table gained the missing CUJ pages.
- docs/integrator/index.md and docs/README.md: integrator-pages tables
  gained missing rows.

STYLE / slug hygiene (CLAUDE.md doc-style rules):
- docs/user/cli-reference.md: stripped trailing (--flag)
  parentheticals from three headings (Storage Class, Deployment
  Methods, Value Overrides) — the Deploy/Undeploy variants are left
  alone in this PR because their slugs have many inbound references in
  bundle templates and golden-files which would need a coordinated
  follow-up.
- docs/contributor/api-server.md: three headings rewritten with
  "and" in place of "&" to fix double-hyphen slugs.
- docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)"
  from a heading.

TYPO / NIT:
- docs/README.md "a automated" → "an automated".
- docs/integrator/recipe-development.md "end to end" → "end-to-end".
- demos/cuj2.md "comma delimination" → "comma-separated".
- demos/valid.md sentence fragment completed.
- docs/user/component-catalog.md `***Note:***` → `**Note:**`.
- docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr
  → github.com/NVIDIA/aicr.
- docs/user/agent-deployment.md and docs/user/cli-reference.md
  literal `app.kubernetes.io/version: v0.17.0` → <aicr-version>
  placeholder.
- docs/contributor/cli.md CronJob example pin v0.6.4 →
  <release-tag> placeholder.
- examples/recipes/README.md driver version 580.82.07 → 580.105.08 to
  match current gpu-operator pin.
- examples/recipes/{eks-training,eks-gb200-...,kind}.yaml
  metadata.version v0.26.7-next → dev (hand-written-example
  convention shared with aks-training.yaml).
- demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log
  as a pre-NVIDIA#871 historical capture (kgateway → agentgateway
  migration).

Also adds the missing CNCF feature list to docs/user/cli-reference.md
--feature row and a "Scoping conformance to specific features" section
in docs/user/validation.md, since the 9 ValidFeatures names from
pkg/evidence/cncf/collector.go were not documented for users.

Skipped (out of scope or risk-deferred):
- ADR-002 cross-references to per-run RBAC names beyond the existing
  2026-05-14 implementation note — ADRs are frozen historical records.
- Deploy Script Behavior / Undeploy Script Behavior heading
  parentheticals — ~23 inbound anchor links in bundle templates and
  golden-files would need a coordinated PR.
- demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both
  files already explicitly note "No podTemplateOverrides /
  runtimePatches needed" since the torch-distributed runtime bakes
  scheduling at bundle time.
- pre-existing site/docs/* renders (gitignored, auto-generated).

Doc-only PR; full `make qualify` skipped per CLAUDE.local.md
("doc-only / infra-only change ... cheap checks are enough"). Ran
`make lint` (yamllint, gofmt, license-check, sidebar check, agents
sync, chart-pin verification) — all green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Sweep of doc-quality issues uncovered by an audit of docs/, demos/,
and examples/ against current main. Fixes break into four buckets:

ERROR (breaking or contradicts code):
- examples/recipes/eks-training.yaml: deploymentOrder listed
  kube-prometheus-stack with no matching componentRefs entry — bundle
  generation rejected the recipe. Added the missing entry.
- examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same
  defect; performance phase referenced nonexistent checks
  (nccl-bandwidth-test, fabric-health-check) and an unknown
  `infrastructure: nccl-doctor` line — replaced with the real
  nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training
  overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and
  nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults.
- docs/contributor/validator.md: validators.Context struct field
  `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput
  *v1.ValidationInput`; gated skip example updated to match. Added an
  RBAC subsection documenting per-run aicr-validator-<runID> naming
  introduced in PR NVIDIA#888.
- docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to
  match RecipeCacheTTL (10 min); root endpoint `routes` list expanded
  from [/v1/recipe] to all three registered routes; Query parameter
  table gained the missing `platform` row; broken anchor
  index.md#cicd-architecture replaced with a working link.
- docs/contributor/cli.md: Factory interface snippet went from 4 to 5
  methods (CreateNodeTopologyCollector was missing); snapshot
  measurements list gained `topology`; bogus [INTERNAL]-wrapped
  invalid-accelerator example corrected to [INVALID_REQUEST];
  duplicate log line and empty section removed.
- docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT)
  to the validate exit-code table; `--best-effort` (a real flag) is no
  longer used as the "rejected typo" example.
- docs/user/api-reference.md: Bundle Components table gained the three
  missing components (nfd, slinky-slurm-operator,
  slinky-slurm-operator-crds) and was re-sorted alphabetically.
- docs/conformance/cncf/index.md: corrected script path
  pkg/evidence/scripts/collect-evidence.sh →
  pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the
  directory tree; section-name list updated to match the script's
  case block.

INCONSISTENCY / drift:
- platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm)
  in docs/README.md, docs/contributor/validations.md.
- OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux,
  talos) in docs/integrator/data-flow.md.
- Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in
  docs/integrator/{recipe-development,automation,data-flow}.md.
- demos/cuj1-eks.md: --system-node-selector dedicated=system-workload
  (taint key as a label selector) → nodeGroup=system-worker; matching
  prose adjustment.
- demos/cuj2.md: toleration grammar field [operation] → [effect].
- demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box
  borders re-aligned.
- demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per
  project artifact-location convention; stale "18 + 1 = 19" component
  count replaced with placeholders.
- demos/data.md: "Asymmetric rule matching" → "Dependency-driven
  ordering based on Kahn's algorithm".
- demos/README.md: index table gained the missing CUJ pages.
- docs/integrator/index.md and docs/README.md: integrator-pages tables
  gained missing rows.

STYLE / slug hygiene (CLAUDE.md doc-style rules):
- docs/user/cli-reference.md: stripped trailing (--flag)
  parentheticals from three headings (Storage Class, Deployment
  Methods, Value Overrides) — the Deploy/Undeploy variants are left
  alone in this PR because their slugs have many inbound references in
  bundle templates and golden-files which would need a coordinated
  follow-up.
- docs/contributor/api-server.md: three headings rewritten with
  "and" in place of "&" to fix double-hyphen slugs.
- docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)"
  from a heading.

TYPO / NIT:
- docs/README.md "a automated" → "an automated".
- docs/integrator/recipe-development.md "end to end" → "end-to-end".
- demos/cuj2.md "comma delimination" → "comma-separated".
- demos/valid.md sentence fragment completed.
- docs/user/component-catalog.md `***Note:***` → `**Note:**`.
- docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr
  → github.com/NVIDIA/aicr.
- docs/user/agent-deployment.md and docs/user/cli-reference.md
  literal `app.kubernetes.io/version: v0.17.0` → <aicr-version>
  placeholder.
- docs/contributor/cli.md CronJob example pin v0.6.4 →
  <release-tag> placeholder.
- examples/recipes/README.md driver version 580.82.07 → 580.105.08 to
  match current gpu-operator pin.
- examples/recipes/{eks-training,eks-gb200-...,kind}.yaml
  metadata.version v0.26.7-next → dev (hand-written-example
  convention shared with aks-training.yaml).
- demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log
  as a pre-NVIDIA#871 historical capture (kgateway → agentgateway
  migration).

Also adds the missing CNCF feature list to docs/user/cli-reference.md
--feature row and a "Scoping conformance to specific features" section
in docs/user/validation.md, since the 9 ValidFeatures names from
pkg/evidence/cncf/collector.go were not documented for users.

Skipped (out of scope or risk-deferred):
- ADR-002 cross-references to per-run RBAC names beyond the existing
  2026-05-14 implementation note — ADRs are frozen historical records.
- Deploy Script Behavior / Undeploy Script Behavior heading
  parentheticals — ~23 inbound anchor links in bundle templates and
  golden-files would need a coordinated PR.
- demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both
  files already explicitly note "No podTemplateOverrides /
  runtimePatches needed" since the torch-distributed runtime bakes
  scheduling at bundle time.
- pre-existing site/docs/* renders (gitignored, auto-generated).

Doc-only PR; full `make qualify` skipped per CLAUDE.local.md
("doc-only / infra-only change ... cheap checks are enough"). Ran
`make lint` (yamllint, gofmt, license-check, sidebar check, agents
sync, chart-pin verification) — all green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants