From 179a3462ca23db92487a619097c3f28c60009307 Mon Sep 17 00:00:00 2001 From: Carlos Eduardo Arango Gutierrez Date: Fri, 15 May 2026 15:05:30 +0200 Subject: [PATCH] docs: post-#866 slurm platform-enum drift + nodeSelector catalog note MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Follow-up to PR #866 (Slinky slurm-operator). Backfills platform-enum drift on surfaces the original PR did not catch, plus surfaces the chart v1.1.0 nodeSelector silent-ignore limitation on the user-facing component-catalog row. PR #896 (yuanchen8911) already fixed docs/README.md and docs/contributor/validations.md as part of a broader docs audit. PR #893 removed the site/docs/ vitepress mirror entirely. Remaining surfaces this PR addresses: - docs/contributor/data.md:130 — platform row in criteria table - docs/user/component-catalog.md:36 — slinky-slurm-operator row, appended **Known limitation:** clause documenting chart v1.1.0 silent-ignore of operator.nodeSelector/webhook.nodeSelector (the same limitation is already inline in recipes/registry.yaml; this surfaces it on the rendered user catalog) - pkg/api/doc.go:72 — REST API godoc was missing slurm entirely - pkg/recipe/doc.go:32 — struct shape comment reordered to alphabetical to match GetCriteriaPlatformTypes() - pkg/recipe/doc.go:93-98 — CriteriaPlatform* constants reordered alphabetical - pkg/recipe/criteria.go:246 — Platform-field godoc updated - .claude/skills/analyzing-snapshots/SKILL.md:278 — internal AICR snapshot-analysis skill criteria table All now list 'dynamo, kubeflow, nim, slurm' alphabetically matching pkg/recipe.GetCriteriaPlatformTypes() (the authoritative Go enum). New guard test ============== pkg/recipe/doc_test.go::TestCriteriaPlatformConstantsMatchGetter asserts the CriteriaPlatform* constants and GetCriteriaPlatformTypes() stay in sync. Mechanically catches the exact class of drift this commit fixes if a future platform value is added to one but not the other. Addressed reviews ================= @mchmarny (#866) — Doc-audit gap. cli-reference and api-reference were fixed at merge time and #896 picked up README/validations; this PR catches the remaining surfaces (contributor/data, the three Go godoc files, and the internal skill table). @coderabbitai (#866) — NodeSelector limitation note on the slinky-slurm-operator catalog row. Internal PE + QA + DA panel review on draft #884 — extended audit to pkg/api/doc.go (factual miss), pkg/recipe/doc.go (ordering), .claude/skills/analyzing-snapshots/SKILL.md, plus the guard test in pkg/recipe/doc_test.go. Signed-off-by: Carlos Eduardo Arango Gutierrez --- .claude/skills/analyzing-snapshots/SKILL.md | 2 +- docs/contributor/data.md | 2 +- docs/user/component-catalog.md | 2 +- pkg/api/doc.go | 2 +- pkg/recipe/criteria.go | 2 +- pkg/recipe/doc.go | 4 +- pkg/recipe/doc_test.go | 45 +++++++++++++++++++++ 7 files changed, 52 insertions(+), 7 deletions(-) create mode 100644 pkg/recipe/doc_test.go diff --git a/.claude/skills/analyzing-snapshots/SKILL.md b/.claude/skills/analyzing-snapshots/SKILL.md index 0ede6e80b..fff37d7d5 100644 --- a/.claude/skills/analyzing-snapshots/SKILL.md +++ b/.claude/skills/analyzing-snapshots/SKILL.md @@ -275,4 +275,4 @@ aicr recipe \ | accelerator | GPU.smi.gpu.model | h100, gb200, b200, a100, l40, rtx-pro-6000 | | os | OS.release.ID | ubuntu, rhel, cos, amazonlinux | | intent | User-specified | training, inference | -| platform | User-specified | kubeflow, dynamo, nim | +| platform | User-specified | dynamo, kubeflow, nim, slurm | diff --git a/docs/contributor/data.md b/docs/contributor/data.md index 3d38999a9..1744af1ea 100644 --- a/docs/contributor/data.md +++ b/docs/contributor/data.md @@ -127,7 +127,7 @@ Criteria define when a recipe matches a user query: | `accelerator` | String | GPU hardware type | `h100`, `gb200`, `b200`, `a100`, `l40`, `rtx-pro-6000` | | `os` | String | Operating system | `ubuntu`, `rhel`, `cos`, `amazonlinux` | | `intent` | String | Workload purpose | `training`, `inference` | -| `platform` | String | Platform/framework type | `kubeflow` | +| `platform` | String | Platform/framework type | `dynamo`, `kubeflow`, `nim`, `slurm` | | `nodes` | Integer | Node count (0 = any) | `8`, `16` | **All fields are optional.** Unpopulated fields act as wildcards (match any value). diff --git a/docs/user/component-catalog.md b/docs/user/component-catalog.md index 56a790c20..7256c5108 100644 --- a/docs/user/component-catalog.md +++ b/docs/user/component-catalog.md @@ -33,7 +33,7 @@ The source of truth is [`recipes/registry.yaml`](https://github.com/NVIDIA/aicr/ | **kueue** | Kubernetes-native job queuing system. Manages quotas and admits jobs for batch and AI workloads. | [Kueue](https://github.com/kubernetes-sigs/kueue) | | **kubeflow-trainer** | Kubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration. | [Kubeflow Trainer](https://github.com/kubeflow/trainer) | | **slinky-slurm-operator-crds** | Custom Resource Definitions for the SchedMD Slinky Slurm operator. Installs the `slinky.slurm.net` CRDs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). Installed separately to support CRD lifecycle management. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) | -| **slinky-slurm-operator** | SchedMD Slinky Slurm operator and admission webhook. Manages the lifecycle of Slurm clusters declared via Slinky CRs. Cluster-instance CRs (Controller, NodeSet, LoginSet, ...) are user-authored — AICR ships only the operator, mirroring how dynamo-platform and kubeflow-trainer ship operator-only. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) | +| **slinky-slurm-operator** | SchedMD Slinky Slurm operator and admission webhook. Manages the lifecycle of Slurm clusters declared via Slinky CRs. Cluster-instance CRs (Controller, NodeSet, LoginSet, ...) are user-authored — AICR ships only the operator, mirroring how dynamo-platform and kubeflow-trainer ship operator-only. **Known limitation:** chart v1.1.0 silently ignores `operator.nodeSelector` and `webhook.nodeSelector` (current chart behavior, not a planned feature); tracking [SlinkyProject/slurm-operator#187](https://github.com/SlinkyProject/slurm-operator/pull/187) for the upstream fix. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) | ## How Components Are Selected diff --git a/pkg/api/doc.go b/pkg/api/doc.go index b2bd601bf..f6eae30ea 100644 --- a/pkg/api/doc.go +++ b/pkg/api/doc.go @@ -69,7 +69,7 @@ // - gpu: Alias for accelerator (back-compat) // - intent: Workload intent (training, inference, any) // - os: Operating system (ubuntu, rhel, cos, amazonlinux, talos, any) -// - platform: Platform/framework (dynamo, kubeflow, nim, any) +// - platform: Platform/framework (dynamo, kubeflow, nim, slurm, any) // - nodes: Number of GPU nodes (0 = any/unspecified) // // # Request Body (POST /v1/recipe) diff --git a/pkg/recipe/criteria.go b/pkg/recipe/criteria.go index 0cb4c2255..5311e9c87 100644 --- a/pkg/recipe/criteria.go +++ b/pkg/recipe/criteria.go @@ -243,7 +243,7 @@ type Criteria struct { // OS is the worker node operating system type. OS CriteriaOSType `json:"os,omitempty" yaml:"os,omitempty"` - // Platform is the platform/framework type (kubeflow). + // Platform is the platform/framework type (dynamo, kubeflow, nim, slurm). Platform CriteriaPlatformType `json:"platform,omitempty" yaml:"platform,omitempty"` // Nodes is the number of worker nodes (0 means any/unspecified). diff --git a/pkg/recipe/doc.go b/pkg/recipe/doc.go index 533fe63c5..37cdf373d 100644 --- a/pkg/recipe/doc.go +++ b/pkg/recipe/doc.go @@ -29,7 +29,7 @@ // Accelerator CriteriaAcceleratorType // h100, gb200, b200, a100, l40, rtx-pro-6000, any // Intent CriteriaIntentType // training, inference, any // OS CriteriaOSType // ubuntu, rhel, cos, amazonlinux, talos, any -// Platform CriteriaPlatformType // kubeflow, dynamo, nim, slurm, any +// Platform CriteriaPlatformType // dynamo, kubeflow, nim, slurm, any // Nodes int // node count (0 = any) // } // @@ -91,8 +91,8 @@ // - CriteriaOSAny: Any OS (wildcard) // // Platform types for workload frameworks: -// - CriteriaPlatformKubeflow: Kubeflow // - CriteriaPlatformDynamo: NVIDIA Dynamo +// - CriteriaPlatformKubeflow: Kubeflow // - CriteriaPlatformNIM: NVIDIA NIM // - CriteriaPlatformSlurm: SchedMD Slinky Slurm // - CriteriaPlatformAny: Any platform (wildcard) diff --git a/pkg/recipe/doc_test.go b/pkg/recipe/doc_test.go new file mode 100644 index 000000000..405515e6b --- /dev/null +++ b/pkg/recipe/doc_test.go @@ -0,0 +1,45 @@ +// Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package recipe + +import ( + "sort" + "testing" +) + +// TestCriteriaPlatformConstantsMatchGetter guards against drift between the +// CriteriaPlatform* constants and the slice returned by +// GetCriteriaPlatformTypes(). Adding a new constant without registering it in +// the getter (or vice versa) is exactly the class of bug that left earlier +// platform-enum doc surfaces stale before this test existed. +func TestCriteriaPlatformConstantsMatchGetter(t *testing.T) { + declared := []string{ + string(CriteriaPlatformDynamo), + string(CriteriaPlatformKubeflow), + string(CriteriaPlatformNIM), + string(CriteriaPlatformSlurm), + } + sort.Strings(declared) + + got := GetCriteriaPlatformTypes() + if len(got) != len(declared) { + t.Fatalf("len(GetCriteriaPlatformTypes())=%d, declared constants=%d", len(got), len(declared)) + } + for i, want := range declared { + if got[i] != want { + t.Errorf("GetCriteriaPlatformTypes()[%d] = %q, want %q (declared)", i, got[i], want) + } + } +}