docs: post-#866 slurm platform-enum drift + nodeSelector catalog note#884
docs: post-#866 slurm platform-enum drift + nodeSelector catalog note#884ArangoGutierrez wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThis PR documents the expanded support for workload platform/framework types in the Criteria field. Changes include: (1) updating field comments in Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
5a66d80 to
b1ef88a
Compare
…g note Two follow-ups to PR NVIDIA#866 (Slinky slurm-operator), bundled into one post-merge cleanup PR. Platform-enum drift backfill — surfaces NVIDIA#866 did not catch ============================================================ PR NVIDIA#866 added `--platform slurm` to the OpenAPI spec, the Go enum (GetCriteriaPlatformTypes), and docs/user/{cli,api}-reference.md + component-catalog.md. Per the project's enum-audit rule (CLAUDE.md "Documentation updates"), every surface that enumerates platform values should be updated. The following surfaces still showed `kubeflow` only (pre-existing drift predating slurm — `dynamo` and `nim` were already missing): - docs/README.md:11 — glossary Criteria row - docs/contributor/data.md:130 — platform row in criteria table - docs/contributor/validations.md:99 — platform bullet - site/docs/getting-started/index.md:19 — glossary mirror - pkg/recipe/criteria.go:246 — Platform field godoc - pkg/recipe/doc.go:32 — Criteria struct shape comment (was non-alphabetical "kubeflow, dynamo, nim, slurm, any") - pkg/recipe/doc.go:93-98 — CriteriaPlatform* constant listing (was non-alphabetical order) - pkg/api/doc.go:72 — REST API doc was missing slurm entirely (only one of these with factual content drift; rest are ordering) - .claude/skills/analyzing-snapshots/SKILL.md:278 — internal AICR snapshot-analysis skill criteria table All now list `dynamo, kubeflow, nim, slurm` in alphabetical order matching pkg/recipe.GetCriteriaPlatformTypes() (the authoritative Go enum). Per-file delimiter convention preserved: slash-separated in glossary-style parens, comma-separated in tables/godoc. New guard test ============== pkg/recipe/doc_test.go::TestCriteriaPlatformConstantsMatchGetter asserts the CriteriaPlatform* constants and GetCriteriaPlatformTypes() stay in sync. Catches the exact class of drift this commit fixes if a future platform value is added to one but not the other. Catalog node-selector limitation note ===================================== Appends a `**Known limitation:**` clause to the slinky-slurm-operator row in docs/user/component-catalog.md, documenting that chart v1.1.0 silently ignores `operator.nodeSelector` and `webhook.nodeSelector` and linking SlinkyProject/slurm-operator#187 for the upstream fix. The same limitation is already in-line in recipes/registry.yaml; the catalog note surfaces it to readers of the rendered user docs. Originally planned but no longer needed ======================================= While auditing, I had also planned to add Mark's two "optional nits" from NVIDIA#866 — a chart-default verification pointer at defaultVersion in recipes/registry.yaml, and a vacuous-pass guard comment in recipes/checks/slinky-slurm-operator/health-check.yaml. Both turned out to already be in the merged version (Mark applied them himself before merge — see recipes/registry.yaml:550-555 and recipes/checks/slinky-slurm-operator/health-check.yaml:64-69). The .lychee.toml header change is the project's license-header check auto-applying the standard Apache 2.0 copyright block. Addresses ========= Mark (@mchmarny) review feedback on NVIDIA#866: - Doc-audit gap across user docs (cli-reference + api-reference landed at merge time; this PR catches the remaining surfaces) CodeRabbit review feedback on NVIDIA#866: - NodeSelector limitation note on slinky-slurm-operator catalog row Internal panel review (PE + QA + DA) on draft NVIDIA#884: - Extended audit to pkg/api/doc.go (factual miss), pkg/recipe/doc.go (ordering), .claude/skills/analyzing-snapshots/SKILL.md - Added guard test against future enum/getter drift - Squashed to single commit; rebased onto upstream/main HEAD Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
|
Promoted draft scope after an internal Principal Engineer + QA Engineer + Devil's Advocate panel review. Panel findings (all addressed in the squashed commit):
Items intentionally not addressed (rationale below):
|
…g note Follow-up to PR NVIDIA#866 (Slinky slurm-operator). Backfills platform-enum drift on surfaces the original PR did not catch, plus surfaces the chart v1.1.0 nodeSelector silent-ignore limitation on the user-facing component-catalog row. PR NVIDIA#896 (yuanchen8911) already fixed docs/README.md and docs/contributor/validations.md as part of a broader docs audit. PR NVIDIA#893 removed the site/docs/ vitepress mirror entirely. Remaining surfaces this PR addresses: - docs/contributor/data.md:130 — platform row in criteria table - docs/user/component-catalog.md:36 — slinky-slurm-operator row, appended **Known limitation:** clause documenting chart v1.1.0 silent-ignore of operator.nodeSelector/webhook.nodeSelector (the same limitation is already inline in recipes/registry.yaml; this surfaces it on the rendered user catalog) - pkg/api/doc.go:72 — REST API godoc was missing slurm entirely - pkg/recipe/doc.go:32 — struct shape comment reordered to alphabetical to match GetCriteriaPlatformTypes() - pkg/recipe/doc.go:93-98 — CriteriaPlatform* constants reordered alphabetical - pkg/recipe/criteria.go:246 — Platform-field godoc updated - .claude/skills/analyzing-snapshots/SKILL.md:278 — internal AICR snapshot-analysis skill criteria table All now list 'dynamo, kubeflow, nim, slurm' alphabetically matching pkg/recipe.GetCriteriaPlatformTypes() (the authoritative Go enum). New guard test ============== pkg/recipe/doc_test.go::TestCriteriaPlatformConstantsMatchGetter asserts the CriteriaPlatform* constants and GetCriteriaPlatformTypes() stay in sync. Mechanically catches the exact class of drift this commit fixes if a future platform value is added to one but not the other. Addressed reviews ================= @mchmarny (NVIDIA#866) — Doc-audit gap. cli-reference and api-reference were fixed at merge time and NVIDIA#896 picked up README/validations; this PR catches the remaining surfaces (contributor/data, the three Go godoc files, and the internal skill table). @coderabbitai (NVIDIA#866) — NodeSelector limitation note on the slinky-slurm-operator catalog row. Internal PE + QA + DA panel review on draft NVIDIA#884 — extended audit to pkg/api/doc.go (factual miss), pkg/recipe/doc.go (ordering), .claude/skills/analyzing-snapshots/SKILL.md, plus the guard test in pkg/recipe/doc_test.go. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
b1ef88a to
179a346
Compare
|
🌿 Preview your docs: https://nvidia-preview-chore-post-866-slurm-followups.docs.buildwithfern.com/aicr |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/recipe/doc_test.go`:
- Around line 27-45: The test TestCriteriaPlatformConstantsMatchGetter is a
single-case check and must be converted into a table-driven test with multiple
cases; replace the current body with a slice of test cases (structs) each
containing a name and expected []string (e.g., a case for the declared
constants, a case for an empty/modified set if relevant), iterate over cases
using t.Run, call GetCriteriaPlatformTypes() inside each subtest, compare
lengths and elements and fail via t.Fatalf/t.Errorf when mismatches occur;
reference the existing constants CriteriaPlatformDynamo,
CriteriaPlatformKubeflow, CriteriaPlatformNIM, CriteriaPlatformSlurm and the
getter GetCriteriaPlatformTypes() and keep the same comparison logic but
executed per test case within TestCriteriaPlatformConstantsMatchGetter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: aecd6f0b-a2e5-4ae6-9425-f05789b772f3
📒 Files selected for processing (7)
.claude/skills/analyzing-snapshots/SKILL.mddocs/contributor/data.mddocs/user/component-catalog.mdpkg/api/doc.gopkg/recipe/criteria.gopkg/recipe/doc.gopkg/recipe/doc_test.go
| func TestCriteriaPlatformConstantsMatchGetter(t *testing.T) { | ||
| declared := []string{ | ||
| string(CriteriaPlatformDynamo), | ||
| string(CriteriaPlatformKubeflow), | ||
| string(CriteriaPlatformNIM), | ||
| string(CriteriaPlatformSlurm), | ||
| } | ||
| sort.Strings(declared) | ||
|
|
||
| got := GetCriteriaPlatformTypes() | ||
| if len(got) != len(declared) { | ||
| t.Fatalf("len(GetCriteriaPlatformTypes())=%d, declared constants=%d", len(got), len(declared)) | ||
| } | ||
| for i, want := range declared { | ||
| if got[i] != want { | ||
| t.Errorf("GetCriteriaPlatformTypes()[%d] = %q, want %q (declared)", i, got[i], want) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Convert this to a table-driven test with multiple cases.
The guard is useful, but this file’s guideline requires table-driven tests with multiple test cases.
♻️ Proposed refactor
func TestCriteriaPlatformConstantsMatchGetter(t *testing.T) {
declared := []string{
string(CriteriaPlatformDynamo),
string(CriteriaPlatformKubeflow),
string(CriteriaPlatformNIM),
string(CriteriaPlatformSlurm),
}
sort.Strings(declared)
- got := GetCriteriaPlatformTypes()
- if len(got) != len(declared) {
- t.Fatalf("len(GetCriteriaPlatformTypes())=%d, declared constants=%d", len(got), len(declared))
- }
- for i, want := range declared {
- if got[i] != want {
- t.Errorf("GetCriteriaPlatformTypes()[%d] = %q, want %q (declared)", i, got[i], want)
- }
- }
+ got := GetCriteriaPlatformTypes()
+ tests := []struct {
+ name string
+ want []string
+ }{
+ {
+ name: "getter matches declared constants",
+ want: declared,
+ },
+ {
+ name: "getter output is alphabetically sorted",
+ want: func() []string {
+ sorted := append([]string(nil), got...)
+ sort.Strings(sorted)
+ return sorted
+ }(),
+ },
+ }
+
+ for _, tc := range tests {
+ t.Run(tc.name, func(t *testing.T) {
+ if len(got) != len(tc.want) {
+ t.Fatalf("len(GetCriteriaPlatformTypes())=%d, want=%d", len(got), len(tc.want))
+ }
+ for i, want := range tc.want {
+ if got[i] != want {
+ t.Errorf("GetCriteriaPlatformTypes()[%d] = %q, want %q", i, got[i], want)
+ }
+ }
+ })
+ }
}As per coding guidelines: "**/*_test.go: Always use table-driven tests with multiple test cases organized in a slice of structs".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/recipe/doc_test.go` around lines 27 - 45, The test
TestCriteriaPlatformConstantsMatchGetter is a single-case check and must be
converted into a table-driven test with multiple cases; replace the current body
with a slice of test cases (structs) each containing a name and expected
[]string (e.g., a case for the declared constants, a case for an empty/modified
set if relevant), iterate over cases using t.Run, call
GetCriteriaPlatformTypes() inside each subtest, compare lengths and elements and
fail via t.Fatalf/t.Errorf when mismatches occur; reference the existing
constants CriteriaPlatformDynamo, CriteriaPlatformKubeflow, CriteriaPlatformNIM,
CriteriaPlatformSlurm and the getter GetCriteriaPlatformTypes() and keep the
same comparison logic but executed per test case within
TestCriteriaPlatformConstantsMatchGetter.
There was a problem hiding this comment.
Pushing back on this one. Project CLAUDE.md says "Table-driven tests (required for multiple cases)" — emphasis on multiple. This test has one invariant: the CriteriaPlatform* constants match what GetCriteriaPlatformTypes() returns. The constants and the getter are package-level singletons, so the "empty/modified set" case in your suggestion would test the framework, not the invariant under guard.
Converting would add ~20 lines of boilerplate for zero additional behavior coverage. Happy to revisit if a reviewer disagrees.
There was a problem hiding this comment.
@ArangoGutierrez, you're right — the guideline says "required for multiple cases", and this test has exactly one meaningful invariant. My proposed second case ("getter output is alphabetically sorted") was a tautology: it sorted got against got, exercising nothing real. The pushback is well-founded and the current implementation is appropriate for a single-invariant guard test. Withdrawing the suggestion.
(ᵔᴥᵔ)
✏️ Learnings added
Learnt from: ArangoGutierrez
Repo: NVIDIA/aicr PR: 884
File: pkg/recipe/doc_test.go:27-45
Timestamp: 2026-05-15T14:08:57.038Z
Learning: In NVIDIA/aicr, CLAUDE.md's table-driven test requirement applies only when there are genuinely multiple test cases ("Table-driven tests (required for multiple cases)"). Single-invariant guard tests — such as `TestCriteriaPlatformConstantsMatchGetter` in `pkg/recipe/doc_test.go`, which checks that `CriteriaPlatform*` constants match `GetCriteriaPlatformTypes()` — do not need to be converted to table-driven form. Do not flag such tests for table-driven refactoring when there is only one real invariant to validate.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: njhensley
Repo: NVIDIA/aicr PR: 595
File: pkg/recipe/loader.go:30-35
Timestamp: 2026-04-16T18:20:41.689Z
Learning: Do not flag uses of `context.WithTimeout(context.Background(), timeout)` when they match the repo’s documented “GOOD” pattern (per `CLAUDE.md`). In particular, treat it as intentional/acceptable in `pkg/serializer` implementations such as `HTTPReader.Download` and `fromConfigMapWithKubeconfig`. Do not require threading `context.Context` through `serializer.FromFileWithKubeconfig` (or other cross-package signature changes) as a guideline violation; if callers need external cancellation propagation, they should use the existing `DownloadWithContext` API rather than changing wide call chains and tests.
Learnt from: njhensley
Repo: NVIDIA/aicr PR: 595
File: pkg/recipe/loader.go:56-60
Timestamp: 2026-04-16T18:20:53.204Z
Learning: In this repository’s Go code, if an error was already created/wrapped with `pkg/errors` (via `aicrerrors.New(...)` or `aicrerrors.WrapWithContext(...)`) and includes the correct structured error code (e.g., `ErrCodeInvalidRequest`, `ErrCodeTimeout`, `ErrCodeInternal`), then returning it should use the bare form `return nil, err` (or propagate the error as-is). Do not require “missing context” when the source already has proper codes; do not double-wrap it. Avoid changing the error by using `errors.Wrapf` (which can override the structured code) or `fmt.Errorf("%w", err)` in these cases, which is documented as an anti-pattern in `CLAUDE.md`.
Learnt from: mchmarny
Repo: NVIDIA/aicr PR: 728
File: pkg/recipe/criteria.go:801-820
Timestamp: 2026-05-01T23:59:30.754Z
Learning: In NVIDIA/aicr, use the idiomatic helper `errors.PropagateOrWrap(err, errCode, message)` from `pkg/errors` when handling errors in Go code. This helper should be preferred over manual `errors.As` + conditional return logic: it propagates `err` as-is if it already carries a `*StructuredError` code, and otherwise wraps `err` using the provided fallback `errCode` and `message`. Apply this pattern especially when calling functions from `pkg/serializer`, `pkg/recipe`, or other packages that may already return coded structured errors.
Learnt from: mchmarny
Repo: NVIDIA/aicr PR: 728
File: pkg/k8s/agent/job_watch_test.go:140-150
Timestamp: 2026-05-01T23:59:44.270Z
Learning: In Go unit tests, do not require or suggest adding `return` statements immediately after `t.Fatal()` / `t.Fatalf()` calls. These functions internally terminate the current test goroutine via `runtime.Goexit()`, so any explicit `return` afterward is redundant/unreachable. Review should focus on the logic before `t.Fatal*`; the absence of a trailing `return` should not be treated as a problem.
Learnt from: mchmarny
Repo: NVIDIA/aicr PR: 728
File: pkg/server/middleware_test.go:400-402
Timestamp: 2026-05-01T23:59:43.871Z
Learning: In Go test files, don’t require a `return` statement immediately after `t.Fatal(...)`/`t.Fatalf(...)` calls. `t.Fatal` calls `runtime.Goexit()`, so any following `return` is unreachable/redundant (though deferred functions will still run). If the lint/static checks already pass without the explicit `return`, avoid flagging the missing `return` in future reviews (e.g., in this repo’s Go tests).
Learnt from: njhensley
Repo: NVIDIA/aicr PR: 844
File: pkg/fingerprint/from_measurements.go:200-220
Timestamp: 2026-05-11T21:09:20.802Z
Learning: In this repo’s Go code, do not flag or request changes for `int64`→`int` casts when the value originates from a Go `int` and remains in-process (e.g., it round-trips via `measurement.Reading.Any()` but never crosses a system boundary like serialization/network/DB). Only validate/guard numeric ranges at system boundaries; adding `math.MaxInt32/MinInt32` clamping for such internal-only casts is an anti-pattern (per CLAUDE.md), since the scenario is guaranteed by internal/framework behavior and 64-bit Linux targets keep `int` effectively 64 bits.
Learnt from: kaponco
Repo: NVIDIA/aicr PR: 857
File: pkg/recipe/metadata.go:567-597
Timestamp: 2026-05-13T06:00:04.625Z
Learning: In NVIDIA/aicr’s overlay model, supported behavior does NOT include converting a ComponentTypeDirect component into ComponentTypeHelm or ComponentTypeKustomize (and not the reverse in those directions) via overlays. In `mergeComponentRef`, switch cases that appear to support Helm↔Direct or Kustomize↔Direct type changes (e.g., clearing `SourceFile`/`Olm`) are defensive dead code. Reviewers should not flag the absence of these transitions as a missing validation or bug, and should treat Direct→Helm/Kustomize overrides as outside the supported overlay model.
faganihajizada
left a comment
There was a problem hiding this comment.
Looks good! thanks!
|
@mchmarny PTAL |
Summary
Single-commit follow-up to PR #866 (Slinky slurm-operator). Backfills
platformenum drift on 7 surfaces (1 contributor doc, 3 Go godoc, 1 user-catalog row, 1 internal AICR snapshot-analysis skill, plus the new guard test), and surfaces the chart v1.1.0nodeSelectorsilent-ignore limitation on the user-facing component-catalog row.Motivation / Context
PR #866 added
--platform slurmand updated the OpenAPI spec,pkg/recipe/criteria.go(theGetCriteriaPlatformTypesreturn value),docs/user/cli-reference.md,docs/user/api-reference.md, anddocs/user/component-catalog.md. Per the project's enum-audit rule (CLAUDE.md "Documentation updates"), every surface that enumerates platform values should also have been updated.Most surfaces still showed only
kubeflow(pre-existing drift predating slurm —dynamoandnimwere already missing). PR #896 (@yuanchen8911, merged 2026-05-14) fixeddocs/README.mdanddocs/contributor/validations.mdas part of a broader docs audit. PR #893 removed thesite/docs/vitepress mirror entirely. This PR catches the remaining surfaces:docs/contributor/data.md:130kubeflowdynamo, kubeflow, nim, slurmpkg/recipe/criteria.go:246(godoc)(kubeflow)(dynamo, kubeflow, nim, slurm)pkg/recipe/doc.go:32(struct shape comment)kubeflow, dynamo, nim, slurm, any(non-alphabetical)dynamo, kubeflow, nim, slurm, anypkg/recipe/doc.go:93-98(constant listing)Kubeflow, Dynamo, NIM, Slurm, Any(non-alphabetical)pkg/api/doc.go:72(REST API godoc)(dynamo, kubeflow, nim, any)(missing slurm)(dynamo, kubeflow, nim, slurm, any).claude/skills/analyzing-snapshots/SKILL.md:278(internal skill table)kubeflow, dynamo, nimdynamo, kubeflow, nim, slurmOf these,
pkg/api/doc.go:72was the only one with factual content drift (missing slurm entirely); the rest are ordering or pre-existing-without-slurm drift.Also appends a
**Known limitation:**clause to theslinky-slurm-operatorrow indocs/user/component-catalog.mddocumenting that chart v1.1.0 silently ignoresoperator.nodeSelectorandwebhook.nodeSelector(per @coderabbitai review on #866). The same limitation is already inline inrecipes/registry.yaml; this surfaces it on the rendered user catalog.Adds
pkg/recipe/doc_test.go::TestCriteriaPlatformConstantsMatchGetter— a guard test that asserts theCriteriaPlatform*constants andGetCriteriaPlatformTypes()stay in sync. Mechanically catches the exact class of drift this PR fixes if a future platform value is added to one but not the other.Fixes: N/A
Related: #866, #896, SlinkyProject/slurm-operator#187
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server) — godoc comment only, no behavior changepkg/recipe) — godoc comment + guard test, no behavior changepkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/).claude/skills/Implementation Notes
Single squashed commit on top of current
upstream/main. Each edit is one line apart from the guard test (45 lines, license header included).Originally planned but no longer needed
recipes/registry.yaml, vacuous-pass guard comment inrecipes/checks/slinky-slurm-operator/health-check.yaml— were already in main when this PR opened (Mark applied them himself before merging Add Slinky slurm-operator as platform-slurm #866).docs/README.mdanddocs/contributor/validations.md— fixed by Yuan Chen's PR docs: fix audit findings across docs, demos, examples #896 while this PR was in draft.site/docs/getting-started/index.md— entiresite/docs/tree removed by PR chore(docs): remove vitepress site in favor of fern docs #893 (vitepress → fern migration).Internal review panel
This PR went through a Principal Engineer + QA Engineer + Devil's Advocate review panel after the first push as draft. Panel surfaced the three Go godoc surfaces (
pkg/api/doc.go,pkg/recipe/doc.gox2) and the internal skill file (.claude/skills/analyzing-snapshots/SKILL.md) that the original audit missed, plus recommended the guard test inpkg/recipe/doc_test.go. All findings addressed.Testing
make qualifyfull run on a prior iteration of this branch showed all stages PASS except for three flaky failures inpkg/bundler/deployer/helm(TestUndeployScript_Preflight*) that fail under parallel load withsignal: killedand pass cleanly in isolation (14s). These tests spawnbashsubprocesses that exechelm/kubectl; the OS kills them under CPU pressure during full-suite parallel test execution. Unrelated to this PR (different package, different files). Expecting CI server-side to pass with more headroom.Coverage for
pkg/recipe: 90.2% — new guard test adds coverage to the platform-type constants without removing any existing coverage.Risk Assessment
git revert.Rollout notes: N/A — additive doc clarifications and a guard test.
Checklist
make testwith-race) — pkg/recipe new test passes; pre-existing flakes in helm-deployer unrelatedgolangci-lint run -c .golangci.yaml)pkg/recipe/doc_test.go)git commit -S)