Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude/skills/analyzing-snapshots/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,4 +275,4 @@ aicr recipe \
| accelerator | GPU.smi.gpu.model | h100, gb200, b200, a100, l40, rtx-pro-6000 |
| os | OS.release.ID | ubuntu, rhel, cos, amazonlinux |
| intent | User-specified | training, inference |
| platform | User-specified | kubeflow, dynamo, nim |
| platform | User-specified | dynamo, kubeflow, nim, slurm |
2 changes: 1 addition & 1 deletion docs/contributor/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ Criteria define when a recipe matches a user query:
| `accelerator` | String | GPU hardware type | `h100`, `gb200`, `b200`, `a100`, `l40`, `rtx-pro-6000` |
| `os` | String | Operating system | `ubuntu`, `rhel`, `cos`, `amazonlinux` |
| `intent` | String | Workload purpose | `training`, `inference` |
| `platform` | String | Platform/framework type | `kubeflow` |
| `platform` | String | Platform/framework type | `dynamo`, `kubeflow`, `nim`, `slurm` |
| `nodes` | Integer | Node count (0 = any) | `8`, `16` |

**All fields are optional.** Unpopulated fields act as wildcards (match any value).
Expand Down
2 changes: 1 addition & 1 deletion docs/user/component-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ The source of truth is [`recipes/registry.yaml`](https://github.com/NVIDIA/aicr/
| **kueue** | Kubernetes-native job queuing system. Manages quotas and admits jobs for batch and AI workloads. | [Kueue](https://github.com/kubernetes-sigs/kueue) |
| **kubeflow-trainer** | Kubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration. | [Kubeflow Trainer](https://github.com/kubeflow/trainer) |
| **slinky-slurm-operator-crds** | Custom Resource Definitions for the SchedMD Slinky Slurm operator. Installs the `slinky.slurm.net` CRDs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). Installed separately to support CRD lifecycle management. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) |
| **slinky-slurm-operator** | SchedMD Slinky Slurm operator and admission webhook. Manages the lifecycle of Slurm clusters declared via Slinky CRs. Cluster-instance CRs (Controller, NodeSet, LoginSet, ...) are user-authored — AICR ships only the operator, mirroring how dynamo-platform and kubeflow-trainer ship operator-only. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) |
| **slinky-slurm-operator** | SchedMD Slinky Slurm operator and admission webhook. Manages the lifecycle of Slurm clusters declared via Slinky CRs. Cluster-instance CRs (Controller, NodeSet, LoginSet, ...) are user-authored — AICR ships only the operator, mirroring how dynamo-platform and kubeflow-trainer ship operator-only. **Known limitation:** chart v1.1.0 silently ignores `operator.nodeSelector` and `webhook.nodeSelector` (current chart behavior, not a planned feature); tracking [SlinkyProject/slurm-operator#187](https://github.com/SlinkyProject/slurm-operator/pull/187) for the upstream fix. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) |

## How Components Are Selected

Expand Down
2 changes: 1 addition & 1 deletion pkg/api/doc.go
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@
// - gpu: Alias for accelerator (back-compat)
// - intent: Workload intent (training, inference, any)
// - os: Operating system (ubuntu, rhel, cos, amazonlinux, talos, any)
// - platform: Platform/framework (dynamo, kubeflow, nim, any)
// - platform: Platform/framework (dynamo, kubeflow, nim, slurm, any)
// - nodes: Number of GPU nodes (0 = any/unspecified)
//
// # Request Body (POST /v1/recipe)
Expand Down
2 changes: 1 addition & 1 deletion pkg/recipe/criteria.go
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ type Criteria struct {
// OS is the worker node operating system type.
OS CriteriaOSType `json:"os,omitempty" yaml:"os,omitempty"`

// Platform is the platform/framework type (kubeflow).
// Platform is the platform/framework type (dynamo, kubeflow, nim, slurm).
Platform CriteriaPlatformType `json:"platform,omitempty" yaml:"platform,omitempty"`

// Nodes is the number of worker nodes (0 means any/unspecified).
Expand Down
4 changes: 2 additions & 2 deletions pkg/recipe/doc.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
// Accelerator CriteriaAcceleratorType // h100, gb200, b200, a100, l40, rtx-pro-6000, any
// Intent CriteriaIntentType // training, inference, any
// OS CriteriaOSType // ubuntu, rhel, cos, amazonlinux, talos, any
// Platform CriteriaPlatformType // kubeflow, dynamo, nim, slurm, any
// Platform CriteriaPlatformType // dynamo, kubeflow, nim, slurm, any
// Nodes int // node count (0 = any)
// }
//
Expand Down Expand Up @@ -91,8 +91,8 @@
// - CriteriaOSAny: Any OS (wildcard)
//
// Platform types for workload frameworks:
// - CriteriaPlatformKubeflow: Kubeflow
// - CriteriaPlatformDynamo: NVIDIA Dynamo
// - CriteriaPlatformKubeflow: Kubeflow
// - CriteriaPlatformNIM: NVIDIA NIM
// - CriteriaPlatformSlurm: SchedMD Slinky Slurm
// - CriteriaPlatformAny: Any platform (wildcard)
Expand Down
45 changes: 45 additions & 0 deletions pkg/recipe/doc_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
// Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package recipe

import (
"sort"
"testing"
)

// TestCriteriaPlatformConstantsMatchGetter guards against drift between the
// CriteriaPlatform* constants and the slice returned by
// GetCriteriaPlatformTypes(). Adding a new constant without registering it in
// the getter (or vice versa) is exactly the class of bug that left earlier
// platform-enum doc surfaces stale before this test existed.
func TestCriteriaPlatformConstantsMatchGetter(t *testing.T) {
declared := []string{
string(CriteriaPlatformDynamo),
string(CriteriaPlatformKubeflow),
string(CriteriaPlatformNIM),
string(CriteriaPlatformSlurm),
}
sort.Strings(declared)

got := GetCriteriaPlatformTypes()
if len(got) != len(declared) {
t.Fatalf("len(GetCriteriaPlatformTypes())=%d, declared constants=%d", len(got), len(declared))
}
for i, want := range declared {
if got[i] != want {
t.Errorf("GetCriteriaPlatformTypes()[%d] = %q, want %q (declared)", i, got[i], want)
}
}
}
Comment on lines +27 to +45
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Convert this to a table-driven test with multiple cases.

The guard is useful, but this file’s guideline requires table-driven tests with multiple test cases.

♻️ Proposed refactor
 func TestCriteriaPlatformConstantsMatchGetter(t *testing.T) {
 	declared := []string{
 		string(CriteriaPlatformDynamo),
 		string(CriteriaPlatformKubeflow),
 		string(CriteriaPlatformNIM),
 		string(CriteriaPlatformSlurm),
 	}
 	sort.Strings(declared)
 
-	got := GetCriteriaPlatformTypes()
-	if len(got) != len(declared) {
-		t.Fatalf("len(GetCriteriaPlatformTypes())=%d, declared constants=%d", len(got), len(declared))
-	}
-	for i, want := range declared {
-		if got[i] != want {
-			t.Errorf("GetCriteriaPlatformTypes()[%d] = %q, want %q (declared)", i, got[i], want)
-		}
-	}
+	got := GetCriteriaPlatformTypes()
+	tests := []struct {
+		name string
+		want []string
+	}{
+		{
+			name: "getter matches declared constants",
+			want: declared,
+		},
+		{
+			name: "getter output is alphabetically sorted",
+			want: func() []string {
+				sorted := append([]string(nil), got...)
+				sort.Strings(sorted)
+				return sorted
+			}(),
+		},
+	}
+
+	for _, tc := range tests {
+		t.Run(tc.name, func(t *testing.T) {
+			if len(got) != len(tc.want) {
+				t.Fatalf("len(GetCriteriaPlatformTypes())=%d, want=%d", len(got), len(tc.want))
+			}
+			for i, want := range tc.want {
+				if got[i] != want {
+					t.Errorf("GetCriteriaPlatformTypes()[%d] = %q, want %q", i, got[i], want)
+				}
+			}
+		})
+	}
 }

As per coding guidelines: "**/*_test.go: Always use table-driven tests with multiple test cases organized in a slice of structs".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/recipe/doc_test.go` around lines 27 - 45, The test
TestCriteriaPlatformConstantsMatchGetter is a single-case check and must be
converted into a table-driven test with multiple cases; replace the current body
with a slice of test cases (structs) each containing a name and expected
[]string (e.g., a case for the declared constants, a case for an empty/modified
set if relevant), iterate over cases using t.Run, call
GetCriteriaPlatformTypes() inside each subtest, compare lengths and elements and
fail via t.Fatalf/t.Errorf when mismatches occur; reference the existing
constants CriteriaPlatformDynamo, CriteriaPlatformKubeflow, CriteriaPlatformNIM,
CriteriaPlatformSlurm and the getter GetCriteriaPlatformTypes() and keep the
same comparison logic but executed per test case within
TestCriteriaPlatformConstantsMatchGetter.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing back on this one. Project CLAUDE.md says "Table-driven tests (required for multiple cases)" — emphasis on multiple. This test has one invariant: the CriteriaPlatform* constants match what GetCriteriaPlatformTypes() returns. The constants and the getter are package-level singletons, so the "empty/modified set" case in your suggestion would test the framework, not the invariant under guard.

Converting would add ~20 lines of boilerplate for zero additional behavior coverage. Happy to revisit if a reviewer disagrees.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArangoGutierrez, you're right — the guideline says "required for multiple cases", and this test has exactly one meaningful invariant. My proposed second case ("getter output is alphabetically sorted") was a tautology: it sorted got against got, exercising nothing real. The pushback is well-founded and the current implementation is appropriate for a single-invariant guard test. Withdrawing the suggestion.

(ᵔᴥᵔ)


✏️ Learnings added
Learnt from: ArangoGutierrez
Repo: NVIDIA/aicr PR: 884
File: pkg/recipe/doc_test.go:27-45
Timestamp: 2026-05-15T14:08:57.038Z
Learning: In NVIDIA/aicr, CLAUDE.md's table-driven test requirement applies only when there are genuinely multiple test cases ("Table-driven tests (required for multiple cases)"). Single-invariant guard tests — such as `TestCriteriaPlatformConstantsMatchGetter` in `pkg/recipe/doc_test.go`, which checks that `CriteriaPlatform*` constants match `GetCriteriaPlatformTypes()` — do not need to be converted to table-driven form. Do not flag such tests for table-driven refactoring when there is only one real invariant to validate.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: njhensley
Repo: NVIDIA/aicr PR: 595
File: pkg/recipe/loader.go:30-35
Timestamp: 2026-04-16T18:20:41.689Z
Learning: Do not flag uses of `context.WithTimeout(context.Background(), timeout)` when they match the repo’s documented “GOOD” pattern (per `CLAUDE.md`). In particular, treat it as intentional/acceptable in `pkg/serializer` implementations such as `HTTPReader.Download` and `fromConfigMapWithKubeconfig`. Do not require threading `context.Context` through `serializer.FromFileWithKubeconfig` (or other cross-package signature changes) as a guideline violation; if callers need external cancellation propagation, they should use the existing `DownloadWithContext` API rather than changing wide call chains and tests.

Learnt from: njhensley
Repo: NVIDIA/aicr PR: 595
File: pkg/recipe/loader.go:56-60
Timestamp: 2026-04-16T18:20:53.204Z
Learning: In this repository’s Go code, if an error was already created/wrapped with `pkg/errors` (via `aicrerrors.New(...)` or `aicrerrors.WrapWithContext(...)`) and includes the correct structured error code (e.g., `ErrCodeInvalidRequest`, `ErrCodeTimeout`, `ErrCodeInternal`), then returning it should use the bare form `return nil, err` (or propagate the error as-is). Do not require “missing context” when the source already has proper codes; do not double-wrap it. Avoid changing the error by using `errors.Wrapf` (which can override the structured code) or `fmt.Errorf("%w", err)` in these cases, which is documented as an anti-pattern in `CLAUDE.md`.

Learnt from: mchmarny
Repo: NVIDIA/aicr PR: 728
File: pkg/recipe/criteria.go:801-820
Timestamp: 2026-05-01T23:59:30.754Z
Learning: In NVIDIA/aicr, use the idiomatic helper `errors.PropagateOrWrap(err, errCode, message)` from `pkg/errors` when handling errors in Go code. This helper should be preferred over manual `errors.As` + conditional return logic: it propagates `err` as-is if it already carries a `*StructuredError` code, and otherwise wraps `err` using the provided fallback `errCode` and `message`. Apply this pattern especially when calling functions from `pkg/serializer`, `pkg/recipe`, or other packages that may already return coded structured errors.

Learnt from: mchmarny
Repo: NVIDIA/aicr PR: 728
File: pkg/k8s/agent/job_watch_test.go:140-150
Timestamp: 2026-05-01T23:59:44.270Z
Learning: In Go unit tests, do not require or suggest adding `return` statements immediately after `t.Fatal()` / `t.Fatalf()` calls. These functions internally terminate the current test goroutine via `runtime.Goexit()`, so any explicit `return` afterward is redundant/unreachable. Review should focus on the logic before `t.Fatal*`; the absence of a trailing `return` should not be treated as a problem.

Learnt from: mchmarny
Repo: NVIDIA/aicr PR: 728
File: pkg/server/middleware_test.go:400-402
Timestamp: 2026-05-01T23:59:43.871Z
Learning: In Go test files, don’t require a `return` statement immediately after `t.Fatal(...)`/`t.Fatalf(...)` calls. `t.Fatal` calls `runtime.Goexit()`, so any following `return` is unreachable/redundant (though deferred functions will still run). If the lint/static checks already pass without the explicit `return`, avoid flagging the missing `return` in future reviews (e.g., in this repo’s Go tests).

Learnt from: njhensley
Repo: NVIDIA/aicr PR: 844
File: pkg/fingerprint/from_measurements.go:200-220
Timestamp: 2026-05-11T21:09:20.802Z
Learning: In this repo’s Go code, do not flag or request changes for `int64`→`int` casts when the value originates from a Go `int` and remains in-process (e.g., it round-trips via `measurement.Reading.Any()` but never crosses a system boundary like serialization/network/DB). Only validate/guard numeric ranges at system boundaries; adding `math.MaxInt32/MinInt32` clamping for such internal-only casts is an anti-pattern (per CLAUDE.md), since the scenario is guaranteed by internal/framework behavior and 64-bit Linux targets keep `int` effectively 64 bits.

Learnt from: kaponco
Repo: NVIDIA/aicr PR: 857
File: pkg/recipe/metadata.go:567-597
Timestamp: 2026-05-13T06:00:04.625Z
Learning: In NVIDIA/aicr’s overlay model, supported behavior does NOT include converting a ComponentTypeDirect component into ComponentTypeHelm or ComponentTypeKustomize (and not the reverse in those directions) via overlays. In `mergeComponentRef`, switch cases that appear to support Helm↔Direct or Kustomize↔Direct type changes (e.g., clearing `SourceFile`/`Olm`) are defensive dead code. Reviewers should not flag the absence of these transitions as a missing validation or bug, and should treat Direct→Helm/Kustomize overrides as outside the supported overlay model.

Loading