Context
We are integrating k8s-launch-kit into NVIDIA/aicr as a Go library — see tracking epic NVIDIA/aicr#827 and Phase 1 NVIDIA/aicr#828.
AICR's snapshot pipeline has a strict Collector contract: read-only, bounded, no cluster mutation, no required RBAC beyond list/get on standard objects. Today's l8k discovery flow does not fit this contract.
Problem
(*NetworkOperatorPlugin).DiscoverClusterConfig (pkg/networkoperatorplugin/discovery.go:99-260) is a mutating operation:
- Patches/creates
NicClusterPolicy to stand up nic-configuration-daemon.
- Execs into those daemon pods to read
/sys/module, nvidia-smi, etc.
- Reads
NicDevice CRs.
- Patches node labels (
nvidia.kubernetes-launch-kit.machine=…).
- Defer-cleanup of the patch.
Even ignoring the cluster-state changes, downstream collectors that import this code would inherit pod-exec RBAC and a hard dependency on network-operator being installed first.
The good news: discovery.go:112-126 already contains a reuseExisting short-circuit that does exactly the right thing when nic-configuration-daemon is already running — it skips the CR patch and proceeds straight to reading the daemons.
Request
Expose a stable, read-only entry point, e.g.:
// Package: github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin
// DiscoverReadOnly inspects an existing nic-configuration-daemon DaemonSet
// (and the NicDevice CRs it has populated) and returns the discovered
// cluster topology. It does NOT create or modify any cluster resources.
//
// Returns ErrNotInstalled if network-operator's nic-configuration-daemon
// is not running in the cluster.
func DiscoverReadOnly(ctx context.Context, c client.Client) (*config.LaunchKubernetesConfig, error)
Behavior:
- Use the existing
reuseExisting branch when daemons are present.
- Return a sentinel
ErrNotInstalled (or similar) when not.
- Never patch
NicClusterPolicy. Never patch node labels. Never create CRs.
- Pod-exec is acceptable (it's read-only against a pre-existing daemon), but document the RBAC requirement clearly.
Why this helps l8k
- Clean library contract for downstream tools beyond AICR.
- Decouples "discovery" from "deployment of the nic-configuration-daemon" — a useful separation in its own right.
- Existing CLI flow can call
DiscoverReadOnly first, fall back to today's mutating flow if it returns ErrNotInstalled.
Open questions
- Naming / package placement:
pkg/networkoperatorplugin.DiscoverReadOnly keeps it close to the existing code. A leaf package (pkg/discovery/readonly) with thinner deps would be even better for downstream library users — willing to discuss.
- Module dep cost for library consumers: importing
pkg/networkoperatorplugin today pulls Mellanox/network-operator, Mellanox/nic-configuration-operator, controller-runtime, etc. If the read-only path could live in a leaf package without the Mellanox CRD type imports, that would meaningfully reduce vendor cost downstream.
- Versioning: any plan for a
v1.0.0 tag? Library consumers would prefer to pin by semver rather than by SHA.
Happy to contribute the implementation if the API shape is acceptable.
Context
We are integrating k8s-launch-kit into NVIDIA/aicr as a Go library — see tracking epic NVIDIA/aicr#827 and Phase 1 NVIDIA/aicr#828.
AICR's snapshot pipeline has a strict
Collectorcontract: read-only, bounded, no cluster mutation, no required RBAC beyond list/get on standard objects. Today's l8k discovery flow does not fit this contract.Problem
(*NetworkOperatorPlugin).DiscoverClusterConfig(pkg/networkoperatorplugin/discovery.go:99-260) is a mutating operation:NicClusterPolicyto stand upnic-configuration-daemon./sys/module,nvidia-smi, etc.NicDeviceCRs.nvidia.kubernetes-launch-kit.machine=…).Even ignoring the cluster-state changes, downstream collectors that import this code would inherit pod-exec RBAC and a hard dependency on network-operator being installed first.
The good news:
discovery.go:112-126already contains areuseExistingshort-circuit that does exactly the right thing whennic-configuration-daemonis already running — it skips the CR patch and proceeds straight to reading the daemons.Request
Expose a stable, read-only entry point, e.g.:
Behavior:
reuseExistingbranch when daemons are present.ErrNotInstalled(or similar) when not.NicClusterPolicy. Never patch node labels. Never create CRs.Why this helps l8k
DiscoverReadOnlyfirst, fall back to today's mutating flow if it returnsErrNotInstalled.Open questions
pkg/networkoperatorplugin.DiscoverReadOnlykeeps it close to the existing code. A leaf package (pkg/discovery/readonly) with thinner deps would be even better for downstream library users — willing to discuss.pkg/networkoperatorplugintoday pullsMellanox/network-operator,Mellanox/nic-configuration-operator, controller-runtime, etc. If the read-only path could live in a leaf package without the Mellanox CRD type imports, that would meaningfully reduce vendor cost downstream.v1.0.0tag? Library consumers would prefer to pin by semver rather than by SHA.Happy to contribute the implementation if the API shape is acceptable.