Fix ResourceFlavor immutability caused by TAS opt-in change#436
Open
Fix ResourceFlavor immutability caused by TAS opt-in change#436
Conversation
d728b90 to
31534ce
Compare
The commit c23eccf moved kaiwo/worker=true outside the if/else in ConvertKaiwoToKueueResourceFlavor, causing spec drift on existing ResourceFlavors that had topologyName set. Since Kueue makes ResourceFlavor specs immutable once topologyName is set, the controller entered an infinite update-rejection loop. The flavor was also stuck deleting due to Kueue's resource-in-use finalizer. Fixes: - Restore kaiwo/worker label to only apply to flavors without topology, preventing spec drift on immutable ResourceFlavors - Keep topologyName on auto-generated ResourceFlavors (enables TAS capability); TAS is opt-in at the workload level via preferredTopologyLabel/requiredTopologyLabel annotations - Add replaceResourceFlavor helper in syncResourceFlavors that attempts in-place update, falls back to delete-and-recreate only on immutability errors (Invalid/Forbidden), skips terminating objects, and accepts eventual convergence across reconciliation cycles - Reorder SyncKueueResources to sync topologies before resource flavors - Update stale comments on PreferredTopologyLabel and DefaultTopologyName - Add chainsaw tests for TAS flavor labels, topology/non-topology flavor behavior, mutable flavor updates, and topology migration path
31534ce to
ff9a259
Compare
Regenerate CRD manifests and reference docs to reflect the updated Go comments for defaultTopologyName and preferredTopologyLabel. Add TAS documentation to the admin configuration guide (topologies, topologyName, immutability warning, two-layer opt-in model) and the scientist scheduling guide (field semantics, examples, prerequisites).
williamanzen
previously approved these changes
Mar 24, 2026
Remove Kueue's resource-in-use finalizer before deleting an immutable ResourceFlavor during replace. Without this, the finalizer blocks deletion indefinitely when a ClusterQueue still references the flavor, preventing the replacement from ever being created. Also add retry/wait loops in the auto-generated-flavors-topology-and-labels Chainsaw test to handle timing differences between controller reconciliation and test assertions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stabilize ResourceFlavor reconciliation for TAS opt-in model
Background
Prior to PR #432, Topology Aware Scheduling (TAS) was active by default for all workloads. Every auto-generated ResourceFlavor referenced a topology, and every pod received a
podset-preferred-topologyannotation automatically.PR #432 changed TAS to an opt-in model: workloads must explicitly set
preferredTopologyLabelorrequiredTopologyLabelto activate topology-aware scheduling. The pod annotation default was correctly removed, but the ResourceFlavor conversion and reconciliation logic needed additional adjustments to fully support this model.Changes
1. Correct
kaiwo/workerlabel placement (ConvertKaiwoToKueueResourceFlavor)The
kaiwo/worker=truelabel is now only added to ResourceFlavors that do not have atopologyNameset. Flavors withtopologyNameusekaiwo/nodepoolexclusively. This avoids spec drift on topology-enabled flavors, whose specs Kueue treats as immutable.2. Restore topology reference on auto-generated flavors (
CreateDefaultResourceFlavors)Auto-generated ResourceFlavors retain their
topologyNamereference to the default topology. This is necessary because Kueue requires the flavor to reference a topology for TAS to function when a workload opts in. The opt-in gate is at the workload level (via annotations), not the flavor level.3. Handle immutable ResourceFlavor updates (
syncResourceFlavors/replaceResourceFlavor)Added a
replaceResourceFlavorhelper that:InvalidorForbiddenerrors (indicating immutability), falls back to delete-and-recreateresource-in-usefinalizer delays deletion, allowing convergence on subsequent reconciliation cycles4. Sync topologies before flavors (
SyncKueueResources)Reordered resource synchronization so
Topologyobjects are created beforeResourceFlavorobjects that may reference them.5. Update documentation
PreferredTopologyLabelandDefaultTopologyNamethat still described the old default-on behaviorChainsaw tests
Added
test/chainsaw/tests/standard/kaiwoqueueconfigs/tas-opt-in/with four test cases:topologyNameset and do not carrykaiwo/workerin nodeLabels