Fix/kaiwoqueueconfig failed on missing clusterqueue#437
Open
Conversation
…ssing ClusterQueue
Adds a Chainsaw test that exercises the bug where syncLocalQueues treats a
missing ClusterQueue (still present in the spec) as a fatal error, causing
KaiwoQueueConfig to transiently enter FAILED status during project
creation/deletion.
The test:
1. Applies a KaiwoQueueConfig with a namespace-scoped ClusterQueue and
confirms the controller reaches READY with both child objects present.
2. Deletes the ClusterQueue directly three times in a loop (each iteration
resets the race window between syncClusterQueues recreating the object
and syncLocalQueues reading it back from the cache), triggering a
reconcile each time and asserting the status never becomes FAILED.
3. Asserts that the ClusterQueue and LocalQueue are eventually recreated
without any external trigger (requires fix #2 RequeueAfter or fix #3
Owns watch to also be present).
…e deletion
Root causes fixed:
- syncLocalQueues: IsNotFound on ClusterQueue Get was treated as fatal.
The CQ may not yet be visible in the informer cache immediately after
syncClusterQueues creates it in the same reconcile (cache lag), or it may
have been externally deleted while still in the spec (expected transient
state). Now logs at Info level and continues without marking success=false.
- Stale LocalQueue delete loop: when a CQ was skipped via continue, its
LocalQueues were absent from the expected-map and subsequently treated as
stale. Their cascade-delete (CQ deletion removes owned LQs) returned
IsNotFound which was treated as fatal. Now guarded with !IsNotFound.
- All other sync delete loops (ResourceFlavor, ClusterQueue, Topology,
WorkloadPriorityClass): IsNotFound on Delete was treated as fatal. The
object is already gone — the desired state is achieved. All guarded.
- syncClusterQueues delete loop: contained an accidental duplicate r.Delete
call introduced by a shadowed err variable in the if-condition. Removed.
- syncTopologies: added a defensive guard that skips spec entries with an
empty topology name (logs an advisory) rather than letting the Kueue API
server reject the Create and fail the entire sync. Root cause of the
empty name is addressed in a follow-up commit.
- SetupWithManager: added Owns(&ClusterQueue{}) so that ClusterQueue
create/delete events trigger a reconcile for the owning KaiwoQueueConfig,
ensuring LocalQueues are created promptly after their CQ is (re)created.
- Reconcile: return RequeueAfter:30s when status is FAILED so the controller
self-heals even if no other event triggers a reconcile.
Test: extended Chainsaw regression test now runs three delete+recover cycles
and uses a continuous kubectl --watch stream to catch any FAILED transition,
however brief.
…ILED
Root cause: CreateDefaultTopology read config.DefaultTopologyName from the
KaiwoConfig spec without a fallback. On clusters where KaiwoConfig was
created before the +kubebuilder:default="default-topology" marker was present
(or where the CRD default was not applied to existing objects), the field is
an empty string. This produced a kaiwo.Topology with metadata.name="" which
was written into KaiwoQueueConfig.spec.topologies by EnsureKaiwoQueueConfig.
Every subsequent reconcile then tried to Create a Kueue Topology with no
name, the API server rejected it, and the status was permanently FAILED.
Changes:
- CreateDefaultTopology (internal/controller/utils/kueue.go): fall back to
common.DefaultTopologyName ("default-topology") when
config.DefaultTopologyName is empty.
- sanitizeTopologies (internal/controller/kaiwoqueueconfig_controller.go):
filter out entries with empty names so that EnsureKaiwoQueueConfig's merge
path never re-introduces them once removed.
- Reconcile (internal/controller/kaiwoqueueconfig_controller.go): on the
non-dynamic path, after fetching an existing KaiwoQueueConfig, compare the
sanitized topology list against the spec and patch the spec if any
empty-named entries are found. This self-heals upgraded clusters the first
time the new controller reconciles, without requiring a manual
delete/recreate of the KaiwoQueueConfig.
… events
The reconciler returned ctrl.Result{}, nil even when status was FAILED,
so the controller went silent after the second consecutive failure and
never retried. Add RequeueAfter:30s as a fallback.
Additionally, watch Namespace create/delete events so the controller
reacts immediately when a referenced namespace appears or disappears,
rather than waiting for the RequeueAfter window.
Includes Chainsaw tests for both recovery paths: removing the namespace
from the spec, and recreating the deleted namespace.
d7865fc to
33ae9dd
Compare
…ec-update The JSON patch used a hardcoded index 0 for clusterQueues, which only works when cq-ns-del-spec is the sole entry. On CI, the controller auto-generates a kaiwo clusterQueue from node pools at index 0, pushing cq-ns-del-spec to index 1. The patch silently modified the wrong entry.
Limit ClusterQueue-owned reconcile triggers to create/delete and generation-changing updates so status churn doesn't drive expensive full-sync loops. Also tighten the clusterqueue-deletion chainsaw assertions around readiness and object existence.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug Fixes:
KaiwoQueueConfigTransient FAILED Status on ClusterQueue DeletionBackground
KaiwoQueueConfigwas observed entering aFAILEDstatus transiently during normal project lifecycle operations (project creation, deletion, or any external removal of aClusterQueuethat is still present in the spec). The status would recover on the next reconcile, but the window was long enough to be visible and could cause downstream issues.Root Causes
1.
syncLocalQueues:IsNotFoundon ClusterQueue Get treated as fatalWhen
syncLocalQueuesiterated over the ClusterQueues in the spec, it calledr.Get()on each one before managing itsLocalQueueobjects. If the ClusterQueue was not yet visible in the informer cache (either due to cache lag aftersyncClusterQueueshad just created it in the same reconcile, or because it had been externally deleted while still referenced by the spec),r.Get()returnedIsNotFound. This was treated as a hard failure:success = false, causingSyncKueueResourcesto return an error and the overall status to flip toFAILED.This is the expected initial state during a reconcile that just created the ClusterQueue — it is inherently transient and should be handled gracefully.
File:
internal/controller/kaiwoqueueconfig_controller.go2. Stale LocalQueue cleanup:
IsNotFoundon Delete treated as fatalsyncLocalQueuesmaintains a map of "expected" LocalQueues as it iterates. When a ClusterQueue is skipped (viacontinue— either because it was not found, or for another reason), its associated LocalQueues are never added to the expected map. The subsequent stale-LocalQueue cleanup loop then considers those LocalQueues as stale and attempts to delete them. However, LocalQueues are owned by their ClusterQueue and are cascade-deleted when the ClusterQueue is removed. The delete therefore returnsIsNotFound, which was treated as a hard failure:success = false.This compounded root cause #1: even after fixing the Get failure, the cleanup loop would independently set
success = falsefor every delete on an already-gone LocalQueue.File:
internal/controller/kaiwoqueueconfig_controller.go3. All other sync-function delete loops:
IsNotFoundon Delete treated as fatalThe same defensive gap existed in the delete loops of:
syncResourceFlavorssyncClusterQueues(which also had an accidental duplicater.Deletecall — a second delete was being issued in theifcondition of the same loop body)syncTopologiessyncWorkloadPriorityClassesAny of these could be triggered if an object was garbage-collected or externally removed between the
Listcall and theDeletecall in the same reconcile.File:
internal/controller/kaiwoqueueconfig_controller.go4.
syncTopologies: empty-named Topology causesCreateto failCreateDefaultTopologypopulatesspec.topologiesusingconfig.DefaultTopologyNamefrom theKaiwoConfigspec. WhenKaiwoConfigis auto-created with an empty spec (thedefaultTopologyNamefield isomitempty), this produces aTopologyentry withmetadata.name: "". The Kueue API server rejects the subsequentCreatecall with a validation error, settingsuccess = falseon every reconcile and keeping theKaiwoQueueConfigpermanently inFAILED.A separate fix for the root cause in
CreateDefaultTopology(falling back to a constant when the configured name is empty) is tracked for a follow-up PR. This PR adds a defensive guard insyncTopologiesthat skips any topology entry with an empty name and logs a clear advisory message, preventing the API-server rejection from causing a permanentFAILEDstate.File:
internal/controller/kaiwoqueueconfig_controller.go5. Missing
Owns(&ClusterQueue{})watchThe controller's
SetupWithManagerdid not watchClusterQueueobjects owned byKaiwoQueueConfig. This meant that whensyncClusterQueuesre-created a deleted ClusterQueue, no new reconcile was enqueued. The LocalQueues for that ClusterQueue would therefore only be created on the next reconcile triggered by some other event (e.g. a Node change), which could be arbitrarily delayed.File:
internal/controller/kaiwoqueueconfig_controller.go6. FAILED status not requeued
When
SyncKueueResourcesreturned an error, theReconcilefunction set the status toFAILEDbut returnedctrl.Result{}, nil— meaning the controller would only retry if another event happened to trigger a reconcile. Combined with the missingOwnswatch, this could leave the controller stuck inFAILEDindefinitely.File:
internal/controller/kaiwoqueueconfig_controller.goFixes
syncLocalQueues— ClusterQueueGetIsNotFound→ log atInfolevel ("ClusterQueue not yet available, deferring LocalQueue sync") andcontinue; only non-IsNotFounderrors setsuccess = falsesyncLocalQueues— stale LocalQueue delete loop!errors.IsNotFound(err)guard so cascade-deleted LocalQueues do not cause a failuresyncResourceFlavorsdelete loop!errors.IsNotFound(err)guardsyncClusterQueuesdelete loopr.Deletecall; added!errors.IsNotFound(err)guardsyncTopologiesdelete loop!errors.IsNotFound(err)guardsyncTopologiescreate loopkueueTopology.Name == ""with an advisory log messagesyncWorkloadPriorityClassesdelete loop!errors.IsNotFound(err)guardSetupWithManagerOwns(&kueuev1beta1.ClusterQueue{})so ClusterQueue create/delete events trigger a reconcile for the owningKaiwoQueueConfigReconcile— final returnqueueConfig.Status.Status == FAILED, returnctrl.Result{RequeueAfter: 30 * time.Second}instead ofctrl.Result{}, nilTesting
A new Chainsaw regression test is added at
test/chainsaw/tests/standard/kaiwoqueueconfigs/clusterqueue-deletion/. It:KaiwoQueueConfigspec containing a namespacedClusterQueue(fizz) and waits for a stableREADYstate.ClusterQueue, triggers a reconcile, and continuously watches theKaiwoQueueConfigstatus — asserting that noFAILEDtransition is ever observed.ClusterQueueand itsLocalQueueare autonomously recreated within 90 s (exercising theOwnswatch andRequeueAfterrecovery path).The test was verified to fail against the unfixed controller and pass against the fixed controller.