Fix/kaiwoqueueconfig failed on missing clusterqueue#437

Open

salexo wants to merge 6 commits intomainfrom

fix/kaiwoqueueconfig-failed-on-missing-clusterqueue

Collaborator

salexo commented Apr 1, 2026

Bug Fixes: `KaiwoQueueConfig` Transient FAILED Status on ClusterQueue Deletion

Background

KaiwoQueueConfig was observed entering a FAILED status transiently during normal project lifecycle operations (project creation, deletion, or any external removal of a ClusterQueue that is still present in the spec). The status would recover on the next reconcile, but the window was long enough to be visible and could cause downstream issues.

Root Causes

1. `syncLocalQueues`: `IsNotFound` on ClusterQueue Get treated as fatal

When syncLocalQueues iterated over the ClusterQueues in the spec, it called r.Get() on each one before managing its LocalQueue objects. If the ClusterQueue was not yet visible in the informer cache (either due to cache lag after syncClusterQueues had just created it in the same reconcile, or because it had been externally deleted while still referenced by the spec), r.Get() returned IsNotFound. This was treated as a hard failure: success = false, causing SyncKueueResources to return an error and the overall status to flip to FAILED.

This is the expected initial state during a reconcile that just created the ClusterQueue — it is inherently transient and should be handled gracefully.

File: internal/controller/kaiwoqueueconfig_controller.go

2. Stale LocalQueue cleanup: `IsNotFound` on Delete treated as fatal

syncLocalQueues maintains a map of "expected" LocalQueues as it iterates. When a ClusterQueue is skipped (via continue — either because it was not found, or for another reason), its associated LocalQueues are never added to the expected map. The subsequent stale-LocalQueue cleanup loop then considers those LocalQueues as stale and attempts to delete them. However, LocalQueues are owned by their ClusterQueue and are cascade-deleted when the ClusterQueue is removed. The delete therefore returns IsNotFound, which was treated as a hard failure: success = false.

This compounded root cause #1: even after fixing the Get failure, the cleanup loop would independently set success = false for every delete on an already-gone LocalQueue.

File: internal/controller/kaiwoqueueconfig_controller.go

3. All other sync-function delete loops: `IsNotFound` on Delete treated as fatal

The same defensive gap existed in the delete loops of:

syncResourceFlavors
syncClusterQueues (which also had an accidental duplicate r.Delete call — a second delete was being issued in the if condition of the same loop body)
syncTopologies
syncWorkloadPriorityClasses

Any of these could be triggered if an object was garbage-collected or externally removed between the List call and the Delete call in the same reconcile.

File: internal/controller/kaiwoqueueconfig_controller.go

4. `syncTopologies`: empty-named Topology causes `Create` to fail

CreateDefaultTopology populates spec.topologies using config.DefaultTopologyName from the KaiwoConfig spec. When KaiwoConfig is auto-created with an empty spec (the defaultTopologyName field is omitempty), this produces a Topology entry with metadata.name: "". The Kueue API server rejects the subsequent Create call with a validation error, setting success = false on every reconcile and keeping the KaiwoQueueConfig permanently in FAILED.

A separate fix for the root cause in CreateDefaultTopology (falling back to a constant when the configured name is empty) is tracked for a follow-up PR. This PR adds a defensive guard in syncTopologies that skips any topology entry with an empty name and logs a clear advisory message, preventing the API-server rejection from causing a permanent FAILED state.

File: internal/controller/kaiwoqueueconfig_controller.go

5. Missing `Owns(&ClusterQueue{})` watch

The controller's SetupWithManager did not watch ClusterQueue objects owned by KaiwoQueueConfig. This meant that when syncClusterQueues re-created a deleted ClusterQueue, no new reconcile was enqueued. The LocalQueues for that ClusterQueue would therefore only be created on the next reconcile triggered by some other event (e.g. a Node change), which could be arbitrarily delayed.

File: internal/controller/kaiwoqueueconfig_controller.go

6. FAILED status not requeued

When SyncKueueResources returned an error, the Reconcile function set the status to FAILED but returned ctrl.Result{}, nil — meaning the controller would only retry if another event happened to trigger a reconcile. Combined with the missing Owns watch, this could leave the controller stuck in FAILED indefinitely.

File: internal/controller/kaiwoqueueconfig_controller.go

Fixes

#	Location	Change
1	`syncLocalQueues` — ClusterQueue `Get`	`IsNotFound` → log at `Info` level ("ClusterQueue not yet available, deferring LocalQueue sync") and `continue`; only non-`IsNotFound` errors set `success = false`
2	`syncLocalQueues` — stale LocalQueue delete loop	Added `!errors.IsNotFound(err)` guard so cascade-deleted LocalQueues do not cause a failure
3a	`syncResourceFlavors` delete loop	Added `!errors.IsNotFound(err)` guard
3b	`syncClusterQueues` delete loop	Removed duplicate `r.Delete` call; added `!errors.IsNotFound(err)` guard
3c	`syncTopologies` delete loop	Added `!errors.IsNotFound(err)` guard
3d	`syncTopologies` create loop	Added empty-name guard: skip entries where `kueueTopology.Name == ""` with an advisory log message
3e	`syncWorkloadPriorityClasses` delete loop	Added `!errors.IsNotFound(err)` guard
4	`SetupWithManager`	Added `Owns(&kueuev1beta1.ClusterQueue{})` so ClusterQueue create/delete events trigger a reconcile for the owning `KaiwoQueueConfig`
5	`Reconcile` — final return	When `queueConfig.Status.Status == FAILED`, return `ctrl.Result{RequeueAfter: 30 * time.Second}` instead of `ctrl.Result{}, nil`

Testing

A new Chainsaw regression test is added at test/chainsaw/tests/standard/kaiwoqueueconfigs/clusterqueue-deletion/. It:

Applies a KaiwoQueueConfig spec containing a namespaced ClusterQueue (fizz) and waits for a stable READY state.
Runs three delete-and-recover cycles: directly deletes the ClusterQueue, triggers a reconcile, and continuously watches the KaiwoQueueConfig status — asserting that no FAILED transition is ever observed.
After the final deletion, asserts that both the ClusterQueue and its LocalQueue are autonomously recreated within 90 s (exercising the Owns watch and RequeueAfter recovery path).

The test was verified to fail against the unfixed controller and pass against the fixed controller.

salexo added 3 commits

April 1, 2026 09:15


          test(chainsaw): add regression test for KaiwoQueueConfig FAILED on mi…

9d4a36e

…ssing ClusterQueue

Adds a Chainsaw test that exercises the bug where syncLocalQueues treats a
missing ClusterQueue (still present in the spec) as a fatal error, causing
KaiwoQueueConfig to transiently enter FAILED status during project
creation/deletion.

The test:
  1. Applies a KaiwoQueueConfig with a namespace-scoped ClusterQueue and
     confirms the controller reaches READY with both child objects present.
  2. Deletes the ClusterQueue directly three times in a loop (each iteration
     resets the race window between syncClusterQueues recreating the object
     and syncLocalQueues reading it back from the cache), triggering a
     reconcile each time and asserting the status never becomes FAILED.
  3. Asserts that the ClusterQueue and LocalQueue are eventually recreated
     without any external trigger (requires fix #2 RequeueAfter or fix #3
     Owns watch to also be present).


          fix(kaiwoqueueconfig): prevent transient FAILED status on ClusterQueu…

822a8f3

…e deletion

Root causes fixed:
- syncLocalQueues: IsNotFound on ClusterQueue Get was treated as fatal.
  The CQ may not yet be visible in the informer cache immediately after
  syncClusterQueues creates it in the same reconcile (cache lag), or it may
  have been externally deleted while still in the spec (expected transient
  state).  Now logs at Info level and continues without marking success=false.
- Stale LocalQueue delete loop: when a CQ was skipped via continue, its
  LocalQueues were absent from the expected-map and subsequently treated as
  stale.  Their cascade-delete (CQ deletion removes owned LQs) returned
  IsNotFound which was treated as fatal.  Now guarded with !IsNotFound.
- All other sync delete loops (ResourceFlavor, ClusterQueue, Topology,
  WorkloadPriorityClass): IsNotFound on Delete was treated as fatal.  The
  object is already gone — the desired state is achieved.  All guarded.
- syncClusterQueues delete loop: contained an accidental duplicate r.Delete
  call introduced by a shadowed err variable in the if-condition.  Removed.
- syncTopologies: added a defensive guard that skips spec entries with an
  empty topology name (logs an advisory) rather than letting the Kueue API
  server reject the Create and fail the entire sync.  Root cause of the
  empty name is addressed in a follow-up commit.
- SetupWithManager: added Owns(&ClusterQueue{}) so that ClusterQueue
  create/delete events trigger a reconcile for the owning KaiwoQueueConfig,
  ensuring LocalQueues are created promptly after their CQ is (re)created.
- Reconcile: return RequeueAfter:30s when status is FAILED so the controller
  self-heals even if no other event triggers a reconcile.

Test: extended Chainsaw regression test now runs three delete+recover cycles
and uses a continuous kubectl --watch stream to catch any FAILED transition,
however brief.


          fix(topology): prevent empty-named Topology from causing permanent FA…

5e1ffdc

…ILED

Root cause: CreateDefaultTopology read config.DefaultTopologyName from the
KaiwoConfig spec without a fallback.  On clusters where KaiwoConfig was
created before the +kubebuilder:default="default-topology" marker was present
(or where the CRD default was not applied to existing objects), the field is
an empty string.  This produced a kaiwo.Topology with metadata.name="" which
was written into KaiwoQueueConfig.spec.topologies by EnsureKaiwoQueueConfig.
Every subsequent reconcile then tried to Create a Kueue Topology with no
name, the API server rejected it, and the status was permanently FAILED.

Changes:
- CreateDefaultTopology (internal/controller/utils/kueue.go): fall back to
  common.DefaultTopologyName ("default-topology") when
  config.DefaultTopologyName is empty.
- sanitizeTopologies (internal/controller/kaiwoqueueconfig_controller.go):
  filter out entries with empty names so that EnsureKaiwoQueueConfig's merge
  path never re-introduces them once removed.
- Reconcile (internal/controller/kaiwoqueueconfig_controller.go): on the
  non-dynamic path, after fetching an existing KaiwoQueueConfig, compare the
  sanitized topology list against the spec and patch the spec if any
  empty-named entries are found.  This self-heals upgraded clusters the first
  time the new controller reconciles, without requiring a manual
  delete/recreate of the KaiwoQueueConfig.

salexo requested review from AVSuni and bjorn-amd

April 2, 2026 06:19


          fix(kaiwoqueueconfig): add RequeueAfter on FAILED and watch Namespace…

33ae9dd

… events

The reconciler returned ctrl.Result{}, nil even when status was FAILED,
so the controller went silent after the second consecutive failure and
never retried. Add RequeueAfter:30s as a fallback.

Additionally, watch Namespace create/delete events so the controller
reacts immediately when a referenced namespace appears or disappears,
rather than waiting for the RequeueAfter window.

Includes Chainsaw tests for both recovery paths: removing the namespace
from the spec, and recreating the deleted namespace.

salexo force-pushed the fix/kaiwoqueueconfig-failed-on-missing-clusterqueue branch from d7865fc to 33ae9dd Compare

April 8, 2026 10:17

salexo added 2 commits

April 8, 2026 11:46


          fix(test): use dynamic index for clusterQueue patch in ns-deletion-sp…

e3aefb7

…ec-update

The JSON patch used a hardcoded index 0 for clusterQueues, which only
works when cq-ns-del-spec is the sole entry. On CI, the controller
auto-generates a kaiwo clusterQueue from node pools at index 0, pushing
cq-ns-del-spec to index 1. The patch silently modified the wrong entry.


          fix(kaiwoqueueconfig): filter ClusterQueue watch updates

f840ea6

Limit ClusterQueue-owned reconcile triggers to create/delete and generation-changing updates so status churn doesn't drive expensive full-sync loops. Also tighten the clusterqueue-deletion chainsaw assertions around readiness and object existence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet