fix(e2e): unblock MicroShift SCC diagnostics + bump bootstrap timeout by Defilan · Pull Request #466 · defilantech/LLMKube

Defilan · 2026-05-13T19:44:02Z

What

The OpenShift SCC admission e2e Context has been flaking on multiple PRs (most recently #464) with no diagnostic signal: it times out at e2e_test.go:2092 waiting for the InferenceService Deployment to be created, but the failure dump shows kubectl logs -n llmkube-system (empty pod name) instead of controller logs.

This PR:

Resolves the controller pod lazily in AfterEach by label selector when controllerPodName is empty (so focused CI runs still produce logs).
Splits the single 120s Eventually into Model→Ready and Deployment→exists checkpoints so the next failure tells us which step stalled.
Bumps the OpenShift Eventually budget to 5 minutes (matches the workflow's helm install --wait budget; MicroShift-on-MINC is materially slower than kind).
Captures per-test namespace state on failure (Model/InferenceService/PVC/Pod listing + describe + events). The outer AfterEach only sees llmkube-system, so the namespace where the SCC story actually plays out was invisible.

Why

The MicroShift / MINC lane was failing closed: red CI signal with no actionable evidence. Either we ship without OpenShift coverage at all, or we improve the diagnostic posture so the next failure is debuggable. This is the latter.

The job is still continue-on-error: true while it stabilizes, so this is not a behavior change. It is an observability fix. No controller / production code is touched.

Fixes ongoing MicroShift e2e flake observed across recent PRs (#464 most recently).

How

controllerPodName resolution: when empty, do kubectl get pods -n llmkube-system -l control-plane=controller-manager -o jsonpath={.items[0].metadata.name} and use the result. Adds ~one kubectl call to the failure path only.
Per-test diagnostic dump: a defer inside the It block runs after the Eventually fails but before the spec returns. Dumps get/describe on Model, InferenceService, PVC, pods, and events in the test namespace.
Timeout bump: introduces a local openshiftEventuallyTimeout = 5 * time.Minute so the three Eventuallys in this Context share one knob. Was 2 minutes each, now 5 minutes each.
Intermediate Model→Ready check: HTTP-source Models reach Ready quickly (the controller's reconcileRuntimeResolvedSource marks Ready immediately and defers actual fetch to the InferenceService init container). If this Eventually fails, the bug is in Model reconcile; if Model reaches Ready but Deployment doesn't appear, the bug is in ensureModelCachePVC or reconcileDeployment.

Checklist

Tests added/updated (the test itself is what's changing)
make test passes locally (e2e package builds with -tags=e2e; unit-test path not exercised by this change)
make lint passes locally
Commit messages follow conventional commits
All commits are signed off (git commit -s) per DCO
Documentation updated (not user-facing; test-only change)

PR runs against the MicroShift / OpenShift SCC admission lane were failing with no observable signal: the Eventually at e2e_test.go:2092 timed out after 120s waiting for the InferenceService Deployment to appear, and the failure dump showed `kubectl logs <empty> -n llmkube-system` because controllerPodName was never populated for focus runs. So we had a flake with no diagnostics. Two fixes: 1. AfterEach now resolves the controller pod lazily by label selector when the variable is empty. The Manager Context's BeforeAll is the only path that previously populated it; -ginkgo.focus on the OpenShift SCC Context skipped that path entirely. 2. The OpenShift SCC test now: - Splits the single 120s Eventually into Model→Ready and Deployment→exists steps so a future failure tells us which stage stalled (HTTP-source Model goes Ready quickly; the Deployment then waits on ensureModelCachePVC + reconcileDeployment). - Bumps the Eventually budget to 5 minutes, matching the helm install --wait timeout the workflow uses elsewhere. MicroShift on MINC has materially slower first-reconcile than kind; 120s was always tight. - Captures per-test namespace state on failure (Model / InferenceService / PVC / Pod listing + describe + events), which the outer AfterEach can't see because it only looks at llmkube-system. This does not change any controller behavior or production code path; it changes the test's diagnostic posture so the next failure (if any) ships with actionable evidence instead of an empty log line. Signed-off-by: Christopher Maher <chris@mahercode.io>

codecov · 2026-05-13T19:47:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

The MicroShift / MINC e2e lane has been flaking on the OpenShift SCC admission test with a confusing error that PR defilantech#466's improved diagnostics finally surfaced: deployments.apps "scc-test-inference" is forbidden: cannot set blockOwnerDeletion in this case because cannot find RESTMapping for APIVersion inference.llmkube.dev/v1alpha1 Kind InferenceService: no matches for kind "InferenceService" in version "inference.llmkube.dev/v1alpha1" Trace: controller-runtime's controllerutil.SetControllerReference sets BlockOwnerDeletion=true by default. The kube-apiserver GarbageCollector admission plugin validates that flag by RESTMapping the owner Kind to check the caller's `update` permission on the owner's `finalizers` subresource. On kind the API server's discovery cache is warm by the time the controller starts; on MicroShift-in- MINC, the in-container apiserver populates discovery lazily on first request, and the controller's first reconcile races it and loses. We do not need BlockOwnerDeletion. Cascading delete still cleans up operator-managed Deployments/Services/ConfigMaps/HPAs when their parent ModelRouter or InferenceService is deleted; the "block" semantics only matter for finalizer-based cleanup workflows LLMKube does not use. Trade: cleaner bootstrap for slightly looser cascade ordering guarantees we weren't relying on. Fix: new setControllerReferenceUnblocked helper that wraps controllerutil.SetControllerReference and clears BlockOwnerDeletion on the owner ref it just installed. Applied to all 9 call sites in the controller package (router deployment / service / configmap, both InferenceService and ModelRouter reconcilers, HPA reconciler). Coverage: TestSetControllerReferenceUnblocked verifies the resulting owner ref has Controller=true and BlockOwnerDeletion=false. Existing tests in router_builders_test.go and modelrouter_controller_test.go continue to pass — the change is in the metadata, not the spec. Side benefit: removes the controllerutil import from five files that no longer need it. Signed-off-by: Christopher Maher <chris@mahercode.io>

Defilan merged commit 0c793b7 into defilantech:main May 13, 2026
8 checks passed

github-actions Bot mentioned this pull request May 13, 2026

chore: release 0.7.8 #443

Merged

Defilan mentioned this pull request May 14, 2026

fix: preserve external annotations on reconciler Deployment updates #468

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): unblock MicroShift SCC diagnostics + bump bootstrap timeout#466

fix(e2e): unblock MicroShift SCC diagnostics + bump bootstrap timeout#466
Defilan merged 1 commit into
defilantech:mainfrom
Defilan:fix/microshift-e2e-diagnostics

Defilan commented May 13, 2026

Uh oh!

codecov Bot commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Defilan commented May 13, 2026

What

Why

How

Checklist

Uh oh!

codecov Bot commented May 13, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant