Skip to content

fix(e2e): unblock MicroShift SCC diagnostics + bump bootstrap timeout#466

Merged
Defilan merged 1 commit into
defilantech:mainfrom
Defilan:fix/microshift-e2e-diagnostics
May 13, 2026
Merged

fix(e2e): unblock MicroShift SCC diagnostics + bump bootstrap timeout#466
Defilan merged 1 commit into
defilantech:mainfrom
Defilan:fix/microshift-e2e-diagnostics

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented May 13, 2026

What

The OpenShift SCC admission e2e Context has been flaking on multiple PRs (most recently #464) with no diagnostic signal: it times out at e2e_test.go:2092 waiting for the InferenceService Deployment to be created, but the failure dump shows kubectl logs -n llmkube-system (empty pod name) instead of controller logs.

This PR:

  • Resolves the controller pod lazily in AfterEach by label selector when controllerPodName is empty (so focused CI runs still produce logs).
  • Splits the single 120s Eventually into Model→Ready and Deployment→exists checkpoints so the next failure tells us which step stalled.
  • Bumps the OpenShift Eventually budget to 5 minutes (matches the workflow's helm install --wait budget; MicroShift-on-MINC is materially slower than kind).
  • Captures per-test namespace state on failure (Model/InferenceService/PVC/Pod listing + describe + events). The outer AfterEach only sees llmkube-system, so the namespace where the SCC story actually plays out was invisible.

Why

The MicroShift / MINC lane was failing closed: red CI signal with no actionable evidence. Either we ship without OpenShift coverage at all, or we improve the diagnostic posture so the next failure is debuggable. This is the latter.

The job is still continue-on-error: true while it stabilizes, so this is not a behavior change. It is an observability fix. No controller / production code is touched.

Fixes ongoing MicroShift e2e flake observed across recent PRs (#464 most recently).

How

  • controllerPodName resolution: when empty, do kubectl get pods -n llmkube-system -l control-plane=controller-manager -o jsonpath={.items[0].metadata.name} and use the result. Adds ~one kubectl call to the failure path only.
  • Per-test diagnostic dump: a defer inside the It block runs after the Eventually fails but before the spec returns. Dumps get/describe on Model, InferenceService, PVC, pods, and events in the test namespace.
  • Timeout bump: introduces a local openshiftEventuallyTimeout = 5 * time.Minute so the three Eventuallys in this Context share one knob. Was 2 minutes each, now 5 minutes each.
  • Intermediate Model→Ready check: HTTP-source Models reach Ready quickly (the controller's reconcileRuntimeResolvedSource marks Ready immediately and defers actual fetch to the InferenceService init container). If this Eventually fails, the bug is in Model reconcile; if Model reaches Ready but Deployment doesn't appear, the bug is in ensureModelCachePVC or reconcileDeployment.

Checklist

  • Tests added/updated (the test itself is what's changing)
  • make test passes locally (e2e package builds with -tags=e2e; unit-test path not exercised by this change)
  • make lint passes locally
  • Commit messages follow conventional commits
  • All commits are signed off (git commit -s) per DCO
  • Documentation updated (not user-facing; test-only change)

PR runs against the MicroShift / OpenShift SCC admission lane were
failing with no observable signal: the Eventually at
e2e_test.go:2092 timed out after 120s waiting for the InferenceService
Deployment to appear, and the failure dump showed `kubectl logs
<empty> -n llmkube-system` because controllerPodName was never
populated for focus runs. So we had a flake with no diagnostics.

Two fixes:

1. AfterEach now resolves the controller pod lazily by label selector
   when the variable is empty. The Manager Context's BeforeAll is the
   only path that previously populated it; -ginkgo.focus on the
   OpenShift SCC Context skipped that path entirely.

2. The OpenShift SCC test now:
   - Splits the single 120s Eventually into Model→Ready and
     Deployment→exists steps so a future failure tells us which
     stage stalled (HTTP-source Model goes Ready quickly; the
     Deployment then waits on ensureModelCachePVC + reconcileDeployment).
   - Bumps the Eventually budget to 5 minutes, matching the
     helm install --wait timeout the workflow uses elsewhere.
     MicroShift on MINC has materially slower first-reconcile than
     kind; 120s was always tight.
   - Captures per-test namespace state on failure (Model /
     InferenceService / PVC / Pod listing + describe + events), which
     the outer AfterEach can't see because it only looks at
     llmkube-system.

This does not change any controller behavior or production code path;
it changes the test's diagnostic posture so the next failure (if any)
ships with actionable evidence instead of an empty log line.

Signed-off-by: Christopher Maher <chris@mahercode.io>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@Defilan Defilan merged commit 0c793b7 into defilantech:main May 13, 2026
8 checks passed
@github-actions github-actions Bot mentioned this pull request May 13, 2026
Defilan added a commit to Defilan/LLMKube that referenced this pull request May 14, 2026
The MicroShift / MINC e2e lane has been flaking on the OpenShift SCC
admission test with a confusing error that PR defilantech#466's improved
diagnostics finally surfaced:

  deployments.apps "scc-test-inference" is forbidden:
    cannot set blockOwnerDeletion in this case because cannot find
    RESTMapping for APIVersion inference.llmkube.dev/v1alpha1
    Kind InferenceService:
    no matches for kind "InferenceService" in version
    "inference.llmkube.dev/v1alpha1"

Trace: controller-runtime's controllerutil.SetControllerReference sets
BlockOwnerDeletion=true by default. The kube-apiserver
GarbageCollector admission plugin validates that flag by RESTMapping
the owner Kind to check the caller's `update` permission on the
owner's `finalizers` subresource. On kind the API server's discovery
cache is warm by the time the controller starts; on MicroShift-in-
MINC, the in-container apiserver populates discovery lazily on first
request, and the controller's first reconcile races it and loses.

We do not need BlockOwnerDeletion. Cascading delete still cleans up
operator-managed Deployments/Services/ConfigMaps/HPAs when their
parent ModelRouter or InferenceService is deleted; the "block"
semantics only matter for finalizer-based cleanup workflows LLMKube
does not use. Trade: cleaner bootstrap for slightly looser cascade
ordering guarantees we weren't relying on.

Fix: new setControllerReferenceUnblocked helper that wraps
controllerutil.SetControllerReference and clears BlockOwnerDeletion
on the owner ref it just installed. Applied to all 9 call sites in
the controller package (router deployment / service / configmap, both
InferenceService and ModelRouter reconcilers, HPA reconciler).

Coverage: TestSetControllerReferenceUnblocked verifies the resulting
owner ref has Controller=true and BlockOwnerDeletion=false. Existing
tests in router_builders_test.go and modelrouter_controller_test.go
continue to pass — the change is in the metadata, not the spec.

Side benefit: removes the controllerutil import from five files that
no longer need it.

Signed-off-by: Christopher Maher <chris@mahercode.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant