Skip to content

kind-traderx cluster unstable: control plane components in crash loop #9

@monadic

Description

@monadic

Problem

The kind-traderx cluster has degraded into an unstable state with control plane components failing:

kube-controller-manager-traderx-control-plane   0/1   Error              184 restarts
kube-scheduler-traderx-control-plane            0/1   CrashLoopBackOff   171 restarts

Root Cause

Restart storm - Multiple workloads are constantly restarting, overwhelming the control plane:

Component Restarts Namespace
ingress-nginx-controller 62 ingress-nginx
web-gui 30 traderx-dev
database 14 traderx-dev
salmon-cub-worker 13 confighub
reference-data 10 traderx-dev

Container stats show severe resource pressure:

  • CPU: 625% (extremely overloaded)
  • Block I/O: 64.9TB read / 1.8TB written (massive thrashing)

The scheduler and controller-manager are timing out on leader election because the API server can't respond in time:

E1214 10:52:59.031223 leaderelection.go:441] Failed to update lock: context deadline exceeded

Impact

  • New pods cannot be scheduled
  • Deployments stay in "Progressing" forever
  • Control plane recovery is unlikely without intervention

Recommended Fix

Option A: Delete and recreate (recommended)

kind delete cluster --name traderx
kind create cluster --name traderx

Option B: Reduce load first

# Delete problematic namespaces
kubectl delete namespace traderx-dev --wait=false
kubectl delete namespace devops-apps --wait=false

# Wait for control plane to recover
sleep 60
kubectl get pods -n kube-system

Prevention

  1. Add resource limits to all deployments
  2. Add pod disruption budgets
  3. Monitor restart counts and alert on thresholds
  4. Consider multi-node kind cluster for better isolation

Discovered during first-try evaluation. Cluster has been in this state for ~44 hours.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions