tektoncd · vdemeester · Feb 24, 2026
diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/README.md b/tekton/cronjobs/dogfooding/cluster-health-monitor/README.md
@@ -0,0 +1,95 @@
+# Cluster Health Monitor
+
+A Tekton Task that monitors the health of CronJobs, Jobs, PipelineRuns, and
+TaskRuns in the dogfooding cluster. It runs as a CronJob that directly creates
+a TaskRun (no EventListener needed), making it independent of the trigger
+infrastructure it monitors.
+
+## What it checks
+
+### CronJob/Job checks (`check-cronjobs.sh`)
+
+- **Stuck active jobs**: CronJobs with active jobs that never completed —
+  blocks all future runs under `concurrencyPolicy=Forbid`
+- **Stale CronJobs**: CronJobs that haven't succeeded in a configurable
+  threshold (default: 48h) — catches silent failures
+- **Image pull failures**: Pods stuck in `ImagePullBackOff`/`ErrImagePull`
+- **Failed jobs**: Jobs with `status.failed > 0`
+
+Intentionally skips suspended CronJobs (those are intentional).
+
+### PipelineRun checks (`check-runs.sh`)
+
+Uses smart filtering to avoid noise from flaky tests:
+
+- **Infrastructure failures**: Always flagged regardless of rate —
+  `PipelineRunTimeout`, `TaskRunImagePullFailed`, `CouldntGetTask`, etc.
+  These indicate platform problems, not user code issues.
+- **Consistently failing**: Pipelines where **all** of the last N runs
+  failed (default: N=3). Skips pipelines with mixed success/failure.
+- **Regressions**: Subset of consistently failing pipelines that
+  previously had successes — flagged separately as regressions.
+
+## How it alerts
+
+When issues are detected, the report step creates a GitHub issue in
+`tektoncd/plumbing` with structured labels (`area/infra`, `kind/monitoring`).
+Deduplicates: won't create a new issue if one is already open.
+
+## Architecture
+
+```
+CronJob (daily at 06:00 UTC)
+  └── creates Job
+       └── creates TaskRun (via kubectl)
+            └── runs cluster-health-monitor Task
+                 ├── step 1: clone plumbing repo
+                 ├── step 2: check-cronjobs.sh  (kubectl)
+                 ├── step 3: check-runs.sh      (kubectl)
+                 └── step 4: report.sh           (creates GitHub issue)
+```
+
+The CronJob creates the TaskRun directly using `kubectl`, avoiding dependency
+on EventListeners/TriggerTemplates. This means the monitor works even when the
+trigger infrastructure is broken.
+
+The Task clones the plumbing repository and runs the scripts from
+`tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/`. This keeps the
+logic in real shell scripts that maintainers can run and test locally.
+
+## Running locally
+
+All scripts live in [`scripts/`](scripts/) and can be run with `kubectl`
+access to the cluster:
+
+```bash
+export KUBECONFIG=~/.kube/config.tekton-oracle
+
+# Create a report directory
+mkdir -p /tmp/health-report
+
+# Run the checks
+./scripts/check-cronjobs.sh /tmp/health-report        # default: 48h stale threshold
+./scripts/check-runs.sh /tmp/health-report             # default: 3-run window
+
+# View the report (skip report.sh to avoid creating a GitHub issue)
+cat /tmp/health-report/health-report.md
+cat /tmp/health-report/status
+```
+
+### Script options
+
+```bash
+# Custom stale threshold (hours)
+./scripts/check-cronjobs.sh /tmp/report 72
+
+# Custom namespaces and window size
+./scripts/check-runs.sh /tmp/report "default,tekton-nightly" 10
+```
+
+## RBAC
+
+The `tekton-monitor` ServiceAccount needs:
+- `get`, `list` on `cronjobs`, `jobs`, `pods` in `default` namespace
+- `get`, `list` on `pipelineruns`, `taskruns` across monitored namespaces
+- `create` on `taskruns` in `default` namespace (for the CronJob to create the TaskRun)
diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/cronjob.yaml b/tekton/cronjobs/dogfooding/cluster-health-monitor/cronjob.yaml
@@ -0,0 +1,64 @@
+# Copyright 2026 The Tekton Authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: cluster-health-monitor
+  namespace: default
+spec:
+  schedule: "0 6 * * *"  # Daily at 06:00 UTC
+  concurrencyPolicy: Replace
+  failedJobsHistoryLimit: 3
+  successfulJobsHistoryLimit: 3
+  jobTemplate:
+    metadata:
+      annotations:
+        managed-by: Tekton
+    spec:
+      activeDeadlineSeconds: 600  # 10 minute timeout to avoid stuck jobs
+      template:
+        spec:
+          serviceAccountName: tekton-monitor
+          containers:
+            - name: create-taskrun
+              # Fully-qualified image to avoid CRI-O short name issues
+              image: ghcr.io/tektoncd/plumbing/kubectl
+              command:
+                - /bin/sh
+              args:
+                - -ce
+                - |
+                  cat <<'EOF' | kubectl create -f -
+                  apiVersion: tekton.dev/v1
+                  kind: TaskRun
+                  metadata:
+                    generateName: cluster-health-monitor-
+                    namespace: default
+                    labels:
+                      app: cluster-health-monitor
+                  spec:
+                    serviceAccountName: tekton-monitor
+                    taskRef:
+                      name: cluster-health-monitor
+                    params:
+                      - name: namespaces
+                        value: "default,tekton-ci,tekton-nightly,bastion-p,bastion-z"
+                    workspaces:
+                      - name: source
+                        emptyDir: {}
+                      - name: report
+                        emptyDir: {}
+                  EOF
+                  echo "TaskRun created successfully"
+          restartPolicy: Never
diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/kustomization.yaml b/tekton/cronjobs/dogfooding/cluster-health-monitor/kustomization.yaml
@@ -0,0 +1,10 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+commonAnnotations:
+  managed-by: Tekton
+
+resources:
+  - rbac.yaml
+  - task.yaml
+  - cronjob.yaml
diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/rbac.yaml b/tekton/cronjobs/dogfooding/cluster-health-monitor/rbac.yaml
@@ -0,0 +1,55 @@
+# Copyright 2026 The Tekton Authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: tekton-monitor
+  namespace: default
+---
+# ClusterRole for reading cluster health data
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: tekton-monitor
+rules:
+  # Read CronJobs and Jobs
+  - apiGroups: ["batch"]
+    resources: ["cronjobs", "jobs"]
+    verbs: ["get", "list"]
+  # Read Pods (for status/events)
+  - apiGroups: [""]
+    resources: ["pods", "events"]
+    verbs: ["get", "list"]
+  # Read PipelineRuns and TaskRuns
+  - apiGroups: ["tekton.dev"]
+    resources: ["pipelineruns", "taskruns"]
+    verbs: ["get", "list"]
+  # Create TaskRuns (for the CronJob to create the monitor TaskRun)
+  - apiGroups: ["tekton.dev"]
+    resources: ["taskruns"]
+    verbs: ["create"]
+---
+# Bind across all monitored namespaces
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: tekton-monitor
+subjects:
+  - kind: ServiceAccount
+    name: tekton-monitor
+    namespace: default
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: tekton-monitor
diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-cronjobs.sh b/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-cronjobs.sh
@@ -0,0 +1,151 @@
+#!/bin/sh
+# Check CronJob and Job health in the cluster.
+#
+# Detects:
+# - CronJobs with stuck active jobs (blocking concurrencyPolicy=Forbid)
+# - CronJobs that haven't succeeded in a long time
+# - Pods with ImagePullBackOff errors
+# - Failed Jobs
+#
+# Usage:
+#   ./check-cronjobs.sh <report-dir> [stale-threshold-hours]
+#
+#   stale-threshold-hours: flag CronJobs with no success in this many hours (default: 48)
+#
+# Outputs:
+#   <report-dir>/health-report.md  — markdown report (created/appended)
+#   <report-dir>/status            — "healthy" or "unhealthy"
+#
+# Requirements: kubectl
+set -e
+
+REPORT_DIR="${1:?Usage: $0 <report-dir> [stale-threshold-hours]}"
+STALE_HOURS="${2:-48}"
+
+REPORT_FILE="${REPORT_DIR}/health-report.md"
+STATUS_FILE="${REPORT_DIR}/status"
+
+# Initialize report if it doesn't exist
+if [ ! -f "${REPORT_FILE}" ]; then
+  echo "# Cluster Health Report" > "${REPORT_FILE}"
+  echo "" >> "${REPORT_FILE}"
+  echo "**Generated:** $(date -u '+%Y-%m-%d %H:%M:%S UTC')" >> "${REPORT_FILE}"
+  echo "" >> "${REPORT_FILE}"
+  echo "healthy" > "${STATUS_FILE}"
+fi
+
+# Parse ISO 8601 timestamp to epoch seconds.
+# Compatible with GNU date and busybox/alpine date.
+parse_ts() {
+  date -u -d "$1" +%s 2>/dev/null ||
+    date -u -D "%Y-%m-%dT%H:%M:%SZ" -d "$1" +%s 2>/dev/null ||
+    echo "0"
+}
+
+NOW=$(date +%s)
+STALE_THRESHOLD=$((STALE_HOURS * 3600))
+
+# =========================================================
+# 1. CronJobs with stuck active jobs
+# =========================================================
+echo "## CronJob Health" >> "${REPORT_FILE}"
+echo "" >> "${REPORT_FILE}"
+
+CRONJOB_ISSUES=""
+
+STUCK_CJS=""
+for cj in $(kubectl get cronjobs -n default -o jsonpath='{range .items[?(@.status.active)]}{.metadata.name}{"\n"}{end}' 2>/dev/null); do
+  POLICY=$(kubectl get cronjob "${cj}" -n default -o jsonpath='{.spec.concurrencyPolicy}' 2>/dev/null)
+  LAST_SUCCESS=$(kubectl get cronjob "${cj}" -n default -o jsonpath='{.status.lastSuccessfulTime}' 2>/dev/null)
+  LAST_SCHEDULE=$(kubectl get cronjob "${cj}" -n default -o jsonpath='{.status.lastScheduleTime}' 2>/dev/null)
+  STUCK_CJS="${STUCK_CJS}- \`${cj}\` (policy: ${POLICY}, last success: ${LAST_SUCCESS:-never}, last scheduled: ${LAST_SCHEDULE:-never})\n"
+  echo "unhealthy" > "${STATUS_FILE}"
+done
+
+if [ -n "${STUCK_CJS}" ]; then
+  CRONJOB_ISSUES="${CRONJOB_ISSUES}### CronJobs with Active (Stuck) Jobs\n\n"
+  CRONJOB_ISSUES="${CRONJOB_ISSUES}These CronJobs have active jobs that haven't completed. If \`concurrencyPolicy=Forbid\`, no new runs can be scheduled.\n\n"
+  CRONJOB_ISSUES="${CRONJOB_ISSUES}${STUCK_CJS}\n"
+fi
+
+# =========================================================
+# 2. CronJobs that haven't succeeded in a long time
+# =========================================================
+STALE_CJS=""
+# Get all non-suspended CronJobs with their lastSuccessfulTime and creation time
+CRONJOBS_DATA=$(kubectl get cronjobs -n default -o jsonpath='{range .items[?(@.spec.suspend!=true)]}{.metadata.name}{"\t"}{.status.lastSuccessfulTime}{"\t"}{.metadata.creationTimestamp}{"\n"}{end}' 2>/dev/null)
+
+echo "${CRONJOBS_DATA}" | while IFS="$(printf '\t')" read -r name last_success created; do
+  [ -z "${name}" ] && continue
+
+  if [ -z "${last_success}" ]; then
+    # Never succeeded — only flag if the CronJob is older than the threshold
+    CREATED_EPOCH=$(parse_ts "${created}")
+    AGE=$((NOW - CREATED_EPOCH))
+    if [ "${AGE}" -gt "${STALE_THRESHOLD}" ]; then
+      printf -- "- \`%s\` — **never succeeded** (created: %s)\n" "${name}" "${created}" >> "${REPORT_DIR}/stale.tmp"
+      echo "unhealthy" > "${STATUS_FILE}"
+    fi
+  else
+    LAST_EPOCH=$(parse_ts "${last_success}")
+    AGE=$((NOW - LAST_EPOCH))
+    if [ "${AGE}" -gt "${STALE_THRESHOLD}" ]; then
+      HOURS_AGO=$((AGE / 3600))
+      printf -- "- \`%s\` — last success **%dh ago** (%s)\n" "${name}" "${HOURS_AGO}" "${last_success}" >> "${REPORT_DIR}/stale.tmp"
+      echo "unhealthy" > "${STATUS_FILE}"
+    fi
+  fi
+done
+
+if [ -f "${REPORT_DIR}/stale.tmp" ]; then
+  CRONJOB_ISSUES="${CRONJOB_ISSUES}### CronJobs Without Recent Success (>${STALE_HOURS}h)\n\n"
+  CRONJOB_ISSUES="${CRONJOB_ISSUES}$(cat "${REPORT_DIR}/stale.tmp")\n\n"
+  rm -f "${REPORT_DIR}/stale.tmp"
+fi
+
+if [ -z "${CRONJOB_ISSUES}" ]; then
+  echo "✅ All CronJobs healthy" >> "${REPORT_FILE}"
+else
+  printf "%b" "${CRONJOB_ISSUES}" >> "${REPORT_FILE}"
+fi
+echo "" >> "${REPORT_FILE}"
+
+# =========================================================
+# 3. Pods with image pull failures
+# =========================================================
+echo "## Job Health" >> "${REPORT_FILE}"
+echo "" >> "${REPORT_FILE}"
+
+JOB_ISSUES=""
+
+PULL_FAILURES=$(kubectl get pods -n default \
+  --field-selector=status.phase!=Succeeded,status.phase!=Running \
+  -o custom-columns='POD:.metadata.name,STATUS:.status.containerStatuses[0].state.waiting.reason,IMAGE:.spec.containers[0].image' \
+  --no-headers 2>/dev/null | grep -i "ImagePull\|ErrImage" || true)
+if [ -n "${PULL_FAILURES}" ]; then
+  JOB_ISSUES="${JOB_ISSUES}### Pods with Image Pull Failures\n\n"
+  JOB_ISSUES="${JOB_ISSUES}\`\`\`\n${PULL_FAILURES}\n\`\`\`\n\n"
+  echo "unhealthy" > "${STATUS_FILE}"
+fi
+
+# =========================================================
+# 4. Failed jobs
+# =========================================================
+FAILED_JOBS=$(kubectl get jobs -n default \
+  -o custom-columns='NAME:.metadata.name,FAILED:.status.failed,CREATED:.metadata.creationTimestamp' \
+  --no-headers 2>/dev/null | awk '$2 ~ /^[0-9]+$/ && $2 > 0 {print}' || true)
+if [ -n "${FAILED_JOBS}" ]; then
+  JOB_ISSUES="${JOB_ISSUES}### Failed Jobs\n\n"
+  JOB_ISSUES="${JOB_ISSUES}\`\`\`\n${FAILED_JOBS}\n\`\`\`\n\n"
+  echo "unhealthy" > "${STATUS_FILE}"
+fi
+
+if [ -z "${JOB_ISSUES}" ]; then
+  echo "✅ All Jobs healthy" >> "${REPORT_FILE}"
+else
+  printf "%b" "${JOB_ISSUES}" >> "${REPORT_FILE}"
+fi
+echo "" >> "${REPORT_FILE}"
+
+echo "=== CronJob/Job check complete ==="
+cat "${REPORT_FILE}"