Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions tekton/cronjobs/dogfooding/cluster-health-monitor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Cluster Health Monitor

A Tekton Task that monitors the health of CronJobs, Jobs, PipelineRuns, and
TaskRuns in the dogfooding cluster. It runs as a CronJob that directly creates
a TaskRun (no EventListener needed), making it independent of the trigger
infrastructure it monitors.

## What it checks

### CronJob/Job checks (`check-cronjobs.sh`)

- **Stuck active jobs**: CronJobs with active jobs that never completed —
blocks all future runs under `concurrencyPolicy=Forbid`
- **Stale CronJobs**: CronJobs that haven't succeeded in a configurable
threshold (default: 48h) — catches silent failures
- **Image pull failures**: Pods stuck in `ImagePullBackOff`/`ErrImagePull`
- **Failed jobs**: Jobs with `status.failed > 0`

Intentionally skips suspended CronJobs (those are intentional).

### PipelineRun checks (`check-runs.sh`)

Uses smart filtering to avoid noise from flaky tests:

- **Infrastructure failures**: Always flagged regardless of rate —
`PipelineRunTimeout`, `TaskRunImagePullFailed`, `CouldntGetTask`, etc.
These indicate platform problems, not user code issues.
- **Consistently failing**: Pipelines where **all** of the last N runs
failed (default: N=3). Skips pipelines with mixed success/failure.
- **Regressions**: Subset of consistently failing pipelines that
previously had successes — flagged separately as regressions.

## How it alerts

When issues are detected, the report step creates a GitHub issue in
`tektoncd/plumbing` with structured labels (`area/infra`, `kind/monitoring`).
Deduplicates: won't create a new issue if one is already open.

## Architecture

```
CronJob (daily at 06:00 UTC)
└── creates Job
└── creates TaskRun (via kubectl)
└── runs cluster-health-monitor Task
├── step 1: clone plumbing repo
├── step 2: check-cronjobs.sh (kubectl)
├── step 3: check-runs.sh (kubectl)
└── step 4: report.sh (creates GitHub issue)
```

The CronJob creates the TaskRun directly using `kubectl`, avoiding dependency
on EventListeners/TriggerTemplates. This means the monitor works even when the
trigger infrastructure is broken.

The Task clones the plumbing repository and runs the scripts from
`tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/`. This keeps the
logic in real shell scripts that maintainers can run and test locally.

## Running locally

All scripts live in [`scripts/`](scripts/) and can be run with `kubectl`
access to the cluster:

```bash
export KUBECONFIG=~/.kube/config.tekton-oracle

# Create a report directory
mkdir -p /tmp/health-report

# Run the checks
./scripts/check-cronjobs.sh /tmp/health-report # default: 48h stale threshold
./scripts/check-runs.sh /tmp/health-report # default: 3-run window

# View the report (skip report.sh to avoid creating a GitHub issue)
cat /tmp/health-report/health-report.md
cat /tmp/health-report/status
```

### Script options

```bash
# Custom stale threshold (hours)
./scripts/check-cronjobs.sh /tmp/report 72

# Custom namespaces and window size
./scripts/check-runs.sh /tmp/report "default,tekton-nightly" 10
```

## RBAC

The `tekton-monitor` ServiceAccount needs:
- `get`, `list` on `cronjobs`, `jobs`, `pods` in `default` namespace
- `get`, `list` on `pipelineruns`, `taskruns` across monitored namespaces
- `create` on `taskruns` in `default` namespace (for the CronJob to create the TaskRun)
64 changes: 64 additions & 0 deletions tekton/cronjobs/dogfooding/cluster-health-monitor/cronjob.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Copyright 2026 The Tekton Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: batch/v1
kind: CronJob
metadata:
name: cluster-health-monitor
namespace: default
spec:
schedule: "0 6 * * *" # Daily at 06:00 UTC
concurrencyPolicy: Replace
failedJobsHistoryLimit: 3
successfulJobsHistoryLimit: 3
jobTemplate:
metadata:
annotations:
managed-by: Tekton
spec:
activeDeadlineSeconds: 600 # 10 minute timeout to avoid stuck jobs
template:
spec:
serviceAccountName: tekton-monitor
containers:
- name: create-taskrun
# Fully-qualified image to avoid CRI-O short name issues
image: ghcr.io/tektoncd/plumbing/kubectl
command:
- /bin/sh
args:
- -ce
- |
cat <<'EOF' | kubectl create -f -
apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
generateName: cluster-health-monitor-
namespace: default
labels:
app: cluster-health-monitor
spec:
serviceAccountName: tekton-monitor
taskRef:
name: cluster-health-monitor
params:
- name: namespaces
value: "default,tekton-ci,tekton-nightly,bastion-p,bastion-z"
workspaces:
- name: source
emptyDir: {}
- name: report
emptyDir: {}
EOF
echo "TaskRun created successfully"
restartPolicy: Never
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

commonAnnotations:
managed-by: Tekton

resources:
- rbac.yaml
- task.yaml
- cronjob.yaml
55 changes: 55 additions & 0 deletions tekton/cronjobs/dogfooding/cluster-health-monitor/rbac.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Copyright 2026 The Tekton Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: v1
kind: ServiceAccount
metadata:
name: tekton-monitor
namespace: default
---
# ClusterRole for reading cluster health data
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: tekton-monitor
rules:
# Read CronJobs and Jobs
- apiGroups: ["batch"]
resources: ["cronjobs", "jobs"]
verbs: ["get", "list"]
# Read Pods (for status/events)
- apiGroups: [""]
resources: ["pods", "events"]
verbs: ["get", "list"]
# Read PipelineRuns and TaskRuns
- apiGroups: ["tekton.dev"]
resources: ["pipelineruns", "taskruns"]
verbs: ["get", "list"]
# Create TaskRuns (for the CronJob to create the monitor TaskRun)
- apiGroups: ["tekton.dev"]
resources: ["taskruns"]
verbs: ["create"]
---
# Bind across all monitored namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: tekton-monitor
subjects:
- kind: ServiceAccount
name: tekton-monitor
namespace: default
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: tekton-monitor
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
#!/bin/sh
# Check CronJob and Job health in the cluster.
#
# Detects:
# - CronJobs with stuck active jobs (blocking concurrencyPolicy=Forbid)
# - CronJobs that haven't succeeded in a long time
# - Pods with ImagePullBackOff errors
# - Failed Jobs
#
# Usage:
# ./check-cronjobs.sh <report-dir> [stale-threshold-hours]
#
# stale-threshold-hours: flag CronJobs with no success in this many hours (default: 48)
#
# Outputs:
# <report-dir>/health-report.md — markdown report (created/appended)
# <report-dir>/status — "healthy" or "unhealthy"
#
# Requirements: kubectl
set -e

REPORT_DIR="${1:?Usage: $0 <report-dir> [stale-threshold-hours]}"
STALE_HOURS="${2:-48}"

REPORT_FILE="${REPORT_DIR}/health-report.md"
STATUS_FILE="${REPORT_DIR}/status"

# Initialize report if it doesn't exist
if [ ! -f "${REPORT_FILE}" ]; then
echo "# Cluster Health Report" > "${REPORT_FILE}"
echo "" >> "${REPORT_FILE}"
echo "**Generated:** $(date -u '+%Y-%m-%d %H:%M:%S UTC')" >> "${REPORT_FILE}"
echo "" >> "${REPORT_FILE}"
echo "healthy" > "${STATUS_FILE}"
fi

# Parse ISO 8601 timestamp to epoch seconds.
# Compatible with GNU date and busybox/alpine date.
parse_ts() {
date -u -d "$1" +%s 2>/dev/null ||
date -u -D "%Y-%m-%dT%H:%M:%SZ" -d "$1" +%s 2>/dev/null ||
echo "0"
}

NOW=$(date +%s)
STALE_THRESHOLD=$((STALE_HOURS * 3600))

# =========================================================
# 1. CronJobs with stuck active jobs
# =========================================================
echo "## CronJob Health" >> "${REPORT_FILE}"
echo "" >> "${REPORT_FILE}"

CRONJOB_ISSUES=""

STUCK_CJS=""
for cj in $(kubectl get cronjobs -n default -o jsonpath='{range .items[?(@.status.active)]}{.metadata.name}{"\n"}{end}' 2>/dev/null); do
POLICY=$(kubectl get cronjob "${cj}" -n default -o jsonpath='{.spec.concurrencyPolicy}' 2>/dev/null)
LAST_SUCCESS=$(kubectl get cronjob "${cj}" -n default -o jsonpath='{.status.lastSuccessfulTime}' 2>/dev/null)
LAST_SCHEDULE=$(kubectl get cronjob "${cj}" -n default -o jsonpath='{.status.lastScheduleTime}' 2>/dev/null)
STUCK_CJS="${STUCK_CJS}- \`${cj}\` (policy: ${POLICY}, last success: ${LAST_SUCCESS:-never}, last scheduled: ${LAST_SCHEDULE:-never})\n"
echo "unhealthy" > "${STATUS_FILE}"
done

if [ -n "${STUCK_CJS}" ]; then
CRONJOB_ISSUES="${CRONJOB_ISSUES}### CronJobs with Active (Stuck) Jobs\n\n"
CRONJOB_ISSUES="${CRONJOB_ISSUES}These CronJobs have active jobs that haven't completed. If \`concurrencyPolicy=Forbid\`, no new runs can be scheduled.\n\n"
CRONJOB_ISSUES="${CRONJOB_ISSUES}${STUCK_CJS}\n"
fi

# =========================================================
# 2. CronJobs that haven't succeeded in a long time
# =========================================================
STALE_CJS=""
# Get all non-suspended CronJobs with their lastSuccessfulTime and creation time
CRONJOBS_DATA=$(kubectl get cronjobs -n default -o jsonpath='{range .items[?(@.spec.suspend!=true)]}{.metadata.name}{"\t"}{.status.lastSuccessfulTime}{"\t"}{.metadata.creationTimestamp}{"\n"}{end}' 2>/dev/null)

echo "${CRONJOBS_DATA}" | while IFS="$(printf '\t')" read -r name last_success created; do
[ -z "${name}" ] && continue

if [ -z "${last_success}" ]; then
# Never succeeded — only flag if the CronJob is older than the threshold
CREATED_EPOCH=$(parse_ts "${created}")
AGE=$((NOW - CREATED_EPOCH))
if [ "${AGE}" -gt "${STALE_THRESHOLD}" ]; then
printf -- "- \`%s\` — **never succeeded** (created: %s)\n" "${name}" "${created}" >> "${REPORT_DIR}/stale.tmp"
echo "unhealthy" > "${STATUS_FILE}"
fi
else
LAST_EPOCH=$(parse_ts "${last_success}")
AGE=$((NOW - LAST_EPOCH))
if [ "${AGE}" -gt "${STALE_THRESHOLD}" ]; then
HOURS_AGO=$((AGE / 3600))
printf -- "- \`%s\` — last success **%dh ago** (%s)\n" "${name}" "${HOURS_AGO}" "${last_success}" >> "${REPORT_DIR}/stale.tmp"
echo "unhealthy" > "${STATUS_FILE}"
fi
fi
done

if [ -f "${REPORT_DIR}/stale.tmp" ]; then
CRONJOB_ISSUES="${CRONJOB_ISSUES}### CronJobs Without Recent Success (>${STALE_HOURS}h)\n\n"
CRONJOB_ISSUES="${CRONJOB_ISSUES}$(cat "${REPORT_DIR}/stale.tmp")\n\n"
rm -f "${REPORT_DIR}/stale.tmp"
fi

if [ -z "${CRONJOB_ISSUES}" ]; then
echo "✅ All CronJobs healthy" >> "${REPORT_FILE}"
else
printf "%b" "${CRONJOB_ISSUES}" >> "${REPORT_FILE}"
fi
echo "" >> "${REPORT_FILE}"

# =========================================================
# 3. Pods with image pull failures
# =========================================================
echo "## Job Health" >> "${REPORT_FILE}"
echo "" >> "${REPORT_FILE}"

JOB_ISSUES=""

PULL_FAILURES=$(kubectl get pods -n default \
--field-selector=status.phase!=Succeeded,status.phase!=Running \
-o custom-columns='POD:.metadata.name,STATUS:.status.containerStatuses[0].state.waiting.reason,IMAGE:.spec.containers[0].image' \
--no-headers 2>/dev/null | grep -i "ImagePull\|ErrImage" || true)
if [ -n "${PULL_FAILURES}" ]; then
JOB_ISSUES="${JOB_ISSUES}### Pods with Image Pull Failures\n\n"
JOB_ISSUES="${JOB_ISSUES}\`\`\`\n${PULL_FAILURES}\n\`\`\`\n\n"
echo "unhealthy" > "${STATUS_FILE}"
fi

# =========================================================
# 4. Failed jobs
# =========================================================
FAILED_JOBS=$(kubectl get jobs -n default \
-o custom-columns='NAME:.metadata.name,FAILED:.status.failed,CREATED:.metadata.creationTimestamp' \
--no-headers 2>/dev/null | awk '$2 ~ /^[0-9]+$/ && $2 > 0 {print}' || true)
if [ -n "${FAILED_JOBS}" ]; then
JOB_ISSUES="${JOB_ISSUES}### Failed Jobs\n\n"
JOB_ISSUES="${JOB_ISSUES}\`\`\`\n${FAILED_JOBS}\n\`\`\`\n\n"
echo "unhealthy" > "${STATUS_FILE}"
fi

if [ -z "${JOB_ISSUES}" ]; then
echo "✅ All Jobs healthy" >> "${REPORT_FILE}"
else
printf "%b" "${JOB_ISSUES}" >> "${REPORT_FILE}"
fi
echo "" >> "${REPORT_FILE}"

echo "=== CronJob/Job check complete ==="
cat "${REPORT_FILE}"
Loading
Loading