From 445f26991d0114cce0d081c90e14f99e95eebb9d Mon Sep 17 00:00:00 2001 From: Vincent Demeester Date: Tue, 24 Feb 2026 11:42:05 +0100 Subject: [PATCH] Add cluster-health-monitor Task and CronJob for dogfooding cluster Add a lightweight monitoring solution for the Tekton dogfooding cluster that checks CronJob, Job, and PipelineRun health daily and creates GitHub issues when problems are detected. The Task clones plumbing and runs standalone scripts from scripts/, making it easy for maintainers to run checks locally with just kubectl access to the cluster. CronJob/Job checks (check-cronjobs.sh): - Stuck active jobs blocking concurrencyPolicy=Forbid - CronJobs that haven't succeeded in a configurable threshold - Pods with ImagePullBackOff errors - Failed jobs PipelineRun checks (check-runs.sh): - Infrastructure failures always flagged (timeouts, image pull, etc) - Consistently failing pipelines (all last N runs failed) - Regressions detected (was succeeding, now all-failing) - Flaky pipelines with mixed results are skipped (not actionable) Issue management (report.sh): - Creates a single issue per incident - Adds comment updates on subsequent runs - Auto-closes with resolved comment when cluster recovers Key design decisions: - CronJob creates TaskRun directly via kubectl (no EventListener), independent of the trigger infrastructure it monitors - concurrencyPolicy: Replace + activeDeadlineSeconds: 600 so the monitor itself can never get stuck - ghcr.io/ images only to avoid CRI-O short name issues Relates to #3119 --- .../cluster-health-monitor/README.md | 95 +++++++++++ .../cluster-health-monitor/cronjob.yaml | 64 ++++++++ .../cluster-health-monitor/kustomization.yaml | 10 ++ .../cluster-health-monitor/rbac.yaml | 55 +++++++ .../scripts/check-cronjobs.sh | 151 +++++++++++++++++ .../scripts/check-runs.sh | 151 +++++++++++++++++ .../cluster-health-monitor/scripts/report.sh | 152 ++++++++++++++++++ .../cluster-health-monitor/task.yaml | 79 +++++++++ 8 files changed, 757 insertions(+) create mode 100644 tekton/cronjobs/dogfooding/cluster-health-monitor/README.md create mode 100644 tekton/cronjobs/dogfooding/cluster-health-monitor/cronjob.yaml create mode 100644 tekton/cronjobs/dogfooding/cluster-health-monitor/kustomization.yaml create mode 100644 tekton/cronjobs/dogfooding/cluster-health-monitor/rbac.yaml create mode 100755 tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-cronjobs.sh create mode 100755 tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-runs.sh create mode 100755 tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/report.sh create mode 100644 tekton/cronjobs/dogfooding/cluster-health-monitor/task.yaml diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/README.md b/tekton/cronjobs/dogfooding/cluster-health-monitor/README.md new file mode 100644 index 000000000..40de1d060 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/README.md @@ -0,0 +1,95 @@ +# Cluster Health Monitor + +A Tekton Task that monitors the health of CronJobs, Jobs, PipelineRuns, and +TaskRuns in the dogfooding cluster. It runs as a CronJob that directly creates +a TaskRun (no EventListener needed), making it independent of the trigger +infrastructure it monitors. + +## What it checks + +### CronJob/Job checks (`check-cronjobs.sh`) + +- **Stuck active jobs**: CronJobs with active jobs that never completed — + blocks all future runs under `concurrencyPolicy=Forbid` +- **Stale CronJobs**: CronJobs that haven't succeeded in a configurable + threshold (default: 48h) — catches silent failures +- **Image pull failures**: Pods stuck in `ImagePullBackOff`/`ErrImagePull` +- **Failed jobs**: Jobs with `status.failed > 0` + +Intentionally skips suspended CronJobs (those are intentional). + +### PipelineRun checks (`check-runs.sh`) + +Uses smart filtering to avoid noise from flaky tests: + +- **Infrastructure failures**: Always flagged regardless of rate — + `PipelineRunTimeout`, `TaskRunImagePullFailed`, `CouldntGetTask`, etc. + These indicate platform problems, not user code issues. +- **Consistently failing**: Pipelines where **all** of the last N runs + failed (default: N=3). Skips pipelines with mixed success/failure. +- **Regressions**: Subset of consistently failing pipelines that + previously had successes — flagged separately as regressions. + +## How it alerts + +When issues are detected, the report step creates a GitHub issue in +`tektoncd/plumbing` with structured labels (`area/infra`, `kind/monitoring`). +Deduplicates: won't create a new issue if one is already open. + +## Architecture + +``` +CronJob (daily at 06:00 UTC) + └── creates Job + └── creates TaskRun (via kubectl) + └── runs cluster-health-monitor Task + ├── step 1: clone plumbing repo + ├── step 2: check-cronjobs.sh (kubectl) + ├── step 3: check-runs.sh (kubectl) + └── step 4: report.sh (creates GitHub issue) +``` + +The CronJob creates the TaskRun directly using `kubectl`, avoiding dependency +on EventListeners/TriggerTemplates. This means the monitor works even when the +trigger infrastructure is broken. + +The Task clones the plumbing repository and runs the scripts from +`tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/`. This keeps the +logic in real shell scripts that maintainers can run and test locally. + +## Running locally + +All scripts live in [`scripts/`](scripts/) and can be run with `kubectl` +access to the cluster: + +```bash +export KUBECONFIG=~/.kube/config.tekton-oracle + +# Create a report directory +mkdir -p /tmp/health-report + +# Run the checks +./scripts/check-cronjobs.sh /tmp/health-report # default: 48h stale threshold +./scripts/check-runs.sh /tmp/health-report # default: 3-run window + +# View the report (skip report.sh to avoid creating a GitHub issue) +cat /tmp/health-report/health-report.md +cat /tmp/health-report/status +``` + +### Script options + +```bash +# Custom stale threshold (hours) +./scripts/check-cronjobs.sh /tmp/report 72 + +# Custom namespaces and window size +./scripts/check-runs.sh /tmp/report "default,tekton-nightly" 10 +``` + +## RBAC + +The `tekton-monitor` ServiceAccount needs: +- `get`, `list` on `cronjobs`, `jobs`, `pods` in `default` namespace +- `get`, `list` on `pipelineruns`, `taskruns` across monitored namespaces +- `create` on `taskruns` in `default` namespace (for the CronJob to create the TaskRun) diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/cronjob.yaml b/tekton/cronjobs/dogfooding/cluster-health-monitor/cronjob.yaml new file mode 100644 index 000000000..a5c022049 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/cronjob.yaml @@ -0,0 +1,64 @@ +# Copyright 2026 The Tekton Authors +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +apiVersion: batch/v1 +kind: CronJob +metadata: + name: cluster-health-monitor + namespace: default +spec: + schedule: "0 6 * * *" # Daily at 06:00 UTC + concurrencyPolicy: Replace + failedJobsHistoryLimit: 3 + successfulJobsHistoryLimit: 3 + jobTemplate: + metadata: + annotations: + managed-by: Tekton + spec: + activeDeadlineSeconds: 600 # 10 minute timeout to avoid stuck jobs + template: + spec: + serviceAccountName: tekton-monitor + containers: + - name: create-taskrun + # Fully-qualified image to avoid CRI-O short name issues + image: ghcr.io/tektoncd/plumbing/kubectl + command: + - /bin/sh + args: + - -ce + - | + cat <<'EOF' | kubectl create -f - + apiVersion: tekton.dev/v1 + kind: TaskRun + metadata: + generateName: cluster-health-monitor- + namespace: default + labels: + app: cluster-health-monitor + spec: + serviceAccountName: tekton-monitor + taskRef: + name: cluster-health-monitor + params: + - name: namespaces + value: "default,tekton-ci,tekton-nightly,bastion-p,bastion-z" + workspaces: + - name: source + emptyDir: {} + - name: report + emptyDir: {} + EOF + echo "TaskRun created successfully" + restartPolicy: Never diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/kustomization.yaml b/tekton/cronjobs/dogfooding/cluster-health-monitor/kustomization.yaml new file mode 100644 index 000000000..7e95611e4 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/kustomization.yaml @@ -0,0 +1,10 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +commonAnnotations: + managed-by: Tekton + +resources: + - rbac.yaml + - task.yaml + - cronjob.yaml diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/rbac.yaml b/tekton/cronjobs/dogfooding/cluster-health-monitor/rbac.yaml new file mode 100644 index 000000000..5bdacce13 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/rbac.yaml @@ -0,0 +1,55 @@ +# Copyright 2026 The Tekton Authors +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +apiVersion: v1 +kind: ServiceAccount +metadata: + name: tekton-monitor + namespace: default +--- +# ClusterRole for reading cluster health data +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: tekton-monitor +rules: + # Read CronJobs and Jobs + - apiGroups: ["batch"] + resources: ["cronjobs", "jobs"] + verbs: ["get", "list"] + # Read Pods (for status/events) + - apiGroups: [""] + resources: ["pods", "events"] + verbs: ["get", "list"] + # Read PipelineRuns and TaskRuns + - apiGroups: ["tekton.dev"] + resources: ["pipelineruns", "taskruns"] + verbs: ["get", "list"] + # Create TaskRuns (for the CronJob to create the monitor TaskRun) + - apiGroups: ["tekton.dev"] + resources: ["taskruns"] + verbs: ["create"] +--- +# Bind across all monitored namespaces +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: tekton-monitor +subjects: + - kind: ServiceAccount + name: tekton-monitor + namespace: default +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: tekton-monitor diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-cronjobs.sh b/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-cronjobs.sh new file mode 100755 index 000000000..478005588 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-cronjobs.sh @@ -0,0 +1,151 @@ +#!/bin/sh +# Check CronJob and Job health in the cluster. +# +# Detects: +# - CronJobs with stuck active jobs (blocking concurrencyPolicy=Forbid) +# - CronJobs that haven't succeeded in a long time +# - Pods with ImagePullBackOff errors +# - Failed Jobs +# +# Usage: +# ./check-cronjobs.sh [stale-threshold-hours] +# +# stale-threshold-hours: flag CronJobs with no success in this many hours (default: 48) +# +# Outputs: +# /health-report.md — markdown report (created/appended) +# /status — "healthy" or "unhealthy" +# +# Requirements: kubectl +set -e + +REPORT_DIR="${1:?Usage: $0 [stale-threshold-hours]}" +STALE_HOURS="${2:-48}" + +REPORT_FILE="${REPORT_DIR}/health-report.md" +STATUS_FILE="${REPORT_DIR}/status" + +# Initialize report if it doesn't exist +if [ ! -f "${REPORT_FILE}" ]; then + echo "# Cluster Health Report" > "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + echo "**Generated:** $(date -u '+%Y-%m-%d %H:%M:%S UTC')" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + echo "healthy" > "${STATUS_FILE}" +fi + +# Parse ISO 8601 timestamp to epoch seconds. +# Compatible with GNU date and busybox/alpine date. +parse_ts() { + date -u -d "$1" +%s 2>/dev/null || + date -u -D "%Y-%m-%dT%H:%M:%SZ" -d "$1" +%s 2>/dev/null || + echo "0" +} + +NOW=$(date +%s) +STALE_THRESHOLD=$((STALE_HOURS * 3600)) + +# ========================================================= +# 1. CronJobs with stuck active jobs +# ========================================================= +echo "## CronJob Health" >> "${REPORT_FILE}" +echo "" >> "${REPORT_FILE}" + +CRONJOB_ISSUES="" + +STUCK_CJS="" +for cj in $(kubectl get cronjobs -n default -o jsonpath='{range .items[?(@.status.active)]}{.metadata.name}{"\n"}{end}' 2>/dev/null); do + POLICY=$(kubectl get cronjob "${cj}" -n default -o jsonpath='{.spec.concurrencyPolicy}' 2>/dev/null) + LAST_SUCCESS=$(kubectl get cronjob "${cj}" -n default -o jsonpath='{.status.lastSuccessfulTime}' 2>/dev/null) + LAST_SCHEDULE=$(kubectl get cronjob "${cj}" -n default -o jsonpath='{.status.lastScheduleTime}' 2>/dev/null) + STUCK_CJS="${STUCK_CJS}- \`${cj}\` (policy: ${POLICY}, last success: ${LAST_SUCCESS:-never}, last scheduled: ${LAST_SCHEDULE:-never})\n" + echo "unhealthy" > "${STATUS_FILE}" +done + +if [ -n "${STUCK_CJS}" ]; then + CRONJOB_ISSUES="${CRONJOB_ISSUES}### CronJobs with Active (Stuck) Jobs\n\n" + CRONJOB_ISSUES="${CRONJOB_ISSUES}These CronJobs have active jobs that haven't completed. If \`concurrencyPolicy=Forbid\`, no new runs can be scheduled.\n\n" + CRONJOB_ISSUES="${CRONJOB_ISSUES}${STUCK_CJS}\n" +fi + +# ========================================================= +# 2. CronJobs that haven't succeeded in a long time +# ========================================================= +STALE_CJS="" +# Get all non-suspended CronJobs with their lastSuccessfulTime and creation time +CRONJOBS_DATA=$(kubectl get cronjobs -n default -o jsonpath='{range .items[?(@.spec.suspend!=true)]}{.metadata.name}{"\t"}{.status.lastSuccessfulTime}{"\t"}{.metadata.creationTimestamp}{"\n"}{end}' 2>/dev/null) + +echo "${CRONJOBS_DATA}" | while IFS="$(printf '\t')" read -r name last_success created; do + [ -z "${name}" ] && continue + + if [ -z "${last_success}" ]; then + # Never succeeded — only flag if the CronJob is older than the threshold + CREATED_EPOCH=$(parse_ts "${created}") + AGE=$((NOW - CREATED_EPOCH)) + if [ "${AGE}" -gt "${STALE_THRESHOLD}" ]; then + printf -- "- \`%s\` — **never succeeded** (created: %s)\n" "${name}" "${created}" >> "${REPORT_DIR}/stale.tmp" + echo "unhealthy" > "${STATUS_FILE}" + fi + else + LAST_EPOCH=$(parse_ts "${last_success}") + AGE=$((NOW - LAST_EPOCH)) + if [ "${AGE}" -gt "${STALE_THRESHOLD}" ]; then + HOURS_AGO=$((AGE / 3600)) + printf -- "- \`%s\` — last success **%dh ago** (%s)\n" "${name}" "${HOURS_AGO}" "${last_success}" >> "${REPORT_DIR}/stale.tmp" + echo "unhealthy" > "${STATUS_FILE}" + fi + fi +done + +if [ -f "${REPORT_DIR}/stale.tmp" ]; then + CRONJOB_ISSUES="${CRONJOB_ISSUES}### CronJobs Without Recent Success (>${STALE_HOURS}h)\n\n" + CRONJOB_ISSUES="${CRONJOB_ISSUES}$(cat "${REPORT_DIR}/stale.tmp")\n\n" + rm -f "${REPORT_DIR}/stale.tmp" +fi + +if [ -z "${CRONJOB_ISSUES}" ]; then + echo "✅ All CronJobs healthy" >> "${REPORT_FILE}" +else + printf "%b" "${CRONJOB_ISSUES}" >> "${REPORT_FILE}" +fi +echo "" >> "${REPORT_FILE}" + +# ========================================================= +# 3. Pods with image pull failures +# ========================================================= +echo "## Job Health" >> "${REPORT_FILE}" +echo "" >> "${REPORT_FILE}" + +JOB_ISSUES="" + +PULL_FAILURES=$(kubectl get pods -n default \ + --field-selector=status.phase!=Succeeded,status.phase!=Running \ + -o custom-columns='POD:.metadata.name,STATUS:.status.containerStatuses[0].state.waiting.reason,IMAGE:.spec.containers[0].image' \ + --no-headers 2>/dev/null | grep -i "ImagePull\|ErrImage" || true) +if [ -n "${PULL_FAILURES}" ]; then + JOB_ISSUES="${JOB_ISSUES}### Pods with Image Pull Failures\n\n" + JOB_ISSUES="${JOB_ISSUES}\`\`\`\n${PULL_FAILURES}\n\`\`\`\n\n" + echo "unhealthy" > "${STATUS_FILE}" +fi + +# ========================================================= +# 4. Failed jobs +# ========================================================= +FAILED_JOBS=$(kubectl get jobs -n default \ + -o custom-columns='NAME:.metadata.name,FAILED:.status.failed,CREATED:.metadata.creationTimestamp' \ + --no-headers 2>/dev/null | awk '$2 ~ /^[0-9]+$/ && $2 > 0 {print}' || true) +if [ -n "${FAILED_JOBS}" ]; then + JOB_ISSUES="${JOB_ISSUES}### Failed Jobs\n\n" + JOB_ISSUES="${JOB_ISSUES}\`\`\`\n${FAILED_JOBS}\n\`\`\`\n\n" + echo "unhealthy" > "${STATUS_FILE}" +fi + +if [ -z "${JOB_ISSUES}" ]; then + echo "✅ All Jobs healthy" >> "${REPORT_FILE}" +else + printf "%b" "${JOB_ISSUES}" >> "${REPORT_FILE}" +fi +echo "" >> "${REPORT_FILE}" + +echo "=== CronJob/Job check complete ===" +cat "${REPORT_FILE}" diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-runs.sh b/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-runs.sh new file mode 100755 index 000000000..53e463183 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-runs.sh @@ -0,0 +1,151 @@ +#!/bin/sh +# Check PipelineRun health across namespaces. +# +# Uses smart filtering to avoid noise from flaky tests: +# 1. Consistently failing: all last N runs of a pipeline failed +# 2. Infrastructure failures: ImagePullBackOff, timeouts, etc — always flagged +# 3. Regressions: pipeline was succeeding but is now all-failing +# +# Pipelines with mixed success/failure (flaky) are NOT flagged. +# +# Usage: +# ./check-runs.sh [namespaces] [window-size] +# +# namespaces: comma-separated (default: default,tekton-ci,tekton-nightly,bastion-p,bastion-z) +# window-size: number of recent runs to consider (default: 3) +# +# Outputs: +# /health-report.md — markdown report (appended) +# /status — "healthy" or "unhealthy" +# +# Requirements: kubectl +set -e + +REPORT_DIR="${1:?Usage: $0 [namespaces] [window-size]}" +NAMESPACES="${2:-default,tekton-ci,tekton-nightly,bastion-p,bastion-z}" +WINDOW="${3:-3}" + +REPORT_FILE="${REPORT_DIR}/health-report.md" +STATUS_FILE="${REPORT_DIR}/status" + +# Infrastructure failure reasons that always indicate a platform problem, +# regardless of failure rate. These are never caused by user code. +INFRA_REASONS="TaskRunImagePullFailed|ImagePullBackOff|PipelineRunTimeout|TaskRunTimeout|CouldntGetTask|CouldntGetPipeline|CreateContainerConfigError|ExceededResourceQuota|ExceededNodeResources" + +echo "## PipelineRun Health" >> "${REPORT_FILE}" +echo "" >> "${REPORT_FILE}" + +HAS_ISSUES=false + +for ns in $(echo "${NAMESPACES}" | tr ',' ' '); do + # Get all PipelineRuns: pipeline_name, failure_reason, succeeded (True/False/Unknown) + # Sorted by creation time (oldest first), so `tail` gives us the most recent. + ALL_RUNS=$(kubectl get pipelineruns -n "${ns}" \ + --sort-by=.metadata.creationTimestamp \ + -o custom-columns='PIPELINE:.metadata.labels.tekton\.dev/pipeline,REASON:.status.conditions[0].reason,STATUS:.status.conditions[0].status' \ + --no-headers 2>/dev/null || true) + + [ -z "${ALL_RUNS}" ] && continue + + # Get unique pipeline names + PIPELINES=$(echo "${ALL_RUNS}" | awk '{print $1}' | sort -u) + + NS_INFRA="" + NS_CONSISTENT="" + NS_REGRESSION="" + + for pipeline in ${PIPELINES}; do + [ "${pipeline}" = "" ] && continue + + # All runs for this pipeline + P_ALL=$(echo "${ALL_RUNS}" | awk -v p="${pipeline}" '$1 == p') + # Last N runs (the window we evaluate) + P_RECENT=$(echo "${P_ALL}" | tail -n "${WINDOW}") + + TOTAL=$(echo "${P_RECENT}" | wc -l | tr -d ' ') + FAILED=$(echo "${P_RECENT}" | awk '$3 != "True"' | wc -l | tr -d ' ') + + # --- Check 1: Infrastructure failures (always flag) --- + INFRA_HITS=$(echo "${P_RECENT}" | grep -cE "${INFRA_REASONS}" || true) + if [ "${INFRA_HITS}" -gt 0 ]; then + INFRA_DETAILS=$(echo "${P_RECENT}" | grep -E "${INFRA_REASONS}" | awk '{print $2}' | sort | uniq -c | sort -rn | awk '{printf "%s (x%s), ", $2, $1}' | sed 's/, $//') + NS_INFRA="${NS_INFRA}- \`${pipeline}\` — ${INFRA_DETAILS}\n" + fi + + # --- Check 2: Consistently failing (all N runs failed) --- + if [ "${FAILED}" -eq "${TOTAL}" ] && [ "${TOTAL}" -ge "${WINDOW}" ]; then + # Get the most common failure reason + TOP_REASON=$(echo "${P_RECENT}" | awk '{print $2}' | sort | uniq -c | sort -rn | head -1 | awk '{print $2}') + + # Skip if already flagged as infra + if ! echo "${NS_INFRA}" | grep -q "\`${pipeline}\`"; then + # --- Check 3: Is this a regression? --- + # Look at runs before the window for any successes + P_OLDER=$(echo "${P_ALL}" | head -n -"${WINDOW}") + HAD_SUCCESS=false + if [ -n "${P_OLDER}" ]; then + OLD_SUCCESS=$(echo "${P_OLDER}" | awk '$3 == "True"' | wc -l | tr -d ' ') + [ "${OLD_SUCCESS}" -gt 0 ] && HAD_SUCCESS=true + fi + + if [ "${HAD_SUCCESS}" = "true" ]; then + NS_REGRESSION="${NS_REGRESSION}- \`${pipeline}\` — last ${TOTAL} runs failed (${TOP_REASON}), was previously succeeding\n" + else + NS_CONSISTENT="${NS_CONSISTENT}- \`${pipeline}\` — last ${TOTAL} runs failed (${TOP_REASON})\n" + fi + fi + fi + done + + # Write findings for this namespace + NS_HAS_ISSUES=false + + if [ -n "${NS_INFRA}" ]; then + echo "### Infrastructure Failures in \`${ns}\`" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + printf "%b" "${NS_INFRA}" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + NS_HAS_ISSUES=true + fi + + if [ -n "${NS_REGRESSION}" ]; then + echo "### Regressions in \`${ns}\`" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + echo "Pipelines that were succeeding but are now consistently failing:" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + printf "%b" "${NS_REGRESSION}" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + NS_HAS_ISSUES=true + fi + + if [ -n "${NS_CONSISTENT}" ]; then + echo "### Consistently Failing in \`${ns}\`" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + printf "%b" "${NS_CONSISTENT}" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + NS_HAS_ISSUES=true + fi + + if [ "${NS_HAS_ISSUES}" = "true" ]; then + HAS_ISSUES=true + echo "unhealthy" > "${STATUS_FILE}" + fi +done + +if [ "${HAS_ISSUES}" = "false" ]; then + echo "✅ No actionable PipelineRun failures" >> "${REPORT_FILE}" +fi +echo "" >> "${REPORT_FILE}" + +# ========================================================= +# Summary +# ========================================================= +STATUS=$(cat "${STATUS_FILE}") +echo "---" >> "${REPORT_FILE}" +echo "" >> "${REPORT_FILE}" +echo "**Overall Status:** ${STATUS}" >> "${REPORT_FILE}" + +echo "=== Full report ===" +cat "${REPORT_FILE}" +echo "" +echo "Status: ${STATUS}" diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/report.sh b/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/report.sh new file mode 100755 index 000000000..3777658fa --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/report.sh @@ -0,0 +1,152 @@ +#!/bin/sh +# Report cluster health status via GitHub issues. +# +# Behavior: +# - Unhealthy + no open issue → create a new issue +# - Unhealthy + open issue → add comment with latest report +# - Healthy + open issue → comment "resolved" and close it +# - Healthy + no open issue → nothing to do +# +# Usage: +# ./report.sh +# +# Environment: +# GITHUB_TOKEN — GitHub API token with issue creation permissions +# +# Requirements: wget +set -e + +REPORT_DIR="${1:?Usage: $0 }" +REPORT_FILE="${REPORT_DIR}/health-report.md" +STATUS_FILE="${REPORT_DIR}/status" +STATUS=$(cat "${STATUS_FILE}") + +: "${GITHUB_TOKEN:?GITHUB_TOKEN must be set}" + +REPO="tektoncd/plumbing" +API="https://api.github.com/repos/${REPO}" +LABELS="area/infra,kind/monitoring" +AUTH_HEADER="Authorization: token ${GITHUB_TOKEN}" +ACCEPT_HEADER="Accept: application/vnd.github.v3+json" + +# Find existing open monitoring issue (returns issue number or empty) +find_open_issue() { + wget -q -O- \ + --header="${AUTH_HEADER}" \ + --header="${ACCEPT_HEADER}" \ + "${API}/issues?labels=${LABELS}&state=open&per_page=1" \ + | sed -n 's/.*"number": *\([0-9]*\).*/\1/p' | head -1 +} + +# Encode text as a JSON string value (with surrounding quotes). +# Reads from stdin, writes to stdout. +json_encode() { + sed 's/\\/\\\\/g; s/"/\\"/g; s/\t/\\t/g' | awk '{printf "%s\\n", $0}' | sed 's/^/"/; s/$/"/' +} + +# Add a comment to an issue +add_comment() { + issue_number="$1" + body_file="$2" + + PAYLOAD_FILE="/tmp/comment-payload.json" + printf '{"body": ' > "${PAYLOAD_FILE}" + json_encode < "${body_file}" >> "${PAYLOAD_FILE}" + printf '}' >> "${PAYLOAD_FILE}" + + wget -q -O- --post-file="${PAYLOAD_FILE}" \ + --header="${AUTH_HEADER}" \ + --header="${ACCEPT_HEADER}" \ + --header="Content-Type: application/json" \ + "${API}/issues/${issue_number}/comments" > /dev/null +} + +# Close an issue via PATCH. Tries curl first (supports PATCH natively), +# falls back to wget --method=PATCH (busybox wget may not support it). +close_issue() { + issue_number="$1" + + if command -v curl > /dev/null 2>&1; then + curl -s -X PATCH \ + -H "${AUTH_HEADER}" \ + -H "${ACCEPT_HEADER}" \ + -H "Content-Type: application/json" \ + -d '{"state": "closed", "state_reason": "completed"}' \ + "${API}/issues/${issue_number}" > /dev/null + else + PAYLOAD_FILE="/tmp/close-payload.json" + printf '{"state": "closed", "state_reason": "completed"}' > "${PAYLOAD_FILE}" + wget -q -O- --post-file="${PAYLOAD_FILE}" \ + --header="${AUTH_HEADER}" \ + --header="${ACCEPT_HEADER}" \ + --header="Content-Type: application/json" \ + --method=PATCH \ + "${API}/issues/${issue_number}" > /dev/null 2>&1 || { + echo "⚠️ Could not auto-close issue #${issue_number}" + echo " Please close it manually." + } + fi +} + +EXISTING_ISSUE=$(find_open_issue) + +TIMESTAMP=$(date -u '+%Y-%m-%d %H:%M UTC') +TMPDIR="/tmp/health-monitor" +mkdir -p "${TMPDIR}" + +if [ "${STATUS}" = "healthy" ]; then + if [ -n "${EXISTING_ISSUE}" ]; then + echo "✅ Cluster healthy — closing issue #${EXISTING_ISSUE}" + cat > "${TMPDIR}/resolved.md" < "${TMPDIR}/update.md" < "${TMPDIR}/body.md" < "${PAYLOAD_FILE}" + json_encode < "${TMPDIR}/body.md" >> "${PAYLOAD_FILE}" + printf ', "labels": ["area/infra", "kind/monitoring"]}' >> "${PAYLOAD_FILE}" + + RESPONSE=$(wget -q -O- --post-file="${PAYLOAD_FILE}" \ + --header="${AUTH_HEADER}" \ + --header="${ACCEPT_HEADER}" \ + --header="Content-Type: application/json" \ + "${API}/issues") + + ISSUE_NUM=$(echo "${RESPONSE}" | sed -n 's/.*"number": *\([0-9]*\).*/\1/p' | head -1) + echo "✅ Created issue #${ISSUE_NUM}" +fi diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/task.yaml b/tekton/cronjobs/dogfooding/cluster-health-monitor/task.yaml new file mode 100644 index 000000000..0c41a7246 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/task.yaml @@ -0,0 +1,79 @@ +# Copyright 2026 The Tekton Authors +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +apiVersion: tekton.dev/v1 +kind: Task +metadata: + name: cluster-health-monitor + namespace: default +spec: + description: > + Monitors the health of CronJobs, Jobs, PipelineRuns, and TaskRuns + in the dogfooding cluster and reports issues. Clones plumbing to + run scripts from the repo, making them easy to maintain and test + locally. + params: + - name: repository + description: The plumbing repository URL + type: string + default: "https://github.com/tektoncd/plumbing.git" + - name: revision + description: Git revision to use for the scripts + type: string + default: "main" + - name: namespaces + description: Comma-separated list of namespaces to monitor for PipelineRun/TaskRun failures + type: string + default: "default,tekton-ci,tekton-nightly,bastion-p,bastion-z" + workspaces: + - name: source + description: Workspace for the cloned plumbing repository + - name: report + description: Workspace to share the health report between steps + steps: + - name: clone + image: ghcr.io/tektoncd/plumbing/alpine-git-nonroot + script: | + #!/bin/sh + set -ex + git clone --depth 1 --branch "$(params.revision)" \ + "$(params.repository)" "$(workspaces.source.path)/plumbing" + echo "Cloned $(params.repository) @ $(params.revision)" + ls "$(workspaces.source.path)/plumbing/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/" + + - name: check-cronjobs-and-jobs + image: ghcr.io/tektoncd/plumbing/kubectl + script: | + #!/bin/sh + exec "$(workspaces.source.path)/plumbing/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-cronjobs.sh" \ + "$(workspaces.report.path)" + + - name: check-pipelineruns-taskruns + image: ghcr.io/tektoncd/plumbing/kubectl + script: | + #!/bin/sh + exec "$(workspaces.source.path)/plumbing/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-runs.sh" \ + "$(workspaces.report.path)" "$(params.namespaces)" + + - name: report + image: ghcr.io/tektoncd/plumbing/kubectl + env: + - name: GITHUB_TOKEN + valueFrom: + secretKeyRef: + name: bot-token-github + key: bot-token + script: | + #!/bin/sh + exec "$(workspaces.source.path)/plumbing/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/report.sh" \ + "$(workspaces.report.path)"