diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/README.md b/tekton/cronjobs/dogfooding/cluster-health-monitor/README.md new file mode 100644 index 000000000..40de1d060 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/README.md @@ -0,0 +1,95 @@ +# Cluster Health Monitor + +A Tekton Task that monitors the health of CronJobs, Jobs, PipelineRuns, and +TaskRuns in the dogfooding cluster. It runs as a CronJob that directly creates +a TaskRun (no EventListener needed), making it independent of the trigger +infrastructure it monitors. + +## What it checks + +### CronJob/Job checks (`check-cronjobs.sh`) + +- **Stuck active jobs**: CronJobs with active jobs that never completed — + blocks all future runs under `concurrencyPolicy=Forbid` +- **Stale CronJobs**: CronJobs that haven't succeeded in a configurable + threshold (default: 48h) — catches silent failures +- **Image pull failures**: Pods stuck in `ImagePullBackOff`/`ErrImagePull` +- **Failed jobs**: Jobs with `status.failed > 0` + +Intentionally skips suspended CronJobs (those are intentional). + +### PipelineRun checks (`check-runs.sh`) + +Uses smart filtering to avoid noise from flaky tests: + +- **Infrastructure failures**: Always flagged regardless of rate — + `PipelineRunTimeout`, `TaskRunImagePullFailed`, `CouldntGetTask`, etc. + These indicate platform problems, not user code issues. +- **Consistently failing**: Pipelines where **all** of the last N runs + failed (default: N=3). Skips pipelines with mixed success/failure. +- **Regressions**: Subset of consistently failing pipelines that + previously had successes — flagged separately as regressions. + +## How it alerts + +When issues are detected, the report step creates a GitHub issue in +`tektoncd/plumbing` with structured labels (`area/infra`, `kind/monitoring`). +Deduplicates: won't create a new issue if one is already open. + +## Architecture + +``` +CronJob (daily at 06:00 UTC) + └── creates Job + └── creates TaskRun (via kubectl) + └── runs cluster-health-monitor Task + ├── step 1: clone plumbing repo + ├── step 2: check-cronjobs.sh (kubectl) + ├── step 3: check-runs.sh (kubectl) + └── step 4: report.sh (creates GitHub issue) +``` + +The CronJob creates the TaskRun directly using `kubectl`, avoiding dependency +on EventListeners/TriggerTemplates. This means the monitor works even when the +trigger infrastructure is broken. + +The Task clones the plumbing repository and runs the scripts from +`tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/`. This keeps the +logic in real shell scripts that maintainers can run and test locally. + +## Running locally + +All scripts live in [`scripts/`](scripts/) and can be run with `kubectl` +access to the cluster: + +```bash +export KUBECONFIG=~/.kube/config.tekton-oracle + +# Create a report directory +mkdir -p /tmp/health-report + +# Run the checks +./scripts/check-cronjobs.sh /tmp/health-report # default: 48h stale threshold +./scripts/check-runs.sh /tmp/health-report # default: 3-run window + +# View the report (skip report.sh to avoid creating a GitHub issue) +cat /tmp/health-report/health-report.md +cat /tmp/health-report/status +``` + +### Script options + +```bash +# Custom stale threshold (hours) +./scripts/check-cronjobs.sh /tmp/report 72 + +# Custom namespaces and window size +./scripts/check-runs.sh /tmp/report "default,tekton-nightly" 10 +``` + +## RBAC + +The `tekton-monitor` ServiceAccount needs: +- `get`, `list` on `cronjobs`, `jobs`, `pods` in `default` namespace +- `get`, `list` on `pipelineruns`, `taskruns` across monitored namespaces +- `create` on `taskruns` in `default` namespace (for the CronJob to create the TaskRun) diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/cronjob.yaml b/tekton/cronjobs/dogfooding/cluster-health-monitor/cronjob.yaml new file mode 100644 index 000000000..a5c022049 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/cronjob.yaml @@ -0,0 +1,64 @@ +# Copyright 2026 The Tekton Authors +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +apiVersion: batch/v1 +kind: CronJob +metadata: + name: cluster-health-monitor + namespace: default +spec: + schedule: "0 6 * * *" # Daily at 06:00 UTC + concurrencyPolicy: Replace + failedJobsHistoryLimit: 3 + successfulJobsHistoryLimit: 3 + jobTemplate: + metadata: + annotations: + managed-by: Tekton + spec: + activeDeadlineSeconds: 600 # 10 minute timeout to avoid stuck jobs + template: + spec: + serviceAccountName: tekton-monitor + containers: + - name: create-taskrun + # Fully-qualified image to avoid CRI-O short name issues + image: ghcr.io/tektoncd/plumbing/kubectl + command: + - /bin/sh + args: + - -ce + - | + cat <<'EOF' | kubectl create -f - + apiVersion: tekton.dev/v1 + kind: TaskRun + metadata: + generateName: cluster-health-monitor- + namespace: default + labels: + app: cluster-health-monitor + spec: + serviceAccountName: tekton-monitor + taskRef: + name: cluster-health-monitor + params: + - name: namespaces + value: "default,tekton-ci,tekton-nightly,bastion-p,bastion-z" + workspaces: + - name: source + emptyDir: {} + - name: report + emptyDir: {} + EOF + echo "TaskRun created successfully" + restartPolicy: Never diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/kustomization.yaml b/tekton/cronjobs/dogfooding/cluster-health-monitor/kustomization.yaml new file mode 100644 index 000000000..7e95611e4 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/kustomization.yaml @@ -0,0 +1,10 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +commonAnnotations: + managed-by: Tekton + +resources: + - rbac.yaml + - task.yaml + - cronjob.yaml diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/rbac.yaml b/tekton/cronjobs/dogfooding/cluster-health-monitor/rbac.yaml new file mode 100644 index 000000000..5bdacce13 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/rbac.yaml @@ -0,0 +1,55 @@ +# Copyright 2026 The Tekton Authors +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +apiVersion: v1 +kind: ServiceAccount +metadata: + name: tekton-monitor + namespace: default +--- +# ClusterRole for reading cluster health data +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: tekton-monitor +rules: + # Read CronJobs and Jobs + - apiGroups: ["batch"] + resources: ["cronjobs", "jobs"] + verbs: ["get", "list"] + # Read Pods (for status/events) + - apiGroups: [""] + resources: ["pods", "events"] + verbs: ["get", "list"] + # Read PipelineRuns and TaskRuns + - apiGroups: ["tekton.dev"] + resources: ["pipelineruns", "taskruns"] + verbs: ["get", "list"] + # Create TaskRuns (for the CronJob to create the monitor TaskRun) + - apiGroups: ["tekton.dev"] + resources: ["taskruns"] + verbs: ["create"] +--- +# Bind across all monitored namespaces +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: tekton-monitor +subjects: + - kind: ServiceAccount + name: tekton-monitor + namespace: default +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: tekton-monitor diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-cronjobs.sh b/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-cronjobs.sh new file mode 100755 index 000000000..478005588 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-cronjobs.sh @@ -0,0 +1,151 @@ +#!/bin/sh +# Check CronJob and Job health in the cluster. +# +# Detects: +# - CronJobs with stuck active jobs (blocking concurrencyPolicy=Forbid) +# - CronJobs that haven't succeeded in a long time +# - Pods with ImagePullBackOff errors +# - Failed Jobs +# +# Usage: +# ./check-cronjobs.sh [stale-threshold-hours] +# +# stale-threshold-hours: flag CronJobs with no success in this many hours (default: 48) +# +# Outputs: +# /health-report.md — markdown report (created/appended) +# /status — "healthy" or "unhealthy" +# +# Requirements: kubectl +set -e + +REPORT_DIR="${1:?Usage: $0 [stale-threshold-hours]}" +STALE_HOURS="${2:-48}" + +REPORT_FILE="${REPORT_DIR}/health-report.md" +STATUS_FILE="${REPORT_DIR}/status" + +# Initialize report if it doesn't exist +if [ ! -f "${REPORT_FILE}" ]; then + echo "# Cluster Health Report" > "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + echo "**Generated:** $(date -u '+%Y-%m-%d %H:%M:%S UTC')" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + echo "healthy" > "${STATUS_FILE}" +fi + +# Parse ISO 8601 timestamp to epoch seconds. +# Compatible with GNU date and busybox/alpine date. +parse_ts() { + date -u -d "$1" +%s 2>/dev/null || + date -u -D "%Y-%m-%dT%H:%M:%SZ" -d "$1" +%s 2>/dev/null || + echo "0" +} + +NOW=$(date +%s) +STALE_THRESHOLD=$((STALE_HOURS * 3600)) + +# ========================================================= +# 1. CronJobs with stuck active jobs +# ========================================================= +echo "## CronJob Health" >> "${REPORT_FILE}" +echo "" >> "${REPORT_FILE}" + +CRONJOB_ISSUES="" + +STUCK_CJS="" +for cj in $(kubectl get cronjobs -n default -o jsonpath='{range .items[?(@.status.active)]}{.metadata.name}{"\n"}{end}' 2>/dev/null); do + POLICY=$(kubectl get cronjob "${cj}" -n default -o jsonpath='{.spec.concurrencyPolicy}' 2>/dev/null) + LAST_SUCCESS=$(kubectl get cronjob "${cj}" -n default -o jsonpath='{.status.lastSuccessfulTime}' 2>/dev/null) + LAST_SCHEDULE=$(kubectl get cronjob "${cj}" -n default -o jsonpath='{.status.lastScheduleTime}' 2>/dev/null) + STUCK_CJS="${STUCK_CJS}- \`${cj}\` (policy: ${POLICY}, last success: ${LAST_SUCCESS:-never}, last scheduled: ${LAST_SCHEDULE:-never})\n" + echo "unhealthy" > "${STATUS_FILE}" +done + +if [ -n "${STUCK_CJS}" ]; then + CRONJOB_ISSUES="${CRONJOB_ISSUES}### CronJobs with Active (Stuck) Jobs\n\n" + CRONJOB_ISSUES="${CRONJOB_ISSUES}These CronJobs have active jobs that haven't completed. If \`concurrencyPolicy=Forbid\`, no new runs can be scheduled.\n\n" + CRONJOB_ISSUES="${CRONJOB_ISSUES}${STUCK_CJS}\n" +fi + +# ========================================================= +# 2. CronJobs that haven't succeeded in a long time +# ========================================================= +STALE_CJS="" +# Get all non-suspended CronJobs with their lastSuccessfulTime and creation time +CRONJOBS_DATA=$(kubectl get cronjobs -n default -o jsonpath='{range .items[?(@.spec.suspend!=true)]}{.metadata.name}{"\t"}{.status.lastSuccessfulTime}{"\t"}{.metadata.creationTimestamp}{"\n"}{end}' 2>/dev/null) + +echo "${CRONJOBS_DATA}" | while IFS="$(printf '\t')" read -r name last_success created; do + [ -z "${name}" ] && continue + + if [ -z "${last_success}" ]; then + # Never succeeded — only flag if the CronJob is older than the threshold + CREATED_EPOCH=$(parse_ts "${created}") + AGE=$((NOW - CREATED_EPOCH)) + if [ "${AGE}" -gt "${STALE_THRESHOLD}" ]; then + printf -- "- \`%s\` — **never succeeded** (created: %s)\n" "${name}" "${created}" >> "${REPORT_DIR}/stale.tmp" + echo "unhealthy" > "${STATUS_FILE}" + fi + else + LAST_EPOCH=$(parse_ts "${last_success}") + AGE=$((NOW - LAST_EPOCH)) + if [ "${AGE}" -gt "${STALE_THRESHOLD}" ]; then + HOURS_AGO=$((AGE / 3600)) + printf -- "- \`%s\` — last success **%dh ago** (%s)\n" "${name}" "${HOURS_AGO}" "${last_success}" >> "${REPORT_DIR}/stale.tmp" + echo "unhealthy" > "${STATUS_FILE}" + fi + fi +done + +if [ -f "${REPORT_DIR}/stale.tmp" ]; then + CRONJOB_ISSUES="${CRONJOB_ISSUES}### CronJobs Without Recent Success (>${STALE_HOURS}h)\n\n" + CRONJOB_ISSUES="${CRONJOB_ISSUES}$(cat "${REPORT_DIR}/stale.tmp")\n\n" + rm -f "${REPORT_DIR}/stale.tmp" +fi + +if [ -z "${CRONJOB_ISSUES}" ]; then + echo "✅ All CronJobs healthy" >> "${REPORT_FILE}" +else + printf "%b" "${CRONJOB_ISSUES}" >> "${REPORT_FILE}" +fi +echo "" >> "${REPORT_FILE}" + +# ========================================================= +# 3. Pods with image pull failures +# ========================================================= +echo "## Job Health" >> "${REPORT_FILE}" +echo "" >> "${REPORT_FILE}" + +JOB_ISSUES="" + +PULL_FAILURES=$(kubectl get pods -n default \ + --field-selector=status.phase!=Succeeded,status.phase!=Running \ + -o custom-columns='POD:.metadata.name,STATUS:.status.containerStatuses[0].state.waiting.reason,IMAGE:.spec.containers[0].image' \ + --no-headers 2>/dev/null | grep -i "ImagePull\|ErrImage" || true) +if [ -n "${PULL_FAILURES}" ]; then + JOB_ISSUES="${JOB_ISSUES}### Pods with Image Pull Failures\n\n" + JOB_ISSUES="${JOB_ISSUES}\`\`\`\n${PULL_FAILURES}\n\`\`\`\n\n" + echo "unhealthy" > "${STATUS_FILE}" +fi + +# ========================================================= +# 4. Failed jobs +# ========================================================= +FAILED_JOBS=$(kubectl get jobs -n default \ + -o custom-columns='NAME:.metadata.name,FAILED:.status.failed,CREATED:.metadata.creationTimestamp' \ + --no-headers 2>/dev/null | awk '$2 ~ /^[0-9]+$/ && $2 > 0 {print}' || true) +if [ -n "${FAILED_JOBS}" ]; then + JOB_ISSUES="${JOB_ISSUES}### Failed Jobs\n\n" + JOB_ISSUES="${JOB_ISSUES}\`\`\`\n${FAILED_JOBS}\n\`\`\`\n\n" + echo "unhealthy" > "${STATUS_FILE}" +fi + +if [ -z "${JOB_ISSUES}" ]; then + echo "✅ All Jobs healthy" >> "${REPORT_FILE}" +else + printf "%b" "${JOB_ISSUES}" >> "${REPORT_FILE}" +fi +echo "" >> "${REPORT_FILE}" + +echo "=== CronJob/Job check complete ===" +cat "${REPORT_FILE}" diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-runs.sh b/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-runs.sh new file mode 100755 index 000000000..53e463183 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-runs.sh @@ -0,0 +1,151 @@ +#!/bin/sh +# Check PipelineRun health across namespaces. +# +# Uses smart filtering to avoid noise from flaky tests: +# 1. Consistently failing: all last N runs of a pipeline failed +# 2. Infrastructure failures: ImagePullBackOff, timeouts, etc — always flagged +# 3. Regressions: pipeline was succeeding but is now all-failing +# +# Pipelines with mixed success/failure (flaky) are NOT flagged. +# +# Usage: +# ./check-runs.sh [namespaces] [window-size] +# +# namespaces: comma-separated (default: default,tekton-ci,tekton-nightly,bastion-p,bastion-z) +# window-size: number of recent runs to consider (default: 3) +# +# Outputs: +# /health-report.md — markdown report (appended) +# /status — "healthy" or "unhealthy" +# +# Requirements: kubectl +set -e + +REPORT_DIR="${1:?Usage: $0 [namespaces] [window-size]}" +NAMESPACES="${2:-default,tekton-ci,tekton-nightly,bastion-p,bastion-z}" +WINDOW="${3:-3}" + +REPORT_FILE="${REPORT_DIR}/health-report.md" +STATUS_FILE="${REPORT_DIR}/status" + +# Infrastructure failure reasons that always indicate a platform problem, +# regardless of failure rate. These are never caused by user code. +INFRA_REASONS="TaskRunImagePullFailed|ImagePullBackOff|PipelineRunTimeout|TaskRunTimeout|CouldntGetTask|CouldntGetPipeline|CreateContainerConfigError|ExceededResourceQuota|ExceededNodeResources" + +echo "## PipelineRun Health" >> "${REPORT_FILE}" +echo "" >> "${REPORT_FILE}" + +HAS_ISSUES=false + +for ns in $(echo "${NAMESPACES}" | tr ',' ' '); do + # Get all PipelineRuns: pipeline_name, failure_reason, succeeded (True/False/Unknown) + # Sorted by creation time (oldest first), so `tail` gives us the most recent. + ALL_RUNS=$(kubectl get pipelineruns -n "${ns}" \ + --sort-by=.metadata.creationTimestamp \ + -o custom-columns='PIPELINE:.metadata.labels.tekton\.dev/pipeline,REASON:.status.conditions[0].reason,STATUS:.status.conditions[0].status' \ + --no-headers 2>/dev/null || true) + + [ -z "${ALL_RUNS}" ] && continue + + # Get unique pipeline names + PIPELINES=$(echo "${ALL_RUNS}" | awk '{print $1}' | sort -u) + + NS_INFRA="" + NS_CONSISTENT="" + NS_REGRESSION="" + + for pipeline in ${PIPELINES}; do + [ "${pipeline}" = "" ] && continue + + # All runs for this pipeline + P_ALL=$(echo "${ALL_RUNS}" | awk -v p="${pipeline}" '$1 == p') + # Last N runs (the window we evaluate) + P_RECENT=$(echo "${P_ALL}" | tail -n "${WINDOW}") + + TOTAL=$(echo "${P_RECENT}" | wc -l | tr -d ' ') + FAILED=$(echo "${P_RECENT}" | awk '$3 != "True"' | wc -l | tr -d ' ') + + # --- Check 1: Infrastructure failures (always flag) --- + INFRA_HITS=$(echo "${P_RECENT}" | grep -cE "${INFRA_REASONS}" || true) + if [ "${INFRA_HITS}" -gt 0 ]; then + INFRA_DETAILS=$(echo "${P_RECENT}" | grep -E "${INFRA_REASONS}" | awk '{print $2}' | sort | uniq -c | sort -rn | awk '{printf "%s (x%s), ", $2, $1}' | sed 's/, $//') + NS_INFRA="${NS_INFRA}- \`${pipeline}\` — ${INFRA_DETAILS}\n" + fi + + # --- Check 2: Consistently failing (all N runs failed) --- + if [ "${FAILED}" -eq "${TOTAL}" ] && [ "${TOTAL}" -ge "${WINDOW}" ]; then + # Get the most common failure reason + TOP_REASON=$(echo "${P_RECENT}" | awk '{print $2}' | sort | uniq -c | sort -rn | head -1 | awk '{print $2}') + + # Skip if already flagged as infra + if ! echo "${NS_INFRA}" | grep -q "\`${pipeline}\`"; then + # --- Check 3: Is this a regression? --- + # Look at runs before the window for any successes + P_OLDER=$(echo "${P_ALL}" | head -n -"${WINDOW}") + HAD_SUCCESS=false + if [ -n "${P_OLDER}" ]; then + OLD_SUCCESS=$(echo "${P_OLDER}" | awk '$3 == "True"' | wc -l | tr -d ' ') + [ "${OLD_SUCCESS}" -gt 0 ] && HAD_SUCCESS=true + fi + + if [ "${HAD_SUCCESS}" = "true" ]; then + NS_REGRESSION="${NS_REGRESSION}- \`${pipeline}\` — last ${TOTAL} runs failed (${TOP_REASON}), was previously succeeding\n" + else + NS_CONSISTENT="${NS_CONSISTENT}- \`${pipeline}\` — last ${TOTAL} runs failed (${TOP_REASON})\n" + fi + fi + fi + done + + # Write findings for this namespace + NS_HAS_ISSUES=false + + if [ -n "${NS_INFRA}" ]; then + echo "### Infrastructure Failures in \`${ns}\`" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + printf "%b" "${NS_INFRA}" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + NS_HAS_ISSUES=true + fi + + if [ -n "${NS_REGRESSION}" ]; then + echo "### Regressions in \`${ns}\`" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + echo "Pipelines that were succeeding but are now consistently failing:" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + printf "%b" "${NS_REGRESSION}" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + NS_HAS_ISSUES=true + fi + + if [ -n "${NS_CONSISTENT}" ]; then + echo "### Consistently Failing in \`${ns}\`" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + printf "%b" "${NS_CONSISTENT}" >> "${REPORT_FILE}" + echo "" >> "${REPORT_FILE}" + NS_HAS_ISSUES=true + fi + + if [ "${NS_HAS_ISSUES}" = "true" ]; then + HAS_ISSUES=true + echo "unhealthy" > "${STATUS_FILE}" + fi +done + +if [ "${HAS_ISSUES}" = "false" ]; then + echo "✅ No actionable PipelineRun failures" >> "${REPORT_FILE}" +fi +echo "" >> "${REPORT_FILE}" + +# ========================================================= +# Summary +# ========================================================= +STATUS=$(cat "${STATUS_FILE}") +echo "---" >> "${REPORT_FILE}" +echo "" >> "${REPORT_FILE}" +echo "**Overall Status:** ${STATUS}" >> "${REPORT_FILE}" + +echo "=== Full report ===" +cat "${REPORT_FILE}" +echo "" +echo "Status: ${STATUS}" diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/report.sh b/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/report.sh new file mode 100755 index 000000000..3777658fa --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/report.sh @@ -0,0 +1,152 @@ +#!/bin/sh +# Report cluster health status via GitHub issues. +# +# Behavior: +# - Unhealthy + no open issue → create a new issue +# - Unhealthy + open issue → add comment with latest report +# - Healthy + open issue → comment "resolved" and close it +# - Healthy + no open issue → nothing to do +# +# Usage: +# ./report.sh +# +# Environment: +# GITHUB_TOKEN — GitHub API token with issue creation permissions +# +# Requirements: wget +set -e + +REPORT_DIR="${1:?Usage: $0 }" +REPORT_FILE="${REPORT_DIR}/health-report.md" +STATUS_FILE="${REPORT_DIR}/status" +STATUS=$(cat "${STATUS_FILE}") + +: "${GITHUB_TOKEN:?GITHUB_TOKEN must be set}" + +REPO="tektoncd/plumbing" +API="https://api.github.com/repos/${REPO}" +LABELS="area/infra,kind/monitoring" +AUTH_HEADER="Authorization: token ${GITHUB_TOKEN}" +ACCEPT_HEADER="Accept: application/vnd.github.v3+json" + +# Find existing open monitoring issue (returns issue number or empty) +find_open_issue() { + wget -q -O- \ + --header="${AUTH_HEADER}" \ + --header="${ACCEPT_HEADER}" \ + "${API}/issues?labels=${LABELS}&state=open&per_page=1" \ + | sed -n 's/.*"number": *\([0-9]*\).*/\1/p' | head -1 +} + +# Encode text as a JSON string value (with surrounding quotes). +# Reads from stdin, writes to stdout. +json_encode() { + sed 's/\\/\\\\/g; s/"/\\"/g; s/\t/\\t/g' | awk '{printf "%s\\n", $0}' | sed 's/^/"/; s/$/"/' +} + +# Add a comment to an issue +add_comment() { + issue_number="$1" + body_file="$2" + + PAYLOAD_FILE="/tmp/comment-payload.json" + printf '{"body": ' > "${PAYLOAD_FILE}" + json_encode < "${body_file}" >> "${PAYLOAD_FILE}" + printf '}' >> "${PAYLOAD_FILE}" + + wget -q -O- --post-file="${PAYLOAD_FILE}" \ + --header="${AUTH_HEADER}" \ + --header="${ACCEPT_HEADER}" \ + --header="Content-Type: application/json" \ + "${API}/issues/${issue_number}/comments" > /dev/null +} + +# Close an issue via PATCH. Tries curl first (supports PATCH natively), +# falls back to wget --method=PATCH (busybox wget may not support it). +close_issue() { + issue_number="$1" + + if command -v curl > /dev/null 2>&1; then + curl -s -X PATCH \ + -H "${AUTH_HEADER}" \ + -H "${ACCEPT_HEADER}" \ + -H "Content-Type: application/json" \ + -d '{"state": "closed", "state_reason": "completed"}' \ + "${API}/issues/${issue_number}" > /dev/null + else + PAYLOAD_FILE="/tmp/close-payload.json" + printf '{"state": "closed", "state_reason": "completed"}' > "${PAYLOAD_FILE}" + wget -q -O- --post-file="${PAYLOAD_FILE}" \ + --header="${AUTH_HEADER}" \ + --header="${ACCEPT_HEADER}" \ + --header="Content-Type: application/json" \ + --method=PATCH \ + "${API}/issues/${issue_number}" > /dev/null 2>&1 || { + echo "⚠️ Could not auto-close issue #${issue_number}" + echo " Please close it manually." + } + fi +} + +EXISTING_ISSUE=$(find_open_issue) + +TIMESTAMP=$(date -u '+%Y-%m-%d %H:%M UTC') +TMPDIR="/tmp/health-monitor" +mkdir -p "${TMPDIR}" + +if [ "${STATUS}" = "healthy" ]; then + if [ -n "${EXISTING_ISSUE}" ]; then + echo "✅ Cluster healthy — closing issue #${EXISTING_ISSUE}" + cat > "${TMPDIR}/resolved.md" < "${TMPDIR}/update.md" < "${TMPDIR}/body.md" < "${PAYLOAD_FILE}" + json_encode < "${TMPDIR}/body.md" >> "${PAYLOAD_FILE}" + printf ', "labels": ["area/infra", "kind/monitoring"]}' >> "${PAYLOAD_FILE}" + + RESPONSE=$(wget -q -O- --post-file="${PAYLOAD_FILE}" \ + --header="${AUTH_HEADER}" \ + --header="${ACCEPT_HEADER}" \ + --header="Content-Type: application/json" \ + "${API}/issues") + + ISSUE_NUM=$(echo "${RESPONSE}" | sed -n 's/.*"number": *\([0-9]*\).*/\1/p' | head -1) + echo "✅ Created issue #${ISSUE_NUM}" +fi diff --git a/tekton/cronjobs/dogfooding/cluster-health-monitor/task.yaml b/tekton/cronjobs/dogfooding/cluster-health-monitor/task.yaml new file mode 100644 index 000000000..0c41a7246 --- /dev/null +++ b/tekton/cronjobs/dogfooding/cluster-health-monitor/task.yaml @@ -0,0 +1,79 @@ +# Copyright 2026 The Tekton Authors +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +apiVersion: tekton.dev/v1 +kind: Task +metadata: + name: cluster-health-monitor + namespace: default +spec: + description: > + Monitors the health of CronJobs, Jobs, PipelineRuns, and TaskRuns + in the dogfooding cluster and reports issues. Clones plumbing to + run scripts from the repo, making them easy to maintain and test + locally. + params: + - name: repository + description: The plumbing repository URL + type: string + default: "https://github.com/tektoncd/plumbing.git" + - name: revision + description: Git revision to use for the scripts + type: string + default: "main" + - name: namespaces + description: Comma-separated list of namespaces to monitor for PipelineRun/TaskRun failures + type: string + default: "default,tekton-ci,tekton-nightly,bastion-p,bastion-z" + workspaces: + - name: source + description: Workspace for the cloned plumbing repository + - name: report + description: Workspace to share the health report between steps + steps: + - name: clone + image: ghcr.io/tektoncd/plumbing/alpine-git-nonroot + script: | + #!/bin/sh + set -ex + git clone --depth 1 --branch "$(params.revision)" \ + "$(params.repository)" "$(workspaces.source.path)/plumbing" + echo "Cloned $(params.repository) @ $(params.revision)" + ls "$(workspaces.source.path)/plumbing/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/" + + - name: check-cronjobs-and-jobs + image: ghcr.io/tektoncd/plumbing/kubectl + script: | + #!/bin/sh + exec "$(workspaces.source.path)/plumbing/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-cronjobs.sh" \ + "$(workspaces.report.path)" + + - name: check-pipelineruns-taskruns + image: ghcr.io/tektoncd/plumbing/kubectl + script: | + #!/bin/sh + exec "$(workspaces.source.path)/plumbing/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/check-runs.sh" \ + "$(workspaces.report.path)" "$(params.namespaces)" + + - name: report + image: ghcr.io/tektoncd/plumbing/kubectl + env: + - name: GITHUB_TOKEN + valueFrom: + secretKeyRef: + name: bot-token-github + key: bot-token + script: | + #!/bin/sh + exec "$(workspaces.source.path)/plumbing/tekton/cronjobs/dogfooding/cluster-health-monitor/scripts/report.sh" \ + "$(workspaces.report.path)"