Skip to content

ci: add kind runtime smoke test for charts#26

Merged
nobodyiam merged 4 commits into
mainfrom
codex/runtime-smoke-kind
Feb 20, 2026
Merged

ci: add kind runtime smoke test for charts#26
nobodyiam merged 4 commits into
mainfrom
codex/runtime-smoke-kind

Conversation

@nobodyiam

@nobodyiam nobodyiam commented Feb 20, 2026

Copy link
Copy Markdown
Member

What Changed

  • Added a new runtime-smoke job to .github/workflows/helm-validate.yml.
  • The new job runs after static chart validation and provisions a real Kubernetes runtime in CI via kind.
  • Runtime flow in CI:
    • create a kind cluster
    • install MySQL with Helm
    • initialize ApolloConfigDB and ApolloPortalDB using Apollo SQL scripts from the matching appVersion tag
    • install apollo-service-dev and apollo-portal
    • wait for deployments to become ready
    • run /health probes for Config Service, Admin Service, and Portal
  • Added failure diagnostics (kubectl get all/events and deployment logs) to help debugging when smoke tests fail.

Why

Static checks (helm lint / helm template / kubeconform) validate manifests, but they do not verify runtime behavior.
This smoke test adds real in-cluster validation before release.

Verification

  • Verified workflow YAML parses correctly.

Summary by CodeRabbit

  • Chores
    • Strengthened release validation by adding an automated runtime smoke test to the pipeline. It deploys services into an isolated Kubernetes environment, initializes required data, verifies rollouts, performs HTTP health probes against key services, and captures diagnostics on failures—improving early detection of deployment and runtime issues before production.

provision a kind cluster, install mysql, initialize Apollo config/portal schemas and run apollo-service + apollo-portal health probes in CI
@coderabbitai

coderabbitai Bot commented Feb 20, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@nobodyiam has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 14 minutes and 51 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Adds a new CI job "Runtime Smoke Test (kind)" to the Helm validation workflow that provisions a Kind cluster, installs MySQL, initializes Apollo databases, deploys Apollo charts via Helm, runs rollout waits and HTTP health probes, and collects diagnostics on failure. (≤50 words)

Changes

Cohort / File(s) Summary
CI Runtime Smoke Test
\.github/workflows/helm-validate.yml
Adds runtime-smoke job: depends on verify; runs on ubuntu-latest with 45m timeout; steps: checkout, Helm setup, create Kind cluster, install MySQL, initialize Apollo DBs from versioned SQL, install Apollo service & portal charts with smoke values, wait for rollouts, run HTTP probes, dump diagnostics on failure.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CI as CI Runner
    participant Kind as Kind Cluster
    participant Helm as Helm
    participant MySQL as MySQL Pod
    participant DBInit as DB Init (SQL)
    participant Apollo as Apollo (Service & Portal)
    participant Probe as HTTP Probes
    CI->>Helm: setup Helm & add repos
    CI->>Kind: create cluster (kind)
    Helm->>Kind: install MySQL chart
    Kind->>MySQL: provision DB pod
    CI->>DBInit: apply versioned SQL -> Kind
    DBInit->>Kind: initialize Apollo databases
    Helm->>Kind: install Apollo charts (smoke values)
    Kind->>Apollo: deploy service & portal pods
    CI->>Kind: wait for rollouts
    CI->>Probe: perform HTTP health probes against endpoints
    Probe-->>CI: success / failure
    alt failure
        CI->>Kind: dump diagnostics (pods, logs, events)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped into the CI lane,
Kind clusters growing in the rain,
MySQL burrowed, databases sown,
Apollo wakes with charts well-known,
Probes blink green — carrots all grown! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'ci: add kind runtime smoke test for charts' clearly and specifically describes the main change: adding a new CI job that implements a runtime smoke test using kind.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/runtime-smoke-kind

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
.github/workflows/helm-validate.yml (2)

167-171: helm/kind-action is two minor releases behind

The current latest release of helm/kind-action is v1.12.0, which updates kind to v0.26.0. The workflow pins v1.10.0.

⬆️ Proposed update
-        uses: helm/kind-action@v1.10.0
+        uses: helm/kind-action@v1.12.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml around lines 167 - 171, Update the
GitHub Actions step that uses helm/kind-action to a current released minor
version: locate the step with "uses: helm/kind-action@v1.10.0" and change the
pinned tag to the newer release (e.g., "v1.12.0") so the action (and bundled
kind) is up-to-date; keep the existing inputs (cluster_name: apollo-ci, wait:
120s) unchanged.

267-274: Add MySQL logs to the failure diagnostic dump

Database initialisation is the most likely upstream cause of Apollo pod crash-loops, but MySQL StatefulSet logs are absent from the dump. One extra line covers this:

🔍 Proposed addition
         kubectl -n apollo get all
         kubectl -n apollo get events --sort-by=.metadata.creationTimestamp
+        kubectl -n apollo logs statefulset/mysql --tail=100 || true
         kubectl -n apollo logs deployment/apollo-service-dev-apollo-configservice --tail=200 || true
         kubectl -n apollo logs deployment/apollo-service-dev-apollo-adminservice --tail=200 || true
         kubectl -n apollo logs deployment/apollo-portal --tail=200 || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml around lines 267 - 274, In the "Dump
diagnostics on failure" step add MySQL pod logs to the dump so DB init failures
are visible; mirror the existing kubectl logs lines by appending a line such as
"kubectl -n apollo logs statefulset/mysql-0 --tail=200 || true" (or use a label
selector like "kubectl -n apollo logs -l app=mysql --tail=200 || true" if your
MySQL StatefulSet is named differently) alongside the existing deployment logs
for apollo-service-dev-apollo-configservice,
apollo-service-dev-apollo-adminservice, and apollo-portal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/helm-validate.yml:
- Around line 250-255: The "Wait for deployments" step currently uses three
kubectl rollout status calls each with --timeout=20m which can cumulatively
exceed the job-level 45 minute ceiling; change each timeout to a smaller value
(e.g. --timeout=12m) so total worst-case wait is well under 45 minutes. Edit the
step containing the kubectl commands for
deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, and deployment/apollo-portal
and replace each --timeout=20m with --timeout=12m (or another value meeting
3×timeout < 45m).

---

Nitpick comments:
In @.github/workflows/helm-validate.yml:
- Around line 167-171: Update the GitHub Actions step that uses helm/kind-action
to a current released minor version: locate the step with "uses:
helm/kind-action@v1.10.0" and change the pinned tag to the newer release (e.g.,
"v1.12.0") so the action (and bundled kind) is up-to-date; keep the existing
inputs (cluster_name: apollo-ci, wait: 120s) unchanged.
- Around line 267-274: In the "Dump diagnostics on failure" step add MySQL pod
logs to the dump so DB init failures are visible; mirror the existing kubectl
logs lines by appending a line such as "kubectl -n apollo logs
statefulset/mysql-0 --tail=200 || true" (or use a label selector like "kubectl
-n apollo logs -l app=mysql --tail=200 || true" if your MySQL StatefulSet is
named differently) alongside the existing deployment logs for
apollo-service-dev-apollo-configservice, apollo-service-dev-apollo-adminservice,
and apollo-portal.

Comment thread .github/workflows/helm-validate.yml Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
.github/workflows/helm-validate.yml (2)

268-275: Consider if: '!success()' to also capture job cancellation.

if: failure() does not fire on job-level timeout cancellation. Changing to if: '!success()' (or if: cancelled() || failure()) ensures diagnostics are collected regardless of how the job terminates.

Proposed fix
-        if: failure()
+        if: '!success()'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml around lines 268 - 275, The workflow
step "Dump diagnostics on failure" currently uses if: failure(), which won't run
on job cancellations/timeouts; update its conditional to run on any non-success
termination — replace if: failure() with if: "!success()" (or if: cancelled() ||
failure()) so the kubectl diagnostic commands (kubectl -n apollo get
all/events/logs for deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, and apollo-portal) execute on
cancellations as well as failures.

258-266: Consider adding --timeout to the probe pods.

The kubectl run commands have no explicit timeout. If a service hangs (accepts connection but never responds), these will block indefinitely until the job-level 45-minute ceiling kills the run — again bypassing the diagnostics step. A --timeout flag on kubectl run or wrapping with timeout 60 would give faster, more debuggable failures.

Proposed fix
-          kubectl -n apollo run curl-probe --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \
-            curl -fsS http://apollo-service-dev-apollo-configservice:8080/health
-          kubectl -n apollo run curl-probe-admin --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \
-            curl -fsS http://apollo-service-dev-apollo-adminservice:8090/health
-          kubectl -n apollo run curl-probe-portal --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \
-            curl -fsS http://apollo-portal:8070/health
+          kubectl -n apollo run curl-probe --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \
+            curl -fsS --max-time 30 http://apollo-service-dev-apollo-configservice:8080/health
+          kubectl -n apollo run curl-probe-admin --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \
+            curl -fsS --max-time 30 http://apollo-service-dev-apollo-adminservice:8090/health
+          kubectl -n apollo run curl-probe-portal --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \
+            curl -fsS --max-time 30 http://apollo-portal:8070/health
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml around lines 258 - 266, The kubectl run
probe commands (curl-probe, curl-probe-admin, curl-probe-portal) lack an
explicit timeout and can hang; update each kubectl run invocation to enforce a
timeout (e.g., add the kubectl --timeout flag if supported or prefix the command
with a shell timeout like timeout 60) so the probe fails fast and doesn't block
the job; apply the same change to all three curl-probe commands (the ones
invoking curl -fsS http://apollo-service-dev-apollo-configservice:8080/health,
http://apollo-service-dev-apollo-adminservice:8090/health, and
http://apollo-portal:8070/health).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/helm-validate.yml:
- Around line 251-256: The three sequential kubectl rollout status calls
(deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, deployment/apollo-portal)
each use --timeout=20m which can cumulatively exceed the GitHub Actions
45-minute job limit; reduce each timeout to something like --timeout=12m (or
another value that keeps the total well under 45m) so the step worst-case
doesn't trigger job cancellation, and/or change the diagnostics step condition
(the step after these rollouts) to use if: always() to ensure diagnostics run on
cancellation; update the timeout values or the diagnostics step condition
accordingly.

---

Nitpick comments:
In @.github/workflows/helm-validate.yml:
- Around line 268-275: The workflow step "Dump diagnostics on failure" currently
uses if: failure(), which won't run on job cancellations/timeouts; update its
conditional to run on any non-success termination — replace if: failure() with
if: "!success()" (or if: cancelled() || failure()) so the kubectl diagnostic
commands (kubectl -n apollo get all/events/logs for
deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, and apollo-portal) execute on
cancellations as well as failures.
- Around line 258-266: The kubectl run probe commands (curl-probe,
curl-probe-admin, curl-probe-portal) lack an explicit timeout and can hang;
update each kubectl run invocation to enforce a timeout (e.g., add the kubectl
--timeout flag if supported or prefix the command with a shell timeout like
timeout 60) so the probe fails fast and doesn't block the job; apply the same
change to all three curl-probe commands (the ones invoking curl -fsS
http://apollo-service-dev-apollo-configservice:8080/health,
http://apollo-service-dev-apollo-adminservice:8090/health, and
http://apollo-portal:8070/health).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
.github/workflows/helm-validate.yml (3)

182-182: Root password test is hardcoded in three independent locations.

The MySQL install (line 182), apollo-service values (line 211), and apollo-portal values (line 234) all independently specify the password test. If one is changed without the others, the smoke test breaks silently (services fail to connect to MySQL, but the error manifests as a rollout timeout rather than a clear password-mismatch error).

Consider extracting it into a single step-level environment variable:

♻️ Suggested refactor

Define once at the job level:

env:
  MYSQL_ROOT_PASSWORD_CI: test

Then reference ${{ env.MYSQL_ROOT_PASSWORD_CI }} in the MySQL install --set auth.rootPassword=... and use envsubst or sed in the heredocs to substitute it into the values files.

Also applies to: 206-211, 229-234

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml at line 182, Extract the hardcoded MySQL
root password into a single job-level environment variable (e.g.,
MYSQL_ROOT_PASSWORD_CI) in the workflow and replace all literal occurrences of
"test" used for MySQL credentials (the `--set auth.rootPassword=test` in the
MySQL install and the password fields in the apollo-service and apollo-portal
values heredocs) to reference that variable; update the helm-validate job to
define env: MYSQL_ROOT_PASSWORD_CI: test and change the three places to
interpolate `${{ env.MYSQL_ROOT_PASSWORD_CI }}` or perform runtime substitution
inside heredocs using envsubst/sed so all three consumers (MySQL install,
apollo-service values, apollo-portal values) use the single source of truth.

173-185: MySQL installation looks solid for CI.

Disabling persistence and using an explicit root password for an ephemeral CI cluster is the right call. One minor note: image.repository=bitnamilegacy/mysql pins to the legacy image — worth a comment in the workflow to explain why this choice was made (presumably the current bitnami/mysql image dropped support for older MySQL versions or architectures), so future maintainers don't try to "upgrade" it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml around lines 173 - 185, Add an inline
comment above the helm install step explaining why image.repository is set to
"bitnamilegacy/mysql" (e.g., compatibility with older MySQL versions or specific
architectures that the current bitnami/mysql image no longer supports) so future
maintainers understand the intentional pin; reference the helm install block
(helm install mysql bitnami/mysql) and the image.repository setting to make the
rationale explicit and include a note that this is intentional for CI ephemeral
clusters.

269-276: Good diagnostics step — consider adding describe for pending pods.

The get all, get events, and logs commands cover the most common failure modes. For cases where pods are stuck in Pending/CrashLoopBackOff, kubectl describe pod output is often more informative than logs alone.

💡 Optional addition
          kubectl -n apollo get all
          kubectl -n apollo get events --sort-by=.metadata.creationTimestamp
+         kubectl -n apollo describe pods || true
          kubectl -n apollo logs deployment/apollo-service-dev-apollo-configservice --tail=200 || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml around lines 269 - 276, Add a kubectl
describe step to the existing "Dump diagnostics on failure" block so pod
conditions and event/reason details are included; e.g., run kubectl -n apollo
describe pod for the relevant pods (or for all pods, or specifically those in
Pending/CrashLoopBackOff) in addition to the current kubectl -n apollo get all /
get events / logs commands so the describe output augments logs and events for
debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/helm-validate.yml:
- Around line 252-257: Change the three kubectl rollout status commands
(deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, and deployment/apollo-portal)
to use a shorter timeout (e.g., --timeout=12m) instead of --timeout=20m so the
cumulative worst-case wait stays below the job-level 45-minute ceiling and
leaves headroom for the diagnostics step to run; update the three lines that
call kubectl -n apollo rollout status to use 12m (or any value ≤12m)
consistently.

---

Nitpick comments:
In @.github/workflows/helm-validate.yml:
- Line 182: Extract the hardcoded MySQL root password into a single job-level
environment variable (e.g., MYSQL_ROOT_PASSWORD_CI) in the workflow and replace
all literal occurrences of "test" used for MySQL credentials (the `--set
auth.rootPassword=test` in the MySQL install and the password fields in the
apollo-service and apollo-portal values heredocs) to reference that variable;
update the helm-validate job to define env: MYSQL_ROOT_PASSWORD_CI: test and
change the three places to interpolate `${{ env.MYSQL_ROOT_PASSWORD_CI }}` or
perform runtime substitution inside heredocs using envsubst/sed so all three
consumers (MySQL install, apollo-service values, apollo-portal values) use the
single source of truth.
- Around line 173-185: Add an inline comment above the helm install step
explaining why image.repository is set to "bitnamilegacy/mysql" (e.g.,
compatibility with older MySQL versions or specific architectures that the
current bitnami/mysql image no longer supports) so future maintainers understand
the intentional pin; reference the helm install block (helm install mysql
bitnami/mysql) and the image.repository setting to make the rationale explicit
and include a note that this is intentional for CI ephemeral clusters.
- Around line 269-276: Add a kubectl describe step to the existing "Dump
diagnostics on failure" block so pod conditions and event/reason details are
included; e.g., run kubectl -n apollo describe pod for the relevant pods (or for
all pods, or specifically those in Pending/CrashLoopBackOff) in addition to the
current kubectl -n apollo get all / get events / logs commands so the describe
output augments logs and events for debugging.

@nobodyiam nobodyiam merged commit dcb144d into main Feb 20, 2026
4 checks passed
@nobodyiam nobodyiam deleted the codex/runtime-smoke-kind branch February 20, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant