ci: add kind runtime smoke test for charts by nobodyiam · Pull Request #26 · apolloconfig/apollo-helm-chart

nobodyiam · 2026-02-20T02:29:49Z

What Changed

Added a new runtime-smoke job to .github/workflows/helm-validate.yml.
The new job runs after static chart validation and provisions a real Kubernetes runtime in CI via kind.
Runtime flow in CI:
- create a kind cluster
- install MySQL with Helm
- initialize ApolloConfigDB and ApolloPortalDB using Apollo SQL scripts from the matching appVersion tag
- install apollo-service-dev and apollo-portal
- wait for deployments to become ready
- run /health probes for Config Service, Admin Service, and Portal
Added failure diagnostics (kubectl get all/events and deployment logs) to help debugging when smoke tests fail.

Why

Static checks (helm lint / helm template / kubeconform) validate manifests, but they do not verify runtime behavior.
This smoke test adds real in-cluster validation before release.

Verification

Verified workflow YAML parses correctly.

Summary by CodeRabbit

Chores
- Strengthened release validation by adding an automated runtime smoke test to the pipeline. It deploys services into an isolated Kubernetes environment, initializes required data, verifies rollouts, performs HTTP health probes against key services, and captures diagnostics on failures—improving early detection of deployment and runtime issues before production.

provision a kind cluster, install mysql, initialize Apollo config/portal schemas and run apollo-service + apollo-portal health probes in CI

coderabbitai · 2026-02-20T02:30:09Z

Warning

Rate limit exceeded

@nobodyiam has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 14 minutes and 51 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Adds a new CI job "Runtime Smoke Test (kind)" to the Helm validation workflow that provisions a Kind cluster, installs MySQL, initializes Apollo databases, deploys Apollo charts via Helm, runs rollout waits and HTTP health probes, and collects diagnostics on failure. (≤50 words)

Changes

Cohort / File(s)	Summary
CI Runtime Smoke Test `\.github/workflows/helm-validate.yml`	Adds `runtime-smoke` job: depends on `verify`; runs on `ubuntu-latest` with 45m timeout; steps: checkout, Helm setup, create Kind cluster, install MySQL, initialize Apollo DBs from versioned SQL, install Apollo service & portal charts with smoke values, wait for rollouts, run HTTP probes, dump diagnostics on failure.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CI as CI Runner
    participant Kind as Kind Cluster
    participant Helm as Helm
    participant MySQL as MySQL Pod
    participant DBInit as DB Init (SQL)
    participant Apollo as Apollo (Service & Portal)
    participant Probe as HTTP Probes
    CI->>Helm: setup Helm & add repos
    CI->>Kind: create cluster (kind)
    Helm->>Kind: install MySQL chart
    Kind->>MySQL: provision DB pod
    CI->>DBInit: apply versioned SQL -> Kind
    DBInit->>Kind: initialize Apollo databases
    Helm->>Kind: install Apollo charts (smoke values)
    Kind->>Apollo: deploy service & portal pods
    CI->>Kind: wait for rollouts
    CI->>Probe: perform HTTP health probes against endpoints
    Probe-->>CI: success / failure
    alt failure
        CI->>Kind: dump diagnostics (pods, logs, events)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped into the CI lane,
Kind clusters growing in the rain,
MySQL burrowed, databases sown,
Apollo wakes with charts well-known,
Probes blink green — carrots all grown! 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'ci: add kind runtime smoke test for charts' clearly and specifically describes the main change: adding a new CI job that implements a runtime smoke test using kind.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch codex/runtime-smoke-kind

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

.github/workflows/helm-validate.yml (2)

167-171: helm/kind-action is two minor releases behind

The current latest release of helm/kind-action is v1.12.0, which updates kind to v0.26.0. The workflow pins v1.10.0.

⬆️ Proposed update

-        uses: helm/kind-action@v1.10.0
+        uses: helm/kind-action@v1.12.0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml around lines 167 - 171, Update the
GitHub Actions step that uses helm/kind-action to a current released minor
version: locate the step with "uses: helm/kind-action@v1.10.0" and change the
pinned tag to the newer release (e.g., "v1.12.0") so the action (and bundled
kind) is up-to-date; keep the existing inputs (cluster_name: apollo-ci, wait:
120s) unchanged.

267-274: Add MySQL logs to the failure diagnostic dump

Database initialisation is the most likely upstream cause of Apollo pod crash-loops, but MySQL StatefulSet logs are absent from the dump. One extra line covers this:

🔍 Proposed addition

         kubectl -n apollo get all
         kubectl -n apollo get events --sort-by=.metadata.creationTimestamp
+        kubectl -n apollo logs statefulset/mysql --tail=100 || true
         kubectl -n apollo logs deployment/apollo-service-dev-apollo-configservice --tail=200 || true
         kubectl -n apollo logs deployment/apollo-service-dev-apollo-adminservice --tail=200 || true
         kubectl -n apollo logs deployment/apollo-portal --tail=200 || true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml around lines 267 - 274, In the "Dump
diagnostics on failure" step add MySQL pod logs to the dump so DB init failures
are visible; mirror the existing kubectl logs lines by appending a line such as
"kubectl -n apollo logs statefulset/mysql-0 --tail=200 || true" (or use a label
selector like "kubectl -n apollo logs -l app=mysql --tail=200 || true" if your
MySQL StatefulSet is named differently) alongside the existing deployment logs
for apollo-service-dev-apollo-configservice,
apollo-service-dev-apollo-adminservice, and apollo-portal.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/helm-validate.yml:
- Around line 250-255: The "Wait for deployments" step currently uses three
kubectl rollout status calls each with --timeout=20m which can cumulatively
exceed the job-level 45 minute ceiling; change each timeout to a smaller value
(e.g. --timeout=12m) so total worst-case wait is well under 45 minutes. Edit the
step containing the kubectl commands for
deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, and deployment/apollo-portal
and replace each --timeout=20m with --timeout=12m (or another value meeting
3×timeout < 45m).

---

Nitpick comments:
In @.github/workflows/helm-validate.yml:
- Around line 167-171: Update the GitHub Actions step that uses helm/kind-action
to a current released minor version: locate the step with "uses:
helm/kind-action@v1.10.0" and change the pinned tag to the newer release (e.g.,
"v1.12.0") so the action (and bundled kind) is up-to-date; keep the existing
inputs (cluster_name: apollo-ci, wait: 120s) unchanged.
- Around line 267-274: In the "Dump diagnostics on failure" step add MySQL pod
logs to the dump so DB init failures are visible; mirror the existing kubectl
logs lines by appending a line such as "kubectl -n apollo logs
statefulset/mysql-0 --tail=200 || true" (or use a label selector like "kubectl
-n apollo logs -l app=mysql --tail=200 || true" if your MySQL StatefulSet is
named differently) alongside the existing deployment logs for
apollo-service-dev-apollo-configservice, apollo-service-dev-apollo-adminservice,
and apollo-portal.

coderabbitai

🧹 Nitpick comments (2)

.github/workflows/helm-validate.yml (2)

268-275: Consider if: '!success()' to also capture job cancellation.

if: failure() does not fire on job-level timeout cancellation. Changing to if: '!success()' (or if: cancelled() || failure()) ensures diagnostics are collected regardless of how the job terminates.

Proposed fix

-        if: failure()
+        if: '!success()'

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml around lines 268 - 275, The workflow
step "Dump diagnostics on failure" currently uses if: failure(), which won't run
on job cancellations/timeouts; update its conditional to run on any non-success
termination — replace if: failure() with if: "!success()" (or if: cancelled() ||
failure()) so the kubectl diagnostic commands (kubectl -n apollo get
all/events/logs for deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, and apollo-portal) execute on
cancellations as well as failures.

258-266: Consider adding --timeout to the probe pods.

The kubectl run commands have no explicit timeout. If a service hangs (accepts connection but never responds), these will block indefinitely until the job-level 45-minute ceiling kills the run — again bypassing the diagnostics step. A --timeout flag on kubectl run or wrapping with timeout 60 would give faster, more debuggable failures.

Proposed fix

-          kubectl -n apollo run curl-probe --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \
-            curl -fsS http://apollo-service-dev-apollo-configservice:8080/health
-          kubectl -n apollo run curl-probe-admin --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \
-            curl -fsS http://apollo-service-dev-apollo-adminservice:8090/health
-          kubectl -n apollo run curl-probe-portal --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \
-            curl -fsS http://apollo-portal:8070/health
+          kubectl -n apollo run curl-probe --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \
+            curl -fsS --max-time 30 http://apollo-service-dev-apollo-configservice:8080/health
+          kubectl -n apollo run curl-probe-admin --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \
+            curl -fsS --max-time 30 http://apollo-service-dev-apollo-adminservice:8090/health
+          kubectl -n apollo run curl-probe-portal --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \
+            curl -fsS --max-time 30 http://apollo-portal:8070/health

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml around lines 258 - 266, The kubectl run
probe commands (curl-probe, curl-probe-admin, curl-probe-portal) lack an
explicit timeout and can hang; update each kubectl run invocation to enforce a
timeout (e.g., add the kubectl --timeout flag if supported or prefix the command
with a shell timeout like timeout 60) so the probe fails fast and doesn't block
the job; apply the same change to all three curl-probe commands (the ones
invoking curl -fsS http://apollo-service-dev-apollo-configservice:8080/health,
http://apollo-service-dev-apollo-adminservice:8090/health, and
http://apollo-portal:8070/health).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/helm-validate.yml:
- Around line 251-256: The three sequential kubectl rollout status calls
(deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, deployment/apollo-portal)
each use --timeout=20m which can cumulatively exceed the GitHub Actions
45-minute job limit; reduce each timeout to something like --timeout=12m (or
another value that keeps the total well under 45m) so the step worst-case
doesn't trigger job cancellation, and/or change the diagnostics step condition
(the step after these rollouts) to use if: always() to ensure diagnostics run on
cancellation; update the timeout values or the diagnostics step condition
accordingly.

---

Nitpick comments:
In @.github/workflows/helm-validate.yml:
- Around line 268-275: The workflow step "Dump diagnostics on failure" currently
uses if: failure(), which won't run on job cancellations/timeouts; update its
conditional to run on any non-success termination — replace if: failure() with
if: "!success()" (or if: cancelled() || failure()) so the kubectl diagnostic
commands (kubectl -n apollo get all/events/logs for
deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, and apollo-portal) execute on
cancellations as well as failures.
- Around line 258-266: The kubectl run probe commands (curl-probe,
curl-probe-admin, curl-probe-portal) lack an explicit timeout and can hang;
update each kubectl run invocation to enforce a timeout (e.g., add the kubectl
--timeout flag if supported or prefix the command with a shell timeout like
timeout 60) so the probe fails fast and doesn't block the job; apply the same
change to all three curl-probe commands (the ones invoking curl -fsS
http://apollo-service-dev-apollo-configservice:8080/health,
http://apollo-service-dev-apollo-adminservice:8090/health, and
http://apollo-portal:8070/health).

coderabbitai

🧹 Nitpick comments (3)

.github/workflows/helm-validate.yml (3)
182-182: Root password test is hardcoded in three independent locations.

The MySQL install (line 182), apollo-service values (line 211), and apollo-portal values (line 234) all independently specify the password test. If one is changed without the others, the smoke test breaks silently (services fail to connect to MySQL, but the error manifests as a rollout timeout rather than a clear password-mismatch error).

Consider extracting it into a single step-level environment variable:
♻️ Suggested refactor

Define once at the job level:
env:
  MYSQL_ROOT_PASSWORD_CI: test
Then reference ${{ env.MYSQL_ROOT_PASSWORD_CI }} in the MySQL install --set auth.rootPassword=... and use envsubst or sed in the heredocs to substitute it into the values files.
Also applies to: 206-211, 229-234
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml at line 182, Extract the hardcoded MySQL
root password into a single job-level environment variable (e.g.,
MYSQL_ROOT_PASSWORD_CI) in the workflow and replace all literal occurrences of
"test" used for MySQL credentials (the `--set auth.rootPassword=test` in the
MySQL install and the password fields in the apollo-service and apollo-portal
values heredocs) to reference that variable; update the helm-validate job to
define env: MYSQL_ROOT_PASSWORD_CI: test and change the three places to
interpolate `${{ env.MYSQL_ROOT_PASSWORD_CI }}` or perform runtime substitution
inside heredocs using envsubst/sed so all three consumers (MySQL install,
apollo-service values, apollo-portal values) use the single source of truth.
173-185: MySQL installation looks solid for CI.

Disabling persistence and using an explicit root password for an ephemeral CI cluster is the right call. One minor note: image.repository=bitnamilegacy/mysql pins to the legacy image — worth a comment in the workflow to explain why this choice was made (presumably the current bitnami/mysql image dropped support for older MySQL versions or architectures), so future maintainers don't try to "upgrade" it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml around lines 173 - 185, Add an inline
comment above the helm install step explaining why image.repository is set to
"bitnamilegacy/mysql" (e.g., compatibility with older MySQL versions or specific
architectures that the current bitnami/mysql image no longer supports) so future
maintainers understand the intentional pin; reference the helm install block
(helm install mysql bitnami/mysql) and the image.repository setting to make the
rationale explicit and include a note that this is intentional for CI ephemeral
clusters.
269-276: Good diagnostics step — consider adding describe for pending pods.

The get all, get events, and logs commands cover the most common failure modes. For cases where pods are stuck in Pending/CrashLoopBackOff, kubectl describe pod output is often more informative than logs alone.
💡 Optional addition
          kubectl -n apollo get all
          kubectl -n apollo get events --sort-by=.metadata.creationTimestamp
+         kubectl -n apollo describe pods || true
          kubectl -n apollo logs deployment/apollo-service-dev-apollo-configservice --tail=200 || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/helm-validate.yml around lines 269 - 276, Add a kubectl
describe step to the existing "Dump diagnostics on failure" block so pod
conditions and event/reason details are included; e.g., run kubectl -n apollo
describe pod for the relevant pods (or for all pods, or specifically those in
Pending/CrashLoopBackOff) in addition to the current kubectl -n apollo get all /
get events / logs commands so the describe output augments logs and events for
debugging.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/helm-validate.yml:
- Around line 252-257: Change the three kubectl rollout status commands
(deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, and deployment/apollo-portal)
to use a shorter timeout (e.g., --timeout=12m) instead of --timeout=20m so the
cumulative worst-case wait stays below the job-level 45-minute ceiling and
leaves headroom for the diagnostics step to run; update the three lines that
call kubectl -n apollo rollout status to use 12m (or any value ≤12m)
consistently.

---

Nitpick comments:
In @.github/workflows/helm-validate.yml:
- Line 182: Extract the hardcoded MySQL root password into a single job-level
environment variable (e.g., MYSQL_ROOT_PASSWORD_CI) in the workflow and replace
all literal occurrences of "test" used for MySQL credentials (the `--set
auth.rootPassword=test` in the MySQL install and the password fields in the
apollo-service and apollo-portal values heredocs) to reference that variable;
update the helm-validate job to define env: MYSQL_ROOT_PASSWORD_CI: test and
change the three places to interpolate `${{ env.MYSQL_ROOT_PASSWORD_CI }}` or
perform runtime substitution inside heredocs using envsubst/sed so all three
consumers (MySQL install, apollo-service values, apollo-portal values) use the
single source of truth.
- Around line 173-185: Add an inline comment above the helm install step
explaining why image.repository is set to "bitnamilegacy/mysql" (e.g.,
compatibility with older MySQL versions or specific architectures that the
current bitnami/mysql image no longer supports) so future maintainers understand
the intentional pin; reference the helm install block (helm install mysql
bitnami/mysql) and the image.repository setting to make the rationale explicit
and include a note that this is intentional for CI ephemeral clusters.
- Around line 269-276: Add a kubectl describe step to the existing "Dump
diagnostics on failure" block so pod conditions and event/reason details are
included; e.g., run kubectl -n apollo describe pod for the relevant pods (or for
all pods, or specifically those in Pending/CrashLoopBackOff) in addition to the
current kubectl -n apollo get all / get events / logs commands so the describe
output augments logs and events for debugging.

ci: add kind runtime smoke test for charts

4a3c9e5

provision a kind cluster, install mysql, initialize Apollo config/portal schemas and run apollo-service + apollo-portal health probes in CI

coderabbitai Bot reviewed Feb 20, 2026

View reviewed changes

Comment thread .github/workflows/helm-validate.yml Outdated

ci: use bitnamilegacy mysql image in runtime smoke test

e56f51d

coderabbitai Bot reviewed Feb 20, 2026

View reviewed changes

ci: read mysql root password from secret for db init

3c1d70f

coderabbitai Bot reviewed Feb 20, 2026

View reviewed changes

ci: shorten rollout wait timeout in runtime smoke test

cc298b9

nobodyiam merged commit dcb144d into main Feb 20, 2026
4 checks passed

nobodyiam deleted the codex/runtime-smoke-kind branch February 20, 2026 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: add kind runtime smoke test for charts#26

ci: add kind runtime smoke test for charts#26
nobodyiam merged 4 commits into
mainfrom
codex/runtime-smoke-kind

nobodyiam commented Feb 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 20, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nobodyiam commented Feb 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Changed

Why

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nobodyiam commented Feb 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 20, 2026 •

edited

Loading