ci: add kind runtime smoke test for charts#26
Conversation
provision a kind cluster, install mysql, initialize Apollo config/portal schemas and run apollo-service + apollo-portal health probes in CI
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughAdds a new CI job "Runtime Smoke Test (kind)" to the Helm validation workflow that provisions a Kind cluster, installs MySQL, initializes Apollo databases, deploys Apollo charts via Helm, runs rollout waits and HTTP health probes, and collects diagnostics on failure. (≤50 words) Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CI as CI Runner
participant Kind as Kind Cluster
participant Helm as Helm
participant MySQL as MySQL Pod
participant DBInit as DB Init (SQL)
participant Apollo as Apollo (Service & Portal)
participant Probe as HTTP Probes
CI->>Helm: setup Helm & add repos
CI->>Kind: create cluster (kind)
Helm->>Kind: install MySQL chart
Kind->>MySQL: provision DB pod
CI->>DBInit: apply versioned SQL -> Kind
DBInit->>Kind: initialize Apollo databases
Helm->>Kind: install Apollo charts (smoke values)
Kind->>Apollo: deploy service & portal pods
CI->>Kind: wait for rollouts
CI->>Probe: perform HTTP health probes against endpoints
Probe-->>CI: success / failure
alt failure
CI->>Kind: dump diagnostics (pods, logs, events)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
.github/workflows/helm-validate.yml (2)
167-171:helm/kind-actionis two minor releases behindThe current latest release of
helm/kind-actionisv1.12.0, which updates kind tov0.26.0. The workflow pinsv1.10.0.⬆️ Proposed update
- uses: helm/kind-action@v1.10.0 + uses: helm/kind-action@v1.12.0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/helm-validate.yml around lines 167 - 171, Update the GitHub Actions step that uses helm/kind-action to a current released minor version: locate the step with "uses: helm/kind-action@v1.10.0" and change the pinned tag to the newer release (e.g., "v1.12.0") so the action (and bundled kind) is up-to-date; keep the existing inputs (cluster_name: apollo-ci, wait: 120s) unchanged.
267-274: Add MySQL logs to the failure diagnostic dumpDatabase initialisation is the most likely upstream cause of Apollo pod crash-loops, but MySQL StatefulSet logs are absent from the dump. One extra line covers this:
🔍 Proposed addition
kubectl -n apollo get all kubectl -n apollo get events --sort-by=.metadata.creationTimestamp + kubectl -n apollo logs statefulset/mysql --tail=100 || true kubectl -n apollo logs deployment/apollo-service-dev-apollo-configservice --tail=200 || true kubectl -n apollo logs deployment/apollo-service-dev-apollo-adminservice --tail=200 || true kubectl -n apollo logs deployment/apollo-portal --tail=200 || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/helm-validate.yml around lines 267 - 274, In the "Dump diagnostics on failure" step add MySQL pod logs to the dump so DB init failures are visible; mirror the existing kubectl logs lines by appending a line such as "kubectl -n apollo logs statefulset/mysql-0 --tail=200 || true" (or use a label selector like "kubectl -n apollo logs -l app=mysql --tail=200 || true" if your MySQL StatefulSet is named differently) alongside the existing deployment logs for apollo-service-dev-apollo-configservice, apollo-service-dev-apollo-adminservice, and apollo-portal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/helm-validate.yml:
- Around line 250-255: The "Wait for deployments" step currently uses three
kubectl rollout status calls each with --timeout=20m which can cumulatively
exceed the job-level 45 minute ceiling; change each timeout to a smaller value
(e.g. --timeout=12m) so total worst-case wait is well under 45 minutes. Edit the
step containing the kubectl commands for
deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, and deployment/apollo-portal
and replace each --timeout=20m with --timeout=12m (or another value meeting
3×timeout < 45m).
---
Nitpick comments:
In @.github/workflows/helm-validate.yml:
- Around line 167-171: Update the GitHub Actions step that uses helm/kind-action
to a current released minor version: locate the step with "uses:
helm/kind-action@v1.10.0" and change the pinned tag to the newer release (e.g.,
"v1.12.0") so the action (and bundled kind) is up-to-date; keep the existing
inputs (cluster_name: apollo-ci, wait: 120s) unchanged.
- Around line 267-274: In the "Dump diagnostics on failure" step add MySQL pod
logs to the dump so DB init failures are visible; mirror the existing kubectl
logs lines by appending a line such as "kubectl -n apollo logs
statefulset/mysql-0 --tail=200 || true" (or use a label selector like "kubectl
-n apollo logs -l app=mysql --tail=200 || true" if your MySQL StatefulSet is
named differently) alongside the existing deployment logs for
apollo-service-dev-apollo-configservice, apollo-service-dev-apollo-adminservice,
and apollo-portal.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
.github/workflows/helm-validate.yml (2)
268-275: Considerif: '!success()'to also capture job cancellation.
if: failure()does not fire on job-level timeout cancellation. Changing toif: '!success()'(orif: cancelled() || failure()) ensures diagnostics are collected regardless of how the job terminates.Proposed fix
- if: failure() + if: '!success()'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/helm-validate.yml around lines 268 - 275, The workflow step "Dump diagnostics on failure" currently uses if: failure(), which won't run on job cancellations/timeouts; update its conditional to run on any non-success termination — replace if: failure() with if: "!success()" (or if: cancelled() || failure()) so the kubectl diagnostic commands (kubectl -n apollo get all/events/logs for deployment/apollo-service-dev-apollo-configservice, deployment/apollo-service-dev-apollo-adminservice, and apollo-portal) execute on cancellations as well as failures.
258-266: Consider adding--timeoutto the probe pods.The
kubectl runcommands have no explicit timeout. If a service hangs (accepts connection but never responds), these will block indefinitely until the job-level 45-minute ceiling kills the run — again bypassing the diagnostics step. A--timeoutflag onkubectl runor wrapping withtimeout 60would give faster, more debuggable failures.Proposed fix
- kubectl -n apollo run curl-probe --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \ - curl -fsS http://apollo-service-dev-apollo-configservice:8080/health - kubectl -n apollo run curl-probe-admin --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \ - curl -fsS http://apollo-service-dev-apollo-adminservice:8090/health - kubectl -n apollo run curl-probe-portal --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \ - curl -fsS http://apollo-portal:8070/health + kubectl -n apollo run curl-probe --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \ + curl -fsS --max-time 30 http://apollo-service-dev-apollo-configservice:8080/health + kubectl -n apollo run curl-probe-admin --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \ + curl -fsS --max-time 30 http://apollo-service-dev-apollo-adminservice:8090/health + kubectl -n apollo run curl-probe-portal --rm -i --restart=Never --image=curlimages/curl:8.7.1 -- \ + curl -fsS --max-time 30 http://apollo-portal:8070/health🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/helm-validate.yml around lines 258 - 266, The kubectl run probe commands (curl-probe, curl-probe-admin, curl-probe-portal) lack an explicit timeout and can hang; update each kubectl run invocation to enforce a timeout (e.g., add the kubectl --timeout flag if supported or prefix the command with a shell timeout like timeout 60) so the probe fails fast and doesn't block the job; apply the same change to all three curl-probe commands (the ones invoking curl -fsS http://apollo-service-dev-apollo-configservice:8080/health, http://apollo-service-dev-apollo-adminservice:8090/health, and http://apollo-portal:8070/health).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.github/workflows/helm-validate.yml:
- Around line 251-256: The three sequential kubectl rollout status calls
(deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, deployment/apollo-portal)
each use --timeout=20m which can cumulatively exceed the GitHub Actions
45-minute job limit; reduce each timeout to something like --timeout=12m (or
another value that keeps the total well under 45m) so the step worst-case
doesn't trigger job cancellation, and/or change the diagnostics step condition
(the step after these rollouts) to use if: always() to ensure diagnostics run on
cancellation; update the timeout values or the diagnostics step condition
accordingly.
---
Nitpick comments:
In @.github/workflows/helm-validate.yml:
- Around line 268-275: The workflow step "Dump diagnostics on failure" currently
uses if: failure(), which won't run on job cancellations/timeouts; update its
conditional to run on any non-success termination — replace if: failure() with
if: "!success()" (or if: cancelled() || failure()) so the kubectl diagnostic
commands (kubectl -n apollo get all/events/logs for
deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, and apollo-portal) execute on
cancellations as well as failures.
- Around line 258-266: The kubectl run probe commands (curl-probe,
curl-probe-admin, curl-probe-portal) lack an explicit timeout and can hang;
update each kubectl run invocation to enforce a timeout (e.g., add the kubectl
--timeout flag if supported or prefix the command with a shell timeout like
timeout 60) so the probe fails fast and doesn't block the job; apply the same
change to all three curl-probe commands (the ones invoking curl -fsS
http://apollo-service-dev-apollo-configservice:8080/health,
http://apollo-service-dev-apollo-adminservice:8090/health, and
http://apollo-portal:8070/health).
There was a problem hiding this comment.
🧹 Nitpick comments (3)
.github/workflows/helm-validate.yml (3)
182-182: Root passwordtestis hardcoded in three independent locations.The MySQL install (line 182), apollo-service values (line 211), and apollo-portal values (line 234) all independently specify the password
test. If one is changed without the others, the smoke test breaks silently (services fail to connect to MySQL, but the error manifests as a rollout timeout rather than a clear password-mismatch error).Consider extracting it into a single step-level environment variable:
♻️ Suggested refactor
Define once at the job level:
env: MYSQL_ROOT_PASSWORD_CI: testThen reference
${{ env.MYSQL_ROOT_PASSWORD_CI }}in the MySQL install--set auth.rootPassword=...and useenvsubstorsedin the heredocs to substitute it into the values files.Also applies to: 206-211, 229-234
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/helm-validate.yml at line 182, Extract the hardcoded MySQL root password into a single job-level environment variable (e.g., MYSQL_ROOT_PASSWORD_CI) in the workflow and replace all literal occurrences of "test" used for MySQL credentials (the `--set auth.rootPassword=test` in the MySQL install and the password fields in the apollo-service and apollo-portal values heredocs) to reference that variable; update the helm-validate job to define env: MYSQL_ROOT_PASSWORD_CI: test and change the three places to interpolate `${{ env.MYSQL_ROOT_PASSWORD_CI }}` or perform runtime substitution inside heredocs using envsubst/sed so all three consumers (MySQL install, apollo-service values, apollo-portal values) use the single source of truth.
173-185: MySQL installation looks solid for CI.Disabling persistence and using an explicit root password for an ephemeral CI cluster is the right call. One minor note:
image.repository=bitnamilegacy/mysqlpins to the legacy image — worth a comment in the workflow to explain why this choice was made (presumably the currentbitnami/mysqlimage dropped support for older MySQL versions or architectures), so future maintainers don't try to "upgrade" it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/helm-validate.yml around lines 173 - 185, Add an inline comment above the helm install step explaining why image.repository is set to "bitnamilegacy/mysql" (e.g., compatibility with older MySQL versions or specific architectures that the current bitnami/mysql image no longer supports) so future maintainers understand the intentional pin; reference the helm install block (helm install mysql bitnami/mysql) and the image.repository setting to make the rationale explicit and include a note that this is intentional for CI ephemeral clusters.
269-276: Good diagnostics step — consider addingdescribefor pending pods.The
get all,get events, andlogscommands cover the most common failure modes. For cases where pods are stuck inPending/CrashLoopBackOff,kubectl describe podoutput is often more informative than logs alone.💡 Optional addition
kubectl -n apollo get all kubectl -n apollo get events --sort-by=.metadata.creationTimestamp + kubectl -n apollo describe pods || true kubectl -n apollo logs deployment/apollo-service-dev-apollo-configservice --tail=200 || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/helm-validate.yml around lines 269 - 276, Add a kubectl describe step to the existing "Dump diagnostics on failure" block so pod conditions and event/reason details are included; e.g., run kubectl -n apollo describe pod for the relevant pods (or for all pods, or specifically those in Pending/CrashLoopBackOff) in addition to the current kubectl -n apollo get all / get events / logs commands so the describe output augments logs and events for debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.github/workflows/helm-validate.yml:
- Around line 252-257: Change the three kubectl rollout status commands
(deployment/apollo-service-dev-apollo-configservice,
deployment/apollo-service-dev-apollo-adminservice, and deployment/apollo-portal)
to use a shorter timeout (e.g., --timeout=12m) instead of --timeout=20m so the
cumulative worst-case wait stays below the job-level 45-minute ceiling and
leaves headroom for the diagnostics step to run; update the three lines that
call kubectl -n apollo rollout status to use 12m (or any value ≤12m)
consistently.
---
Nitpick comments:
In @.github/workflows/helm-validate.yml:
- Line 182: Extract the hardcoded MySQL root password into a single job-level
environment variable (e.g., MYSQL_ROOT_PASSWORD_CI) in the workflow and replace
all literal occurrences of "test" used for MySQL credentials (the `--set
auth.rootPassword=test` in the MySQL install and the password fields in the
apollo-service and apollo-portal values heredocs) to reference that variable;
update the helm-validate job to define env: MYSQL_ROOT_PASSWORD_CI: test and
change the three places to interpolate `${{ env.MYSQL_ROOT_PASSWORD_CI }}` or
perform runtime substitution inside heredocs using envsubst/sed so all three
consumers (MySQL install, apollo-service values, apollo-portal values) use the
single source of truth.
- Around line 173-185: Add an inline comment above the helm install step
explaining why image.repository is set to "bitnamilegacy/mysql" (e.g.,
compatibility with older MySQL versions or specific architectures that the
current bitnami/mysql image no longer supports) so future maintainers understand
the intentional pin; reference the helm install block (helm install mysql
bitnami/mysql) and the image.repository setting to make the rationale explicit
and include a note that this is intentional for CI ephemeral clusters.
- Around line 269-276: Add a kubectl describe step to the existing "Dump
diagnostics on failure" block so pod conditions and event/reason details are
included; e.g., run kubectl -n apollo describe pod for the relevant pods (or for
all pods, or specifically those in Pending/CrashLoopBackOff) in addition to the
current kubectl -n apollo get all / get events / logs commands so the describe
output augments logs and events for debugging.
What Changed
runtime-smokejob to.github/workflows/helm-validate.yml.kind.kindclusterApolloConfigDBandApolloPortalDBusing Apollo SQL scripts from the matchingappVersiontagapollo-service-devandapollo-portal/healthprobes for Config Service, Admin Service, and Portalkubectl get all/eventsand deployment logs) to help debugging when smoke tests fail.Why
Static checks (
helm lint/helm template/kubeconform) validate manifests, but they do not verify runtime behavior.This smoke test adds real in-cluster validation before release.
Verification
Summary by CodeRabbit