-
Notifications
You must be signed in to change notification settings - Fork 0
CI Hardening Phase 3: Deep analysis — post #699/#700/#701 #122
Description
CI Hardening Phase 3: Deep Analysis — Post bradygaster#699/bradygaster#700/bradygaster#701
Author: Flight (Lead)
Date: 2026-03-30
Scope: Every workflow, action, test, and operational pattern in bradygaster/squad CI
Refs: #121 (original 19-item analysis), bradygaster#699 (workarounds), bradygaster#700 (Phase 1), bradygaster#701 (Phase 2)
What's Been Done (Phases 1-2)
Phase 1 — PR bradygaster#700 (MERGED) ✅
5 quick wins implemented:
- Retry logic on npm install — 3-attempt wrapper on all npm install/ci calls
- Job timeout tuning — timeout-minutes on every job (3-15 min)
- npm cache optimization — cache npm on setup-node where missing
- Conditional docs quality — path filter skips docs-quality on code-only PRs
- Publish secret validation — fail-fast check for NPM_TOKEN
Phase 2 — PR bradygaster#701 (OPEN, pending merge) 🔄
5 short-term items + security hardening:
6. Lockfile stability — drift detection in preflight (prevents v0.9.1-style incidents)
7. Composite action — DRY setup-node + npm install wrapper (.github/actions/setup-squad-node/)
8. Ralph cron audit — documented "no cron" as intentional design decision
9. API rate limit monitoring — warns <100, fails <10 remaining (state-only logging for security)
10. npm registry health check — npm ping + namespace check before publish
Security fixes: shell injection prevention in composite action, lockfile integrity hash validation, rate limit count redaction from public logs.
PR bradygaster#699 (MERGED) ✅
Combined CI workarounds for @github/copilot-sdk ESM bug (vscode-jsonrpc/node module resolution).
Current CI Health Snapshot (2026-03-30)
| Metric | Value |
|---|---|
| Workflows | 15 active |
| Tests | 5,038+ passing across 196 test files |
| CI duration | ~4-5 min (main), ~2-3 min (publish) |
| Success rate (last 50) | Squad CI: 15/22 success, 4 action_required, 3 failure |
| Actions minutes | |
| Platforms tested | Ubuntu only |
| Node versions | 22 (main), 20 (insider) |
Phase 3: Deep Analysis — 30 Items Across 6 Categories
Category A: Reliability & Resilience
A1. Concurrency controls missing on most workflows
- Problem: Only
squad-docs.ymlhas concurrency settings. All other workflows can run multiple instances simultaneously — if 3 PRs push in quick succession, 3squad-ciruns compete for resources. - Impact: HIGH — resource waste, potential race conditions on label/triage workflows
- Effort: LOW (YAML-only)
- Fix: Add
concurrency: { group: '${{ github.workflow }}-${{ github.ref }}', cancel-in-progress: true }tosquad-ci.yml,squad-heartbeat.yml,squad-triage.yml,squad-label-enforce.yml,squad-issue-assign.yml - Priority: P1
- Owner: Booster
A2. Promote workflow uses dangerous merge strategy
- Problem:
squad-promote.ymluses--no-commit --no-ff -X theirsto merge dev→preview. This auto-resolves ALL conflicts by taking dev's version without human review. - Impact: HIGH — bad code can silently override preview/main
- Effort: MEDIUM
- Fix: Switch to
--no-ffwithout-X theirs. Fail on conflicts and require manual resolution. Add dry-run diff output showing exactly what would change. - Priority: P1
- Owner: Procedures + Flight
A3. No rollback mechanism for docs deployments
- Problem:
squad-docs.ymldeploys directly to GitHub Pages with no build output validation and no rollback. Broken docs go live instantly. - Impact: MEDIUM — user-facing docs broken until next push to main
- Effort: MEDIUM
- Fix: Add post-deploy smoke check (HTTP status on key pages). Store previous deployment artifact for rollback. Consider deployment environments with approval gates.
- Priority: P2
- Owner: PAO + Booster
A4. Insider publish creates inconsistent state on partial failure
- Problem:
squad-insider-publish.ymlpublishes SDK and CLI sequentially. If CLI publish fails after SDK succeeds, npm has mismatched SDK/CLI versions on theinsidertag. - Impact: MEDIUM — insider users get broken dependency graph
- Effort: MEDIUM
- Fix: Add post-publish verification step. If CLI fails, emit clear error with manual recovery instructions. Consider publish-both-or-neither pattern with npm unpublish fallback (risky but atomic).
- Priority: P2
- Owner: Surgeon + Booster
A5. Release/publish workflow coordination gap
- Problem:
squad-release.yml(push to main) andsquad-npm-publish.yml(release event) can trigger near-simultaneously. Release creates tag + GitHub Release, which triggers npm publish. No explicit handoff or dependency check. - Impact: MEDIUM — timing-sensitive; if release workflow slow, npm publish may start before tag exists
- Effort: LOW
- Fix: Add
workflow_runtrigger onsquad-npm-publish.ymlthat waits forsquad-release.ymlcompletion instead of relying on release event timing. - Priority: P2
- Owner: Booster
A6. Tag deduplication gaps in release workflows
- Problem: Both
squad-release.ymlandsquad-insider-release.ymlcreate tags. If a tag already exists,git pushfails.squad-release.ymlchecks for existing tags, butsquad-insider-release.ymldoes not. - Impact: LOW — insider releases would fail on re-push; not catastrophic
- Effort: LOW
- Fix: Add tag existence check to
squad-insider-release.ymlmatching the pattern insquad-release.yml. - Priority: P3
- Owner: Booster
Category B: Testing Depth
B1. No cross-platform CI (Windows/macOS)
- Problem: All CI runs on
ubuntu-latestonly. SDK/CLI published on npm for all platforms but never tested on Windows or macOS. Known platform-specific skips exist in tests (storage-provider, resolution, doctor). - Impact: HIGH — Windows CLI bugs ship undetected
- Effort: MEDIUM (2-3 hours)
- Fix: Create weekly
squad-cross-platform.ymlwith matrix:{os: [ubuntu-latest, windows-latest, macos-latest], node: [20, 22]}. Gate behindworkflow_dispatch+ weekly cron. Opt-in via repo variable. - Priority: P1
- Owner: Booster + FIDO
- Refs: Flaky tests under full-suite load + Docker-dependent tests need skip guard bradygaster/squad#582 (flaky tests under load), fix(test): add Docker skip guards + stabilize flaky tests under load bradygaster/squad#677 (Docker skip guards)
B2. Docker-dependent tests need skip guards
- Problem: Some tests require Docker (e.g., Aspire integration). These fail in environments without Docker, creating false failures.
- Impact: MEDIUM — blocks cross-platform CI (B1) since Windows/macOS runners may not have Docker
- Effort: LOW
- Fix: Already tracked as fix(test): add Docker skip guards + stabilize flaky tests under load bradygaster/squad#677. Add
describe.skipIf(!hasDocker)guards. Must ship before B1. - Priority: P1 (prerequisite for B1)
- Owner: FIDO
- Refs: fix(test): add Docker skip guards + stabilize flaky tests under load bradygaster/squad#677
B3. Flaky test quarantine system
- Problem: No mechanism to detect, track, or quarantine flaky tests. 3/22 recent Squad CI failures — unclear if flakes or real issues. Issue Flaky tests under full-suite load + Docker-dependent tests need skip guard bradygaster/squad#582 documents flaky tests under full-suite load.
- Impact: HIGH — flakes erode trust in CI, developers ignore failures
- Effort: HIGH (4-6 hours)
- Fix: Add vitest
--reporter=jsonoutput + post-test analysis script. Track per-test pass/fail across last N runs. Quarantine tests with >20% but <100% failure rate. Post flake report to PR comment. - Priority: P2
- Owner: FIDO + Booster
- Refs: Flaky tests under full-suite load + Docker-dependent tests need skip guard bradygaster/squad#582
B4. Acceptance tests (Gherkin BDD) not visible as separate CI gate
- Problem: 8 Gherkin feature files exist in
test/acceptance/with full step definitions and terminal harness. But they run as part of the monolithicnpm test— no visibility into acceptance test health vs unit test health. - Impact: LOW — informational gap, not a failure gap
- Effort: LOW
- Fix: Add separate vitest workspace config for acceptance tests. Report as distinct CI step in
squad-ci.yml. - Priority: P3
- Owner: FIDO
B5. Sample tests only run when SDK source changes
- Problem:
samples-buildjob insquad-ci.ymlis gated onpackages/squad-sdk/src/changes. If a sample's own code breaks (e.g., bad import), CI won't catch it unless SDK is also changed. - Impact: MEDIUM — sample rot goes undetected
- Effort: LOW
- Fix: Expand path filter to include
samples/**in trigger condition. Also add weekly cron run of samples-build. - Priority: P2
- Owner: Booster
B6. No test sharding for faster execution
- Problem: All 196 test files run in a single job (~4-5 min). As test count grows, this becomes a bottleneck.
- Impact: LOW (currently fast enough)
- Effort: MEDIUM
- Fix: Use vitest
--shard=1/Nflag with matrix strategy when test count exceeds threshold (~300 files). Not needed now. - Priority: P3 (future)
- Owner: Booster
B7. E2E CLI smoke test expansion
- Problem:
squad-npm-publish.ymlhas CLI packaging smoke test, but it only validatesnpm pack. No test of actual global install + command execution. - Impact: MEDIUM — global install breakage ships undetected
- Effort: LOW (1-2 hours)
- Fix: Extend smoke test to:
npm install -g @bradygaster/squad-cli,squad --version,squad init --help,squad status. - Priority: P2
- Owner: Booster + EECOM
Category C: Security & Supply Chain
C1. Third-party actions not pinned to SHAs
- Problem: All first-party actions (
actions/checkout,actions/setup-node) use@v4tags. Third-party actions insquad-docs-links.ymluse tags:lycheeverse/lychee-action@v2,peter-evans/create-issue-from-file@v5. Tags can be force-pushed by upstream maintainers. - Impact: HIGH — supply chain attack vector
- Effort: LOW
- Fix: Pin ALL actions (first-party and third-party) to full SHA commits. Use
npx pin-github-actionor Dependabot's action pinning feature. Example:actions/checkout@b4ffde65f...instead of@v4. - Priority: P1
- Owner: Booster
C2. No dependency review on PRs
- Problem: No Dependabot, Renovate, or
dependency-review-actionconfigured. New dependencies added via PR are not automatically audited for vulnerabilities or license issues. - Impact: HIGH — malicious/vulnerable transitive deps go unreviewed
- Effort: LOW
- Fix: Add
.github/dependabot.ymlfor automated PR creation on dependency updates. Addactions/dependency-review-action@v4to PR CI pipeline. Enable GitHub's built-in dependency graph. - Priority: P1
- Owner: Booster + FIDO
C3. No SAST/security scanning in CI
- Problem: No CodeQL, Snyk, SonarQube, or
npm auditin CI. TypeScript strict mode and ESLint are the only static analysis. - Impact: HIGH — security regressions, known CVEs in deps go undetected
- Effort: MEDIUM (3-4 hours)
- Fix: Add
npm audit --audit-level=moderateas CI gate. Add CodeQL analysis workflow for JavaScript/TypeScript. Considergithub/codeql-action@v3with auto-build. - Priority: P1
- Owner: FIDO + Booster
C4. No CODEOWNERS file
- Problem: No
.github/CODEOWNERS. PRs don't auto-assign reviewers based on file paths. Anyone can modify workflow files without required review from CI owners. - Impact: MEDIUM — workflow changes not automatically routed to right reviewer
- Effort: LOW (15 min)
- Fix: Create
.github/CODEOWNERSwith:.github/workflows/* @bradygaster,packages/squad-sdk/* @eecom-owner,packages/squad-cli/* @eecom-owner,docs/* @pao-owner. - Priority: P2
- Owner: Flight + Procedures
C5. Team.md parsing is a fragile injection surface
- Problem: 6 workflows parse
.squad/team.mdor.ai-team/team.mdusing string splitting on|. The parsed values are used in GitHub API calls (label names, comments, assignments). A malicious team.md edit could inject unexpected values. - Impact: MEDIUM — limited to repos with write access, but violates defense-in-depth
- Effort: MEDIUM
- Fix: Extract team.md parsing into a single composite action with input validation (A7 recommendation). Sanitize all parsed values before use in API calls. Add regex validation for member names.
- Priority: P2
- Owner: Booster + RETRO
C6. No license compliance scanning
- Problem: No automated check for license compatibility of npm dependencies. SDK/CLI are published packages — transitive dependency licenses matter.
- Impact: MEDIUM — legal compliance risk for downstream consumers
- Effort: LOW
- Fix: Add
license-checkerorlicenseecheck to CI. Maintain allowlist of approved licenses (MIT, Apache-2.0, ISC, BSD-*). - Priority: P3
- Owner: Procedures
Category D: Performance & Cost
D1. No build artifact caching between jobs
- Problem: Multiple jobs in
squad-ci.ymlrunnpm ciandnpm run buildindependently. Thetest,exports-map-check,export-smoke-test, andsamples-buildjobs all rebuild from scratch. - Impact: MEDIUM — ~2 min wasted per run on redundant builds
- Effort: MEDIUM
- Fix: Use
actions/cacheoractions/upload-artifact+actions/download-artifactto share build output between jobs. Build once in abuildjob, download in dependent jobs. - Priority: P2
- Owner: Booster
D2. Playwright browser install is unconditional
- Problem:
squad-ci.ymlalways installs Playwright browsers (~30s) even if no Playwright tests would run. - Impact: LOW — 30 seconds per run
- Effort: LOW
- Fix: Already identified in original analysis (item 2.4). Cache Playwright browsers. Skip install if only doc changes.
- Priority: P3
- Owner: Booster
D3. action_required conclusions need investigation
- Problem: 4 out of 22 recent Squad CI runs show
action_requiredconclusion. These are likely first-time fork contributors needing workflow approval. But the pattern creates noise in CI history. - Impact: LOW — informational
- Effort: LOW
- Fix: Document that
action_requiredis expected for fork PRs. Consider auto-approve for known contributors. Add to CI runbook. - Priority: P3
- Owner: Procedures
D4. Expand Node version matrix for insider builds
- Problem:
squad-insider-publish.ymluses a matrix strategy but only tests Node 20. Main CI only tests Node 22. Node 18 (LTS until April 2025) is untested. - Impact: MEDIUM — Node version incompatibilities ship undetected
- Effort: LOW
- Fix: Expand insider matrix to
[20, 22]. Add Node 22 validation to insider builds (currently only Node 20). Align withenginesfield (>=22.5.0). - Priority: P2
- Owner: Booster
Category E: Developer Experience
E1. No .nvmrc or .node-version file
- Problem:
package.jsonspecifiesengines.node: ">=22.5.0"but there's no.nvmrcor.node-versionfile. Contributors must readpackage.jsonto know which Node version to use. - Impact: LOW — friction for new contributors
- Effort: LOW (1 min)
- Fix: Create
.nvmrcwith22or.node-versionwith22.5.0. - Priority: P3
- Owner: Booster
E2. No local CI simulation (act config)
- Problem: Workflows have excellent inline documentation for running individual gates locally, but there's no way to run the full CI pipeline locally. No
actconfig, no Makefile. - Impact: MEDIUM — contributors push to test, wasting CI minutes and time
- Effort: MEDIUM (2-3 hours)
- Fix: Add
Makefileorscripts/ci-local.shthat runs all CI gates in sequence: build, test, lint, changelog check, exports check, samples build. Optional: add.actrcfor GitHub Actions local runner. - Priority: P2
- Owner: Booster + EECOM
E3. CI failure messages could be more actionable
- Problem: Some CI gates output generic failure messages. For example, the large deletion guard just says "PR deletes more than 50 files" without showing which files. The exports map check doesn't show which export is missing.
- Impact: LOW — developers can dig into logs, but friction is real
- Effort: LOW
- Fix: Improve error messages in each gate to include: what failed, what the expected state is, and how to fix it. Use
::error file=...annotations for file-specific failures. - Priority: P3
- Owner: Booster
E4. Status check names not consistently prefixed
- Problem: CI jobs have names like
test,docs-quality,changelog-gate— but in the GitHub Checks UI, they appear under the workflow name "Squad CI". The individual job names aren't immediately recognizable in the PR status list. - Impact: LOW — cosmetic
- Effort: LOW
- Fix: Prefix job names consistently: "Squad CI / Build & Test", "Squad CI / Changelog Gate", etc. This makes the Checks list scannable.
- Priority: P3
- Owner: Booster
Category F: Observability & Monitoring
F1. No CI metrics collection
- Problem: No tracking of CI pass rate, duration trends, or flake rate over time. Can't answer "is CI getting slower?" or "what's our flake rate this month?"
- Impact: MEDIUM — can't detect gradual degradation
- Effort: MEDIUM (3-4 hours)
- Fix: Add post-run step that writes metrics to a JSON artifact. Create weekly cron workflow that aggregates metrics via
gh run listAPI and posts summary to a discussion or issue. - Priority: P2
- Owner: Booster + Flight
F2. No alerts for CI degradation
- Problem: If CI starts failing consistently, no one is notified. The team discovers failures when they check GitHub.
- Impact: MEDIUM — delayed response to CI breakage
- Effort: LOW
- Fix: Add a daily/weekly cron workflow that checks recent CI pass rate via
gh run list. If pass rate drops below 80%, create an issue or send notification. Could integrate with existing Ralph heartbeat. - Priority: P2
- Owner: Booster + Ralph
F3. No scheduled link checking
- Problem:
squad-docs-links.ymlis manual-only. External links rot silently. - Impact: LOW — docs quality degrades over time
- Effort: LOW (5 min)
- Fix: Add weekly cron trigger to
squad-docs-links.yml:schedule: [{cron: '0 9 * * 1'}]. - Priority: P3
- Owner: Booster
F4. No build time regression detection
- Problem: No tracking of build/test duration. If a PR adds 60 seconds to test time, no one notices until it accumulates.
- Impact: LOW (CI is fast now)
- Effort: MEDIUM
- Fix: Record build/test duration in CI artifacts. Add threshold alert if total CI time exceeds 8 minutes (currently ~4-5 min). Post warning to PR if duration increases >20%.
- Priority: P3 (future)
- Owner: Booster
Summary Table
| # | Category | Item | Impact | Effort | Priority | Owner |
|---|---|---|---|---|---|---|
| A1 | Reliability | Concurrency controls | HIGH | LOW | P1 | Booster |
| A2 | Reliability | Promote merge strategy fix | HIGH | MEDIUM | P1 | Procedures + Flight |
| A3 | Reliability | Docs deployment rollback | MEDIUM | MEDIUM | P2 | PAO + Booster |
| A4 | Reliability | Insider publish atomicity | MEDIUM | MEDIUM | P2 | Surgeon + Booster |
| A5 | Reliability | Release/publish coordination | MEDIUM | LOW | P2 | Booster |
| A6 | Reliability | Insider tag deduplication | LOW | LOW | P3 | Booster |
| B1 | Testing | Cross-platform CI | HIGH | MEDIUM | P1 | Booster + FIDO |
| B2 | Testing | Docker skip guards | MEDIUM | LOW | P1 | FIDO |
| B3 | Testing | Flaky test quarantine | HIGH | HIGH | P2 | FIDO + Booster |
| B4 | Testing | Acceptance test visibility | LOW | LOW | P3 | FIDO |
| B5 | Testing | Sample test trigger expansion | MEDIUM | LOW | P2 | Booster |
| B6 | Testing | Test sharding | LOW | MEDIUM | P3 | Booster |
| B7 | Testing | E2E CLI smoke expansion | MEDIUM | LOW | P2 | Booster + EECOM |
| C1 | Security | SHA-pin all actions | HIGH | LOW | P1 | Booster |
| C2 | Security | Dependency review on PRs | HIGH | LOW | P1 | Booster + FIDO |
| C3 | Security | SAST/security scanning | HIGH | MEDIUM | P1 | FIDO + Booster |
| C4 | Security | CODEOWNERS file | MEDIUM | LOW | P2 | Flight + Procedures |
| C5 | Security | Team.md parsing hardening | MEDIUM | MEDIUM | P2 | Booster + RETRO |
| C6 | Security | License compliance | MEDIUM | LOW | P3 | Procedures |
| D1 | Performance | Build artifact caching | MEDIUM | MEDIUM | P2 | Booster |
| D2 | Performance | Conditional Playwright install | LOW | LOW | P3 | Booster |
| D3 | Performance | Document action_required | LOW | LOW | P3 | Procedures |
| D4 | Performance | Node version matrix expansion | MEDIUM | LOW | P2 | Booster |
| E1 | DX | .nvmrc file | LOW | LOW | P3 | Booster |
| E2 | DX | Local CI simulation | MEDIUM | MEDIUM | P2 | Booster + EECOM |
| E3 | DX | Actionable failure messages | LOW | LOW | P3 | Booster |
| E4 | DX | Status check name consistency | LOW | LOW | P3 | Booster |
| F1 | Observability | CI metrics collection | MEDIUM | MEDIUM | P2 | Booster + Flight |
| F2 | Observability | CI degradation alerts | MEDIUM | LOW | P2 | Booster + Ralph |
| F3 | Observability | Scheduled link checking | LOW | LOW | P3 | Booster |
| F4 | Observability | Build time regression detection | LOW | MEDIUM | P3 | Booster |
Recommended Implementation Order
Wave 1: Security + Reliability Foundation (P1, ~6-8 hours)
Ship these first — they prevent real incidents:
- C1 SHA-pin all actions (30 min) — supply chain protection
- C2 Dependency review on PRs (1 hour) — add dependabot.yml + dependency-review-action
- C3 npm audit + CodeQL (3-4 hours) — security scanning baseline
- A1 Concurrency controls (30 min) — prevent race conditions
- A2 Fix promote merge strategy (1-2 hours) — prevent silent conflict resolution
- B2 Docker skip guards (prerequisite for B1) — already tracked as fix(test): add Docker skip guards + stabilize flaky tests under load bradygaster/squad#677
Wave 2: Testing Coverage + DX (P1-P2, ~8-10 hours)
Ship after Wave 1 — expands what CI catches:
- B1 Cross-platform CI (2-3 hours) — requires B2 first
- B5 Sample test trigger expansion (30 min)
- B7 E2E CLI smoke expansion (1-2 hours)
- E2 Local CI simulation (2-3 hours)
- C4 CODEOWNERS file (15 min)
Wave 3: Observability + Polish (P2-P3, ~6-8 hours)
Ship when bandwidth allows:
- F1 CI metrics collection (3-4 hours)
- F2 CI degradation alerts (1 hour)
- B3 Flaky test quarantine (4-6 hours)
- D1 Build artifact caching (2 hours)
- A3 Docs deployment rollback (2 hours)
- A5 Release/publish coordination (1 hour)
Backlog (P3, defer until needed):
18-30. All P3 items — implement when relevant or as quick wins during other work.
Agent Routing Summary
| Agent | Items | Rationale |
|---|---|---|
| Booster | A1, A3-A6, B1, B5-B7, C1-C2, D1-D4, E1-E4, F1-F4 | CI/DevOps engineer — owns workflow files |
| FIDO | B1-B4, C2-C3 | Quality/Reliability — owns test infrastructure |
| Procedures | A2, C4, C6, D3 | Governance + process documentation |
| Flight | A2, C4, F1 | Architecture decisions, CODEOWNERS routing |
| RETRO | C5 | Security review of parsing surfaces |
| PAO | A3 | Docs deployment ownership |
| Surgeon | A4 | Publish pipeline ownership |
| EECOM | B7, E2 | SDK/CLI integration testing |
| Ralph | F2 | Heartbeat integration for CI alerts |
Notes
- feat(ci): CI hardening phase 2 — items 6-10 + security hardening bradygaster/squad#701 is still open — once merged, Phase 2 items (6-10) are complete. This issue covers everything beyond that.
- Issue Flaky tests under full-suite load + Docker-dependent tests need skip guard bradygaster/squad#582 (flaky tests under load) and fix(test): add Docker skip guards + stabilize flaky tests under load bradygaster/squad#677 (Docker skip guards) are prerequisites for several items above. Prioritize accordingly.
- Cost is not a concern — 145 min/week is well within GitHub's free tier. Cross-platform CI (Wave 2) will add ~50-100 min/week, still manageable.
- The original 19-item analysis lives at
docs/proposals/ci-hardening-opportunities.mdin the bradygaster/squad repo. This Phase 3 goes deeper with 30 new items. - All items marked "Booster" — this is the CI/DevOps agent role. If no Booster exists, route to whoever owns
.github/workflows/.
Filed by Flight. Dina: route from Wave 1 down. Each wave is independently shippable.