Skip to content

CI Hardening Phase 3: Deep analysis — post #699/#700/#701 #122

@diberry

Description

@diberry

CI Hardening Phase 3: Deep Analysis — Post bradygaster#699/bradygaster#700/bradygaster#701

Author: Flight (Lead)
Date: 2026-03-30
Scope: Every workflow, action, test, and operational pattern in bradygaster/squad CI
Refs: #121 (original 19-item analysis), bradygaster#699 (workarounds), bradygaster#700 (Phase 1), bradygaster#701 (Phase 2)


What's Been Done (Phases 1-2)

Phase 1 — PR bradygaster#700 (MERGED) ✅

5 quick wins implemented:

  1. Retry logic on npm install — 3-attempt wrapper on all npm install/ci calls
  2. Job timeout tuning — timeout-minutes on every job (3-15 min)
  3. npm cache optimization — cache npm on setup-node where missing
  4. Conditional docs quality — path filter skips docs-quality on code-only PRs
  5. Publish secret validation — fail-fast check for NPM_TOKEN

Phase 2 — PR bradygaster#701 (OPEN, pending merge) 🔄

5 short-term items + security hardening:
6. Lockfile stability — drift detection in preflight (prevents v0.9.1-style incidents)
7. Composite action — DRY setup-node + npm install wrapper (.github/actions/setup-squad-node/)
8. Ralph cron audit — documented "no cron" as intentional design decision
9. API rate limit monitoring — warns <100, fails <10 remaining (state-only logging for security)
10. npm registry health checknpm ping + namespace check before publish

Security fixes: shell injection prevention in composite action, lockfile integrity hash validation, rate limit count redaction from public logs.

PR bradygaster#699 (MERGED) ✅

Combined CI workarounds for @github/copilot-sdk ESM bug (vscode-jsonrpc/node module resolution).


Current CI Health Snapshot (2026-03-30)

Metric Value
Workflows 15 active
Tests 5,038+ passing across 196 test files
CI duration ~4-5 min (main), ~2-3 min (publish)
Success rate (last 50) Squad CI: 15/22 success, 4 action_required, 3 failure
Actions minutes 145 min/week ($0.35/month)
Platforms tested Ubuntu only
Node versions 22 (main), 20 (insider)

Phase 3: Deep Analysis — 30 Items Across 6 Categories

Category A: Reliability & Resilience

A1. Concurrency controls missing on most workflows

  • Problem: Only squad-docs.yml has concurrency settings. All other workflows can run multiple instances simultaneously — if 3 PRs push in quick succession, 3 squad-ci runs compete for resources.
  • Impact: HIGH — resource waste, potential race conditions on label/triage workflows
  • Effort: LOW (YAML-only)
  • Fix: Add concurrency: { group: '${{ github.workflow }}-${{ github.ref }}', cancel-in-progress: true } to squad-ci.yml, squad-heartbeat.yml, squad-triage.yml, squad-label-enforce.yml, squad-issue-assign.yml
  • Priority: P1
  • Owner: Booster

A2. Promote workflow uses dangerous merge strategy

  • Problem: squad-promote.yml uses --no-commit --no-ff -X theirs to merge dev→preview. This auto-resolves ALL conflicts by taking dev's version without human review.
  • Impact: HIGH — bad code can silently override preview/main
  • Effort: MEDIUM
  • Fix: Switch to --no-ff without -X theirs. Fail on conflicts and require manual resolution. Add dry-run diff output showing exactly what would change.
  • Priority: P1
  • Owner: Procedures + Flight

A3. No rollback mechanism for docs deployments

  • Problem: squad-docs.yml deploys directly to GitHub Pages with no build output validation and no rollback. Broken docs go live instantly.
  • Impact: MEDIUM — user-facing docs broken until next push to main
  • Effort: MEDIUM
  • Fix: Add post-deploy smoke check (HTTP status on key pages). Store previous deployment artifact for rollback. Consider deployment environments with approval gates.
  • Priority: P2
  • Owner: PAO + Booster

A4. Insider publish creates inconsistent state on partial failure

  • Problem: squad-insider-publish.yml publishes SDK and CLI sequentially. If CLI publish fails after SDK succeeds, npm has mismatched SDK/CLI versions on the insider tag.
  • Impact: MEDIUM — insider users get broken dependency graph
  • Effort: MEDIUM
  • Fix: Add post-publish verification step. If CLI fails, emit clear error with manual recovery instructions. Consider publish-both-or-neither pattern with npm unpublish fallback (risky but atomic).
  • Priority: P2
  • Owner: Surgeon + Booster

A5. Release/publish workflow coordination gap

  • Problem: squad-release.yml (push to main) and squad-npm-publish.yml (release event) can trigger near-simultaneously. Release creates tag + GitHub Release, which triggers npm publish. No explicit handoff or dependency check.
  • Impact: MEDIUM — timing-sensitive; if release workflow slow, npm publish may start before tag exists
  • Effort: LOW
  • Fix: Add workflow_run trigger on squad-npm-publish.yml that waits for squad-release.yml completion instead of relying on release event timing.
  • Priority: P2
  • Owner: Booster

A6. Tag deduplication gaps in release workflows

  • Problem: Both squad-release.yml and squad-insider-release.yml create tags. If a tag already exists, git push fails. squad-release.yml checks for existing tags, but squad-insider-release.yml does not.
  • Impact: LOW — insider releases would fail on re-push; not catastrophic
  • Effort: LOW
  • Fix: Add tag existence check to squad-insider-release.yml matching the pattern in squad-release.yml.
  • Priority: P3
  • Owner: Booster

Category B: Testing Depth

B1. No cross-platform CI (Windows/macOS)

B2. Docker-dependent tests need skip guards

B3. Flaky test quarantine system

B4. Acceptance tests (Gherkin BDD) not visible as separate CI gate

  • Problem: 8 Gherkin feature files exist in test/acceptance/ with full step definitions and terminal harness. But they run as part of the monolithic npm test — no visibility into acceptance test health vs unit test health.
  • Impact: LOW — informational gap, not a failure gap
  • Effort: LOW
  • Fix: Add separate vitest workspace config for acceptance tests. Report as distinct CI step in squad-ci.yml.
  • Priority: P3
  • Owner: FIDO

B5. Sample tests only run when SDK source changes

  • Problem: samples-build job in squad-ci.yml is gated on packages/squad-sdk/src/ changes. If a sample's own code breaks (e.g., bad import), CI won't catch it unless SDK is also changed.
  • Impact: MEDIUM — sample rot goes undetected
  • Effort: LOW
  • Fix: Expand path filter to include samples/** in trigger condition. Also add weekly cron run of samples-build.
  • Priority: P2
  • Owner: Booster

B6. No test sharding for faster execution

  • Problem: All 196 test files run in a single job (~4-5 min). As test count grows, this becomes a bottleneck.
  • Impact: LOW (currently fast enough)
  • Effort: MEDIUM
  • Fix: Use vitest --shard=1/N flag with matrix strategy when test count exceeds threshold (~300 files). Not needed now.
  • Priority: P3 (future)
  • Owner: Booster

B7. E2E CLI smoke test expansion

  • Problem: squad-npm-publish.yml has CLI packaging smoke test, but it only validates npm pack. No test of actual global install + command execution.
  • Impact: MEDIUM — global install breakage ships undetected
  • Effort: LOW (1-2 hours)
  • Fix: Extend smoke test to: npm install -g @bradygaster/squad-cli, squad --version, squad init --help, squad status.
  • Priority: P2
  • Owner: Booster + EECOM

Category C: Security & Supply Chain

C1. Third-party actions not pinned to SHAs

  • Problem: All first-party actions (actions/checkout, actions/setup-node) use @v4 tags. Third-party actions in squad-docs-links.yml use tags: lycheeverse/lychee-action@v2, peter-evans/create-issue-from-file@v5. Tags can be force-pushed by upstream maintainers.
  • Impact: HIGH — supply chain attack vector
  • Effort: LOW
  • Fix: Pin ALL actions (first-party and third-party) to full SHA commits. Use npx pin-github-action or Dependabot's action pinning feature. Example: actions/checkout@b4ffde65f... instead of @v4.
  • Priority: P1
  • Owner: Booster

C2. No dependency review on PRs

  • Problem: No Dependabot, Renovate, or dependency-review-action configured. New dependencies added via PR are not automatically audited for vulnerabilities or license issues.
  • Impact: HIGH — malicious/vulnerable transitive deps go unreviewed
  • Effort: LOW
  • Fix: Add .github/dependabot.yml for automated PR creation on dependency updates. Add actions/dependency-review-action@v4 to PR CI pipeline. Enable GitHub's built-in dependency graph.
  • Priority: P1
  • Owner: Booster + FIDO

C3. No SAST/security scanning in CI

  • Problem: No CodeQL, Snyk, SonarQube, or npm audit in CI. TypeScript strict mode and ESLint are the only static analysis.
  • Impact: HIGH — security regressions, known CVEs in deps go undetected
  • Effort: MEDIUM (3-4 hours)
  • Fix: Add npm audit --audit-level=moderate as CI gate. Add CodeQL analysis workflow for JavaScript/TypeScript. Consider github/codeql-action@v3 with auto-build.
  • Priority: P1
  • Owner: FIDO + Booster

C4. No CODEOWNERS file

  • Problem: No .github/CODEOWNERS. PRs don't auto-assign reviewers based on file paths. Anyone can modify workflow files without required review from CI owners.
  • Impact: MEDIUM — workflow changes not automatically routed to right reviewer
  • Effort: LOW (15 min)
  • Fix: Create .github/CODEOWNERS with: .github/workflows/* @bradygaster, packages/squad-sdk/* @eecom-owner, packages/squad-cli/* @eecom-owner, docs/* @pao-owner.
  • Priority: P2
  • Owner: Flight + Procedures

C5. Team.md parsing is a fragile injection surface

  • Problem: 6 workflows parse .squad/team.md or .ai-team/team.md using string splitting on |. The parsed values are used in GitHub API calls (label names, comments, assignments). A malicious team.md edit could inject unexpected values.
  • Impact: MEDIUM — limited to repos with write access, but violates defense-in-depth
  • Effort: MEDIUM
  • Fix: Extract team.md parsing into a single composite action with input validation (A7 recommendation). Sanitize all parsed values before use in API calls. Add regex validation for member names.
  • Priority: P2
  • Owner: Booster + RETRO

C6. No license compliance scanning

  • Problem: No automated check for license compatibility of npm dependencies. SDK/CLI are published packages — transitive dependency licenses matter.
  • Impact: MEDIUM — legal compliance risk for downstream consumers
  • Effort: LOW
  • Fix: Add license-checker or licensee check to CI. Maintain allowlist of approved licenses (MIT, Apache-2.0, ISC, BSD-*).
  • Priority: P3
  • Owner: Procedures

Category D: Performance & Cost

D1. No build artifact caching between jobs

  • Problem: Multiple jobs in squad-ci.yml run npm ci and npm run build independently. The test, exports-map-check, export-smoke-test, and samples-build jobs all rebuild from scratch.
  • Impact: MEDIUM — ~2 min wasted per run on redundant builds
  • Effort: MEDIUM
  • Fix: Use actions/cache or actions/upload-artifact + actions/download-artifact to share build output between jobs. Build once in a build job, download in dependent jobs.
  • Priority: P2
  • Owner: Booster

D2. Playwright browser install is unconditional

  • Problem: squad-ci.yml always installs Playwright browsers (~30s) even if no Playwright tests would run.
  • Impact: LOW — 30 seconds per run
  • Effort: LOW
  • Fix: Already identified in original analysis (item 2.4). Cache Playwright browsers. Skip install if only doc changes.
  • Priority: P3
  • Owner: Booster

D3. action_required conclusions need investigation

  • Problem: 4 out of 22 recent Squad CI runs show action_required conclusion. These are likely first-time fork contributors needing workflow approval. But the pattern creates noise in CI history.
  • Impact: LOW — informational
  • Effort: LOW
  • Fix: Document that action_required is expected for fork PRs. Consider auto-approve for known contributors. Add to CI runbook.
  • Priority: P3
  • Owner: Procedures

D4. Expand Node version matrix for insider builds

  • Problem: squad-insider-publish.yml uses a matrix strategy but only tests Node 20. Main CI only tests Node 22. Node 18 (LTS until April 2025) is untested.
  • Impact: MEDIUM — Node version incompatibilities ship undetected
  • Effort: LOW
  • Fix: Expand insider matrix to [20, 22]. Add Node 22 validation to insider builds (currently only Node 20). Align with engines field (>=22.5.0).
  • Priority: P2
  • Owner: Booster

Category E: Developer Experience

E1. No .nvmrc or .node-version file

  • Problem: package.json specifies engines.node: ">=22.5.0" but there's no .nvmrc or .node-version file. Contributors must read package.json to know which Node version to use.
  • Impact: LOW — friction for new contributors
  • Effort: LOW (1 min)
  • Fix: Create .nvmrc with 22 or .node-version with 22.5.0.
  • Priority: P3
  • Owner: Booster

E2. No local CI simulation (act config)

  • Problem: Workflows have excellent inline documentation for running individual gates locally, but there's no way to run the full CI pipeline locally. No act config, no Makefile.
  • Impact: MEDIUM — contributors push to test, wasting CI minutes and time
  • Effort: MEDIUM (2-3 hours)
  • Fix: Add Makefile or scripts/ci-local.sh that runs all CI gates in sequence: build, test, lint, changelog check, exports check, samples build. Optional: add .actrc for GitHub Actions local runner.
  • Priority: P2
  • Owner: Booster + EECOM

E3. CI failure messages could be more actionable

  • Problem: Some CI gates output generic failure messages. For example, the large deletion guard just says "PR deletes more than 50 files" without showing which files. The exports map check doesn't show which export is missing.
  • Impact: LOW — developers can dig into logs, but friction is real
  • Effort: LOW
  • Fix: Improve error messages in each gate to include: what failed, what the expected state is, and how to fix it. Use ::error file=... annotations for file-specific failures.
  • Priority: P3
  • Owner: Booster

E4. Status check names not consistently prefixed

  • Problem: CI jobs have names like test, docs-quality, changelog-gate — but in the GitHub Checks UI, they appear under the workflow name "Squad CI". The individual job names aren't immediately recognizable in the PR status list.
  • Impact: LOW — cosmetic
  • Effort: LOW
  • Fix: Prefix job names consistently: "Squad CI / Build & Test", "Squad CI / Changelog Gate", etc. This makes the Checks list scannable.
  • Priority: P3
  • Owner: Booster

Category F: Observability & Monitoring

F1. No CI metrics collection

  • Problem: No tracking of CI pass rate, duration trends, or flake rate over time. Can't answer "is CI getting slower?" or "what's our flake rate this month?"
  • Impact: MEDIUM — can't detect gradual degradation
  • Effort: MEDIUM (3-4 hours)
  • Fix: Add post-run step that writes metrics to a JSON artifact. Create weekly cron workflow that aggregates metrics via gh run list API and posts summary to a discussion or issue.
  • Priority: P2
  • Owner: Booster + Flight

F2. No alerts for CI degradation

  • Problem: If CI starts failing consistently, no one is notified. The team discovers failures when they check GitHub.
  • Impact: MEDIUM — delayed response to CI breakage
  • Effort: LOW
  • Fix: Add a daily/weekly cron workflow that checks recent CI pass rate via gh run list. If pass rate drops below 80%, create an issue or send notification. Could integrate with existing Ralph heartbeat.
  • Priority: P2
  • Owner: Booster + Ralph

F3. No scheduled link checking

  • Problem: squad-docs-links.yml is manual-only. External links rot silently.
  • Impact: LOW — docs quality degrades over time
  • Effort: LOW (5 min)
  • Fix: Add weekly cron trigger to squad-docs-links.yml: schedule: [{cron: '0 9 * * 1'}].
  • Priority: P3
  • Owner: Booster

F4. No build time regression detection

  • Problem: No tracking of build/test duration. If a PR adds 60 seconds to test time, no one notices until it accumulates.
  • Impact: LOW (CI is fast now)
  • Effort: MEDIUM
  • Fix: Record build/test duration in CI artifacts. Add threshold alert if total CI time exceeds 8 minutes (currently ~4-5 min). Post warning to PR if duration increases >20%.
  • Priority: P3 (future)
  • Owner: Booster

Summary Table

# Category Item Impact Effort Priority Owner
A1 Reliability Concurrency controls HIGH LOW P1 Booster
A2 Reliability Promote merge strategy fix HIGH MEDIUM P1 Procedures + Flight
A3 Reliability Docs deployment rollback MEDIUM MEDIUM P2 PAO + Booster
A4 Reliability Insider publish atomicity MEDIUM MEDIUM P2 Surgeon + Booster
A5 Reliability Release/publish coordination MEDIUM LOW P2 Booster
A6 Reliability Insider tag deduplication LOW LOW P3 Booster
B1 Testing Cross-platform CI HIGH MEDIUM P1 Booster + FIDO
B2 Testing Docker skip guards MEDIUM LOW P1 FIDO
B3 Testing Flaky test quarantine HIGH HIGH P2 FIDO + Booster
B4 Testing Acceptance test visibility LOW LOW P3 FIDO
B5 Testing Sample test trigger expansion MEDIUM LOW P2 Booster
B6 Testing Test sharding LOW MEDIUM P3 Booster
B7 Testing E2E CLI smoke expansion MEDIUM LOW P2 Booster + EECOM
C1 Security SHA-pin all actions HIGH LOW P1 Booster
C2 Security Dependency review on PRs HIGH LOW P1 Booster + FIDO
C3 Security SAST/security scanning HIGH MEDIUM P1 FIDO + Booster
C4 Security CODEOWNERS file MEDIUM LOW P2 Flight + Procedures
C5 Security Team.md parsing hardening MEDIUM MEDIUM P2 Booster + RETRO
C6 Security License compliance MEDIUM LOW P3 Procedures
D1 Performance Build artifact caching MEDIUM MEDIUM P2 Booster
D2 Performance Conditional Playwright install LOW LOW P3 Booster
D3 Performance Document action_required LOW LOW P3 Procedures
D4 Performance Node version matrix expansion MEDIUM LOW P2 Booster
E1 DX .nvmrc file LOW LOW P3 Booster
E2 DX Local CI simulation MEDIUM MEDIUM P2 Booster + EECOM
E3 DX Actionable failure messages LOW LOW P3 Booster
E4 DX Status check name consistency LOW LOW P3 Booster
F1 Observability CI metrics collection MEDIUM MEDIUM P2 Booster + Flight
F2 Observability CI degradation alerts MEDIUM LOW P2 Booster + Ralph
F3 Observability Scheduled link checking LOW LOW P3 Booster
F4 Observability Build time regression detection LOW MEDIUM P3 Booster

Recommended Implementation Order

Wave 1: Security + Reliability Foundation (P1, ~6-8 hours)

Ship these first — they prevent real incidents:

  1. C1 SHA-pin all actions (30 min) — supply chain protection
  2. C2 Dependency review on PRs (1 hour) — add dependabot.yml + dependency-review-action
  3. C3 npm audit + CodeQL (3-4 hours) — security scanning baseline
  4. A1 Concurrency controls (30 min) — prevent race conditions
  5. A2 Fix promote merge strategy (1-2 hours) — prevent silent conflict resolution
  6. B2 Docker skip guards (prerequisite for B1) — already tracked as fix(test): add Docker skip guards + stabilize flaky tests under load bradygaster/squad#677

Wave 2: Testing Coverage + DX (P1-P2, ~8-10 hours)

Ship after Wave 1 — expands what CI catches:

  1. B1 Cross-platform CI (2-3 hours) — requires B2 first
  2. B5 Sample test trigger expansion (30 min)
  3. B7 E2E CLI smoke expansion (1-2 hours)
  4. E2 Local CI simulation (2-3 hours)
  5. C4 CODEOWNERS file (15 min)

Wave 3: Observability + Polish (P2-P3, ~6-8 hours)

Ship when bandwidth allows:

  1. F1 CI metrics collection (3-4 hours)
  2. F2 CI degradation alerts (1 hour)
  3. B3 Flaky test quarantine (4-6 hours)
  4. D1 Build artifact caching (2 hours)
  5. A3 Docs deployment rollback (2 hours)
  6. A5 Release/publish coordination (1 hour)

Backlog (P3, defer until needed):

18-30. All P3 items — implement when relevant or as quick wins during other work.


Agent Routing Summary

Agent Items Rationale
Booster A1, A3-A6, B1, B5-B7, C1-C2, D1-D4, E1-E4, F1-F4 CI/DevOps engineer — owns workflow files
FIDO B1-B4, C2-C3 Quality/Reliability — owns test infrastructure
Procedures A2, C4, C6, D3 Governance + process documentation
Flight A2, C4, F1 Architecture decisions, CODEOWNERS routing
RETRO C5 Security review of parsing surfaces
PAO A3 Docs deployment ownership
Surgeon A4 Publish pipeline ownership
EECOM B7, E2 SDK/CLI integration testing
Ralph F2 Heartbeat integration for CI alerts

Notes


Filed by Flight. Dina: route from Wave 1 down. Each wave is independently shippable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    go:needs-researchNeeds investigationsquadSquad triage inbox — Lead will assign to a membersquad:archiveResolved by upstream or no longer applicablesquad:fidoAssigned to FIDO (Quality Owner)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions