[CI] Split spirv-ci into Build + parallel Test jobs#194
Merged
Conversation
Today the SPIRV CI is one mega-job (Linux Build & Test) that builds
the LLVM/translator/Comgr stack and runs all lit suites sequentially.
As we add more test categories (rocm-examples, hip-tests per the SPIRV
Automated Testing Status confluence page), one mega-check is too coarse
— a failing suite hides the others, can't selectively re-run, and the
required-check is all-or-nothing.
Split into 4 jobs:
Linux Build (required)
├─► Linux Test - SPIRV translator lit (informational, baseline-diff)
├─► Linux Test - LLVM SPIRV codegen (informational)
└─► Linux Test - Comgr (informational)
Build uploads `build/`, `build-comgr/`, `build-device-libs/` (after
strip --strip-unneeded) as a single GHA artifact. Test jobs do a fresh
checkout of source trees + download the artifact. Source isn't shipped
in the artifact (cleaner: artifact = build outputs, checkout = source).
Translator-lit baseline-diff (PR head vs amd-staging) preserved as-is —
runs in the translator-lit test job: download artifact, run lit at PR
head, swap translator src to amd-staging tip, cmake-reconfigure +
ninja-incremental + lit-rerun, post sticky comment with new/fixed/
pre-existing partition. Translator checkout in this job uses
fetch-depth: 0 so the baseline `git checkout amd-staging` works.
Tests stay informational (per design discussion): promote to required
individually as each suite stabilizes. Only Linux Build is required.
Pragmatic deviations from TheRock conventions:
- GHA artifacts (not S3) for transport — no OIDC/IAM access here yet.
Will swap to S3 patterns (post_build_upload.py / fetch_artifacts.py)
when this workflow moves into TheRock. Job graph + naming match
TheRock so the swap is mechanical.
- "Linux" prefix on job names — TheRock uses workflow-file-as-platform
convention, but we keep the prefix consistent with the prior rename
and our future Windows expansion plans.
ACTION REQUIRED on merge: update the amd-staging-psdb ruleset's
required-check context from "Linux Build & Test" → "Linux Build".
Without it the old rule will dangle (no job named "Linux Build & Test"
exists anymore) and PRs would block on a permanently-pending placeholder.
lamb-j
added a commit
to ROCm/llvm-project
that referenced
this pull request
May 9, 2026
Symmetric to the translator-side change in ROCm/SPIRV-LLVM-Translator#194. Splits the single Linux Build & Test job into 4: Linux Build (required) ├─► Linux Test - SPIRV translator lit (informational, baseline-diff) ├─► Linux Test - LLVM SPIRV codegen (informational) └─► Linux Test - Comgr (informational) Build uploads `build/`, `build-comgr/`, `build-device-libs/` as a single GHA artifact (`linux-build-tree`) after `strip --strip-unneeded`. Test jobs do a fresh source checkout + download the artifact. Difference from the translator copy: - PR head IS llvm-project (default checkout, no `repository:`); the translator is overlaid at amd-staging tip under llvm/projects/. - Translator-lit baseline-diff swaps llvm-project (not the translator) via `git fetch origin amd-staging && git checkout FETCH_HEAD` from cwd root. The translator overlay at llvm/projects/SPIRV-LLVM-Translator is untracked from llvm-project's tree, so the swap doesn't touch it. - cmake source dir is `llvm` (not `llvm-project/llvm`). ACTION REQUIRED on merge: same as #194 — update the amd-staging-psdb ruleset's required-check context from "Linux Build & Test" → "Linux Build".
Contributor
🔴 New failures (0) — likely caused by this PR(none) 🟢 Fixed by this PR (0) — failing on baseline, passing here(none)
|
Mirrors TheRock's Multi-Arch CI shape: a top-level dispatcher with
platform-variant jobs that call reusable per-platform workflows. The
Linux variant is byte-isolated so adding Windows later is just a new
file + dispatcher entry.
Files:
- spirv-ci.yml — top-level dispatcher (~25 lines), byte-identical
across both this repo and ROCm/llvm-project. Triggers on
pull_request + workflow_dispatch. Sole job (linux_release, name
Linux::release) calls spirv-ci-linux.yml.
- spirv-ci-linux.yml — workflow_call. Holds the build job + 3 test
jobs (factored from the prior single spirv-ci.yml).
Rendered check_run names in the PR rollup (workflow_call composes the
slash hierarchy):
- SPIRV Compiler CI / Linux::release / Build
- SPIRV Compiler CI / Linux::release / Test SPIRV translator lit
- SPIRV Compiler CI / Linux::release / Test LLVM SPIRV codegen
- SPIRV Compiler CI / Linux::release / Test Comgr
Convention alignment with TheRock (verified against
.github/workflows/multi_arch_ci.yml, multi_arch_ci_linux.yml and the
rest of the workflow set):
- Top-level workflow name Title Case, no branch suffix
- Reusable workflow has its own descriptive name
- snake_case job IDs, separate display name field
- No literal slashes in job names — slashes come from workflow_call
- concurrency: only on dispatcher, not on workflow_call
- secrets: inherit at the dispatcher → variant call
- permissions read-only at workflow level; pull-requests: write
escalated only on the translator-lit job that posts the comment
- workflow_dispatch alongside pull_request — Multi-Arch's check
names don't get the "(pull_request)" suffix even with both
triggers, so the disambiguation bug we hit before is sidestepped
by the workflow_call structure
- Container image pinned with @sha256:
ACTION REQUIRED on merge: update the amd-staging-psdb ruleset's
required-check context from "Linux Build" → "Linux::release / Build".
Without it the dangling rule blocks PRs on a permanently-pending
placeholder.
lamb-j
added a commit
to ROCm/llvm-project
that referenced
this pull request
May 9, 2026
Mirrors TheRock's Multi-Arch CI shape: top-level dispatcher with
platform-variant jobs that call reusable per-platform workflows. Adding
Windows later = drop in spirv-ci-windows.yml + a windows_release job.
Files:
- spirv-ci.yml — top-level dispatcher (~25 lines), byte-identical
to the SPIRV-LLVM-Translator copy. Triggers on pull_request +
workflow_dispatch. Sole job (linux_release, name Linux::release)
calls spirv-ci-linux.yml.
- spirv-ci-linux.yml — workflow_call. Holds the build job + 3 test
jobs (factored from the prior single spirv-ci.yml). Differs from
the translator-side copy only in checkout blocks (which repo is
PR head vs which is pinned to amd-staging tip; same divergence as
the prior single-file structure).
Rendered check_run names in the PR rollup (workflow_call composes the
slash hierarchy):
- SPIRV Compiler CI / Linux::release / Build
- SPIRV Compiler CI / Linux::release / Test SPIRV translator lit
- SPIRV Compiler CI / Linux::release / Test LLVM SPIRV codegen
- SPIRV Compiler CI / Linux::release / Test Comgr
Convention alignment: Title Case workflow name with no branch suffix,
snake_case job IDs + display name override, no literal slashes in job
names, concurrency only on dispatcher, secrets: inherit at the
dispatcher, permissions read-only at workflow level with
pull-requests: write escalated only on the translator-lit job, both
pull_request and workflow_dispatch triggers (workflow_call sidesteps
the (pull_request) check-name suffix bug we hit before), pinned
container image.
Companion change: ROCm/SPIRV-LLVM-Translator#194.
ACTION REQUIRED on merge: update the amd-staging-psdb ruleset's
required-check context from "Linux Build" → "Linux::release / Build".
…ure) The previous commit moved pull-requests: write to a job-level permission on the test_translator_lit job inside spirv-ci-linux.yml. Per GHA rules, a called workflow's GITHUB_TOKEN permissions are capped by the caller's permissions; if the caller's workflow-level grant is `contents: read` only, the called workflow can't add pull-requests: write — validation fails at startup, no jobs run. Symptom: PR #194 ran with conclusion=startup_failure, no check_runs created, empty PR rollup. Fix: grant pull-requests: write at the dispatcher's workflow-level permissions. The called workflow's job-level grant on test_translator_lit still narrows the actual usage to that single job; this just lifts the caller-side cap. Multi-Arch CI gets away with `contents: read` only because none of its jobs post comments. Ours does, so this grant is required.
lamb-j
added a commit
to ROCm/llvm-project
that referenced
this pull request
May 9, 2026
…ure) Same fix as ROCm/SPIRV-LLVM-Translator#194 follow-up. The previous commit moved pull-requests: write to job-level inside spirv-ci-linux.yml, but per GHA rules a called workflow can't exceed the caller's permission cap. The dispatcher's `contents: read` only cap caused startup_failure (no jobs run, empty rollup). Lift pull-requests: write to the dispatcher's workflow-level permissions. The called workflow's job-level grant on test_translator_lit still scopes the actual usage to that single job.
The previous commit's first run hit two distinct failures rooted in
actions/upload-artifact@v4 defaults:
1. Comgr test job: "/bin/sh: clang-23: Permission denied" — v4
strips executable bits on upload, so binaries come back
non-executable.
2. Codegen test job: cmake reconfigure (triggered when ninja sees
the freshly-checked-out source as newer than the build tree)
failed inside FetchContent's SPIRV-Headers update step with
"fatal: not a git repository: '.git'" — v4 also excludes
hidden files (the .git dir under build/_deps/) by default.
Translator-lit job appeared green but actually hit the same first-lit
breakage; continue-on-error: true masked it.
Fix: tar the build trees ourselves before upload, untar after
download. Tar preserves both file modes and hidden files in one shot,
which is simpler than chmodding +x and toggling include-hidden-files
separately.
lamb-j
added a commit
to ROCm/llvm-project
that referenced
this pull request
May 10, 2026
Same fix as ROCm/SPIRV-LLVM-Translator#194 follow-up. actions/upload-artifact@v4 strips executable bits and excludes hidden files by default. The first symptom: clang-23 came back non-executable in the Comgr test job ("Permission denied"). The second: cmake reconfigure on the test side failed inside FetchContent's SPIRV-Headers git update because the .git dir got dropped on upload. Tar the build trees before upload and untar after download — preserves both modes and hidden files in one shot.
Two wall-time bugs found while diagnosing PR #194's slow codegen test: 1. Untar with `tar -xf` restored mtimes from when the build job produced them. Test-job source checkouts run just before, so freshly-checked-out source files appeared newer than (older) build outputs from the tar, and ninja cascade-rebuilt instead of running the requested test target. Switch to `tar -xmf` (skip mtime restore) so build outputs are newer. Observed ~5-10 min wasted on codegen test job. 2. Translator-lit job's translator checkout used `fetch-depth: 0` (full history) only to enable the later `git checkout amd-staging` for the baseline swap. But `git fetch --depth=1 origin amd-staging` followed by `git checkout FETCH_HEAD` works on a shallow clone too. Switch to `fetch-depth: 1`. ~30-60s save per run. Drive-by: trim verbose comments on the tar/untar steps.
lamb-j
added a commit
to ROCm/llvm-project
that referenced
this pull request
May 10, 2026
Same two wall-time bugs as ROCm/SPIRV-LLVM-Translator#194 follow-up: 1. `tar -xf` restored mtimes from when the build job produced files, making source (freshly checked out in the test job) appear newer than build outputs and triggering ninja cascade-rebuild. Switch to `tar -xmf` so build outputs are newer. 2. Translator-lit job's llvm-project checkout used `fetch-depth: 0` only to enable the baseline-swap `git checkout amd-staging`. The swap step explicitly does `git fetch --depth=1` + `git checkout FETCH_HEAD` which works on a shallow clone. Switch to `fetch-depth: 1`. Drive-by: trim verbose comments on the tar/untar steps.
The PR-head and baseline lit steps stay continue-on-error so pre- existing amd-staging breakage doesn't block. The partition comment already surfaces 🔴 New / 🟢 Fixed /⚠️ Pre-existing buckets. Add a final gate step that exits non-zero when newFails > 0 (PR head FAILs not present in baseline). Real PR-introduced regressions now turn the check red instead of just dropping into a comment. Degrades gracefully — if either capture file is missing (e.g., one of the lit runs failed to produce output) the gate emits a warning and passes rather than blocking on incomplete data. Now that the partition logic has been validated end-to-end on PRs #190/#192 (🔴 New) and #193 (🟢 Fixed), this is a safe step up from informational to blocking.
lamb-j
added a commit
to ROCm/llvm-project
that referenced
this pull request
May 10, 2026
Same gate as ROCm/SPIRV-LLVM-Translator#194 follow-up. The lit steps stay continue-on-error so pre-existing amd-staging breakage doesn't block, but a final step exits non-zero when newFails > 0 (PR head FAILs not present in baseline). Real PR-introduced regressions turn the check red instead of just landing in the partition comment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Restructures SPIRV CI from one mega-job (
Linux Build & Test) into a 2-fileworkflow_callchain modeled on TheRock's Multi-Arch CI shape. PR rollup now shows 4 separately-runnable checks:Build uploads the build trees as a tarred GHA artifact; the 3 test jobs run in parallel after
needs: build. Adding Windows later is justspirv-ci-windows.yml+ aWindows::releasejob in the dispatcher.Translator-lit baseline-diff
Preserved end-to-end: PR-head lit → swap translator to⚠️ Pre-existing. The check fails on 🔴 New so real regressions block, while pre-existing Khronos drift doesn't.
amd-stagingtip → reconfigure + incremental rebuild + lit again → diff. Sticky comment partitions failures into 🔴 New (PR-introduced) / 🟢 Fixed /Companion change: ROCm/llvm-project#2451.