Skip to content

[CI] Prototype SPIRV-focused CI workflow#2451

Open
lamb-j wants to merge 13 commits into
amd-stagingfrom
users/lambj/spirv-ci-prototype
Open

[CI] Prototype SPIRV-focused CI workflow#2451
lamb-j wants to merge 13 commits into
amd-stagingfrom
users/lambj/spirv-ci-prototype

Conversation

@lamb-j
Copy link
Copy Markdown
Collaborator

@lamb-j lamb-j commented May 8, 2026

Summary

Introduces SPIRV-focused PR CI for ROCm/llvm-project amd-staging. Builds LLVM/Clang/translator/Comgr in one job, runs SPIRV-relevant test suites in parallel test jobs that consume a GHA-artifact build tree. Catches breakage in the compiler / SPIRV translator that would fail downstream Comgr testing, without paying the cost of a full TheRock build.

PR rollup shows 4 separately-runnable checks:

SPIRV Compiler CI / Linux::release / Build
SPIRV Compiler CI / Linux::release / Test SPIRV translator lit
SPIRV Compiler CI / Linux::release / Test LLVM SPIRV codegen
SPIRV Compiler CI / Linux::release / Test Comgr

Structure

2-file workflow_call chain modeled on TheRock's Multi-Arch CI shape:

  • spirv-ci.yml — top-level dispatcher (~25 lines), pull_request + workflow_dispatch
  • spirv-ci-linux.yml — Linux variant: build job + 3 parallel test jobs

Adding Windows later = drop in spirv-ci-windows.yml + a Windows::release job.

runs-on: azure-linux-scale-rocm with the manylinux container TheRock uses. permissions: contents: read only at top level (no PR comment posted on llvm-project PRs — the failing check is enough signal).

Translator-lit baseline-diff

Translator-lit job runs lit against PR head, swaps llvm-project to amd-staging tip (the translator overlay at llvm/projects/SPIRV-LLVM-Translator/ is untracked from llvm-project's tree, so the swap leaves it alone), reconfigures + incrementally rebuilds + reruns lit. Diffs the two FAIL lists into 🔴 New (PR-introduced) / 🟢 Fixed / ⚠️ Pre-existing buckets. The check fails on 🔴 New so real PR-introduced regressions block; pre-existing Khronos drift doesn't.

Companion change in the translator copy of this workflow (which also posts a sticky PR comment with the partition data): ROCm/SPIRV-LLVM-Translator#194.

Mirrors the workflow already running on ROCm/SPIRV-LLVM-Translator,
with the roles swapped: this side checks out llvm-project at PR head
and pulls SPIRV-LLVM-Translator at amd-staging tip. Same build chain
(LLVM + Clang + LLD + amd-llvm-spirv + device-libs + Comgr standalone)
and same three lit/gtest suites:

  - check-amd-llvm-spirv     non-blocking + sticky PR comment
                             (upstream Khronos churn ~1 fail/wk)
  - check-llvm-codegen-spirv blocking
  - check-comgr              blocking (lit + gtest + ctest layers)

Catches breakage in compiler / SPIRV translator that would fail
downstream Comgr testing, without paying the cost of a full TheRock
build. Plan: once stable, promote to a TheRock stage-based workflow.
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

⚠️ 21 pre-existing translator lit failures on baseline; not caused by this PR (see run).

🔴 New failures (0) — likely caused by this PR

(none)

🟢 Fixed by this PR (0) — failing on baseline, passing here

(none)

⚠️ Pre-existing on `amd-staging` (21)
FAIL: LLVM_SPIRV :: constant/local-float-point-constants.ll
FAIL: LLVM_SPIRV :: extensions/EXT/SPV_EXT_float8/conversions_matrix.ll
FAIL: LLVM_SPIRV :: extensions/EXT/SPV_EXT_float8/conversions_scalar_vector.ll
FAIL: LLVM_SPIRV :: extensions/EXT/SPV_EXT_shader_atomic_float_/atomicrmw_fsub_half.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_float4/conversions_packed.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_float4/conversions_scalar_vector.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_fp_conversions/spv_intel_fp_conversions.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_int4/conversions_packed.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_sigmoid/sigmoid_f16.ll
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_bfloat16/cooperative_matrix_bfloat16.ll
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_cooperative_matrix/conversion_instructions.ll
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_subgroup_rotate/SPV_KHR_subgroup_rotate.cl
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_uniform_group_instructions/group-instructions.ll
FAIL: LLVM_SPIRV :: transcoding/float16.ll
FAIL: LLVM_SPIRV :: transcoding/image_signedness_spv_ir.ll
FAIL: LLVM_SPIRV :: transcoding/OpImageSampleExplicitLod_arg.cl
FAIL: LLVM_SPIRV :: transcoding/spec_const.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_clustered_reduce.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_non_uniform_arithmetic.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_shuffle.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_shuffle_relative.ll

The check-amd-llvm-spirv step is non-blocking because upstream Khronos
churn breaks ~1 lit test per week. Today the sticky comment just lists
N failing tests and tells the reviewer to compare against amd-staging
manually. It also misses the inverse signal — a Khronos-upstream merge
PR often *fixes* tests red on amd-staging.

Run check-amd-llvm-spirv twice: once with the PR-head llvm-project,
then again after swapping llvm-project to amd-staging tip. Diff the two
FAIL lists in the github-script step and partition into:

  - new       (in PR, not in baseline)  -- likely caused by this PR
  - fixed     (in baseline, not in PR)  -- resolved by this PR
  - common    (in both)                 -- pre-existing breakage

Headline picks the dominant bucket. Sticky comment marker unchanged
(<!-- spirv-ci:translator-lit -->), so existing comments update in place.

The translator subdir at llvm/projects/SPIRV-LLVM-Translator/ is
untracked from llvm-project's tree, so the baseline `git checkout` of
amd-staging leaves the overlay alone. Build dir is reused; CMake
reconfigure + ninja incremental keep the second pass bounded by the PR
diff size — cheap on small PRs, larger on upstream-merge PRs.

Falls back to the legacy single-list shape if the baseline run didn't
produce a result.

Companion change in the ROCm/SPIRV-LLVM-Translator copy of this
workflow: ROCm/SPIRV-LLVM-Translator#188.
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Fixup to the previous commit. The baseline lit step combined ninja and
the grep capture into one shell script. Under bash -e + set -o pipefail,
the non-zero ninja exit (lit failures) aborted the script before the
grep ran, so build/spirv-fails-baseline.txt was never written and the
comment script fell through to "baseline comparison unavailable".

Split the capture into its own step, mirroring the working PR-head
pattern. Same fix applied in companion translator PR
ROCm/SPIRV-LLVM-Translator#189 (observed on
ROCm/SPIRV-LLVM-Translator#187 after #188 landed).
@z1-cciauto
Copy link
Copy Markdown
Collaborator

When a workflow has multiple on: triggers (pull_request + workflow_dispatch),
GitHub disambiguates the emitted check context with a trailing event suffix
— actual context becomes "SPIRV CI - amd-staging / Build & Test
(pull_request)" instead of the bare "SPIRV CI - amd-staging / Build & Test"
the required-check rule expects. Required check stays "Pending — Required"
forever and blocks non-admin merges on amd-staging.

Drop workflow_dispatch — never used in practice, pull_request's
synchronize/reopened types already cover the retriggers we'd want.

Same fix in companion translator PR ROCm/SPIRV-LLVM-Translator#189.
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Generic name was at risk of colliding with other workflows. Adds
"Linux" platform qualifier so future "Windows Build & Test" etc. can
slot in without further renaming. Doesn't list components (LLVM,
Comgr, translator) since the workflow will expand.

** ACTION REQUIRED on merge: **
The amd-staging ruleset's required-check context is currently
"SPIRV CI - amd-staging / Build & Test" — never actually matched (the
matcher uses bare check_run.name, see ROCm/SPIRV-LLVM-Translator#191).
Update to "Linux Build & Test" so the rule fixes both the rename and
the long-standing matcher bug at once.

Companion change in the translator copy: ROCm/SPIRV-LLVM-Translator#191.
@z1-cciauto
Copy link
Copy Markdown
Collaborator

lamb-j added 8 commits May 9, 2026 09:34
Symmetric to the translator-side change in
ROCm/SPIRV-LLVM-Translator#194. Splits the single Linux Build & Test
job into 4:

  Linux Build (required)
   ├─► Linux Test - SPIRV translator lit  (informational, baseline-diff)
   ├─► Linux Test - LLVM SPIRV codegen    (informational)
   └─► Linux Test - Comgr                  (informational)

Build uploads `build/`, `build-comgr/`, `build-device-libs/` as a single
GHA artifact (`linux-build-tree`) after `strip --strip-unneeded`. Test
jobs do a fresh source checkout + download the artifact.

Difference from the translator copy:
  - PR head IS llvm-project (default checkout, no `repository:`); the
    translator is overlaid at amd-staging tip under llvm/projects/.
  - Translator-lit baseline-diff swaps llvm-project (not the translator)
    via `git fetch origin amd-staging && git checkout FETCH_HEAD` from
    cwd root. The translator overlay at llvm/projects/SPIRV-LLVM-Translator
    is untracked from llvm-project's tree, so the swap doesn't touch it.
  - cmake source dir is `llvm` (not `llvm-project/llvm`).

ACTION REQUIRED on merge: same as #194 — update the amd-staging-psdb
ruleset's required-check context from "Linux Build & Test" → "Linux Build".
Mirrors TheRock's Multi-Arch CI shape: top-level dispatcher with
platform-variant jobs that call reusable per-platform workflows. Adding
Windows later = drop in spirv-ci-windows.yml + a windows_release job.

Files:
  - spirv-ci.yml — top-level dispatcher (~25 lines), byte-identical
    to the SPIRV-LLVM-Translator copy. Triggers on pull_request +
    workflow_dispatch. Sole job (linux_release, name Linux::release)
    calls spirv-ci-linux.yml.
  - spirv-ci-linux.yml — workflow_call. Holds the build job + 3 test
    jobs (factored from the prior single spirv-ci.yml). Differs from
    the translator-side copy only in checkout blocks (which repo is
    PR head vs which is pinned to amd-staging tip; same divergence as
    the prior single-file structure).

Rendered check_run names in the PR rollup (workflow_call composes the
slash hierarchy):
  - SPIRV Compiler CI / Linux::release / Build
  - SPIRV Compiler CI / Linux::release / Test SPIRV translator lit
  - SPIRV Compiler CI / Linux::release / Test LLVM SPIRV codegen
  - SPIRV Compiler CI / Linux::release / Test Comgr

Convention alignment: Title Case workflow name with no branch suffix,
snake_case job IDs + display name override, no literal slashes in job
names, concurrency only on dispatcher, secrets: inherit at the
dispatcher, permissions read-only at workflow level with
pull-requests: write escalated only on the translator-lit job, both
pull_request and workflow_dispatch triggers (workflow_call sidesteps
the (pull_request) check-name suffix bug we hit before), pinned
container image.

Companion change: ROCm/SPIRV-LLVM-Translator#194.

ACTION REQUIRED on merge: update the amd-staging-psdb ruleset's
required-check context from "Linux Build" → "Linux::release / Build".
…ure)

Same fix as ROCm/SPIRV-LLVM-Translator#194 follow-up. The previous
commit moved pull-requests: write to job-level inside spirv-ci-linux.yml,
but per GHA rules a called workflow can't exceed the caller's
permission cap. The dispatcher's `contents: read` only cap caused
startup_failure (no jobs run, empty rollup).

Lift pull-requests: write to the dispatcher's workflow-level
permissions. The called workflow's job-level grant on
test_translator_lit still scopes the actual usage to that single job.
Same fix as ROCm/SPIRV-LLVM-Translator#194 follow-up.

actions/upload-artifact@v4 strips executable bits and excludes hidden
files by default. The first symptom: clang-23 came back non-executable
in the Comgr test job ("Permission denied"). The second: cmake
reconfigure on the test side failed inside FetchContent's SPIRV-Headers
git update because the .git dir got dropped on upload.

Tar the build trees before upload and untar after download — preserves
both modes and hidden files in one shot.
Same two wall-time bugs as ROCm/SPIRV-LLVM-Translator#194 follow-up:

1. `tar -xf` restored mtimes from when the build job produced files,
   making source (freshly checked out in the test job) appear newer
   than build outputs and triggering ninja cascade-rebuild. Switch to
   `tar -xmf` so build outputs are newer.

2. Translator-lit job's llvm-project checkout used `fetch-depth: 0`
   only to enable the baseline-swap `git checkout amd-staging`. The
   swap step explicitly does `git fetch --depth=1` + `git checkout
   FETCH_HEAD` which works on a shallow clone. Switch to
   `fetch-depth: 1`.

Drive-by: trim verbose comments on the tar/untar steps.
Same gate as ROCm/SPIRV-LLVM-Translator#194 follow-up. The lit steps
stay continue-on-error so pre-existing amd-staging breakage doesn't
block, but a final step exits non-zero when newFails > 0 (PR head
FAILs not present in baseline). Real PR-introduced regressions turn
the check red instead of just landing in the partition comment.
On llvm-project PRs the failing-check signal from the translator-lit
gate step is enough — a sticky partition comment would be noise.
Removed:

  - The actions/github-script step that posted the partition comment
    (~65 lines of inline JS)
  - pull-requests: write permissions at both the workflow_call job
    level (was scoping the comment) AND the dispatcher workflow level
    (was the caller-cap that allowed it)

Net: llvm-project's SPIRV CI now runs with permissions: contents: read
only — matches TheRock's tightest-perms convention.

The partition logic still computes new/fixed/pre-existing internally
(the gate step uses spirv-fails-pr.txt + spirv-fails-baseline.txt to
decide whether to fail). Just no comment artifact.

Translator-side workflow keeps the comment — translator-author PRs
benefit from the inline context.
Same fix as ROCm/SPIRV-LLVM-Translator#197. tar -m sets per-file
mtimes from sequential extraction order; build.ninja ends up older
than CMakeCache.txt, triggering ninja's regen rule and cascade
rebuild. Touch build.ninja explicitly to make it the newest file in
the tree.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants