Skip to content

[CI] Split spirv-ci into Build + parallel Test jobs#194

Merged
lamb-j merged 6 commits into
amd-stagingfrom
users/lambj/spirv-ci-multi-job
May 10, 2026
Merged

[CI] Split spirv-ci into Build + parallel Test jobs#194
lamb-j merged 6 commits into
amd-stagingfrom
users/lambj/spirv-ci-multi-job

Conversation

@lamb-j
Copy link
Copy Markdown
Collaborator

@lamb-j lamb-j commented May 9, 2026

Summary

Restructures SPIRV CI from one mega-job (Linux Build & Test) into a 2-file workflow_call chain modeled on TheRock's Multi-Arch CI shape. PR rollup now shows 4 separately-runnable checks:

SPIRV Compiler CI / Linux::release / Build
SPIRV Compiler CI / Linux::release / Test SPIRV translator lit
SPIRV Compiler CI / Linux::release / Test LLVM SPIRV codegen
SPIRV Compiler CI / Linux::release / Test Comgr

Build uploads the build trees as a tarred GHA artifact; the 3 test jobs run in parallel after needs: build. Adding Windows later is just spirv-ci-windows.yml + a Windows::release job in the dispatcher.

Translator-lit baseline-diff

Preserved end-to-end: PR-head lit → swap translator to amd-staging tip → reconfigure + incremental rebuild + lit again → diff. Sticky comment partitions failures into 🔴 New (PR-introduced) / 🟢 Fixed / ⚠️ Pre-existing. The check fails on 🔴 New so real regressions block, while pre-existing Khronos drift doesn't.

Companion change: ROCm/llvm-project#2451.

Today the SPIRV CI is one mega-job (Linux Build & Test) that builds
the LLVM/translator/Comgr stack and runs all lit suites sequentially.
As we add more test categories (rocm-examples, hip-tests per the SPIRV
Automated Testing Status confluence page), one mega-check is too coarse
— a failing suite hides the others, can't selectively re-run, and the
required-check is all-or-nothing.

Split into 4 jobs:

  Linux Build (required)
   ├─► Linux Test - SPIRV translator lit  (informational, baseline-diff)
   ├─► Linux Test - LLVM SPIRV codegen    (informational)
   └─► Linux Test - Comgr                  (informational)

Build uploads `build/`, `build-comgr/`, `build-device-libs/` (after
strip --strip-unneeded) as a single GHA artifact. Test jobs do a fresh
checkout of source trees + download the artifact. Source isn't shipped
in the artifact (cleaner: artifact = build outputs, checkout = source).

Translator-lit baseline-diff (PR head vs amd-staging) preserved as-is —
runs in the translator-lit test job: download artifact, run lit at PR
head, swap translator src to amd-staging tip, cmake-reconfigure +
ninja-incremental + lit-rerun, post sticky comment with new/fixed/
pre-existing partition. Translator checkout in this job uses
fetch-depth: 0 so the baseline `git checkout amd-staging` works.

Tests stay informational (per design discussion): promote to required
individually as each suite stabilizes. Only Linux Build is required.

Pragmatic deviations from TheRock conventions:
  - GHA artifacts (not S3) for transport — no OIDC/IAM access here yet.
    Will swap to S3 patterns (post_build_upload.py / fetch_artifacts.py)
    when this workflow moves into TheRock. Job graph + naming match
    TheRock so the swap is mechanical.
  - "Linux" prefix on job names — TheRock uses workflow-file-as-platform
    convention, but we keep the prefix consistent with the prior rename
    and our future Windows expansion plans.

ACTION REQUIRED on merge: update the amd-staging-psdb ruleset's
required-check context from "Linux Build & Test" → "Linux Build".
Without it the old rule will dangle (no job named "Linux Build & Test"
exists anymore) and PRs would block on a permanently-pending placeholder.
lamb-j added a commit to ROCm/llvm-project that referenced this pull request May 9, 2026
Symmetric to the translator-side change in
ROCm/SPIRV-LLVM-Translator#194. Splits the single Linux Build & Test
job into 4:

  Linux Build (required)
   ├─► Linux Test - SPIRV translator lit  (informational, baseline-diff)
   ├─► Linux Test - LLVM SPIRV codegen    (informational)
   └─► Linux Test - Comgr                  (informational)

Build uploads `build/`, `build-comgr/`, `build-device-libs/` as a single
GHA artifact (`linux-build-tree`) after `strip --strip-unneeded`. Test
jobs do a fresh source checkout + download the artifact.

Difference from the translator copy:
  - PR head IS llvm-project (default checkout, no `repository:`); the
    translator is overlaid at amd-staging tip under llvm/projects/.
  - Translator-lit baseline-diff swaps llvm-project (not the translator)
    via `git fetch origin amd-staging && git checkout FETCH_HEAD` from
    cwd root. The translator overlay at llvm/projects/SPIRV-LLVM-Translator
    is untracked from llvm-project's tree, so the swap doesn't touch it.
  - cmake source dir is `llvm` (not `llvm-project/llvm`).

ACTION REQUIRED on merge: same as #194 — update the amd-staging-psdb
ruleset's required-check context from "Linux Build & Test" → "Linux Build".
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

⚠️ 21 pre-existing translator lit failures on baseline; not caused by this PR (see run).

🔴 New failures (0) — likely caused by this PR

(none)

🟢 Fixed by this PR (0) — failing on baseline, passing here

(none)

⚠️ Pre-existing on `amd-staging` (21)
FAIL: LLVM_SPIRV :: constant/local-float-point-constants.ll
FAIL: LLVM_SPIRV :: extensions/EXT/SPV_EXT_float8/conversions_matrix.ll
FAIL: LLVM_SPIRV :: extensions/EXT/SPV_EXT_float8/conversions_scalar_vector.ll
FAIL: LLVM_SPIRV :: extensions/EXT/SPV_EXT_shader_atomic_float_/atomicrmw_fsub_half.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_float4/conversions_packed.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_float4/conversions_scalar_vector.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_fp_conversions/spv_intel_fp_conversions.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_int4/conversions_packed.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_sigmoid/sigmoid_f16.ll
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_bfloat16/cooperative_matrix_bfloat16.ll
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_cooperative_matrix/conversion_instructions.ll
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_subgroup_rotate/SPV_KHR_subgroup_rotate.cl
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_uniform_group_instructions/group-instructions.ll
FAIL: LLVM_SPIRV :: transcoding/float16.ll
FAIL: LLVM_SPIRV :: transcoding/image_signedness_spv_ir.ll
FAIL: LLVM_SPIRV :: transcoding/OpImageSampleExplicitLod_arg.cl
FAIL: LLVM_SPIRV :: transcoding/spec_const.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_clustered_reduce.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_non_uniform_arithmetic.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_shuffle.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_shuffle_relative.ll

Mirrors TheRock's Multi-Arch CI shape: a top-level dispatcher with
platform-variant jobs that call reusable per-platform workflows. The
Linux variant is byte-isolated so adding Windows later is just a new
file + dispatcher entry.

Files:
  - spirv-ci.yml — top-level dispatcher (~25 lines), byte-identical
    across both this repo and ROCm/llvm-project. Triggers on
    pull_request + workflow_dispatch. Sole job (linux_release, name
    Linux::release) calls spirv-ci-linux.yml.
  - spirv-ci-linux.yml — workflow_call. Holds the build job + 3 test
    jobs (factored from the prior single spirv-ci.yml).

Rendered check_run names in the PR rollup (workflow_call composes the
slash hierarchy):
  - SPIRV Compiler CI / Linux::release / Build
  - SPIRV Compiler CI / Linux::release / Test SPIRV translator lit
  - SPIRV Compiler CI / Linux::release / Test LLVM SPIRV codegen
  - SPIRV Compiler CI / Linux::release / Test Comgr

Convention alignment with TheRock (verified against
.github/workflows/multi_arch_ci.yml, multi_arch_ci_linux.yml and the
rest of the workflow set):
  - Top-level workflow name Title Case, no branch suffix
  - Reusable workflow has its own descriptive name
  - snake_case job IDs, separate display name field
  - No literal slashes in job names — slashes come from workflow_call
  - concurrency: only on dispatcher, not on workflow_call
  - secrets: inherit at the dispatcher → variant call
  - permissions read-only at workflow level; pull-requests: write
    escalated only on the translator-lit job that posts the comment
  - workflow_dispatch alongside pull_request — Multi-Arch's check
    names don't get the "(pull_request)" suffix even with both
    triggers, so the disambiguation bug we hit before is sidestepped
    by the workflow_call structure
  - Container image pinned with @sha256:

ACTION REQUIRED on merge: update the amd-staging-psdb ruleset's
required-check context from "Linux Build" → "Linux::release / Build".
Without it the dangling rule blocks PRs on a permanently-pending
placeholder.
lamb-j added a commit to ROCm/llvm-project that referenced this pull request May 9, 2026
Mirrors TheRock's Multi-Arch CI shape: top-level dispatcher with
platform-variant jobs that call reusable per-platform workflows. Adding
Windows later = drop in spirv-ci-windows.yml + a windows_release job.

Files:
  - spirv-ci.yml — top-level dispatcher (~25 lines), byte-identical
    to the SPIRV-LLVM-Translator copy. Triggers on pull_request +
    workflow_dispatch. Sole job (linux_release, name Linux::release)
    calls spirv-ci-linux.yml.
  - spirv-ci-linux.yml — workflow_call. Holds the build job + 3 test
    jobs (factored from the prior single spirv-ci.yml). Differs from
    the translator-side copy only in checkout blocks (which repo is
    PR head vs which is pinned to amd-staging tip; same divergence as
    the prior single-file structure).

Rendered check_run names in the PR rollup (workflow_call composes the
slash hierarchy):
  - SPIRV Compiler CI / Linux::release / Build
  - SPIRV Compiler CI / Linux::release / Test SPIRV translator lit
  - SPIRV Compiler CI / Linux::release / Test LLVM SPIRV codegen
  - SPIRV Compiler CI / Linux::release / Test Comgr

Convention alignment: Title Case workflow name with no branch suffix,
snake_case job IDs + display name override, no literal slashes in job
names, concurrency only on dispatcher, secrets: inherit at the
dispatcher, permissions read-only at workflow level with
pull-requests: write escalated only on the translator-lit job, both
pull_request and workflow_dispatch triggers (workflow_call sidesteps
the (pull_request) check-name suffix bug we hit before), pinned
container image.

Companion change: ROCm/SPIRV-LLVM-Translator#194.

ACTION REQUIRED on merge: update the amd-staging-psdb ruleset's
required-check context from "Linux Build" → "Linux::release / Build".
…ure)

The previous commit moved pull-requests: write to a job-level permission
on the test_translator_lit job inside spirv-ci-linux.yml. Per GHA rules,
a called workflow's GITHUB_TOKEN permissions are capped by the caller's
permissions; if the caller's workflow-level grant is `contents: read`
only, the called workflow can't add pull-requests: write — validation
fails at startup, no jobs run.

Symptom: PR #194 ran with conclusion=startup_failure, no check_runs
created, empty PR rollup.

Fix: grant pull-requests: write at the dispatcher's workflow-level
permissions. The called workflow's job-level grant on
test_translator_lit still narrows the actual usage to that single job;
this just lifts the caller-side cap.

Multi-Arch CI gets away with `contents: read` only because none of its
jobs post comments. Ours does, so this grant is required.
lamb-j added a commit to ROCm/llvm-project that referenced this pull request May 9, 2026
…ure)

Same fix as ROCm/SPIRV-LLVM-Translator#194 follow-up. The previous
commit moved pull-requests: write to job-level inside spirv-ci-linux.yml,
but per GHA rules a called workflow can't exceed the caller's
permission cap. The dispatcher's `contents: read` only cap caused
startup_failure (no jobs run, empty rollup).

Lift pull-requests: write to the dispatcher's workflow-level
permissions. The called workflow's job-level grant on
test_translator_lit still scopes the actual usage to that single job.
The previous commit's first run hit two distinct failures rooted in
actions/upload-artifact@v4 defaults:

  1. Comgr test job: "/bin/sh: clang-23: Permission denied" — v4
     strips executable bits on upload, so binaries come back
     non-executable.

  2. Codegen test job: cmake reconfigure (triggered when ninja sees
     the freshly-checked-out source as newer than the build tree)
     failed inside FetchContent's SPIRV-Headers update step with
     "fatal: not a git repository: '.git'" — v4 also excludes
     hidden files (the .git dir under build/_deps/) by default.

Translator-lit job appeared green but actually hit the same first-lit
breakage; continue-on-error: true masked it.

Fix: tar the build trees ourselves before upload, untar after
download. Tar preserves both file modes and hidden files in one shot,
which is simpler than chmodding +x and toggling include-hidden-files
separately.
lamb-j added a commit to ROCm/llvm-project that referenced this pull request May 10, 2026
Same fix as ROCm/SPIRV-LLVM-Translator#194 follow-up.

actions/upload-artifact@v4 strips executable bits and excludes hidden
files by default. The first symptom: clang-23 came back non-executable
in the Comgr test job ("Permission denied"). The second: cmake
reconfigure on the test side failed inside FetchContent's SPIRV-Headers
git update because the .git dir got dropped on upload.

Tar the build trees before upload and untar after download — preserves
both modes and hidden files in one shot.
Two wall-time bugs found while diagnosing PR #194's slow codegen test:

1. Untar with `tar -xf` restored mtimes from when the build job
   produced them. Test-job source checkouts run just before, so
   freshly-checked-out source files appeared newer than (older) build
   outputs from the tar, and ninja cascade-rebuilt instead of running
   the requested test target. Switch to `tar -xmf` (skip mtime
   restore) so build outputs are newer. Observed ~5-10 min wasted on
   codegen test job.

2. Translator-lit job's translator checkout used `fetch-depth: 0` (full
   history) only to enable the later `git checkout amd-staging` for the
   baseline swap. But `git fetch --depth=1 origin amd-staging` followed
   by `git checkout FETCH_HEAD` works on a shallow clone too. Switch to
   `fetch-depth: 1`. ~30-60s save per run.

Drive-by: trim verbose comments on the tar/untar steps.
lamb-j added a commit to ROCm/llvm-project that referenced this pull request May 10, 2026
Same two wall-time bugs as ROCm/SPIRV-LLVM-Translator#194 follow-up:

1. `tar -xf` restored mtimes from when the build job produced files,
   making source (freshly checked out in the test job) appear newer
   than build outputs and triggering ninja cascade-rebuild. Switch to
   `tar -xmf` so build outputs are newer.

2. Translator-lit job's llvm-project checkout used `fetch-depth: 0`
   only to enable the baseline-swap `git checkout amd-staging`. The
   swap step explicitly does `git fetch --depth=1` + `git checkout
   FETCH_HEAD` which works on a shallow clone. Switch to
   `fetch-depth: 1`.

Drive-by: trim verbose comments on the tar/untar steps.
The PR-head and baseline lit steps stay continue-on-error so pre-
existing amd-staging breakage doesn't block. The partition comment
already surfaces 🔴 New / 🟢 Fixed / ⚠️ Pre-existing buckets.

Add a final gate step that exits non-zero when newFails > 0 (PR head
FAILs not present in baseline). Real PR-introduced regressions now
turn the check red instead of just dropping into a comment.

Degrades gracefully — if either capture file is missing (e.g., one of
the lit runs failed to produce output) the gate emits a warning and
passes rather than blocking on incomplete data.

Now that the partition logic has been validated end-to-end on PRs
#190/#192 (🔴 New) and #193 (🟢 Fixed), this is a safe step up from
informational to blocking.
lamb-j added a commit to ROCm/llvm-project that referenced this pull request May 10, 2026
Same gate as ROCm/SPIRV-LLVM-Translator#194 follow-up. The lit steps
stay continue-on-error so pre-existing amd-staging breakage doesn't
block, but a final step exits non-zero when newFails > 0 (PR head
FAILs not present in baseline). Real PR-introduced regressions turn
the check red instead of just landing in the partition comment.
@lamb-j lamb-j marked this pull request as ready for review May 10, 2026 19:53
@lamb-j lamb-j requested a review from kirthana14m May 10, 2026 19:53
@lamb-j lamb-j merged commit a689573 into amd-staging May 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant