Skip to content

[ci]: Auto-apply ci:run-all-archs label to submodule bump PRs#5745

Draft
amd-chiranjeevi wants to merge 10 commits into
mainfrom
users/amd-chiranjeevi/modify_presubmit_coverage
Draft

[ci]: Auto-apply ci:run-all-archs label to submodule bump PRs#5745
amd-chiranjeevi wants to merge 10 commits into
mainfrom
users/amd-chiranjeevi/modify_presubmit_coverage

Conversation

@amd-chiranjeevi

@amd-chiranjeevi amd-chiranjeevi commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Motivation

Submodule bump PRs (rocm-systems, rocm-libraries) currently only build the default presubmit families (gfx94x, gfx110x, gfx1151, gfx120x), missing build coverage for older and nightly-only families. Instead of permanently adding families to the presubmit matrix (which adds CI load to every PR), automatically apply ci:run-all-archs when bump_automation.py creates a bump PR. This gives bump PRs:

  • Builds across all known families (gfx900, gfx901x, gfx103x, gfx950, etc.)
  • Families with nightly_check_only_for_family or no hardware remain build-only

Test Plan

  • Verify the next bump PR created by bump_automation.py has the
    ci:run-all-archs label applied automatically.
  • Check the CI run on that bump PR to confirm all families appear
    in the job matrix.

Test Result

To be updated

Submission Checklist

@amd-chiranjeevi amd-chiranjeevi changed the title adding gfx900, gfx906, gfx103x, gfx101x to presubmit matrix [ci]: Add gfx900, gfx101x, gfx103x to presubmit build matrix Jun 10, 2026
@amd-chiranjeevi amd-chiranjeevi changed the title [ci]: Add gfx900, gfx101x, gfx103x to presubmit build matrix [ci]: Add gfx900, gfx101x, gfx103x to presubmit build matrix and extend PyTorch multi-version builds Jun 10, 2026
Comment on lines +276 to +279
strategy:
fail-fast: false
matrix:
pytorch_git_ref: ["release/2.10", "release/2.11", "release/2.12"]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep PRs focused on one topic at a time please. I'm already making this change in #5729 (as announced on #5714 (comment))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Scott for the heads up. I will revert pytorch changes in this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I mainly want us to avoid duplicating work. Each expansion of the CI matrix has the potential to add considerable load to our build and test systems, as well as increase the risk of flakes, so I want us to roll out these changes carefully.

In the case of the GPU targets/families, we probably want to have submodule update PRs run more jobs than regular PRs. We can currently achieve that with the ci:run-all-archs label https://github.com/ROCm/TheRock/blob/main/docs/development/ci_behavior_manipulation.md or we can add some dedicated code to the configure script for that. We could even have https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/bump_automation.py automatically apply that label to the bump PRs that it creates 🤔

BTW since I created the #5729 from a fork, it's affected by #5737 and the workflow jobs failed. I could recreate from the shared repository but we need to get the CI system back to working on all kinds of PRs. I also decided on that PR to start with just expanding the Linux matrix, leaving Windows alone, as I think/hope that should get us sufficient coverage with minimal extra configuration. Like with the GPU targets/families though, what we should really do is dynamically choose which jobs to run (more on submodule update PRs, less on routine build system changes, opt-in when a developer wants extra testing and sees something the CI system doesn't know to look for on its own yet, etc.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could even have https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/bump_automation.py automatically apply that label to the bump PRs that it creates 🤔

Thanks Scott, this approach looks perfect, I'll proceed with this.

@amd-chiranjeevi amd-chiranjeevi Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I am looking for what ci:run-all-archs label enable. as per current infra availability, only Linux gf94x, windows gfx110x tests are being tested on the prs.

chiranjeevipattigidi added 2 commits June 10, 2026 16:41
@amd-chiranjeevi amd-chiranjeevi changed the title [ci]: Add gfx900, gfx101x, gfx103x to presubmit build matrix and extend PyTorch multi-version builds [ci]: Add gfx900, gfx101x, gfx103x to presubmit build matrix Jun 10, 2026
@amd-chiranjeevi amd-chiranjeevi changed the title [ci]: Add gfx900, gfx101x, gfx103x to presubmit build matrix [ci]: Auto-apply ci:run-all-archs label to submodule bump PRs Jun 11, 2026
@amd-chiranjeevi amd-chiranjeevi changed the title [ci]: Auto-apply ci:run-all-archs label to submodule bump PRs [ci]: Auto-apply ci:run-all-archs label to submodule bump PRs Jun 11, 2026
"fetch-gfx-targets": ["gfx950"],
"build_variants": ["release", "asan", "tsan"],
# Build-only on PRs (including ci:run-all-archs); tests run on push and nightly
"postsubmit_check_only_for_family": True,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the right change?

presubmit = PR
postsubmit = on push

wouldn't the label ci:run-all-archs cover this and we don't need to add this scenario?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my biggest worry here, is for submodule bumps, we need to make sure specific archs only runs sanity checks (and not full tests)

Comment on lines +944 to +953
# If postsubmit_check_only_for_family is set, skip tests on PRs only
if platform_info.get("postsubmit_check_only_for_family", False) and not (
is_push or is_schedule
):
test_runs_on = ""
print(
f" {family_name}: postsubmit_check_only_for_family flag set, "
f"disabling test runner for pull request runs"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we would still want gfx950 to run all tests since we have enough machines, but the machines that we do not have enough of, we should only run sanity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: TODO

Development

Successfully merging this pull request may close these issues.

3 participants