Skip to content

Expand Linux PyTorch presubmit matrix to include release/2.11 and release/2.12#5729

Draft
ScottTodd wants to merge 2 commits into
ROCm:mainfrom
ScottTodd:torch-presubmit-matrix
Draft

Expand Linux PyTorch presubmit matrix to include release/2.11 and release/2.12#5729
ScottTodd wants to merge 2 commits into
ROCm:mainfrom
ScottTodd:torch-presubmit-matrix

Conversation

@ScottTodd

Copy link
Copy Markdown
Member

Motivation

We recently had a build break that only affected torch versions release/2.11+:

Expanding build coverage on presubmit to catch these issues earlier is part of:

Technical Details

Important

This is expected to add around 40-60 minutes of build runner usage per pytorch version to CI workflow runs. These builds may run in parallel to GPU test jobs and the existing pytorch build job, so total time to signal is not expected to be impacted (other than increasing the load on our CPU runner pools).

Migrating the build runners to AWS (higher core count, closer to the AWS sccache) will also improve build times.

We can choose to conditionally enable these additional jobs to limit how many workflow runs include them. References:

I'm limiting to just Linux for now. We could also expand coverage on Windows (though those builds are slower).

Test Plan

Watch CI on this PR (expected to fail until #5714 is resolved)

Submission Checklist

Run the Linux multi-arch PyTorch presubmit build across release/2.10, release/2.11, and release/2.12 while keeping the presubmit Python version pinned to 3.12.

Tested with: pre-commit run --files .github/workflows/multi_arch_ci_linux.yml

Co-authored-by: OpenAI Codex <codex@openai.com>
subodh-dubey-amd added a commit that referenced this pull request Jun 14, 2026
## Summary

Fixes #5737 (supersedes the Draft #5738). PyTorch CI jobs fail on PRs
from forks at the sccache steps:

```
User: arn:aws:iam::692859939525:user/therock-external-upload is not authorized to perform:
sts:TagSession on resource: arn:aws:iam::324352301041:role/therock-ci
... AccessDenied ... s3:GetObject on therock-pytorch-sccache-ci/.../.sccache_check
```

### Root cause

sccache uses an S3 bucket reached via an OIDC-assumed IAM role
(`therock-ci`). Fork PRs don't get OIDC tokens, so the role can't be
assumed. The gate `github.repository_owner == 'ROCm'` is **always true**
on fork PRs (it's the base repo), and it only guarded the credentials
step — `Verify sccache` and the build still passed `--use-sccache`, so
sccache started and failed on S3 access.

### Fix

Compute the effective cache type **once**, in
[build_tools/github_actions/compute_pytorch_cache_type.py](https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/compute_pytorch_cache_type.py)
(reusing `_is_current_run_pr_from_fork()` from `s3_buckets.py`),
downgrading `sccache` -> `none` on fork PRs / non-ROCm repos. Every
sccache step in the two `*_pytorch_wheels_ci.yml` workflows now keys off
that single `steps.cache.outputs.cache_type` instead of repeating fork
checks. `ccache`/`none` pass through unchanged; in-org runs keep sccache
and its hard-fail semantics.

This follows the review direction on #5738 ("compute earlier… this all
needs to go through scripts, not yml code") and limits changes to the
`_ci.yml` workflows (the release / `multi_arch_build_*` workflows are
never fork-triggered).

## Files

- `build_tools/github_actions/compute_pytorch_cache_type.py` (new)
- `build_tools/github_actions/tests/compute_pytorch_cache_type_test.py`
(new, 8 cases)
- `build_portable_linux_pytorch_wheels_ci.yml`,
`build_windows_pytorch_wheels_ci.yml` — add "Determine cache type" step;
gate all sccache steps on its output.

## Test plan

- [x] Unit tests (fork / in-org / non-PR / non-ROCm / none / ccache).
- [x] Real-script test against fork-shaped event payloads:
fork+sccache->none, in-org+sccache->sccache, fork+ccache->ccache.
- [ ] In-org CI run on this PR: sccache still active + cache hits (no
regression).
- [ ] Fork PR (#5729) after merge: sccache skipped, build succeeds with
cache_type=none.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: TODO

Development

Successfully merging this pull request may close these issues.

2 participants