Disable PyTorch sccache on fork PRs via a computed cache_type#5816
Conversation
PyTorch sccache uses an S3 bucket reached through an OIDC-assumed IAM role (therock-ci). Pull requests from forks (and runs in non-ROCm repos) cannot assume that role, so the build failed: the credentials step was gated on `github.repository_owner == 'ROCm'` (always true on fork PRs) and the verify / build steps still passed `--use-sccache`, so sccache started and failed on `s3:GetObject` (AccessDenied). Instead of repeating fork checks across several `if:` conditions, compute the effective cache type once in build_tools/github_actions/compute_pytorch_cache_type.py (reusing _is_current_run_pr_from_fork from s3_buckets.py), downgrading `sccache` -> `none` on fork PRs / non-ROCm repos, and gate every sccache step in the two `*_pytorch_wheels_ci.yml` workflows on that single output. ccache and none pass through unchanged; in-org runs keep sccache (and its hard-fail semantics) intact. Scope is the CI workflows only -- the release / multi_arch_build workflows are never triggered by fork PRs. Fixes #5737. Co-authored-by: Cursor <cursoragent@cursor.com>
ScottTodd
left a comment
There was a problem hiding this comment.
Thanks. I think the longer term fix here will be to move the pytorch builds to AWS runners that have base credentials for writing to a dev cache regardless of whether the triggering workflow run is from a fork PR or not. Do we have that worked into the design @amd-shiraz ? We should have caches set up in that way for pytorch (e.g. sccache here), ROCm (e.g. ccache today), and other projects like JAX too.
|
|
||
| sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) | ||
|
|
||
| from _therock_utils.s3_buckets import _is_current_run_pr_from_fork |
There was a problem hiding this comment.
underscore-prefixed functions like _is_current_run_pr_from_fork are considered private implementation details of the current file and should not typically be imported from other files. I'm willing to let this one slide though.
There was a problem hiding this comment.
Cleaned up in #5848 — added a public is_current_run_pr_from_fork() wrapper and repointed the import. Thanks!
|
Verified on #5729 (a real fork PR) after merging main — Multi-Arch CI run 27493147226. The fork path now behaves correctly, e.g. the Linux release/2.10 Build PyTorch job:
All 4 PyTorch build cells passed:
|
#5848) Follow-up to #5816, addressing the review nit at #5816 (comment) (importing the private `_is_current_run_pr_from_fork` across files). Adds a public `is_current_run_pr_from_fork()` wrapper in `s3_buckets.py` and repoints `compute_pytorch_cache_type.py` (and its test mocks) to it. No behavior change. ## Test plan - [x] `compute_pytorch_cache_type_test.py` (8 cases) pass - [x] Real fork-event smoke test: fork+sccache -> none ## Submission Checklist Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
#5848) Follow-up to #5816, addressing the review nit at #5816 (comment) (importing the private `_is_current_run_pr_from_fork` across files). Adds a public `is_current_run_pr_from_fork()` wrapper in `s3_buckets.py` and repoints `compute_pytorch_cache_type.py` (and its test mocks) to it. No behavior change. ## Test plan - [x] `compute_pytorch_cache_type_test.py` (8 cases) pass - [x] Real fork-event smoke test: fork+sccache -> none ## Submission Checklist Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Summary
Fixes #5737 (supersedes the Draft #5738). PyTorch CI jobs fail on PRs from forks at the sccache steps:
Root cause
sccache uses an S3 bucket reached via an OIDC-assumed IAM role (
therock-ci). Fork PRs don't get OIDC tokens, so the role can't be assumed. The gategithub.repository_owner == 'ROCm'is always true on fork PRs (it's the base repo), and it only guarded the credentials step —Verify sccacheand the build still passed--use-sccache, so sccache started and failed on S3 access.Fix
Compute the effective cache type once, in build_tools/github_actions/compute_pytorch_cache_type.py (reusing
_is_current_run_pr_from_fork()froms3_buckets.py), downgradingsccache->noneon fork PRs / non-ROCm repos. Every sccache step in the two*_pytorch_wheels_ci.ymlworkflows now keys off that singlesteps.cache.outputs.cache_typeinstead of repeating fork checks.ccache/nonepass through unchanged; in-org runs keep sccache and its hard-fail semantics.This follows the review direction on #5738 ("compute earlier… this all needs to go through scripts, not yml code") and limits changes to the
_ci.ymlworkflows (the release /multi_arch_build_*workflows are never fork-triggered).Files
build_tools/github_actions/compute_pytorch_cache_type.py(new)build_tools/github_actions/tests/compute_pytorch_cache_type_test.py(new, 8 cases)build_portable_linux_pytorch_wheels_ci.yml,build_windows_pytorch_wheels_ci.yml— add "Determine cache type" step; gate all sccache steps on its output.Test plan