Skip to content

Disable pytorch sccache for PRs from forks#5738

Closed
ScottTodd wants to merge 1 commit into
ROCm:mainfrom
ScottTodd:pytorch-sccache-s3-auth-patch
Closed

Disable pytorch sccache for PRs from forks#5738
ScottTodd wants to merge 1 commit into
ROCm:mainfrom
ScottTodd:pytorch-sccache-s3-auth-patch

Conversation

@ScottTodd

Copy link
Copy Markdown
Member

Motivation

Tentative fix for #5737, to unblock #5729.

Workflow runs from fork PRs outside of the ROCm organization are not authorized to assume the therock-ci AWS IAM role, so PyTorch builds are failing on fork PRs with errors like

It looks like you might be trying to authenticate with OIDC. Did you mean to set the `id-token` permission? If you are not trying to authenticate with OIDC and the action is working successfully, you can ignore this message.
Assuming role with user credentials
Retry AssumeRole: attempt 1 of 12 failed: Could not assume role with user credentials: User: arn:aws:iam::692859939525:user/therock-external-upload is not authorized to perform: sts:TagSession on resource: arn:aws:iam::324352301041:role/therock-ci. Retrying after 43ms.
Assuming role with user credentials
Retry AssumeRole: attempt 2 of 12 failed: Could not assume role with user credentials: User: arn:aws:iam::692859939525:user/therock-external-upload is not authorized to perform: sts:TagSession on resource: arn:aws:iam::324352301041:role/therock-ci. Retrying after 79ms.
Assuming role with user credentials
Retry AssumeRole: attempt 3 of 12 failed: Could not assume role with user credentials: User: arn:aws:iam::692859939525:user/therock-external-upload is not authorized to perform: sts:TagSession on resource: arn:aws:iam::324352301041:role/therock-ci. Retrying after 110ms.
Assuming role with user credentials
Retry AssumeRole: attempt 4 of 12 failed: Could not assume role with user credentials: User: arn:aws:iam::692859939525:user/therock-external-upload is not authorized to perform: sts:TagSession on resource: arn:aws:iam::324352301041:role/therock-ci. Retrying after 10ms.
Assuming role with user credentials
Retry AssumeRole: attempt 5 of 12 failed: Could not assume role with user credentials: User: arn:aws:iam::692859939525:user/therock-external-upload is not authorized to perform: sts:TagSession on resource: arn:aws:iam::324352301041:role/therock-ci. Retrying after 270ms.

Technical Details

I guess github.repository_owner == 'ROCm' is true for PRs from forks. I think github.event.pull_request.head.repo.owner.login == 'ROCm' should be a more precise check.

We should add base credentials to these runners that can access some cache like how artifacts can upload to therock-ci-artifacts-external (see https://github.com/ROCm/TheRock/blob/main/docs/development/s3_buckets.md#ci-buckets).

Test Plan

Watch to see if workflow runs on this PR successfully skip sccache setup and build pytorch.

Submission Checklist

Comment on lines 163 to +164
- name: Configure AWS Credentials for sccache
if: ${{ inputs.cache_type == 'sccache' && github.repository_owner == 'ROCm' }}
if: ${{ inputs.cache_type == 'sccache' && github.repository_owner == 'ROCm' && (github.event_name != 'pull_request' || github.event.pull_request.head.repo.owner.login == 'ROCm') }}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can limit these pull_request changes to just the _ci.yml workflows.

This isn't working though, other steps fail: https://github.com/ROCm/TheRock/actions/runs/27240729872/job/80461488687?pr=5738#step:14:73

sccache: Starting the server...
sccache: error: Server startup failed: cache storage failed to read: PermissionDenied (permanent) at read => S3Error { code: "AccessDenied", message: "User: arn:aws:iam::692859939525:user/therock-external-upload is not authorized to perform: s3:GetObject on resource: \"arn:aws:s3:::therock-pytorch-sccache-ci/linux/multi-arch-release/.sccache_check\" because no identity-based policy allows the s3:GetObject action", resource: "", request_id: "RQ3NQ6XYPNWWZQAA" }

Could compute earlier in the workflow whether or not to enable sccache based on inputs.cache_type and other values. Really this all needs to go through scripts and not have so many available options handled via yml code (different cache types that may or may not work in different conditions).

@ScottTodd

Copy link
Copy Markdown
Member Author

Closing in favor of #5816

@ScottTodd ScottTodd closed this Jun 12, 2026
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Jun 12, 2026
subodh-dubey-amd added a commit that referenced this pull request Jun 14, 2026
## Summary

Fixes #5737 (supersedes the Draft #5738). PyTorch CI jobs fail on PRs
from forks at the sccache steps:

```
User: arn:aws:iam::692859939525:user/therock-external-upload is not authorized to perform:
sts:TagSession on resource: arn:aws:iam::324352301041:role/therock-ci
... AccessDenied ... s3:GetObject on therock-pytorch-sccache-ci/.../.sccache_check
```

### Root cause

sccache uses an S3 bucket reached via an OIDC-assumed IAM role
(`therock-ci`). Fork PRs don't get OIDC tokens, so the role can't be
assumed. The gate `github.repository_owner == 'ROCm'` is **always true**
on fork PRs (it's the base repo), and it only guarded the credentials
step — `Verify sccache` and the build still passed `--use-sccache`, so
sccache started and failed on S3 access.

### Fix

Compute the effective cache type **once**, in
[build_tools/github_actions/compute_pytorch_cache_type.py](https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/compute_pytorch_cache_type.py)
(reusing `_is_current_run_pr_from_fork()` from `s3_buckets.py`),
downgrading `sccache` -> `none` on fork PRs / non-ROCm repos. Every
sccache step in the two `*_pytorch_wheels_ci.yml` workflows now keys off
that single `steps.cache.outputs.cache_type` instead of repeating fork
checks. `ccache`/`none` pass through unchanged; in-org runs keep sccache
and its hard-fail semantics.

This follows the review direction on #5738 ("compute earlier… this all
needs to go through scripts, not yml code") and limits changes to the
`_ci.yml` workflows (the release / `multi_arch_build_*` workflows are
never fork-triggered).

## Files

- `build_tools/github_actions/compute_pytorch_cache_type.py` (new)
- `build_tools/github_actions/tests/compute_pytorch_cache_type_test.py`
(new, 8 cases)
- `build_portable_linux_pytorch_wheels_ci.yml`,
`build_windows_pytorch_wheels_ci.yml` — add "Determine cache type" step;
gate all sccache steps on its output.

## Test plan

- [x] Unit tests (fork / in-org / non-PR / non-ROCm / none / ccache).
- [x] Real-script test against fork-shaped event payloads:
fork+sccache->none, in-org+sccache->sccache, fork+ccache->ccache.
- [ ] In-org CI run on this PR: sccache still active + cache hits (no
regression).
- [ ] Fork PR (#5729) after merge: sccache skipped, build succeeds with
cache_type=none.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant