ci: add nightly wheel build workflow#289
Merged
Merged
Conversation
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
c6461f0 to
a6cccc3
Compare
Self-hosted runners retain files in runner.temp across runs, causing pip install to fail when it encounters incompatible wheel files from previous runs.
a6cccc3 to
ab52ff0
Compare
Files in runner.temp are created by Docker as root, so the runner process cannot rm them directly. Use docker containers to clean up, and also clean dist/wheel dirs in the Cleanup steps to prevent future accumulation.
- Create run-intranode-tests composite action with all Python test steps
(EP, IO, IR, CCL/shmem, collectives, async bench)
- Create run-internode-tests composite action with all internode tests
(IO write/read, EP bench/stress, async kernel)
- Refactor ci.yml to call composite actions instead of inline steps
- Refactor nightly.yml:
- Parameterize BASE_IMAGE per Python version (py3.10 uses Ubuntu 22.04,
py3.12 uses Ubuntu 24.04), eliminating deadsnakes PPA hack
- Replace inline tests with run-intranode-tests composite action
- Add test-wheel-internode job on both MI355X_AINIC and MI300X_BNXT
- deploy-nightly now gates on both test-wheel and test-wheel-internode
…python Composite actions fold all test steps into one UI entry, making it hard to identify which test failed. Revert to inline steps for full visibility. Changes kept from previous commits: - Parameterized BASE_IMAGE per Python version (py3.10 uses Ubuntu 22.04 native image, py3.12 uses Ubuntu 24.04), eliminating deadsnakes PPA - Docker-based stale dist/wheel cleanup for root-owned files - Nightly test-wheel now includes all CI intranode Python tests (async_ll IBGDA, bench fp8_blockwise, CCL collectives) New: - Add test-wheel-internode job (MI355X_AINIC + MI300X_BNXT) - Fix node2 rsync: use /tmp/nightly-wheel as remote wheel path instead of runner.temp which does not exist on the remote node - deploy-nightly gates on both test-wheel and test-wheel-internode - Restore ci.yml to main (no changes needed in this PR)
The internode test job was only testing with py3.12. Add py3.10 with its matching base image, giving 4 matrix entries (2 platforms x 2 python).
test-wheel now runs 4 variants (MI355X_AINIC + MI300X_BNXT) x (py3.10 + py3.12) with per-platform RDMA config and socket_ifname, matching the CI intranode-test coverage.
Change pyproject.toml from static version to dynamic (setuptools_scm),
so SETUPTOOLS_SCM_PRETEND_VERSION actually controls the wheel version.
Nightly wheels now use PEP 440 dev format derived from the latest tag:
{base}.dev{date}+{sha} e.g. 1.1.1.dev20260518+abc1234
This makes nightly wheels clearly distinguishable from release builds,
and pip treats .dev as pre-release (won't be installed by default).
Also update the deploy step's prune logic and index.html generation
to parse the new version format.
Nightly wheels use .dev pre-release versions, so pip needs --pre to install them.
- Add fetch-depth: 0 and fetch-tags: true to build-wheel checkout so git describe can find version tags for the nightly dev version string. - Remove MI300X_BNXT py3.10 from intranode and internode test matrices. MI300X_BNXT hosts run Ubuntu 24.04 (GLIBC 2.38) but the py3.10 base image is Ubuntu 22.04 (GLIBC 2.35). The bnxt OOT RDMA libraries mounted from the host require GLIBC 2.38, causing import failures. MI355X_AINIC is unaffected (host is Ubuntu 22.04, GLIBC matches).
Restructure gh-pages/nightly/ from flat to date-based layout: nightly/latest/ - always has the most recent wheels (pip target) nightly/2026-05-18/ - wheels from that date nightly/2026-05-17/ - ... Prune logic now removes entire date directories older than 30 days. index.html groups wheels by date with per-date tables. Install command: pip install --pre --find-links .../nightly/latest/ amd_mori
Restructure the nightly workflow into 4 stages:
1. build-wheel: build py3.10 + py3.12 wheels (unchanged)
2. deploy-staging: push wheels to gh-pages/nightly/{date}/ but NOT
latest/; outputs wheel_url for test jobs
3. test-wheel + test-wheel-internode: install from gh-pages via
pip --find-links (validates the real user install path); includes
retry loop for CDN propagation delay
4. promote-latest: only after all tests pass, copy tested wheels
to gh-pages/nightly/latest/
This ensures latest/ only ever contains tested wheels, while the
date directories have all builds for traceability.
PR triggers skip deploy-staging (if != pull_request), so the test
and promote jobs are also skipped — PR runs only validate wheel build.
Internode tests no longer need wheel rsync to node2; both nodes
pip install directly from gh-pages.
Avoid CI workflow competing for runners while validating nightly.yml. Will re-enable before merging.
When py3.10 and py3.12 matrix jobs share a runner, building both as rocm/mori:ci causes the second build to overwrite the first. Use rocm/mori:ci-py3.10 and rocm/mori:ci-py3.12 instead. Applies to build-wheel, test-wheel, and test-wheel-internode jobs.
The sed pattern was too greedy, matching '-cp312' as part of the version string. Use '-cp[0-9]*-cp' as delimiter to correctly extract the version. Also show full dev version instead of just base version.
pip --find-links needs an HTML page with <a href="*.whl"> links to discover wheels. Without index.html in date subdirectories, pip falls back to PyPI and installs the wrong package. - Generate index.html in each date directory during deploy-staging - Generate index.html in latest/ directory during promote-latest - Use --no-index instead of --pre to prevent any PyPI fallback
- deploy-staging extracts version from wheel filename and outputs it - test-wheel and test-wheel-internode use amd_mori==$VERSION to ensure pip installs the exact wheel from this build, not an older one - index.html now shows "Latest tested version" with commit and link to latest/ directory, plus an "All builds" section header
setuptools_scm generates python/mori/_version.py at build time but __init__.py never imported it, so mori.__version__ was not accessible. Also restore version print in nightly install verification step.
…1:00 - Increase gh-pages propagation retry from 5 to 15 (max ~3.5 min wait) - Add --force-reinstall to user install instructions so nightly always overwrites any existing release version - Latest section on index.html now shows a full table (same format as date sections) instead of just a text line - promote-latest regenerates index.html so the Latest section reflects the just-promoted wheels immediately (not delayed by one run) - Change cron schedule from 00:00 to 01:00 Beijing time (UTC 17:00) - Push scope in promote-latest changed from nightly/latest/ to nightly/ to include the regenerated index.html
- ci.yml: re-enable pull_request trigger (was disabled during nightly validation) - nightly.yml: remove pull_request trigger (only schedule + manual dispatch needed in production)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a nightly wheel build workflow that builds, tests, and publishes pre-built wheels to GitHub Pages.
Workflow stages:
gh-pages/nightly/{date}/gh-pages/nightly/latest/Install nightly:
Other changes:
pyproject.toml: switch to dynamic version via setuptools_scm (nightly uses PEP 440 dev format:1.1.1.dev20260518+abc1234)python/mori/__init__.py: expose__version__ci.yml: print version in install verification step