Skip to content

ci: add nightly wheel build workflow#289

Merged
jhchouuu merged 25 commits into
mainfrom
jiahzhou/nightly-ci
May 18, 2026
Merged

ci: add nightly wheel build workflow#289
jhchouuu merged 25 commits into
mainfrom
jiahzhou/nightly-ci

Conversation

@jhchouuu
Copy link
Copy Markdown
Collaborator

@jhchouuu jhchouuu commented Apr 22, 2026

Summary

Add a nightly wheel build workflow that builds, tests, and publishes pre-built wheels to GitHub Pages.

Workflow stages:

  1. build-wheel — Build py3.10 + py3.12 wheels on MI355X-AINIC-TW
  2. deploy-staging — Push wheels to gh-pages/nightly/{date}/
  3. test-wheel + test-wheel-internode — Install from gh-pages and run full test suite (3 intranode + 3 internode variants)
  4. promote-latest — Copy tested wheels to gh-pages/nightly/latest/

Install nightly:

pip install --no-index --force-reinstall --find-links https://rocm.github.io/mori/nightly/latest/ amd_mori

Other changes:

  • pyproject.toml: switch to dynamic version via setuptools_scm (nightly uses PEP 440 dev format: 1.1.1.dev20260518+abc1234)
  • python/mori/__init__.py: expose __version__
  • ci.yml: print version in install verification step

@jhchouuu jhchouuu force-pushed the jiahzhou/nightly-ci branch from c6461f0 to a6cccc3 Compare May 18, 2026 04:48
Self-hosted runners retain files in runner.temp across runs, causing
pip install to fail when it encounters incompatible wheel files from
previous runs.
@jhchouuu jhchouuu force-pushed the jiahzhou/nightly-ci branch from a6cccc3 to ab52ff0 Compare May 18, 2026 04:49
jhchouuu added 21 commits May 18, 2026 04:52
Files in runner.temp are created by Docker as root, so the runner
process cannot rm them directly. Use docker containers to clean up,
and also clean dist/wheel dirs in the Cleanup steps to prevent
future accumulation.
- Create run-intranode-tests composite action with all Python test steps
  (EP, IO, IR, CCL/shmem, collectives, async bench)
- Create run-internode-tests composite action with all internode tests
  (IO write/read, EP bench/stress, async kernel)
- Refactor ci.yml to call composite actions instead of inline steps
- Refactor nightly.yml:
  - Parameterize BASE_IMAGE per Python version (py3.10 uses Ubuntu 22.04,
    py3.12 uses Ubuntu 24.04), eliminating deadsnakes PPA hack
  - Replace inline tests with run-intranode-tests composite action
  - Add test-wheel-internode job on both MI355X_AINIC and MI300X_BNXT
  - deploy-nightly now gates on both test-wheel and test-wheel-internode
…python

Composite actions fold all test steps into one UI entry, making it hard
to identify which test failed.  Revert to inline steps for full visibility.

Changes kept from previous commits:
- Parameterized BASE_IMAGE per Python version (py3.10 uses Ubuntu 22.04
  native image, py3.12 uses Ubuntu 24.04), eliminating deadsnakes PPA
- Docker-based stale dist/wheel cleanup for root-owned files
- Nightly test-wheel now includes all CI intranode Python tests
  (async_ll IBGDA, bench fp8_blockwise, CCL collectives)

New:
- Add test-wheel-internode job (MI355X_AINIC + MI300X_BNXT)
- Fix node2 rsync: use /tmp/nightly-wheel as remote wheel path instead
  of runner.temp which does not exist on the remote node
- deploy-nightly gates on both test-wheel and test-wheel-internode
- Restore ci.yml to main (no changes needed in this PR)
The internode test job was only testing with py3.12. Add py3.10 with
its matching base image, giving 4 matrix entries (2 platforms x 2 python).
test-wheel now runs 4 variants (MI355X_AINIC + MI300X_BNXT) x
(py3.10 + py3.12) with per-platform RDMA config and socket_ifname,
matching the CI intranode-test coverage.
Change pyproject.toml from static version to dynamic (setuptools_scm),
so SETUPTOOLS_SCM_PRETEND_VERSION actually controls the wheel version.

Nightly wheels now use PEP 440 dev format derived from the latest tag:
  {base}.dev{date}+{sha}  e.g. 1.1.1.dev20260518+abc1234

This makes nightly wheels clearly distinguishable from release builds,
and pip treats .dev as pre-release (won't be installed by default).

Also update the deploy step's prune logic and index.html generation
to parse the new version format.
Nightly wheels use .dev pre-release versions, so pip needs --pre to
install them.
- Add fetch-depth: 0 and fetch-tags: true to build-wheel checkout so
  git describe can find version tags for the nightly dev version string.

- Remove MI300X_BNXT py3.10 from intranode and internode test matrices.
  MI300X_BNXT hosts run Ubuntu 24.04 (GLIBC 2.38) but the py3.10 base
  image is Ubuntu 22.04 (GLIBC 2.35). The bnxt OOT RDMA libraries
  mounted from the host require GLIBC 2.38, causing import failures.
  MI355X_AINIC is unaffected (host is Ubuntu 22.04, GLIBC matches).
Restructure gh-pages/nightly/ from flat to date-based layout:
  nightly/latest/     - always has the most recent wheels (pip target)
  nightly/2026-05-18/ - wheels from that date
  nightly/2026-05-17/ - ...

Prune logic now removes entire date directories older than 30 days.
index.html groups wheels by date with per-date tables.

Install command: pip install --pre --find-links .../nightly/latest/ amd_mori
Restructure the nightly workflow into 4 stages:

1. build-wheel: build py3.10 + py3.12 wheels (unchanged)
2. deploy-staging: push wheels to gh-pages/nightly/{date}/ but NOT
   latest/; outputs wheel_url for test jobs
3. test-wheel + test-wheel-internode: install from gh-pages via
   pip --find-links (validates the real user install path); includes
   retry loop for CDN propagation delay
4. promote-latest: only after all tests pass, copy tested wheels
   to gh-pages/nightly/latest/

This ensures latest/ only ever contains tested wheels, while the
date directories have all builds for traceability.

PR triggers skip deploy-staging (if != pull_request), so the test
and promote jobs are also skipped — PR runs only validate wheel build.

Internode tests no longer need wheel rsync to node2; both nodes
pip install directly from gh-pages.
Avoid CI workflow competing for runners while validating nightly.yml.
Will re-enable before merging.
When py3.10 and py3.12 matrix jobs share a runner, building both as
rocm/mori:ci causes the second build to overwrite the first. Use
rocm/mori:ci-py3.10 and rocm/mori:ci-py3.12 instead.

Applies to build-wheel, test-wheel, and test-wheel-internode jobs.
The sed pattern was too greedy, matching '-cp312' as part of the
version string. Use '-cp[0-9]*-cp' as delimiter to correctly extract
the version. Also show full dev version instead of just base version.
pip --find-links needs an HTML page with <a href="*.whl"> links to
discover wheels. Without index.html in date subdirectories, pip falls
back to PyPI and installs the wrong package.

- Generate index.html in each date directory during deploy-staging
- Generate index.html in latest/ directory during promote-latest
- Use --no-index instead of --pre to prevent any PyPI fallback
- deploy-staging extracts version from wheel filename and outputs it
- test-wheel and test-wheel-internode use amd_mori==$VERSION to ensure
  pip installs the exact wheel from this build, not an older one
- index.html now shows "Latest tested version" with commit and link
  to latest/ directory, plus an "All builds" section header
setuptools_scm generates python/mori/_version.py at build time but
__init__.py never imported it, so mori.__version__ was not accessible.

Also restore version print in nightly install verification step.
…1:00

- Increase gh-pages propagation retry from 5 to 15 (max ~3.5 min wait)
- Add --force-reinstall to user install instructions so nightly always
  overwrites any existing release version
- Latest section on index.html now shows a full table (same format as
  date sections) instead of just a text line
- promote-latest regenerates index.html so the Latest section reflects
  the just-promoted wheels immediately (not delayed by one run)
- Change cron schedule from 00:00 to 01:00 Beijing time (UTC 17:00)
- Push scope in promote-latest changed from nightly/latest/ to nightly/
  to include the regenerated index.html
- ci.yml: re-enable pull_request trigger (was disabled during nightly
  validation)
- nightly.yml: remove pull_request trigger (only schedule + manual
  dispatch needed in production)
@jhchouuu jhchouuu merged commit 4cc7e8d into main May 18, 2026
18 of 19 checks passed
@jhchouuu jhchouuu deleted the jiahzhou/nightly-ci branch May 18, 2026 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant