Skip to content

RELOPS-2372: bump NVIDIA A10 GRID driver to 573.96 (vGPU 18.x)#1218

Merged
jwmossmoz merged 2 commits into
masterfrom
RELOPS-2372
May 20, 2026
Merged

RELOPS-2372: bump NVIDIA A10 GRID driver to 573.96 (vGPU 18.x)#1218
jwmossmoz merged 2 commits into
masterfrom
RELOPS-2372

Conversation

@jwmossmoz
Copy link
Copy Markdown
Contributor

Summary

  • Bumps gpu_a10 in data/os/Windows.yaml from GRID 553.62 (vGPU 17.x) to 573.96 (vGPU 18.x) ahead of Azure's 2026-06-15 deadline for NVadsA10_v5 VMs (Service Health tracking ID 0YSB-WGZ). After that date, Azure begins rolling out the vGPU 20.x (R595.x) host driver, which is incompatible with anything older than 18.x.
  • 573.96 is the current Azure-redistributed GRID 18.6 build for Windows 11 25H2. The installer no longer carries Server 2019 in its filename (vGPU 18.x dropped 2019 support); the 25H2 pool does not run Server 2019, so this is a no-op for us.
  • Strengthens the kitchen serverspec for win116425h2azure to assert the downloaded installer is > 100 MB. The previous check only confirmed the file existed, which would still pass if the blob mirror served a placeholder or 0-byte file.

Blockers before merge

  • The 573.96_grid_win10_win11_server2022_server2025_dch_64bit_international_azure_swl.exe installer must be uploaded to the windows.ext_pkg_src Azure blob mirror first. Source: Azure N-series Windows driver setup (direct link).
  • Without the binary in place, puppet's file { $driver_exe: source => ... } will fail during converge.

Test plan

  • Upload 573.96 installer to the Azure blob backing windows.ext_pkg_src
  • CI: kitchen-windows.yml converge + verify on win11-64-25h2
  • Build win11-64-25h2-gpu worker image with the new driver
  • Validate the new image on the alpha pool: confirm nvidia-smi returns version 573.96 after first boot
  • Promote to production before 2026-06-15

Out of scope

  • The gpu.* entry (538.15) used by win11-64-24h2-gpu via win116424h2azure. Need to confirm whether that pool runs on A10 hardware. If it does, it needs the same upgrade. Tracked on RELOPS-2372.

jwmossmoz added 2 commits May 18, 2026 09:31
Azure is retiring vGPU 17.x for NVadsA10_v5 VMs on 2026-06-15 ahead of
the vGPU 20.x (R595.x) host driver rollout (Service Health tracking ID
0YSB-WGZ). 553.62 sits in the 17.x branch and must move to 18.x before
the deadline to avoid worker interruptions when Azure ships the host
update.

573.96 is the current Azure-redistributed GRID 18.6 build for Windows
11 25H2. It drops Server 2019 from the installer's OS list (vGPU 18.x
does not support 2019), which is fine for the 25H2 pool.

Also raised the kitchen serverspec check for the downloaded installer
to assert a non-trivial size (>100 MB), so a placeholder or truncated
file on the blob mirror is caught early. Full install verification
still has to happen at worker-image-validation time because the GRID
installer needs a reboot to complete.

Open follow-ups tracked on RELOPS-2372:
- Upload 573.96 .exe to the windows.ext_pkg_src Azure blob before merge
- Confirm whether win11-64-24h2-gpu also runs on A10 hardware; if so,
  the gpu.* entry (538.15) needs the same upgrade
Serverspec's winrm backend raises NotImplementedError on file.size, so
the size check has to go through Get-Item .Length instead.
@jwmossmoz
Copy link
Copy Markdown
Contributor Author

Integration test results — win11-64-25h2-alpha

Workflow run: https://github.com/mozilla-platform-ops/worker-images/actions/runs/26103935100
Taskcluster task group: https://firefox-ci-tc.services.mozilla.com/tasks/groups/RojgUc3pSgOjButsvjjqYg

Result: 52/54 passed, 2 failed

Both failures are pre-existing and unrelated to this change:

Task Failure
gecko-test-windows11-64-25h2-asan/opt-mochitest-browser-media ASAN access-violation crash, 0 test failures — intermittent ASAN crash unrelated to GPU driver
gecko-test-windows11-64-25h2/debug-xpcshell testAddTaskSkipAll in xpcshell-selftest — harness selftest intermittent, not image-related

@jwmossmoz jwmossmoz requested review from aerickson and markcor May 20, 2026 14:29
Copy link
Copy Markdown
Member

@aerickson aerickson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jwmossmoz jwmossmoz merged commit 757b34d into master May 20, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants