Skip to content

stop downloading open driver for AL2 4.14 and AL2 5.10 GPU AMIs in China regions#635

Merged
singholt merged 1 commit intoaws:mainfrom
singholt:main
Feb 26, 2026
Merged

stop downloading open driver for AL2 4.14 and AL2 5.10 GPU AMIs in China regions#635
singholt merged 1 commit intoaws:mainfrom
singholt:main

Conversation

@singholt
Copy link
Copy Markdown
Contributor

@singholt singholt commented Feb 26, 2026

Summary

Back in 2023, we had made some changes to our AL2 GPU scripts, to install open driver from Nvidia's repo, store the static tarball on the AMI, and provide a convenience script to users if they wish to use this open driver: #163. Specifically, this would help customers who want to use EFA.

This PR removes the above workaround, specifically for AL2 GPU AMIs in China regions. Our ECS AMI builds for release 20260225 are failing, when trying to download the open driver from Nvidia-repos, due to a cross-partition network timeout.

Here's why this is safe to do now:

  • There are no active ECS users of P4/P5 instance types in China in the last 30 days.
  • The convenience script is not documented in our ECS docs: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html.
  • To avoid/minimize customer disruption, I am only removing the workaround for China regions' AMI builds. We already do not provide this workaround in air-gapped regions. In about 3-6 months time, all ECS AL2 AMIs will reach EOL anyways.
  • I launched a P4 instance with the default driver installation, and did not explicitly run the workaround script. nvidia-smi was successful, and able to detect the GPU - indicating the problem from 2023, no longer applies.

Note: Without the workaround script, EFA doesnt work. So customers would need to create the workaround script/tarball themselves, in China.

Implementation details

  • Updated the workaround condition to exclude China regions, in addition to air-gapped regions.
  • Since this script existed in China until AMI version 20260223, I created a new stub script, that echo's a clear error and exits with exit code 1. If a customer was running this script via userdata, they will now see a meaningful error message in their cloud-init logs, instead of a vague "no such file" error.

Testing

Getting a China AWS account creds, and building an AMI locally was challenging, due to some errors unreleated to this change. Hence, I built an AMI in the us-west-2 region (REGION=us-west-2 make al2kernel5dot10gpu) with a modified version of this PR, such that all code paths are hit for us-west-2, instead cn. I verified the absence of open driver tarballs and that only the stub script exists.

This change will also unblock the release, and will be tested via that workflow.

Description for the changelog

enhancement: stop downloading open driver for AL2 4.14 and AL2 5.10 GPU AMIs in China regions

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@singholt singholt requested a review from a team as a code owner February 26, 2026 00:28
@singholt singholt force-pushed the main branch 3 times, most recently from db3e797 to 6f939dd Compare February 26, 2026 00:43
Comment thread scripts/enable-ecs-agent-gpu-support.sh
@singholt singholt merged commit 2ea61b1 into aws:main Feb 26, 2026
2 checks passed
@singholt singholt mentioned this pull request Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants