Skip to content

Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script#1374

Draft
cjac wants to merge 10 commits intoGoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:gpu-202601
Draft

Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script#1374
cjac wants to merge 10 commits intoGoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:gpu-202601

Conversation

@cjac
Copy link
Contributor

@cjac cjac commented Jan 23, 2026

GPU Initialization Action Enhancements for Secure Boot, Proxy, and Reliability

This large update significantly improves the install_gpu_driver.sh script and its accompanying documentation, focusing on robust support for complex environments involving Secure Boot and HTTP/S proxies, and increasing overall reliability and maintainability.

1. gpu/README.md:

  • Comprehensive Documentation for Secure Boot & Proxy:
    • Added a major section: "Building Custom Images with Secure Boot and Proxy Support". This details the end-to-end process using the GoogleCloudDataproc/custom-images repository to create Dataproc images with NVIDIA drivers signed for Secure Boot. It covers environment setup, key management in GCP Secret Manager, Docker builder image creation, and running the image generation process.
    • Added a major section: "Launching a Cluster with the Secure Boot Custom Image". This explains how to use the custom-built images to launch Dataproc clusters with --shielded-secure-boot. It includes instructions for private network setups using Google Cloud Secure Web Proxy, leveraging scripts from the GoogleCloudDataproc/cloud-dataproc repository for VPC, subnet, and proxy configuration.
    • Includes essential verification steps for checking driver status, module signatures, and system logs on the cluster nodes.
  • Enhanced Proxy Metadata: Clarified and expanded descriptions for proxy-related metadata: http-proxy, https-proxy, proxy-uri, no-proxy, and http-proxy-pem-uri.
  • New Section: "Enhanced Proxy Support": Explicitly outlines the script's capabilities in proxied environments, including custom CA certificate handling, automatic tool configuration (curl, apt, dnf, gpg, Java), and bypass mechanisms.
  • Troubleshooting: Added specific points for debugging network and proxy issues.

2. gpu/install_gpu_driver.sh:

  • Robust Proxy Handling (set_proxy):
    • Completely revamped to handle http-proxy, https-proxy, and proxy-uri metadata, determining the correct proxy values for HTTP and HTTPS.
    • Dynamically sets HTTP_PROXY, HTTPS_PROXY, and NO_PROXY environment variables.
    • Updates /etc/environment with the current proxy settings.
    • Conditionally configures gcloud proxy settings only if the gcloud SDK version is 547.0.0 or greater.
    • Performs TCP and HTTP(S) connection tests to the proxy to validate setup.
    • Configures apt and dnf to use the proxy.
    • Ensures dirmngr or gnupg2-smime is installed and configures dirmngr.conf to use the HTTP proxy.
    • Installs custom proxy CA certificates from http-proxy-pem-uri into system, Java, and Conda trust stores. Switches to HTTPS for proxy communications when a CA cert is provided.
    • Includes comprehensive verification steps for the proxy and certificate setup.
  • Reliable GPG Key Importing (import_gpg_keys):
    • Introduced a new function import_gpg_keys to handle GPG key fetching and importing in a proxy-aware manner using curl over HTTPS, replacing direct gpg --recv-keys calls to keyservers.
    • This function supports fetching keys by URL or Key ID and is used throughout the script for repository setup (NVIDIA Container Toolkit, CUDA, Bigtop, Adoptium, Docker, Google Cloud, CRAN-R, MySQL).
  • Conda/Mamba Environment (install_pytorch):
    • Refined package list: numba, pytorch, tensorflow[and-cuda], rapids, pyspark, and cuda-version<=${CUDA_VERSION}. Explicit CUDA runtime (e.g., cudart_spec) is no longer added, allowing the solver more flexibility.
    • Uses Mamba preferentially, with a Conda fallback.
    • Implements cache/environment clearing logic based on install_gpu_driver-main and pytorch sentinels to allow forced refreshes.
    • Improved error handling for environment creation, with specific messages for Mamba failures in proxied environments.
  • NVIDIA Driver Handling:
    • set_driver_version: Uses curl -I for a more lightweight HEAD request to check URL validity.
    • build_driver_from_github: Caches the open kernel module source tarball from GitHub to GCS. Checks for existing signed and loadable modules to avoid unnecessary rebuilds.
    • execute_github_driver_build: Refactored to accept tarball paths. popd removed to balance pushd in caller. Removed a debug echo of the sign-file exit code.
    • Added make -j$(nproc) to modules_install for parallelization.
    • Post-install verification loop checks modinfo for signer: to confirm modules are signed.
  • Lifecycle Improvements:
    • prepare_to_install: Moved curl_retry_args definition earlier.
    • install_nvidia_gpu_driver: Checks if nvidia module loads at the start and marks incomplete if not.
    • main: Added mark_complete install_gpu_driver-main at the end.
    • configure_dkms_certs: Always fetches keys from secret manager if PSN is set to ensure modulus_md5sum is available.
    • install_gpu_agent: Checks if METADATA_HTTP_PROXY_PEM_URI is non-empty before using it.
  • Secure Boot Check: Issues a warning instead of exiting for Secure Boot on Debian < 2.2.

@cjac cjac self-assigned this Jan 23, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @cjac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the GPU driver installation script's reliability and adaptability, particularly in complex network environments requiring HTTP/HTTPS proxies and for systems utilizing Secure Boot. The changes focus on making the installation process more robust, configurable, and resilient to common issues like network restrictions and module signing requirements, while also refining the Conda environment setup and updating documentation.

Highlights

  • Enhanced Proxy Configuration: Introduced support for https-proxy and proxy-uri metadata, refined NO_PROXY handling with intelligent defaults, and enabled custom CA certificate installation into system, Java, and Conda trust stores. Proxy connections now default to HTTPS when a custom CA is provided, and the script includes pre-flight tests for proxy connectivity and reachability.
  • Robust GPG Key Import: A new import_gpg_keys function centralizes GPG key fetching from URLs or keyservers, ensuring it respects proxy settings and custom CA certificates, replacing previous direct curl | gpg --import calls.
  • Secure Boot Signing Refinements: The configure_dkms_certs function now consistently fetches keys from Secret Manager if private_secret_name is set, ensuring modulus_md5sum is available. Kernel module signing is integrated into the build process, with checks to verify modules are signed and loadable after installation.
  • Resilient Driver Installation: The install_nvidia_gpu_driver function now includes an initial check for the nvidia module's loadability, triggering a re-installation attempt if it fails. curl commands for downloads now incorporate retry flags and honor proxy settings.
  • Conda Environment Adjustments: The PyTorch Conda environment package list was streamlined by removing TensorFlow. Specific workarounds were added for Debian 10, including using conda instead of mamba and disabling SSL verification.
  • Documentation Updates: The gpu/README.md file has been updated to reflect the new proxy metadata (https-proxy, proxy-uri, no-proxy), detail the enhanced proxy support, and add proxy-related troubleshooting guidance.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the GPU driver installation script by introducing robust proxy handling, improving Secure Boot integration, and refining Conda environment setup. Key improvements include flexible proxy configuration with support for HTTPS proxies and custom CA certificates, a new import_gpg_keys function for reliable GPG key fetching, and more thorough verification steps for signed kernel modules under Secure Boot. The documentation has also been updated to reflect these new features and provide better troubleshooting guidance. Overall, these changes make the script more resilient and configurable for diverse network environments and security requirements.

Comment on lines 3271 to 3275
if [[ -v METADATA_HTTP_PROXY_PEM_URI ]] && [[ -n "${METADATA_HTTP_PROXY_PEM_URI}" ]]; then
if [[ -z "${trusted_pem_path:-}" ]]; then
echo "WARNING: METADATA_HTTP_PROXY_PEM_URI is set, but trusted_pem_path is not defined." >&2
else
curl_retry_args+=(--cacert "${trusted_pem_path}")
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The warning METADATA_HTTP_PROXY_PEM_URI is set, but trusted_pem_path is not defined indicates a potential issue. trusted_pem_path is only set within set_proxy if both a proxy (http-proxy/https-proxy) and a PEM URI are provided. If http-proxy-pem-uri is provided but no http-proxy or https-proxy is set, set_proxy returns early, leaving trusted_pem_path undefined. This could lead to GPG key imports failing to use the custom CA, even if the PEM URI is present.

Comment on lines 2755 to 2768
pkg_proxy_conf_file="/etc/apt/apt.conf.d/99proxy"
cat > "${pkg_proxy_conf_file}" <<EOF
Acquire::http::Proxy "http://${METADATA_HTTP_PROXY}";
Acquire::https::Proxy "http://${METADATA_HTTP_PROXY}";
EOF
echo "Acquire::http::Proxy \"http://${effective_proxy}\";" > "${pkg_proxy_conf_file}"
echo "Acquire::https::Proxy \"http://${effective_proxy}\";" >> "${pkg_proxy_conf_file}"
echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}"
elif is_rocky ; then
pkg_proxy_conf_file="/etc/dnf/dnf.conf"

touch "${pkg_proxy_conf_file}"

if grep -q "^proxy=" "${pkg_proxy_conf_file}"; then
sed -i.bak "s@^proxy=.*@proxy=${HTTP_PROXY}@" "${pkg_proxy_conf_file}"
elif grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then
sed -i.bak "/^\[main\]/a proxy=${HTTP_PROXY}" "${pkg_proxy_conf_file}"
sed -i.bak '/^proxy=/d' "${pkg_proxy_conf_file}"
if grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then
sed -i.bak "/^\\\[main\\\\]/a proxy=http://${effective_proxy}" "${pkg_proxy_conf_file}"
else
local TMP_FILE=$(mktemp)
printf "[main]\nproxy=%s\n" "${HTTP_PROXY}" > "${TMP_FILE}"

cat "${TMP_FILE}" "${pkg_proxy_conf_file}" > "${pkg_proxy_conf_file}".new
mv "${pkg_proxy_conf_file}".new "${pkg_proxy_conf_file}"
echo -e "[main]\nproxy=http://${effective_proxy}" >> "${pkg_proxy_conf_file}"
fi
echo "DEBUG: set_proxy: Configured dnf proxy: ${pkg_proxy_conf_file}"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The apt and dnf proxy configurations (Acquire::http::Proxy "http://${effective_proxy}"; and proxy=http://${effective_proxy}) use an http:// prefix. If effective_proxy is derived solely from https_proxy_val (meaning only an HTTPS proxy was specified), this could lead to apt/dnf attempting to connect to an HTTPS proxy using an HTTP scheme. While a later sed command attempts to correct this if http-proxy-pem-uri is set, it might be incorrect if http-proxy-pem-uri is not provided.

Suggested change
pkg_proxy_conf_file="/etc/apt/apt.conf.d/99proxy"
cat > "${pkg_proxy_conf_file}" <<EOF
Acquire::http::Proxy "http://${METADATA_HTTP_PROXY}";
Acquire::https::Proxy "http://${METADATA_HTTP_PROXY}";
EOF
echo "Acquire::http::Proxy \"http://${effective_proxy}\";" > "${pkg_proxy_conf_file}"
echo "Acquire::https::Proxy \"http://${effective_proxy}\";" >> "${pkg_proxy_conf_file}"
echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}"
elif is_rocky ; then
pkg_proxy_conf_file="/etc/dnf/dnf.conf"
touch "${pkg_proxy_conf_file}"
if grep -q "^proxy=" "${pkg_proxy_conf_file}"; then
sed -i.bak "s@^proxy=.*@proxy=${HTTP_PROXY}@" "${pkg_proxy_conf_file}"
elif grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then
sed -i.bak "/^\[main\]/a proxy=${HTTP_PROXY}" "${pkg_proxy_conf_file}"
sed -i.bak '/^proxy=/d' "${pkg_proxy_conf_file}"
if grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then
sed -i.bak "/^\\\[main\\\\]/a proxy=http://${effective_proxy}" "${pkg_proxy_conf_file}"
else
local TMP_FILE=$(mktemp)
printf "[main]\nproxy=%s\n" "${HTTP_PROXY}" > "${TMP_FILE}"
cat "${TMP_FILE}" "${pkg_proxy_conf_file}" > "${pkg_proxy_conf_file}".new
mv "${pkg_proxy_conf_file}".new "${pkg_proxy_conf_file}"
echo -e "[main]\nproxy=http://${effective_proxy}" >> "${pkg_proxy_conf_file}"
fi
echo "DEBUG: set_proxy: Configured dnf proxy: ${pkg_proxy_conf_file}"
fi
echo "Acquire::http::Proxy \"${HTTP_PROXY}\";" > "${pkg_proxy_conf_file}"
echo "Acquire::https::Proxy \"${HTTPS_PROXY}\";" >> "${pkg_proxy_conf_file}"
echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was resolved by using a variable scheme

@cjac
Copy link
Contributor Author

cjac commented Jan 24, 2026

/gcbrun

This commit significantly enhances the robustness and configurability of the GPU driver installation script, particularly for environments with HTTP/HTTPS proxies and those using Secure Boot.

**Key Changes:**

*   **Enhanced Proxy Configuration (`set_proxy`):**
    *   Added support for `https-proxy` and `proxy-uri` metadata, providing more flexibility in proxy setups.
    *   Improved `NO_PROXY` handling with sensible defaults (including Google APIs) and user-configurable additions.
    *   Integrated support for custom proxy CA certificates via `http-proxy-pem-uri`, including installation into system, Java, and Conda trust stores.
    *   Connections to the proxy now use HTTPS when a custom CA is provided.
    *   Added proxy connection and reachability tests to fail fast on misconfiguration.
    *   Ensures `curl`, `apt`, `dnf`, `gpg`, and Java all respect the proxy settings.

*   **Robust GPG Key Import (`import_gpg_keys`):**
    *   Introduced a new function to reliably import GPG keys from URLs or keyservers, fully respecting the configured proxy and custom CA settings.
    *   This replaces direct `curl | gpg --import` calls, making key fetching more resilient in restricted network environments.

*   **Secure Boot Signing Refinements:**
    *   The `configure_dkms_certs` function now always fetches keys from Secret Manager if `private_secret_name` is set, ensuring `modulus_md5sum` is available for GCS cache paths.
    *   Kernel module signing is now more clearly integrated into the build process.
    *   Improved checks to ensure modules are actually signed and loadable after installation when Secure Boot is active.

*   **Resilient Driver Installation:**
    *   The script now checks if the `nvidia` module can be loaded at the beginning of `install_nvidia_gpu_driver` and will re-attempt installation if it fails.
    *   `curl` calls for downloading drivers and other artifacts now use retry flags and honor proxy settings.

*   **Conda Environment for PyTorch:**
    *   Adjusted package list for Conda environment, removing TensorFlow to streamline.
    *   Added specific workarounds for Debian 10, using `conda` instead of `mamba`.

*   **Documentation Updates (`gpu/README.md`):**
    *   Added details on the new proxy metadata: `https-proxy`, `proxy-uri`, `no-proxy`.
    *   Created a new section "Enhanced Proxy Support" explaining the features.
    *   Updated `http-proxy-pem-uri` description.
    *   Added proxy considerations to the "Troubleshooting" section.

These changes aim to make the GPU initialization action more reliable across a wider range of network environments and improve the Secure Boot workflow.
@cjac
Copy link
Contributor Author

cjac commented Jan 24, 2026

/gcbrun

This commit refines the `import_gpg_keys` function and a few other areas:

- Moves the definition of `curl_retry_args` earlier in `prepare_to_install` to ensure it's available for `set_proxy`.
- Removes local `tmpdir` and `curl_retry_args` definitions within `import_gpg_keys`, using the global settings instead.
- Eliminates manual proxy and certificate configuration for `curl` within `import_gpg_keys`. The function now relies on the environment variables (`HTTP_PROXY`, `HTTPS_PROXY`, `REQUESTS_CA_BUNDLE`) properly set by the `set_proxy` function.
- Restores `tensorflow[and-cuda]` to the default Conda package list in `install_pytorch`.

These changes simplify the script, reduce redundancy, and ensure consistent use of proxy settings and curl arguments.
@cjac
Copy link
Contributor Author

cjac commented Jan 27, 2026

I've made substantial changes. Please perform another review, discarding previous review comments. Let's start from the top once again.

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly refactors the GPU driver installation script, enhancing proxy handling, improving Secure Boot integration, and refining Conda environment setup. The documentation has been updated to reflect these changes, providing clearer guidance on proxy configuration and custom image building with Secure Boot. The introduction of the import_gpg_keys function centralizes GPG key management, making the script more robust in various network environments. Overall, these changes contribute to a more reliable and configurable GPU initialization process.

building_file=""
rm "${local_tarball}"
make clean
popd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The execute_github_driver_build function includes a popd command. However, this function is called from build_driver_from_github, which also has a popd at its end (line 1331). This results in an extra popd call, which could lead to a directory stack underflow or unexpected behavior if the script's current directory is not what's expected. Ensure that pushd and popd calls are balanced.

@cjac cjac force-pushed the gpu-202601 branch 3 times, most recently from fef918e to 25df48a Compare January 28, 2026 00:54
…d Reliability

This update significantly improves the `install_gpu_driver.sh` script and its accompanying documentation, focusing on robust support for complex environments involving Secure Boot and HTTP/S proxies, and increasing overall reliability and maintainability.

**1. `gpu/README.md`:**

*   **Comprehensive Documentation for Secure Boot & Proxy:**
    *   Added a major section: "Building Custom Images with Secure Boot and Proxy Support". This details the end-to-end process using the `GoogleCloudDataproc/custom-images` repository to create Dataproc images with NVIDIA drivers signed for Secure Boot. It covers environment setup, key management in GCP Secret Manager, Docker builder image creation, and running the image generation process.
    *   Added a major section: "Launching a Cluster with the Secure Boot Custom Image". This explains how to use the custom-built images to launch Dataproc clusters with `--shielded-secure-boot`. It includes instructions for private network setups using Google Cloud Secure Web Proxy, leveraging scripts from the `GoogleCloudDataproc/cloud-dataproc` repository for VPC, subnet, and proxy configuration.
    *   Includes essential verification steps for checking driver status, module signatures, and system logs on the cluster nodes.

**2. `gpu/install_gpu_driver.sh`:**

*   **Conda/Mamba Environment (`install_pytorch`):**
    *   The package list for the Conda environment now omits the explicit CUDA runtime specification, allowing the solver more flexibility based on other dependencies and the `cuda-version` constraint.
    *   Mamba is now used preferentially for faster environment creation, with a fallback to Conda.
    *   Implements a cache/environment clearing logic: If the main driver installation is marked complete (`install_gpu_driver-main` sentinel exists) but the PyTorch environment setup is not (`pytorch` sentinel missing), it purges the GCS cache and local Conda environment to ensure a clean rebuild.
    *   Enhanced error handling for Conda/Mamba environment creation.
*   **NVIDIA Driver Handling:**
    *   `set_driver_version`: Uses `curl -I` for lightweight URL HEAD requests.
    *   `build_driver_from_github`: Caches the open kernel module source tarball from GitHub to GCS. Checks for existing signed and loadable modules to avoid unnecessary rebuilds.
    *   `execute_github_driver_build`: Refactored to accept tarball paths as arguments. `popd` removed to balance `pushd` in caller. Removed a debug echo of the `sign-file` exit code.
    *   Added `make -j$(nproc)` to `modules_install` for parallelization.
    *   Post-install verification loop checks `modinfo` for `signer:` to confirm modules are signed.
*   **Secure Boot Check:** Changed the script to issue a warning rather than exit if Secure Boot is enabled on unsupported Debian versions (<= 2.1).
*   **Completion Sentinel:** Added `mark_complete install_gpu_driver-main` at the end of the `main` function.
*   **Proxy Configuration (`set_proxy`):** Conditionally adds `gcloud config set proxy/...` commands if the gcloud SDK version is >= 547.0.0. Corrected `sed` command for DNF proxy configuration.
@cjac
Copy link
Contributor Author

cjac commented Jan 28, 2026

I've made substantial changes in response to your previous review. Please perform another review.

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements to the GPU installation script, primarily focusing on robust proxy handling and support for Secure Boot. The documentation in gpu/README.md has been extensively updated with detailed instructions for these complex scenarios, which is a great improvement. The script gpu/install_gpu_driver.sh has been refactored to centralize GPG key imports, improve caching logic for driver and source builds, and add comprehensive proxy configuration, including custom CA certificate handling.

My review focuses on ensuring the new logic is robust and consistent. I've identified a minor documentation inconsistency and a couple of potential issues in the script related to command execution order and error handling. Overall, this is a very strong refactoring that greatly improves the reliability and maintainability of the script.

This commit incorporates several enhancements and fixes into the `install_gpu_driver.sh` script, particularly improving the Conda environment setup, caching mechanisms, robustness in different environments, and documentation for Secure Boot and Proxy scenarios.

**Script Changes (`install_gpu_driver.sh`):**

1.  **Conda/Mamba Environment (`install_pytorch`):**
    *   Integrates Mamba for faster package installation, with fallback to Conda.
    *   The `conda_pkg_list` uses a flexible specification including `numba`, `pytorch`, `tensorflow[and-cuda]`, `rapids`, `pyspark`, and `cuda-version<=${CUDA_VERSION}` to guide the solver, balancing `rapids` compatibility with the desired CUDA version.
    *   Implements logic to force-clean and rebuild the Conda environment if the main driver installation has run, ensuring a fresh state by removing GCS cache and local environment when the `pytorch` sentinel is missing but `install_gpu_driver-main` is present.
    *   Enhances error handling during environment creation, providing specific guidance for Mamba multi-download failures in proxied environments.
    *   Cleans up the local Conda pack tarball after uploading to GCS.

2.  **Source Tarball Caching (`build_driver_from_github`):**
    *   Adds GCS caching for the NVIDIA open kernel module source code downloaded from GitHub, reducing external fetches. Checks local, GCS, then GitHub.

3.  **Driver Download (`set_driver_version`):**
    *   Improves NVIDIA driver URL validation using a more robust `curl` and `grep` check for HTTP 200 status.
    *   Simplifies `curl` download commands, relying on globally set options.

4.  **Bug Fixes & Refinements:**
    *   Fixes module signature check in `build_driver_from_github` by correctly extracting the module name before calling `modinfo`.
    *   Consolidates `curl` retry arguments (`curl_retry_args`) and removes redundant proxy settings in functions, relying on `set_proxy`.
    *   Adds a main completion sentinel (`install_gpu_driver-main`) at the end of the `main` function.

**Documentation Changes (`gpu/README.md`):**

*   **Secure Boot & Proxy:** Added a detailed section on "Building Custom Images with Secure Boot and Proxy Support".
    *   Provides a step-by-step guide on using the `GoogleCloudDataproc/custom-images` repository to build custom images with signed NVIDIA drivers, including configuration for proxy environments.
    *   Includes instructions for setting up the build environment (`env.json`, keys in Secret Manager).
    *   Details how to launch a cluster using the custom image, including setup for private networks with Secure Web Proxy using scripts from the `GoogleCloudDataproc/cloud-dataproc` repo.
    *   Adds verification steps for the driver and signature on the created cluster.
*   Clarified that the init action applies to master (-m) and worker (-w) nodes.

These changes collectively improve the robustness and reliability of the GPU initialization script and provide crucial documentation for complex setups involving Secure Boot and network proxies. Successfully tested on Dataproc 2.2-debian12.
@cjac
Copy link
Contributor Author

cjac commented Jan 28, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a significant and valuable pull request that greatly enhances the GPU installation script's reliability, especially in complex environments with proxies and Secure Boot. The refactoring for proxy handling is comprehensive, and the new GPG key import mechanism is a major improvement for robustness. The documentation updates in the README are thorough and will be very helpful for users. The changes are well-executed, but I've identified one high-severity issue related to directory stack management (pushd/popd) in the new execute_github_driver_build function that should be addressed to prevent potential issues with script execution flow.

@cjac
Copy link
Contributor Author

cjac commented Jan 28, 2026

/gcbrun

- Add GCS caching for CUDA runfiles in `set_cuda_runfile_url`. The script now checks gs://[PKG_BUCKET]/nvidia/ for the runfile before downloading from NVIDIA, and uploads after a fresh download.
- Define `tmpdir` in the deferred GPU config script (`/usr/local/sbin/apply-dataproc-gpu-config.sh`) to prevent unbound variable errors.
- Minor adjustments to set_proxy to ensure METADATA_HTTP_PROXY_PEM_URI is read.
- Make disk zeroing errors non-fatal during custom image builds.
@cjac
Copy link
Contributor Author

cjac commented Jan 29, 2026

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Jan 29, 2026

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 29, 2026

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 29, 2026

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 29, 2026

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 29, 2026

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 29, 2026

BzhEa3CoxwE75j5

I had to disable Rocky 8 tests entirely. None of them would work. I am not committed to Rocky on 2.0 and 2.1 for this release. Further work would require a customer issue filed to support the platform.

note to self: this commit is 739fb91

@cjac
Copy link
Contributor Author

cjac commented Feb 3, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a substantial and valuable pull request that significantly refactors the GPU installation script to improve proxy handling, Secure Boot support, and overall reliability. The extensive documentation updates in the README are particularly helpful. My review has identified a critical bug in a curl command that could cause downloads to fail, a potential robustness issue in the exit handler, a minor documentation inaccuracy, and a concern about the reduced build timeout in the Cloud Build configuration. Addressing these points will further strengthen this excellent set of improvements.

This commit introduces Cloud Build testing for Dataproc 2.3 image versions (Debian 12, Rocky 9, Ubuntu 22) to ensure compatibility with the initialization actions.

The GPU README.md has been updated to more accurately reflect tested and supported OS versions for various CUDA releases, including Secure Boot compatibility notes.

Minor fixes to curl argument handling within the `install_gpu_driver.sh` script are also included to ensure robust downloads.

Specific changes:
- cloudbuild/cloudbuild.yaml: Added parallel test jobs for 2.3 images. Restored longer timeout.
- gpu/README.md: Updated OS/CUDA compatibility table.
- gpu/install_gpu_driver.sh: Adjusted curl command invocations in set_cuda_runfile_url.
@cjac
Copy link
Contributor Author

cjac commented Feb 3, 2026

/gcbrun

12.4 | 12.4.1 | 550.135 | 9.1.0.70 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
12.6 | 12.6.3 | 550.142 | 9.6.0.74 | 2.23.4 | 2.1 (Ubuntu 20.04, Rocky 8); Dataproc 2.2+
11.8 | 11.8.0 | 525.147.05| 9.5.1.17 | 2.21.5 | 2.0, 2.1 (Debian 10/11, Ubuntu); 2.2 (Debian 11, Ubuntu 22.04)
12.0 | 12.0.1 | 525.147.05| 8.8.1.3 | 2.16.5 | 2.0, 2.1 (Debian 10/11, Ubuntu); 2.2 (Debian 11/12, Rocky 9, Ubuntu 22.04)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be Debian 11 needs to be removed from here. There has never been any 2.2-debian11 in GA releases. It was only in preview, before Debian 12 was available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, my apologies. I'm sure it was a typo. I think that should read Debian 12 ; thank you for catching that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants