-
Notifications
You must be signed in to change notification settings - Fork 5
job-exporter docker image clean #140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
… and simplify the build steps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the Docker build process for the job-exporter component by introducing multi-stage builds, streamlining dependency installation, and updating tools. The changes aim to reduce both build time and final image size.
Changes:
- Introduced multi-stage Docker build with separate builder and runtime stages to minimize final image size
- Replaced custom-built AMD RDC and DCGM installations with pre-packaged versions from official repositories
- Updated nerdctl from version 2.1.3 to 2.2.1
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| job-exporter.common.dockerfile | Complete restructure with multi-stage build, optimized layer caching, and streamlined dependency installation |
| moneo-gpu-exporter_entrypoint.sh | Removed DCGM update script call since DCGM 4 is now directly installed |
| update-dcgm.py | Removed as DCGM version management is now handled during Docker build |
| nvidia.sh | Removed as DCGM installation is now handled directly in Dockerfile |
| common.sh | Removed as common dependencies are now installed directly in Dockerfile |
| amd.sh | Removed as RDC installation is now handled via package manager instead of building from source |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| mkdir -p /tmp/nerdctl; \ | ||
| tar -xzf /tmp/nerdctl.tar.gz -C /tmp/nerdctl; \ | ||
| mv /tmp/nerdctl/nerdctl /usr/local/bin/nerdctl; \ | ||
| rm -rf /tmp/nerdctl* /tmp/nerdctl.tar.gz |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nerdctl binary is downloaded from GitHub without checksum verification. This could pose a security risk if the download is compromised or if the download fails partially. Consider adding checksum verification using sha256sum or downloading and verifying the checksum file provided by the nerdctl releases. For example, the nerdctl releases include SHA256SUMS files that can be used to verify the integrity of the downloaded archive.
| mkdir -p /tmp/nerdctl; \ | |
| tar -xzf /tmp/nerdctl.tar.gz -C /tmp/nerdctl; \ | |
| mv /tmp/nerdctl/nerdctl /usr/local/bin/nerdctl; \ | |
| rm -rf /tmp/nerdctl* /tmp/nerdctl.tar.gz | |
| wget -O /tmp/nerdctl-SHA256SUMS \ | |
| https://github.com/containerd/nerdctl/releases/download/v${NERDCTL_VERSION}/SHA256SUMS; \ | |
| grep " nerdctl-${NERDCTL_VERSION}-linux-${TARGETARCH}.tar.gz$" /tmp/nerdctl-SHA256SUMS | sha256sum -c -; \ | |
| mkdir -p /tmp/nerdctl; \ | |
| tar -xzf /tmp/nerdctl.tar.gz -C /tmp/nerdctl; \ | |
| mv /tmp/nerdctl/nerdctl /usr/local/bin/nerdctl; \ | |
| rm -rf /tmp/nerdctl* /tmp/nerdctl.tar.gz /tmp/nerdctl-SHA256SUMS |
| datacenter-gpu-manager-4-core=${DCGM_TARGET_VERSION} \ | ||
| datacenter-gpu-manager-4-proprietary-cuda12=${DCGM_TARGET_VERSION}; \ | ||
| apt-get clean; \ | ||
| rm -rf /var/lib/apt/lists/* |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DCGM Python bindings path in nvidia_exporter.py is incompatible with DCGM 4. The new Dockerfile installs datacenter-gpu-manager-4 which places Python bindings at /usr/share/datacenter-gpu-manager-4/bindings/python3, but nvidia_exporter.py still references /usr/local/dcgm/bindings/python3. The update-dcgm.py script that previously handled this path migration has been removed, which will cause the NVIDIA exporter to fail with ImportError when trying to import dcgm_fields and DcgmReader. The sys.path.append line needs to be updated to match the new DCGM 4 installation path.
| rm -rf /var/lib/apt/lists/* | |
| rm -rf /var/lib/apt/lists/* | |
| ENV PYTHONPATH=/usr/share/datacenter-gpu-manager-4/bindings/python3:${PYTHONPATH} |
| COPY requirements.txt /job_exporter/requirements.txt | ||
|
|
||
| RUN apt update && apt upgrade -y && apt-get clean && rm -rf /var/lib/apt/lists/* | ||
| RUN python3 -m pip install --no-cache-dir -U pip && \ |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is trailing whitespace at the end of this line after the backslash continuation character. This should be removed for consistency and to avoid potential issues with shell script parsing.
| RUN python3 -m pip install --no-cache-dir -U pip && \ | |
| RUN python3 -m pip install --no-cache-dir -U pip && \ |
| -r /w/requirements.txt \ | ||
| prometheus_client psutil filelock | ||
|
|
||
|
|
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package prometheus_client is being built twice as a wheel - once from requirements.txt (which pins it to version 0.20.0) on line 42, and again without version specification on line 43. This may cause version conflicts or unnecessary duplication. Similarly, the same packages (prometheus_client, psutil, filelock) are being installed again in lines 125-127 after already being installed from requirements.txt on line 124. Consider consolidating these package specifications into requirements.txt to ensure consistent versioning and avoid redundant installations.
| -r /w/requirements.txt \ | |
| prometheus_client psutil filelock | |
| -r /w/requirements.txt |
| RUN python3 -m pip install --no-cache-dir -U pip && \ | ||
| python3 -m pip install --no-cache-dir \ | ||
| --no-index --find-links=/wheels \ | ||
| -r /job_exporter/requirements.txt && \ | ||
| python3 -m pip install --no-cache-dir \ | ||
| --no-index --find-links=/wheels \ | ||
| prometheus_client psutil filelock && \ |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The packages prometheus_client, psutil, and filelock are being installed redundantly. They were already installed from requirements.txt on line 124 (prometheus_client is pinned to 0.20.0 in requirements.txt). Installing them again without version specifications may cause version conflicts or simply waste build time. Consider removing these redundant installations or ensuring all package versions are consistently managed through requirements.txt.
| RUN python3 -m pip install --no-cache-dir -U pip && \ | |
| python3 -m pip install --no-cache-dir \ | |
| --no-index --find-links=/wheels \ | |
| -r /job_exporter/requirements.txt && \ | |
| python3 -m pip install --no-cache-dir \ | |
| --no-index --find-links=/wheels \ | |
| prometheus_client psutil filelock && \ | |
| RUN python3 -m pip install --no-cache-dir -U pip && \ | |
| python3 -m pip install --no-cache-dir \ | |
| --no-index --find-links=/wheels \ | |
| -r /job_exporter/requirements.txt && \ |
| curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -; \ | ||
| echo "deb https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" \ | ||
| > /etc/apt/sources.list.d/rocm.list; \ | ||
| echo "deb https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" \ |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command apt-key add is deprecated and will be removed in a future Ubuntu release. Consider using the recommended approach of placing the GPG key in /etc/apt/trusted.gpg.d/ or /usr/share/keyrings/ and referencing it with the signed-by option in the sources.list entry. For example: curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor -o /usr/share/keyrings/rocm-archive-keyring.gpg and then use [signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] in the deb line.
| curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -; \ | |
| echo "deb https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" \ | |
| > /etc/apt/sources.list.d/rocm.list; \ | |
| echo "deb https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" \ | |
| curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor -o /usr/share/keyrings/rocm-archive-keyring.gpg; \ | |
| echo "deb [signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" \ | |
| > /etc/apt/sources.list.d/rocm.list; \ | |
| echo "deb [signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" \ |
| RUN set -eux; \ | ||
| if [ "$TARGETARCH" = "amd64" ]; then \ | ||
| printf "Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600" \ | ||
| > /etc/apt/preferences.d/rocm-pin-600; \ | ||
| curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -; \ | ||
| echo "deb https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" \ | ||
| > /etc/apt/sources.list.d/rocm.list; \ | ||
| echo "deb https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" \ | ||
| > /etc/apt/sources.list.d/amdgpu.list; \ | ||
| apt-get update; \ | ||
| DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends rdc; \ | ||
| rm -rf /var/lib/apt/lists/*; \ | ||
| fi |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AMD RDC installation has been significantly simplified from building from source (with a specific cherry-picked commit 660c5afaf49630781c1059ba6d30bae21743c32f from amd-staging branch) to installing the prebuilt rdc package. This changes the RDC version and removes the custom patches that were previously applied. Verify that the packaged RDC version includes the necessary functionality from the cherry-picked commit, or that the commit is no longer needed for the target ROCm version 6.2.2. This change could impact AMD GPU monitoring functionality if the custom patches were critical.
This PR is used to clean the docker build process for job-exporter to reduce the build time as well as the docker image size, as well as the tool update.