Skip to content

Conversation

@hippogr
Copy link
Contributor

@hippogr hippogr commented Jan 20, 2026

This PR is used to clean the docker build process for job-exporter to reduce the build time as well as the docker image size, as well as the tool update.

Copilot AI review requested due to automatic review settings January 20, 2026 08:34
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the Docker build process for the job-exporter component by introducing multi-stage builds, streamlining dependency installation, and updating tools. The changes aim to reduce both build time and final image size.

Changes:

  • Introduced multi-stage Docker build with separate builder and runtime stages to minimize final image size
  • Replaced custom-built AMD RDC and DCGM installations with pre-packaged versions from official repositories
  • Updated nerdctl from version 2.1.3 to 2.2.1

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
job-exporter.common.dockerfile Complete restructure with multi-stage build, optimized layer caching, and streamlined dependency installation
moneo-gpu-exporter_entrypoint.sh Removed DCGM update script call since DCGM 4 is now directly installed
update-dcgm.py Removed as DCGM version management is now handled during Docker build
nvidia.sh Removed as DCGM installation is now handled directly in Dockerfile
common.sh Removed as common dependencies are now installed directly in Dockerfile
amd.sh Removed as RDC installation is now handled via package manager instead of building from source

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +109 to +112
mkdir -p /tmp/nerdctl; \
tar -xzf /tmp/nerdctl.tar.gz -C /tmp/nerdctl; \
mv /tmp/nerdctl/nerdctl /usr/local/bin/nerdctl; \
rm -rf /tmp/nerdctl* /tmp/nerdctl.tar.gz
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nerdctl binary is downloaded from GitHub without checksum verification. This could pose a security risk if the download is compromised or if the download fails partially. Consider adding checksum verification using sha256sum or downloading and verifying the checksum file provided by the nerdctl releases. For example, the nerdctl releases include SHA256SUMS files that can be used to verify the integrity of the downloaded archive.

Suggested change
mkdir -p /tmp/nerdctl; \
tar -xzf /tmp/nerdctl.tar.gz -C /tmp/nerdctl; \
mv /tmp/nerdctl/nerdctl /usr/local/bin/nerdctl; \
rm -rf /tmp/nerdctl* /tmp/nerdctl.tar.gz
wget -O /tmp/nerdctl-SHA256SUMS \
https://github.com/containerd/nerdctl/releases/download/v${NERDCTL_VERSION}/SHA256SUMS; \
grep " nerdctl-${NERDCTL_VERSION}-linux-${TARGETARCH}.tar.gz$" /tmp/nerdctl-SHA256SUMS | sha256sum -c -; \
mkdir -p /tmp/nerdctl; \
tar -xzf /tmp/nerdctl.tar.gz -C /tmp/nerdctl; \
mv /tmp/nerdctl/nerdctl /usr/local/bin/nerdctl; \
rm -rf /tmp/nerdctl* /tmp/nerdctl.tar.gz /tmp/nerdctl-SHA256SUMS

Copilot uses AI. Check for mistakes.
datacenter-gpu-manager-4-core=${DCGM_TARGET_VERSION} \
datacenter-gpu-manager-4-proprietary-cuda12=${DCGM_TARGET_VERSION}; \
apt-get clean; \
rm -rf /var/lib/apt/lists/*
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DCGM Python bindings path in nvidia_exporter.py is incompatible with DCGM 4. The new Dockerfile installs datacenter-gpu-manager-4 which places Python bindings at /usr/share/datacenter-gpu-manager-4/bindings/python3, but nvidia_exporter.py still references /usr/local/dcgm/bindings/python3. The update-dcgm.py script that previously handled this path migration has been removed, which will cause the NVIDIA exporter to fail with ImportError when trying to import dcgm_fields and DcgmReader. The sys.path.append line needs to be updated to match the new DCGM 4 installation path.

Suggested change
rm -rf /var/lib/apt/lists/*
rm -rf /var/lib/apt/lists/*
ENV PYTHONPATH=/usr/share/datacenter-gpu-manager-4/bindings/python3:${PYTHONPATH}

Copilot uses AI. Check for mistakes.
COPY requirements.txt /job_exporter/requirements.txt

RUN apt update && apt upgrade -y && apt-get clean && rm -rf /var/lib/apt/lists/*
RUN python3 -m pip install --no-cache-dir -U pip && \
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is trailing whitespace at the end of this line after the backslash continuation character. This should be removed for consistency and to avoid potential issues with shell script parsing.

Suggested change
RUN python3 -m pip install --no-cache-dir -U pip && \
RUN python3 -m pip install --no-cache-dir -U pip && \

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +45
-r /w/requirements.txt \
prometheus_client psutil filelock


Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package prometheus_client is being built twice as a wheel - once from requirements.txt (which pins it to version 0.20.0) on line 42, and again without version specification on line 43. This may cause version conflicts or unnecessary duplication. Similarly, the same packages (prometheus_client, psutil, filelock) are being installed again in lines 125-127 after already being installed from requirements.txt on line 124. Consider consolidating these package specifications into requirements.txt to ensure consistent versioning and avoid redundant installations.

Suggested change
-r /w/requirements.txt \
prometheus_client psutil filelock
-r /w/requirements.txt

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +127
RUN python3 -m pip install --no-cache-dir -U pip && \
python3 -m pip install --no-cache-dir \
--no-index --find-links=/wheels \
-r /job_exporter/requirements.txt && \
python3 -m pip install --no-cache-dir \
--no-index --find-links=/wheels \
prometheus_client psutil filelock && \
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The packages prometheus_client, psutil, and filelock are being installed redundantly. They were already installed from requirements.txt on line 124 (prometheus_client is pinned to 0.20.0 in requirements.txt). Installing them again without version specifications may cause version conflicts or simply waste build time. Consider removing these redundant installations or ensuring all package versions are consistently managed through requirements.txt.

Suggested change
RUN python3 -m pip install --no-cache-dir -U pip && \
python3 -m pip install --no-cache-dir \
--no-index --find-links=/wheels \
-r /job_exporter/requirements.txt && \
python3 -m pip install --no-cache-dir \
--no-index --find-links=/wheels \
prometheus_client psutil filelock && \
RUN python3 -m pip install --no-cache-dir -U pip && \
python3 -m pip install --no-cache-dir \
--no-index --find-links=/wheels \
-r /job_exporter/requirements.txt && \

Copilot uses AI. Check for mistakes.
Comment on lines +80 to +83
curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -; \
echo "deb https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" \
> /etc/apt/sources.list.d/rocm.list; \
echo "deb https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" \
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command apt-key add is deprecated and will be removed in a future Ubuntu release. Consider using the recommended approach of placing the GPG key in /etc/apt/trusted.gpg.d/ or /usr/share/keyrings/ and referencing it with the signed-by option in the sources.list entry. For example: curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor -o /usr/share/keyrings/rocm-archive-keyring.gpg and then use [signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] in the deb line.

Suggested change
curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -; \
echo "deb https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" \
> /etc/apt/sources.list.d/rocm.list; \
echo "deb https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" \
curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor -o /usr/share/keyrings/rocm-archive-keyring.gpg; \
echo "deb [signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" \
> /etc/apt/sources.list.d/rocm.list; \
echo "deb [signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" \

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +88
RUN set -eux; \
if [ "$TARGETARCH" = "amd64" ]; then \
printf "Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600" \
> /etc/apt/preferences.d/rocm-pin-600; \
curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -; \
echo "deb https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" \
> /etc/apt/sources.list.d/rocm.list; \
echo "deb https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" \
> /etc/apt/sources.list.d/amdgpu.list; \
apt-get update; \
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends rdc; \
rm -rf /var/lib/apt/lists/*; \
fi
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AMD RDC installation has been significantly simplified from building from source (with a specific cherry-picked commit 660c5afaf49630781c1059ba6d30bae21743c32f from amd-staging branch) to installing the prebuilt rdc package. This changes the RDC version and removes the custom patches that were previously applied. Verify that the packaged RDC version includes the necessary functionality from the cherry-picked commit, or that the commit is no longer needed for the target ROCm version 6.2.2. This change could impact AMD GPU monitoring functionality if the custom patches were critical.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants