Skip to content

Conversation

@hippogr
Copy link
Contributor

@hippogr hippogr commented Jan 15, 2026

Add tag vendor=nvidia and vendor=amd for nodes with different types of GPUs and add nodeSelector when deploying nvidia-device-plugin and amd-device-plugin to remove complaints with no GPUs in device plugin containers.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds GPU vendor tagging functionality to differentiate between NVIDIA and AMD GPU nodes, and implements nodeSelector constraints in device plugin deployments to prevent errors when devices are not present.

Changes:

  • Added vendor=nvidia and vendor=amd node labels based on GPU type detection
  • Added nodeSelector to nvidia-device-plugin and amd-device-plugin DaemonSet deployments
  • Implemented vendor labeling logic for worker nodes including fallback values for unknown GPU types and CPU-only nodes

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/device-plugin/deploy/start.sh.template Adds nodeSelector with vendor labels to NVIDIA and AMD GPU device plugin deployments
src/cluster-configuration/deploy/start.sh.template Implements node labeling logic to assign vendor tags based on computing device type

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

s/(^[[:space:]]*allowPrivilegeEscalation: false.*)\n([[:space:]]*privileged: false)/\1\n\2/
}';
}' \
| sed '/^[[:space:]]*tolerations:/i\ nodeSelector:\n vendor: nvidia';
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sed command uses || (logical OR operator) on line 39, which should be | (pipe operator) to chain the sed commands. The || operator will only execute the second sed if the first one fails, which is not the intended behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +54 to +59
| sed 's|rocm/k8s-device-plugin|{{ cluster_cfg['cluster']['docker-registry']['prefix'] }}k8s-rocm-device-plugin:{{ cluster_cfg['cluster']['docker-registry']['tag'] }}|' \
| sed -E '/^[[:space:]]*nodeSelector:[[:space:]]*$/{
n
s/^([[:space:]]*)(.*)$/\1vendor: amd\
\1\2/
}';
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sed command uses || (logical OR operator) on line 54, which should be | (pipe operator) to chain the sed commands. The || operator will only execute the second sed if the first one fails, which is not the intended behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +59
| sed -E '/^[[:space:]]*nodeSelector:[[:space:]]*$/{
n
s/^([[:space:]]*)(.*)$/\1vendor: amd\
\1\2/
}';
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing closing brace for the opening brace on line 53. The AMD device plugin section starts with { but the sed command chain ends with a semicolon without a corresponding } and the cat command structure that exists in the NVIDIA section.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to 64
{%- set machine_type = cluster_cfg['layout']['machine-list'][host]['machine-type'] %}
{%- if machine_type in cluster_cfg['layout']['machine-sku'] and 'computing-device' in cluster_cfg['layout']['machine-sku'][machine_type] %}
{%- set device_type = cluster_cfg['layout']['machine-sku'][machine_type]['computing-device']['type'] %}
{%- if device_type == 'nvidia.com/gpu' %}
echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=nvidia || exit $?
{%- elif device_type == 'amd.com/gpu' %}
echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=amd || exit $?
{%- else %}
echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=unknown || exit $?
{%- endif %}
{%- else %}
echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=cpu || exit $?
{%- endif %}
{%- else %}
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation of the Jinja2 template blocks is inconsistent. Lines 51-63 use excessive leading spaces (8 spaces) compared to the surrounding code which uses 4 spaces for the conditional blocks. This makes the code harder to read and maintain.

Suggested change
{%- set machine_type = cluster_cfg['layout']['machine-list'][host]['machine-type'] %}
{%- if machine_type in cluster_cfg['layout']['machine-sku'] and 'computing-device' in cluster_cfg['layout']['machine-sku'][machine_type] %}
{%- set device_type = cluster_cfg['layout']['machine-sku'][machine_type]['computing-device']['type'] %}
{%- if device_type == 'nvidia.com/gpu' %}
echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=nvidia || exit $?
{%- elif device_type == 'amd.com/gpu' %}
echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=amd || exit $?
{%- else %}
echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=unknown || exit $?
{%- endif %}
{%- else %}
echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=cpu || exit $?
{%- endif %}
{%- else %}
{%- set machine_type = cluster_cfg['layout']['machine-list'][host]['machine-type'] %}
{%- if machine_type in cluster_cfg['layout']['machine-sku'] and 'computing-device' in cluster_cfg['layout']['machine-sku'][machine_type] %}
{%- set device_type = cluster_cfg['layout']['machine-sku'][machine_type]['computing-device']['type'] %}
{%- if device_type == 'nvidia.com/gpu' %}
echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=nvidia || exit $?
{%- elif device_type == 'amd.com/gpu' %}
echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=amd || exit $?
{%- else %}
echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=unknown || exit $?
{%- endif %}
{%- else %}
echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=cpu || exit $?
{%- endif %}
{%- else %}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants