-
Notifications
You must be signed in to change notification settings - Fork 5
Add tag for different type of GPUs #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
dfa6099 to
1b5d18f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds GPU vendor tagging functionality to differentiate between NVIDIA and AMD GPU nodes, and implements nodeSelector constraints in device plugin deployments to prevent errors when devices are not present.
Changes:
- Added
vendor=nvidiaandvendor=amdnode labels based on GPU type detection - Added nodeSelector to nvidia-device-plugin and amd-device-plugin DaemonSet deployments
- Implemented vendor labeling logic for worker nodes including fallback values for unknown GPU types and CPU-only nodes
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/device-plugin/deploy/start.sh.template | Adds nodeSelector with vendor labels to NVIDIA and AMD GPU device plugin deployments |
| src/cluster-configuration/deploy/start.sh.template | Implements node labeling logic to assign vendor tags based on computing device type |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| s/(^[[:space:]]*allowPrivilegeEscalation: false.*)\n([[:space:]]*privileged: false)/\1\n\2/ | ||
| }'; | ||
| }' \ | ||
| | sed '/^[[:space:]]*tolerations:/i\ nodeSelector:\n vendor: nvidia'; |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sed command uses || (logical OR operator) on line 39, which should be | (pipe operator) to chain the sed commands. The || operator will only execute the second sed if the first one fails, which is not the intended behavior.
| | sed 's|rocm/k8s-device-plugin|{{ cluster_cfg['cluster']['docker-registry']['prefix'] }}k8s-rocm-device-plugin:{{ cluster_cfg['cluster']['docker-registry']['tag'] }}|' \ | ||
| | sed -E '/^[[:space:]]*nodeSelector:[[:space:]]*$/{ | ||
| n | ||
| s/^([[:space:]]*)(.*)$/\1vendor: amd\ | ||
| \1\2/ | ||
| }'; |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sed command uses || (logical OR operator) on line 54, which should be | (pipe operator) to chain the sed commands. The || operator will only execute the second sed if the first one fails, which is not the intended behavior.
| | sed -E '/^[[:space:]]*nodeSelector:[[:space:]]*$/{ | ||
| n | ||
| s/^([[:space:]]*)(.*)$/\1vendor: amd\ | ||
| \1\2/ | ||
| }'; |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing closing brace for the opening brace on line 53. The AMD device plugin section starts with { but the sed command chain ends with a semicolon without a corresponding } and the cat command structure that exists in the NVIDIA section.
| {%- set machine_type = cluster_cfg['layout']['machine-list'][host]['machine-type'] %} | ||
| {%- if machine_type in cluster_cfg['layout']['machine-sku'] and 'computing-device' in cluster_cfg['layout']['machine-sku'][machine_type] %} | ||
| {%- set device_type = cluster_cfg['layout']['machine-sku'][machine_type]['computing-device']['type'] %} | ||
| {%- if device_type == 'nvidia.com/gpu' %} | ||
| echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=nvidia || exit $? | ||
| {%- elif device_type == 'amd.com/gpu' %} | ||
| echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=amd || exit $? | ||
| {%- else %} | ||
| echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=unknown || exit $? | ||
| {%- endif %} | ||
| {%- else %} | ||
| echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=cpu || exit $? | ||
| {%- endif %} | ||
| {%- else %} |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation of the Jinja2 template blocks is inconsistent. Lines 51-63 use excessive leading spaces (8 spaces) compared to the surrounding code which uses 4 spaces for the conditional blocks. This makes the code harder to read and maintain.
| {%- set machine_type = cluster_cfg['layout']['machine-list'][host]['machine-type'] %} | |
| {%- if machine_type in cluster_cfg['layout']['machine-sku'] and 'computing-device' in cluster_cfg['layout']['machine-sku'][machine_type] %} | |
| {%- set device_type = cluster_cfg['layout']['machine-sku'][machine_type]['computing-device']['type'] %} | |
| {%- if device_type == 'nvidia.com/gpu' %} | |
| echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=nvidia || exit $? | |
| {%- elif device_type == 'amd.com/gpu' %} | |
| echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=amd || exit $? | |
| {%- else %} | |
| echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=unknown || exit $? | |
| {%- endif %} | |
| {%- else %} | |
| echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=cpu || exit $? | |
| {%- endif %} | |
| {%- else %} | |
| {%- set machine_type = cluster_cfg['layout']['machine-list'][host]['machine-type'] %} | |
| {%- if machine_type in cluster_cfg['layout']['machine-sku'] and 'computing-device' in cluster_cfg['layout']['machine-sku'][machine_type] %} | |
| {%- set device_type = cluster_cfg['layout']['machine-sku'][machine_type]['computing-device']['type'] %} | |
| {%- if device_type == 'nvidia.com/gpu' %} | |
| echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=nvidia || exit $? | |
| {%- elif device_type == 'amd.com/gpu' %} | |
| echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=amd || exit $? | |
| {%- else %} | |
| echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=unknown || exit $? | |
| {%- endif %} | |
| {%- else %} | |
| echo kubectl label --overwrite=true nodes {{ cluster_cfg['layout']['machine-list'][host]['hostname'] }} vendor=cpu || exit $? | |
| {%- endif %} | |
| {%- else %} |
Add tag vendor=nvidia and vendor=amd for nodes with different types of GPUs and add nodeSelector when deploying nvidia-device-plugin and amd-device-plugin to remove complaints with no GPUs in device plugin containers.