Problem Description
When madengine is run on a machine where user is not in render group, it miss counts the number of GPUs because of the warnings from amd-smi
vpolamre@useocpm2m-097-123:~/CMajor-RL/rl/runs/tasks$ amd-smi list --csv | tail -n +3
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
gpu,gpu_bdf,gpu_uuid,kfd_id,node_id,partition_id
0,N/A,N/A,32700,2,0
1,N/A,N/A,3884,3,0
2,N/A,N/A,29122,4,0
3,N/A,N/A,35464,5,0
4,N/A,N/A,46166,6,0
5,N/A,N/A,64654,7,0
6,N/A,N/A,4769,8,0
7,N/A,N/A,6315,9,0
Operating System
Ubuntu 22.04.5 LTS
CPU
NA
GPU
MI300X
ROCm Version
rocm-7.0.2
ROCm Component
amdsmi
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
` useocpm2m-097-076
MACHINE NAME is useocpm2m-097-076
ℹ️ Inherited 2 environment variables from shell for Docker
ROCm container ROCM_PATH from image OCI config (ci-sglang_sglang-perf_pyt_sglang.ubuntu.amd): /opt/rocm
MAD_DATA_PROVIDER::huggingface: reordered list of data provider types to: {} ...
MAD_DATA_PROVIDER::huggingface: not found.
MAD_DATA_PROVIDER::huggingface: searched for previously. Reusing ...
pre encap post scripts: {'pre_scripts': [{'path': 'scripts/common/pre_scripts/run_rocenv_tool.sh', 'args': 'sglang_sglang-perf_env'}], 'encapsulate_script': '', 'post_scripts': []}
NGPUS requested is ALL (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16).
NGPUS requested is 17 out of 17
⠧ Building and running models...❌ Failed to run sglang/sglang-perf: list index out of range
Created performance CSV file: perf.csv
hostname
useocpm2m-097-076 `
Problem Description
When madengine is run on a machine where user is not in render group, it miss counts the number of GPUs because of the warnings from amd-smi
Operating System
Ubuntu 22.04.5 LTS
CPU
NA
GPU
MI300X
ROCm Version
rocm-7.0.2
ROCm Component
amdsmi
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
` useocpm2m-097-076
MACHINE NAME is useocpm2m-097-076
ℹ️ Inherited 2 environment variables from shell for Docker
ROCm container ROCM_PATH from image OCI config (ci-sglang_sglang-perf_pyt_sglang.ubuntu.amd): /opt/rocm
MAD_DATA_PROVIDER::huggingface: reordered list of data provider types to: {} ...
MAD_DATA_PROVIDER::huggingface: not found.
MAD_DATA_PROVIDER::huggingface: searched for previously. Reusing ...
pre encap post scripts: {'pre_scripts': [{'path': 'scripts/common/pre_scripts/run_rocenv_tool.sh', 'args': 'sglang_sglang-perf_env'}], 'encapsulate_script': '', 'post_scripts': []}
NGPUS requested is ALL (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16).
NGPUS requested is 17 out of 17
⠧ Building and running models...❌ Failed to run sglang/sglang-perf: list index out of range
Created performance CSV file: perf.csv