Skip to content

[Issue]: Wrong GPU count when the User is not in render, video groups #123

@itej89

Description

@itej89

Problem Description

When madengine is run on a machine where user is not in render group, it miss counts the number of GPUs because of the warnings from amd-smi

vpolamre@useocpm2m-097-123:~/CMajor-RL/rl/runs/tasks$ amd-smi list --csv | tail -n +3                                                                                                                                                                                                                                                                                                                                                                                                          
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
gpu,gpu_bdf,gpu_uuid,kfd_id,node_id,partition_id
0,N/A,N/A,32700,2,0
1,N/A,N/A,3884,3,0
2,N/A,N/A,29122,4,0
3,N/A,N/A,35464,5,0
4,N/A,N/A,46166,6,0
5,N/A,N/A,64654,7,0
6,N/A,N/A,4769,8,0
7,N/A,N/A,6315,9,0

Operating System

Ubuntu 22.04.5 LTS

CPU

NA

GPU

MI300X

ROCm Version

rocm-7.0.2

ROCm Component

amdsmi

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

` useocpm2m-097-076
MACHINE NAME is useocpm2m-097-076
ℹ️ Inherited 2 environment variables from shell for Docker
ROCm container ROCM_PATH from image OCI config (ci-sglang_sglang-perf_pyt_sglang.ubuntu.amd): /opt/rocm
MAD_DATA_PROVIDER::huggingface: reordered list of data provider types to: {} ...
MAD_DATA_PROVIDER::huggingface: not found.
MAD_DATA_PROVIDER::huggingface: searched for previously. Reusing ...
pre encap post scripts: {'pre_scripts': [{'path': 'scripts/common/pre_scripts/run_rocenv_tool.sh', 'args': 'sglang_sglang-perf_env'}], 'encapsulate_script': '', 'post_scripts': []}
NGPUS requested is ALL (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16).
NGPUS requested is 17 out of 17
⠧ Building and running models...❌ Failed to run sglang/sglang-perf: list index out of range
Created performance CSV file: perf.csv

hostname
useocpm2m-097-076 `

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions