Skip to content

uschpc/slurm-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Slurm CLI tools

A collection of CLI tools for various tasks and queries on Slurm clusters.

Installation

The CLI tools are Python scripts that rely only on the Python standard library. All scripts will run with Python 3.10+. The scripts should run on most Slurm clusters, but they depend on certain Slurm configuration. Check the notes in each script for more information.

To install, simply clone the repo:

git clone --depth 1 https://github.com/uschpc/slurm-tools.git

The scripts will be downloaded with execute permissions. If desired, move the scripts to another directory, such as a directory on PATH. If needed, load a compatible version of Python or change the hashbang in the scripts to use a compatible Python executable.

Usage

Each script contains help and usage information that can be viewed with the -h/--help flag (e.g., jobinfo -h).

The scripts are described and shown with example output below.

myaccount

View account information for user.

$ myaccount
-----------------------------------------------------------------
Cluster accounts
-----------------------------------------------------------------
User       Account         Cluster    Default         QOS
---------- --------------- ---------- --------------- -----------
ttrojan    ttrojan_123     discovery  ttrojan_123     normal
ttrojan    ttrojan_125     discovery  ttrojan_123     normal

-----------------------------------------------------------------
Cluster account service units (SUs)
-----------------------------------------------------------------
Account           Limit           Usage           Remaining
----------------- --------------- --------------- ---------------
ttrojan_123       12000000        422699          11577301
ttrojan_125       n/a             839856          n/a

-----------------------------------------------------------------
Allowed cluster partitions
-----------------------------------------------------------------
Partition      Allowed accounts
-------------- --------------------------------------------------
trojan         ttrojan_125
shared         ALL

noderes

View available resources (free or configured) on nodes.

$ noderes -p largemem
-------------------------------------------------------------------
Node      Partition     State         CPU      GPU Free   Free Free
                                    Model    Model CPUs Memory GPUs
------ ------------ --------- ----------- -------- ---- ------ ----
a01-10     largemem     mixed   epyc-7513       --   22    78G   --
a02-10     largemem allocated   epyc-7513       --    0    54G   --
a03-10     largemem     mixed   epyc-7513       --   12    50G   --
a04-10     largemem  reserved   epyc-7513       --   62    38G   --
b17-13     largemem allocated   epyc-9354       --    0     0G   --

jobqueue

View job queue information.

$ jobqueue -p largemem
-----------------------------------------------------------------------------------------
Job ID             User     Job Name  Partition    State     Elapsed     Nodelist(Reason)
------------ ---------- ------------ ---------- -------- ----------- --------------------
7453378            jesp Run_do52bran   largemem  PENDING        0:00 (QOSMaxMemoryPerUser
7453379            jesp Run_do6b.job   largemem  PENDING        0:00 (QOSMaxMemoryPerUser
7473836         ttrojan  ood/jupyter   largemem  RUNNING     2:42:28               a04-10
7449562            snet run_study_4x   largemem  RUNNING  2-23:17:10               a02-10
7453377            jesp Run_do51bran   largemem  RUNNING  2-17:02:07               a01-10
7470944          huy435    rfmixchr1   largemem  RUNNING    21:18:51        a02-10,a04-10

jobhist

View compact history of jobs.

$ jobhist -p largemem
----------------------------------------------------------------------------------------------------
Job ID         Startdate       User     Job Name  Partition      State     Elapsed Nodes CPUs Memory
------------- ---------- ---------- ------------ ---------- ---------- ----------- ----- ---- ------
14690860      2024-07-29    ttrojan       sim.sl   largemem    RUNNING  3-08:19:19     1   32   128G 
14734145      2024-07-31       jesp         sfla   largemem    RUNNING  2-21:46:24     1   64   998G 
14738354      2024-07-31       snet  interactive   largemem  COMPLETED    06:56:19     1   16   400G 
14741823      2024-07-31     huy435   model_fit1   largemem  COMPLETED    07:04:19     1   64   248G 
14741846      2024-07-31     huy435   model_fit2   largemem  COMPLETED    08:10:59     1   64   248G 
14741918      2024-08-01       snet   feature.sl   largemem     FAILED    00:02:16     1    8   300G 

jobinfo

View detailed job information.

$ jobinfo 483699
Job ID               | 483699
Job name             | simdebug.sl
User                 | ttrojan
Account              | ttrojan_123
Working directory    | /project/ttrojan_123/sim
Cluster              | discovery
Partition            | main
State                | COMPLETED
Exit code            | 0:0
Nodes                | 2
Tasks                | 32
CPUs                 | 32
Memory               | 120G
GPUs                 | 0
Nodelist             | e05-[42,76]
Submit time          | 2024-01-26T14:56:23
Start time           | 2024-01-26T14:56:24
End time             | 2024-01-26T14:57:32
Wait time            | 00:00:01
Reserved walltime    | 00:10:00
Elapsed walltime     | 00:01:08
Time efficiency      | 11.33%
Elapsed CPU walltime | 00:36:16
Used CPU time        | 00:35:18.342
CPU efficiency       | 97.37%
User CPU time pct    | 96.35%
System CPU time pct  |  3.65%
Max memory used      | 64.74G (estimate)
Memory efficiency    | 53.95%
Max disk read        | 14.42M (estimate)
Max disk write       | 1.04M (estimate)

jobeff

View job efficiency information.

$ jobeff -p largemem
------------------------------------------------------------------
Job ID             State       CPU Efficiency    Memory Efficiency
------------- ---------- -------------------- --------------------
1131130        CANCELLED  42.02% [||||      ]  50.03% [|||||     ]
1140921        COMPLETED   9.78% [|         ]   4.15% [          ]
1140925          RUNNING                                          
1189016           FAILED  87.30% [||||||||| ]  85.03% [||||||||| ]
1201010       OUT_OF_MEM  86.08% [||||||||| ]  99.69% [||||||||||]
1201035          TIMEOUT  23.00% [||        ]  73.35% [|||||||   ]

License

0BSD

About

CLI tools for Slurm clusters

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages