A collection of CLI tools for various tasks and queries on Slurm clusters.
The CLI tools are Python scripts that rely only on the Python standard library. All scripts will run with Python 3.10+. The scripts should run on most Slurm clusters, but they depend on certain Slurm configuration. Check the notes in each script for more information.
To install, simply clone the repo:
git clone --depth 1 https://github.com/uschpc/slurm-tools.git
The scripts will be downloaded with execute permissions. If desired, move the scripts to another directory, such as a directory on PATH. If needed, load a compatible version of Python or change the hashbang in the scripts to use a compatible Python executable.
Each script contains help and usage information that can be viewed with the -h/--help flag (e.g., jobinfo -h).
The scripts are described and shown with example output below.
View account information for user.
$ myaccount
-----------------------------------------------------------------
Cluster accounts
-----------------------------------------------------------------
User Account Cluster Default QOS
---------- --------------- ---------- --------------- -----------
ttrojan ttrojan_123 discovery ttrojan_123 normal
ttrojan ttrojan_125 discovery ttrojan_123 normal
-----------------------------------------------------------------
Cluster account service units (SUs)
-----------------------------------------------------------------
Account Limit Usage Remaining
----------------- --------------- --------------- ---------------
ttrojan_123 12000000 422699 11577301
ttrojan_125 n/a 839856 n/a
-----------------------------------------------------------------
Allowed cluster partitions
-----------------------------------------------------------------
Partition Allowed accounts
-------------- --------------------------------------------------
trojan ttrojan_125
shared ALL
View available resources (free or configured) on nodes.
$ noderes -p largemem
-------------------------------------------------------------------
Node Partition State CPU GPU Free Free Free
Model Model CPUs Memory GPUs
------ ------------ --------- ----------- -------- ---- ------ ----
a01-10 largemem mixed epyc-7513 -- 22 78G --
a02-10 largemem allocated epyc-7513 -- 0 54G --
a03-10 largemem mixed epyc-7513 -- 12 50G --
a04-10 largemem reserved epyc-7513 -- 62 38G --
b17-13 largemem allocated epyc-9354 -- 0 0G --
View job queue information.
$ jobqueue -p largemem
-----------------------------------------------------------------------------------------
Job ID User Job Name Partition State Elapsed Nodelist(Reason)
------------ ---------- ------------ ---------- -------- ----------- --------------------
7453378 jesp Run_do52bran largemem PENDING 0:00 (QOSMaxMemoryPerUser
7453379 jesp Run_do6b.job largemem PENDING 0:00 (QOSMaxMemoryPerUser
7473836 ttrojan ood/jupyter largemem RUNNING 2:42:28 a04-10
7449562 snet run_study_4x largemem RUNNING 2-23:17:10 a02-10
7453377 jesp Run_do51bran largemem RUNNING 2-17:02:07 a01-10
7470944 huy435 rfmixchr1 largemem RUNNING 21:18:51 a02-10,a04-10
View compact history of jobs.
$ jobhist -p largemem
----------------------------------------------------------------------------------------------------
Job ID Startdate User Job Name Partition State Elapsed Nodes CPUs Memory
------------- ---------- ---------- ------------ ---------- ---------- ----------- ----- ---- ------
14690860 2024-07-29 ttrojan sim.sl largemem RUNNING 3-08:19:19 1 32 128G
14734145 2024-07-31 jesp sfla largemem RUNNING 2-21:46:24 1 64 998G
14738354 2024-07-31 snet interactive largemem COMPLETED 06:56:19 1 16 400G
14741823 2024-07-31 huy435 model_fit1 largemem COMPLETED 07:04:19 1 64 248G
14741846 2024-07-31 huy435 model_fit2 largemem COMPLETED 08:10:59 1 64 248G
14741918 2024-08-01 snet feature.sl largemem FAILED 00:02:16 1 8 300G
View detailed job information.
$ jobinfo 483699
Job ID | 483699
Job name | simdebug.sl
User | ttrojan
Account | ttrojan_123
Working directory | /project/ttrojan_123/sim
Cluster | discovery
Partition | main
State | COMPLETED
Exit code | 0:0
Nodes | 2
Tasks | 32
CPUs | 32
Memory | 120G
GPUs | 0
Nodelist | e05-[42,76]
Submit time | 2024-01-26T14:56:23
Start time | 2024-01-26T14:56:24
End time | 2024-01-26T14:57:32
Wait time | 00:00:01
Reserved walltime | 00:10:00
Elapsed walltime | 00:01:08
Time efficiency | 11.33%
Elapsed CPU walltime | 00:36:16
Used CPU time | 00:35:18.342
CPU efficiency | 97.37%
User CPU time pct | 96.35%
System CPU time pct | 3.65%
Max memory used | 64.74G (estimate)
Memory efficiency | 53.95%
Max disk read | 14.42M (estimate)
Max disk write | 1.04M (estimate)
View job efficiency information.
$ jobeff -p largemem
------------------------------------------------------------------
Job ID State CPU Efficiency Memory Efficiency
------------- ---------- -------------------- --------------------
1131130 CANCELLED 42.02% [|||| ] 50.03% [||||| ]
1140921 COMPLETED 9.78% [| ] 4.15% [ ]
1140925 RUNNING
1189016 FAILED 87.30% [||||||||| ] 85.03% [||||||||| ]
1201010 OUT_OF_MEM 86.08% [||||||||| ] 99.69% [||||||||||]
1201035 TIMEOUT 23.00% [|| ] 73.35% [||||||| ]