-
Notifications
You must be signed in to change notification settings - Fork 64
Description
I'm attempting to run some UW2 models on UQ's Bunya HPC. I am using the ompi version of UW with Apptainer. Bunya uses a Slurm scheduler and it's my first time using Slurm, so maybe I'm making some rookie mistakes. The model runs without any errors and I get some output, but the overall CPU utilization is 25%.
I normally run these models on Gadi and have not noticed any issues, so I suspect it might have something to do with how I've set everything up on Bunya. This model is definitely running slower than it does on Gadi with the same number of CPU's - far less outputs generated in 12hrs. Is there a way to check the CPU efficiency on Gadi? If so, I can compare. Here are the summary statistics for the job:
================================================================================
Slurm Job Statistics
================================================================================
Job ID: 17583095
NetID/Account: yousephibrahim/a_ibrahim
Job Name: 800C_nomelt
State: TIMEOUT
Nodes: 1
CPU Cores: 96
CPU Memory: 50GB (520.8MB per CPU-core)
QOS/Partition: normal/general
Cluster: bunya
Start Time: Wed Oct 15, 2025 at 4:47 PM
Run Time: 12:30:13
Time Limit: 12:00:00
Overall Utilization
================================================================================
CPU utilization [|||||||||||| 25%]
CPU memory usage [|||||||||||||||||||||||||||||||| 65%]
Detailed Utilization
================================================================================
CPU utilization per node (CPU time used/run time)
bun019.hpc.net.uq.edu.au: 12-09:38:24/50-00:20:48 (efficiency=24.8%)
CPU memory usage per node - used/allocated
bun019.hpc.net.uq.edu.au: 32.7GB/50.0GB (348.3MB/533.3MB per core of 96)
Notes
================================================================================
* The overall CPU utilization of this job is 25%. This value is low compared
to the target range of 80% and above. Please investigate the reason for
the low efficiency. For instance, have you conducted a scaling analysis?
For more info:
https://github.com/UQ-RCC/hpc-docs/blob/main/guides/Bunya-User-Guide.md
This is the SLURM script I use:
#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks=48
#SBATCH --cpus-per-task=1
#SBATCH --mem=50G
#SBATCH --job-name=800C_nomelt
#SBATCH --time=12:00:00
#SBATCH --qos=normal
#SBATCH --partition=general
#SBATCH --account=a_ibrahim
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.err
module load openmpi/4.1.4
export singularityDir=/home/yousephibrahim/Underworld
export containerImage=$singularityDir/UNDERWORLD_ompi.sif
SCRIPT="800C.py"
# execute
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK apptainer exec $containerImage python3 $SCRIPT
#======START=====
echo "The current job ID is $SLURM_JOB_ID"
echo "Running on $SLURM_JOB_NUM_NODES nodes"
echo "Using $SLURM_NTASKS_PER_NODE tasks per node"
echo "A total of $SLURM_NTASKS tasks is used"
echo "Node list:"
sacct --format=JobID,NodeList%100 -j $SLURM_JOB_ID
Thanks for your help!!