Skip to content

CPU underutilization (25%) on Bunya HPC #737

@JoeIbrahim

Description

@JoeIbrahim

I'm attempting to run some UW2 models on UQ's Bunya HPC. I am using the ompi version of UW with Apptainer. Bunya uses a Slurm scheduler and it's my first time using Slurm, so maybe I'm making some rookie mistakes. The model runs without any errors and I get some output, but the overall CPU utilization is 25%.

I normally run these models on Gadi and have not noticed any issues, so I suspect it might have something to do with how I've set everything up on Bunya. This model is definitely running slower than it does on Gadi with the same number of CPU's - far less outputs generated in 12hrs. Is there a way to check the CPU efficiency on Gadi? If so, I can compare. Here are the summary statistics for the job:

================================================================================
                              Slurm Job Statistics
================================================================================
         Job ID: 17583095
  NetID/Account: yousephibrahim/a_ibrahim
       Job Name: 800C_nomelt
          State: TIMEOUT
          Nodes: 1
      CPU Cores: 96
     CPU Memory: 50GB (520.8MB per CPU-core)
  QOS/Partition: normal/general
        Cluster: bunya
     Start Time: Wed Oct 15, 2025 at 4:47 PM
       Run Time: 12:30:13
     Time Limit: 12:00:00
                              Overall Utilization
================================================================================
  CPU utilization  [||||||||||||                                   25%]
  CPU memory usage [||||||||||||||||||||||||||||||||               65%]
                              Detailed Utilization
================================================================================
  CPU utilization per node (CPU time used/run time)
      bun019.hpc.net.uq.edu.au: 12-09:38:24/50-00:20:48 (efficiency=24.8%)
  CPU memory usage per node - used/allocated
      bun019.hpc.net.uq.edu.au: 32.7GB/50.0GB (348.3MB/533.3MB per core of 96)
                                     Notes
================================================================================
  * The overall CPU utilization of this job is 25%. This value is low compared
    to the target range of 80% and above. Please investigate the reason for
    the low efficiency. For instance, have you conducted a scaling analysis?
    For more info:
      https://github.com/UQ-RCC/hpc-docs/blob/main/guides/Bunya-User-Guide.md

This is the SLURM script I use:

#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks=48
#SBATCH --cpus-per-task=1
#SBATCH --mem=50G
#SBATCH --job-name=800C_nomelt
#SBATCH --time=12:00:00
#SBATCH --qos=normal
#SBATCH --partition=general
#SBATCH --account=a_ibrahim
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.err

module load openmpi/4.1.4

export singularityDir=/home/yousephibrahim/Underworld
export containerImage=$singularityDir/UNDERWORLD_ompi.sif
SCRIPT="800C.py"

# execute
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK apptainer exec $containerImage python3 $SCRIPT
#======START=====
echo "The current job ID is $SLURM_JOB_ID"
echo "Running on $SLURM_JOB_NUM_NODES nodes"
echo "Using $SLURM_NTASKS_PER_NODE tasks per node"
echo "A total of $SLURM_NTASKS tasks is used"
echo "Node list:"
sacct --format=JobID,NodeList%100 -j $SLURM_JOB_ID

Thanks for your help!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions