Skip to content

Chapter 5,7-9: Update CPU bindings when using torchrun #95

@marlon-tobaben

Description

@marlon-tobaben

Quite clear that the CPU bindings are sensible, but would be good to check if it is smart to set them twice with srun. This was mentioned by @mitjasai in #89

  • in the runscript
    CPU_BIND_MASKS="0x00fe000000000000,0xfe00000000000000,0x0000000000fe0000,0x00000000fe000000,0x00000000000000fe,0x000000000000fe00,0x000000fe00000000,0x0000fe0000000000"
    export SINGULARITYENV_PREPEND_PATH=/user-software/bin # gives access to packages inside the container
    srun --cpu-bind=v,mask_cpu=$CPU_BIND_MASKS singularity run -B ../resources/ai-guide-env.sqsh:/user-software:image-src=/ \
    $SIF bash -c 'export RANK=$SLURM_PROCID; export LOCAL_RANK=$SLURM_LOCALID; python ds_visiontransformer.py --deepspeed --deepspeed_config ds_config.json'
  • in the python script
    def set_cpu_affinity(local_rank):
    LUMI_GPU_CPU_map = {
    # A mapping from GCD to the closest CPU cores in a LUMI-G node
    # Note that CPU cores 0, 8, 16, 24, 32, 40, 48, 56 are reserved for the
    # system and not available for the user
    # See https://docs.lumi-supercomputer.eu/hardware/lumig/
    0: [49, 50, 51, 52, 53, 54, 55],
    1: [57, 58, 59, 60, 61, 62, 63],
    2: [17, 18, 19, 20, 21, 22, 23],
    3: [25, 26, 27, 28, 29, 30, 31],
    4: [1, 2, 3, 4, 5, 6, 7],
    5: [9, 10, 11, 12, 13, 14, 15],
    6: [33, 34, 35, 36, 37, 38, 39],
    7: [41, 42, 43, 44, 45, 46, 47],
    }
    cpu_list = LUMI_GPU_CPU_map[local_rank]
    print(f"Rank {rank} (local {local_rank}) binding to cpus: {cpu_list}")
    psutil.Process().cpu_affinity(cpu_list)

We think it does not hurt but it might not be pedagogical sensible to claim that with srun the PyTorch script is portable.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions