Chapter 5,7-9: Update CPU bindings when using torchrun

Quite clear that the CPU bindings are sensible, but would be good to check if it is smart to set them twice with srun. This was mentioned by @mitjasai in #89

- in the runscript
https://github.com/Lumi-supercomputer/LUMI-AI-Guide/blob/6ca88fcf2367bcb9f7e797e12d50742816ef8583/5-multi-gpu-and-node/run_ds_srun.sh#L35-L40
- in the python script
https://github.com/Lumi-supercomputer/LUMI-AI-Guide/blob/6ca88fcf2367bcb9f7e797e12d50742816ef8583/5-multi-gpu-and-node/ds_visiontransformer.py#L18-L35

We think it does not hurt but it might not be pedagogical sensible to claim that with srun the PyTorch script is portable.

	CPU_BIND_MASKS="0x00fe000000000000,0xfe00000000000000,0x0000000000fe0000,0x00000000fe000000,0x00000000000000fe,0x000000000000fe00,0x000000fe00000000,0x0000fe0000000000"

	export SINGULARITYENV_PREPEND_PATH=/user-software/bin # gives access to packages inside the container

	srun --cpu-bind=v,mask_cpu=$CPU_BIND_MASKS singularity run -B ../resources/ai-guide-env.sqsh:/user-software:image-src=/ \
	$SIF bash -c 'export RANK=$SLURM_PROCID; export LOCAL_RANK=$SLURM_LOCALID; python ds_visiontransformer.py --deepspeed --deepspeed_config ds_config.json'

	def set_cpu_affinity(local_rank):
	LUMI_GPU_CPU_map = {
	# A mapping from GCD to the closest CPU cores in a LUMI-G node
	# Note that CPU cores 0, 8, 16, 24, 32, 40, 48, 56 are reserved for the
	# system and not available for the user
	# See https://docs.lumi-supercomputer.eu/hardware/lumig/
	0: [49, 50, 51, 52, 53, 54, 55],
	1: [57, 58, 59, 60, 61, 62, 63],
	2: [17, 18, 19, 20, 21, 22, 23],
	3: [25, 26, 27, 28, 29, 30, 31],
	4: [1, 2, 3, 4, 5, 6, 7],
	5: [9, 10, 11, 12, 13, 14, 15],
	6: [33, 34, 35, 36, 37, 38, 39],
	7: [41, 42, 43, 44, 45, 46, 47],
	}
	cpu_list = LUMI_GPU_CPU_map[local_rank]
	print(f"Rank {rank} (local {local_rank}) binding to cpus: {cpu_list}")
	psutil.Process().cpu_affinity(cpu_list)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 5,7-9: Update CPU bindings when using torchrun #95

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Chapter 5,7-9: Update CPU bindings when using torchrun #95

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions