Tool for setting up singularity overlays with miniconda on NYU Torch - official NYU Torch docs
...because nobody likes doing it (until now)
✨ Here's what you could look like ✨
singuconda.webm
Running singuconda will give you a magic little ./sing 🧚🏾♀️ command in your current directory that:
- autodetects GPUs and will automagically add the
--nvflag - remembers the path to your overlay and sif images so all you have to do is
./sing - It automatically sources your
envfile for you (the one from the tutorial) - also creates a
singrwscript that mounts the overlay in read-write mode (./singmounts with read-only so you can have multiple scripts using it) - Has full support for both interactive shells (
./sing) and scripts (./sing <<< "type -P python") which will run the command and exit. This is what's used in sbatch files! - It accepts additional arguments so you can do
./sing -o /scratch/work/public/ml-datasets/coco/coco-2017.sqf:roto mount additional overlays (for example)
The ~/singuconda script itself:
- has autocomplete for all of the overlays and sif files
- automatically installs miniconda and lets you optionally pick a python version
ssh torch # or whatever your environment is
curl -L https://github.com/beasteers/singuconda/raw/main/singuconda --output ~/singuconda
chmod +x ~/singucondaThe singuconda command should always be run from the directory where you want your overlay and sing script to live.
But once they're created, the sing script can be run from anywhere.
Do NOT run ~/singuconda from a login node on the HPC cluster. Start an interactive job before calling the binary to avoid slow installations 🐌.
# cd to your projects directory
mkdir myproject
cd myproject
# make magic!
~/singucondaThe script will create some helper scripts for you:
./singrun the singularity container in read-only mode - use this to run many containers at once./singrwrun the singularity container in read-write mode - use this to install packages
Those commands above will create interactive sessions. If you want to run a script/commands in singularity (e.g. in a sbatch file), you can do this:
echo 'python script.py' | ./sing
./sing <<< 'python script.py'
./sing <<EOF
python script.py
EOF
./sing <<< "
python script.py
"Any arguments you provide will be passed to the singularity command.
# e.g. mount squashfs files
./sing -o path/to/dataset.sqf <<< "
python train.py
"If you do this while you're inside a git repository, you may want to ignore the generated files.
Here's a list of rules to filter them.
# the overlay file
*.ext3
# singuconda: start scripts
sing
singrw
# the singularity container associated with the overlay
.*.sifpath
You can customize behavior using environment variables. Set these in your ~/.bashrc
# in case you prefer to scream: SING_CMD="aagh"
export SING_CMD="sing"
# not everyone is at NYU
export SING_OVERLAY_DIR="/share/apps/overlay-fs-ext3"
export SING_SIF_DIR="/share/apps/images/"
# personal preferences
export SING_DEFAULT_OVERLAY="overlay-5GB-200K.ext3.gz"
export SING_DEFAULT_SIF="cuda12.3.2-cudnn9.0.0-ubuntu-22.04.4.sif"
~/singuconda# if you have multiple overlays in the same directory
SING_NAME=other ./singto delete the sing environment, just do:
rm *.ext3 sing singrw .*.sifpathsinguconda does allow creating multiple overlays in the same directory. When you use singuconda to setup a second overlay in the same directory, it will overwrite the sing command to point to your newer overlay.
If you want to use your first overlay, you can override the overlay using SING_NAME=my-first-sing ./sing
(assuming your overlay is called my-first-sing.ext3).
Yep that's gonna happen if you do that! But don't fret.
You need to rename the hidden file that contains what SIF file you want to use.
OLD_SING_NAME=my-first-sing
NEW_SING_NAME=better-name
mv ".${OLD_SING_NAME}.sifpath" ".${NEW_SING_NAME}.sifpath"And then you're going to want to edit ./sing and ./singrw to point to your new overlay name.
change
SING_NAME="${SING_NAME:-my-first-sing}"to
SING_NAME="${SING_NAME:-better-name}"Just run ~/singuconda again! It'll ask you if you want to configure an existing one or create a new one.
Well that's a bummer! But I've done that too. Unfortunately, there's not a super convenient way, but fortunately it's very easy to just start over! (which is what I always do).
If you need to, I suppose you could try creating a new overlay, then mount both overlays and try to
copy between them, but I'm not sure how to mount the second overlay
to a different directory (because afaik right now they'd both mount to /ext3).
./singrw -o my-too-small-overlay.ext3 # uh oh! collision? I should test this lol
This is just a common singularity error that happens because no other processes can be using the overlay while it's in write mode.
FATAL: while loading overlay images: failed to open overlay image ./overlay.ext3: while locking ext3 partition from /scratch/bs3639/ego2023/InstructBLIP_PEFT/blip.ext3: can't open /scratch/bs3639/ego2023/InstructBLIP_PEFT/blip.ext3 for writing, currently in use by another processSo you have to find which one of your processes is still running (background screen, tmux, sbatch, ..) and either wait for them to finish, or kill the processes.
ps -fu $USER | grep tmuxOne time, I spent an hour trying to hunt down the process and I swear I couldn't find it, so I just:
# move it out of the way
mv overlay.ext3 overlay1.ext3
# and made a copy
cp overlay1.ext3 overlay.ext3
# now the lock is on overlay1.ext3 :)It will go through a series of prompts. What happens:
- pick an overlay file
- pick a sif file
- install miniconda and allows you to select a specific python version if you want
- adds the startup environment script (/ext3/env)
- menu to install packages in the container
- create shortcut script(s) for running the container
Then you're all done!
You can re-run it if you want to change anything (sif file, python version, installs).
env GOOS=linux GOARCH=amd64 go build .So we have something to copy and paste from ;)
#!/bin/bash
#SBATCH -c 8
#SBATCH --mem 8GB
#SBATCH --time 8:00:00
#SBATCH --gres gpu:1
#SBATCH --job-name=myjob
#SBATCH --output logs/job.%J.out
#SBATCH --account=<YOUR_PROJECTID>
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<YOUR_USERID>@nyu.edu
../sing << EOF
python blah.py ...
EOFAnd for jupyter:
#!/bin/bash
#SBATCH -c 8
#SBATCH --mem 24GB
#SBATCH --time 8:00:00
#SBATCH --gres gpu:1
#SBATCH --account=<YOUR_PROJECTID>
#SBATCH --job-name=jupyter
#SBATCH --output logs/jupyter.out
port=$(shuf -i 10000-65500 -n 1)
echo -e "\nTo access, run one of the following from your local machine:"
echo -e "ssh -L 8888:localhost:$port $USER@login.torch.hpc.nyu.edu"
echo -e "ssh -L 8888:localhost:$port torch"
echo -e "\nThen, from that shell, run:"
echo -e "ssh -L $port:localhost:$port $HOSTNAME"
echo -e "\nIn your local browser, open http://localhost:8888/"
echo -e "If asked for a token, copy the token from the URL in logs/jupyter.out\n"
./singrw << EOF
python -m ipykernel install --name sing --user
jupyter lab --no-browser --port $port
EOFRemember that you have to open a new ssh session and forward the port. Check
logs/jupyter.outfor the port number and token.
Put your things in your home directory
ln -s /scratch/$USER ~/scratch
ln -s /vast/$USER ~/vast
ln -s /archive/$USER ~/archiveFor your ~/.bashrc:
# convenience commands for watching squeue
export SQUEUEFMT='%.18i %.9P %.32j %.8u %.8T %.10M %.9l %.6D %R'
alias msq='squeue --me -o "$SQUEUEFMT"'
alias wsq='watch -n 2 "squeue --me -o \"$SQUEUEFMT\""'
alias wnv='watch -n 0.1 nvidia-smi'
# lets me know when my bashrc is sourced
[[ $- == *i* ]] && echo 'hi bea :)'If you manage to get this fully working, please post how you did it here! #7
Setup your ssh config on your local computer like this: vim ~/.ssh/config
Host sing
User YOUR-NETID # CHANGE
HostName cs022 # YOU WILL HAVE TO CHANGE ME EVERY TIME YOU SUBMIT A JOB
RemoteCommand /path/to/sing # CHANGE
RequestTTY yes # needed for sing to work
ProxyCommand ssh greene nc %h %p 2> /dev/null
Host greene
HostName greene.hpc.nyu.edu
User YOUR-NETID
ServerAliveInterval 120
ForwardAgent yes # for git push over ssh
# greene changes their signature for some reason?? So you have to do this to avoid errors
Host greene sing
StrictHostKeyChecking no
UserKnownHostsFile=/dev/nullYou can test this by setting HostName to log-1 and doing ssh sing. If all is successful, you should go straight into singularity (remember no running code on the login node).
In VSCode, open settings (CMD-",") and enable "Remote.SSH: Enable Remote Command".
# I don't need nothin fancy
srun -c 1 -t 8:0:0 --mem 8GB "sleep infinity"
# OR
# gimme gpu pls
srun -c 12 -t 6:0:0 --mem 64GB --gres gpu:1 "sleep infinity"Get the node's name e.g. cs022
# lets see what node I got
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
43351615 cs bash bs3639 R 1:11:40 1 cs022on your local computer: vim ~/.ssh/config
Host sing
HostName cs022 # UPDATE
- CMD-Shift-P to open the command palette.
- Type "Connect to Host"
- Select host "sing"
Or if you're already in a remote window (job died, you submitted another), just run "Reload Window" instead.
you can do ssh sing and end up in a singularity container just fine.
But vscode uses ssh -T and just timeouts when connecting.