This is a .md file [Markdown file].
View in a markdown viewer(reccomended) or as plain text. I reccomend using VS Code
This Guide serves as a "wtf is going on" and serves to try and explain why things are done in X way and what things are being done.
It is meant to be comprehensive in showing how to use NSCC
These guides are "how to" guides, meant to get jobs running: (start from first in list)
For quick start, you can refer to these guides.
- Python_Singularity_Guide.md (RunPythonSingularity Folder) -> Guide to use tensorflow library and run python files
- JupyterNoteBook_Guide.md (RunJupyterSingularity FOlder) -> Guide to setting up jupyter notebook server
Visual Guide:
These are other usefull resources online:
- Official quick start guide by NSCC, link
- NSCC User Guides
- NSCC quick reference sheet
- NSCC FAQ page
- Guide to using dgx GPUs link
NSCC owns the supercomputer, Aspire1, and lets users "borrow" its computing power.
To facilitate this, the linux Operating System PBS Pro is used.
See Portable Batch System.
Users submit "jobs", which are requests for resources that contains code/commands they want to run.
These "jobs" use Bash to tell the supercomputer what to do and what files to run.
"Jobs" are submitted in the NSCC Login node, where they are processed and sent to an internal compute node to be processed.
To connect to the Login node, we join the VPN used by NSCC, Sophos VPN.
Then we have to SSH to the Login node to connect to it.
PBS Pro have special commands to submit jobs, such as qsub, qstat, qdel.
As jobs are Bash Scripts, we will need to use Bash to run our code.
ie. all our code can be run using the bash command line.
We need all our dependacies to be working on NSCC as well.
to enable use of some commands like anaconda, we can add it as a module using module load
Installing modules is problematic. we dont have root access, so we cannot just pip install <module>.
We cannot modify the root file where modules are installed, we have to install into our local directory.
SSH is a Network protocol to securely allow a user to access the command line(shell) of another computer
We use this to access NSCC login terminal.
A popular SSH terminal software is PuTTy, we reccomend using it to SSH into the Login Node
It can be downloaded here
Bash is a shell, a user interface for access to an operating system's services, in this case Linux Machines
Basically, Bash is the windows command prompt, but for Linux machines(UNIX).
NSCC uses PBS Pro, which is a Linux OS and uses the Bash shell.
Therefore we will need to use Bash commands to interact with NSCC.
Common Bash Commands:
ls=> List all files + folders in current directorycd=> Change directorymkdir=> Makes a directoryrm=> Removes an itemecho=> Prints to console(stdout)env=> List all environment variables
TIP: To get an overview on what a command does, use the man command or --help flag.
man <command>=> opens manual for command<command> --help=> bring out help menu for command
NSCC uses SFTP (Secure File Transfer Protocol) protocol to transfer files from host computer to NSCC.
Reccomended software that supports SFTP is WinSCP
Therefore, we use WinSCP to tranfer bash scripts and other files such as python files to NSCC
PBS Pro has commands to submit, delete and track jobs.
NSCC has guides on how to use these commands,
As well as a quick reference sheet.
These commands flags are placed on the top of the bash file, after #!/bin/sh, and always start with #PBS.
### The following requests 1 Chunk, 5 CPU Nodes, 1 GPU
#PBS -l select=1:ncpus=5:ngpus=1
### Specify amount of time required
### Values less than 4 hours go into a higher priority queue
#PBS -l walltime=2:00:00
### Specify gpu/dgx queue
#PBS -q gpu
Some Common Commands:
qsub <shell_script>=> Submit a job to queueqdel <job_id>=> Delete a running or queued jobqstat=> Find information about current jobsqstat -f <job_id>=> Full information of specific job
To view more info about commands , use ` help`. \ For a list of commands, refer to [quick reference sheet](https://help.nscc.sg/wp-content/uploads/2016/08/PBS_Professional_Quick_Reference.pdf)
Different queues are used to satisfy the resource requirements of the various workloads that run on NSCC.
If we want use dgx GPU vs normal GPU, we have to send out job into different queues.
As a user, we submit jobs into an external queue depending on our needs
Examples:
- normal => just use CPUs
- GPU => use normal GPUs
- dgx => use special dgx fast GPUs
Specifc documentation on queues can be found on pg.4 of [quick start guide](https://help.nscc.sg/pbspro-quickstartguide/) under external queues
Usually, when we add libraries for python, we enter into the command prompt pip install <module> or conda install <module>.
However if we pip install <module>, we get a permission error, as we end up trying to overwrite files we dont have permission to.
pip modules are usually. installed in /usr/*** file, However we do not have access to that file location on NSCC.
Therefore, we have to use pip install -U -q --user <module>:
-U=> upgrade if possible-q=> quiet installation, preserve time stamps--user=> install to user home directory instead of system directory (Installs in site.USER_SITE) documentation
Modules are a part of Modules Package in Linux. Modules Documentation
Modules allow for dynamic modification of the user's environment ($PATH, $MANPATH) viaย modulefiles.
In essence, modules prepares the environment and allows us to use commands.
We load modules we want with the command module load.
Example: module load anaconda to let us use the conda command.
Commonly used commands:
module help=> Show help manualmodule list=> List currently added modulesmodule avail=> Show available modulesmodule add/load [module file]=> Add/Loads modulesmodule remove/unload [module file]=> Remove/Unloads modulemodule show [modulefile]=> Shows what module file does to environmentmodule whatis [modulefile]=> Query what the module does
Its reccomended to follow first half of the visual guide. Visual step by step guide
Download Sophos VPN from NSCC website.
Download Auth app on phone, reccomneded to download Sophos Authenticator.
Use App to Aceess VPN network.
Login to NSCC Login node using PuTTY. \
NSCC has made a NSCC PBSPro Quick Start Guide which can be followed.
Use WinSCP with this setting
- Hostname: aspire.nscc.sg
- PortNumber: 22
- User name: <your_user_name>
- Password: <leave bank, enter when required>
Click and drag to copy files over.
To submit a job.
qsub submit.pbs
Where submit.pbs is :
#!/bin/bash
#PBS โq normal
#PBS โl select=1:ncpus=1:mem=100M
#PBS โl walltime=00:10:00
#PBS โN Sleep_Job
#PBS -o ~/outputfiles/Sleep_Job.o
#PBS โe ~/errorfiles/Sleep_Job.e
echo sleep job for 30 seconds
sleep 30
Of the format :
#!/bin/bash
[#PBS Commands, specify configs about job.]
[Rest of commands to run]
To check status on your submitted jobs qstat will list current running jobs.
To delete a job, use qdel <job_id>
Tensorflow 2.0 is a tricky module to add.
Tensorflow is GPU dependant, and module load tensorflow only supports Tensorflow 1.4.
Even worse, module load anaconda and module load tensorflow are incompatiable and will raise a warning.
Even pip install tensorflow can cause errors.
The only proper way to add GPU Tensorflow libraries i'hv found so far is by using containers.
Containers allows for OS-level virtualisation.
Basically its a Virtual Machine that virtualises processes and not the whole computer, which makes it much more efficient.
The main benefit of containers is isolation,
allowing us to package an application with all of its dependencies into a standardized unit.
In essence, we can put our setup into the container (such as installing dependacies), and it will work anywhere.
The most popular containerisation software is Docker.
However for High Performance Computing(HPC) in scientific context, Singularity is a popular option.
While both can be used with NSCC, in our guide we use Singularity.
NSCC has ready-made containers with tensorflow that works with their GPUs.
So we can:
- "Boot up" a container with tensorflow
- Add other dependacies as needed (pip install)
- Run our code from there.
Singularity Container Syntax: singularity exec --nv <image> /bin/bash << EOF
Example: singularity exec --nv /app/singularity/images/tensorflow/tensorflow_2.3.0_gpu_py3.simg /bin/bash << EOF
singularity exec=> executes singularity image--nv=> specify nvidia card<image>=> specify image path/bin/bash=> specify to use bash<< EOF=> here-document, takes in terminal input and input into /bin/bash untill it see anEOFcharacter
Below << EOF, we enter commands we want to run inside the container.
Container starts at location defined in image. As this image is defined by NSCC, it means it starts at some weird nameless location.
We will need to cd into the correct directory like so, cd "$PBS_O_WORKDIR".
$PBS_O_WORKDIR refers to the directory the job was submitted in.
Variables in bash shell and in the container are different when we use << EOF.
In bash shell $variable will resolve the variable in bash shell.
Instead using \$variable will escape the $ character, so the variable will not be resolved in shell, but instead when it runs in container.
Jupyter Notebook lets us run a webpage on a host server machine, and by accessing the webpage, we can run python code on the machine.
The key point to note, is that anyone that can access the webpage on the machine, can run python code on the machine.
connect to webpage => run python code!
Link to documentation
So the idea is to have our job run a Jupyter Notebook Server. Then we connect from our computer to the Jupyter Notebook Server!
There are some caveats and problems to the above plan.
Our jobs run on a compute node, which is not directly exposed to the internet.
It is however connected to the login node(where you connect PuTTy to), which is exposed to the internet.
So... we need to "jump" from the login node to the compute node.
Welcome SSH Port Forwarding/Tunneling which does exactly that.
We basically ssh into the login node, then tell the login node to forward our messages to the compute node.
Syntax:
ssh -N -L localhost::-ib0: @aspire.nscc.sg
- -N => dont do anything after ssh connection established
- -L => Local Port Forwarding
- => Start a connection from us
- => On Port
- -ib0 => when connected, forward to -ib0
- => On Port (Port which Jupyter Notebook runs on)
- @aspire.nscc.sg => Where to ssh to and as what user
We cannot determine which compute node our job will run on before hand(that i know of).
So we need to query the running job with `qstat -f ` and determine from EXEC_1720 variable
Jupyter Notebook needs to verify whoever connecting is legit.
so we set a password using:
from notebook.auth import passwd
passwd()
This generates a salted hash from our password, and we copy this into our script.
So when we enter our password, the jupyter notebook server can verify
Why use a Salted Hash?
TL;DR cannot reverse compute password and wont fall to rainbow table attacks.
If we want export Shell variables into Singularity Container,
we use SINGULARITYENV_<VAR_NAME>=<VAR_VALUE>