Skip to content

16fb/NSCC-Guide

Repository files navigation

This is a .md file [Markdown file].
View in a markdown viewer(reccomended) or as plain text. I reccomend using VS Code

This guide is done by Lim Zi Xiong and Bryan Wong Wen Ping from Singapore Polytechnic.


Preclude

This Guide serves as a "wtf is going on" and serves to try and explain why things are done in X way and what things are being done.
It is meant to be comprehensive in showing how to use NSCC

These guides are "how to" guides, meant to get jobs running: (start from first in list)
For quick start, you can refer to these guides.

  • Python_Singularity_Guide.md (RunPythonSingularity Folder) -> Guide to use tensorflow library and run python files
  • JupyterNoteBook_Guide.md (RunJupyterSingularity FOlder) -> Guide to setting up jupyter notebook server

Visual Guide:

These are other usefull resources online:

So whats actually happening?

NSCC owns the supercomputer, Aspire1, and lets users "borrow" its computing power.
To facilitate this, the linux Operating System PBS Pro is used. See Portable Batch System.
Users submit "jobs", which are requests for resources that contains code/commands they want to run.
These "jobs" use Bash to tell the supercomputer what to do and what files to run.

"Jobs" are submitted in the NSCC Login node, where they are processed and sent to an internal compute node to be processed.
To connect to the Login node, we join the VPN used by NSCC, Sophos VPN.
Then we have to SSH to the Login node to connect to it.

PBS Pro have special commands to submit jobs, such as qsub, qstat, qdel.
As jobs are Bash Scripts, we will need to use Bash to run our code.
ie. all our code can be run using the bash command line.

We need all our dependacies to be working on NSCC as well.
to enable use of some commands like anaconda, we can add it as a module using module load

Installing modules is problematic. we dont have root access, so we cannot just pip install <module>.
We cannot modify the root file where modules are installed, we have to install into our local directory.

SSH (Secure Shell)

SSH is a Network protocol to securely allow a user to access the command line(shell) of another computer

We use this to access NSCC login terminal.
A popular SSH terminal software is PuTTy, we reccomend using it to SSH into the Login Node
It can be downloaded here

Bash (Unix Shell)

Bash is a shell, a user interface for access to an operating system's services, in this case Linux Machines
Basically, Bash is the windows command prompt, but for Linux machines(UNIX).

NSCC uses PBS Pro, which is a Linux OS and uses the Bash shell.
Therefore we will need to use Bash commands to interact with NSCC.

Common Bash Commands:

  • ls => List all files + folders in current directory
  • cd => Change directory
  • mkdir => Makes a directory
  • rm => Removes an item
  • echo => Prints to console(stdout)
  • env => List all environment variables

TIP: To get an overview on what a command does, use the man command or --help flag.

  • man <command> => opens manual for command
  • <command> --help => bring out help menu for command

WinSCP, Transfering Files

NSCC uses SFTP (Secure File Transfer Protocol) protocol to transfer files from host computer to NSCC.
Reccomended software that supports SFTP is WinSCP

Therefore, we use WinSCP to tranfer bash scripts and other files such as python files to NSCC

PBS Commands

PBS Pro has commands to submit, delete and track jobs.
NSCC has guides on how to use these commands, As well as a quick reference sheet.

These commands flags are placed on the top of the bash file, after #!/bin/sh, and always start with #PBS.

### The following requests 1 Chunk, 5 CPU Nodes, 1 GPU
#PBS -l select=1:ncpus=5:ngpus=1

### Specify amount of time required
### Values less than 4 hours go into a higher priority queue
#PBS -l walltime=2:00:00

### Specify gpu/dgx queue
#PBS -q gpu

Some Common Commands:

  • qsub <shell_script> => Submit a job to queue
  • qdel <job_id> => Delete a running or queued job
  • qstat => Find information about current jobs
  • qstat -f <job_id> => Full information of specific job

To view more info about commands , use ` help`. \ For a list of commands, refer to [quick reference sheet](https://help.nscc.sg/wp-content/uploads/2016/08/PBS_Professional_Quick_Reference.pdf)

PBS Queues

Different queues are used to satisfy the resource requirements of the various workloads that run on NSCC.
If we want use dgx GPU vs normal GPU, we have to send out job into different queues.

As a user, we submit jobs into an external queue depending on our needs
Examples:

  • normal => just use CPUs
  • GPU => use normal GPUs
  • dgx => use special dgx fast GPUs

Specifc documentation on queues can be found on pg.4 of [quick start guide](https://help.nscc.sg/pbspro-quickstartguide/) under external queues

Problem with pip install as no root access

Usually, when we add libraries for python, we enter into the command prompt pip install <module> or conda install <module>.
However if we pip install <module>, we get a permission error, as we end up trying to overwrite files we dont have permission to.
pip modules are usually. installed in /usr/*** file, However we do not have access to that file location on NSCC.

Therefore, we have to use pip install -U -q --user <module>:

  • -U => upgrade if possible
  • -q => quiet installation, preserve time stamps
  • --user => install to user home directory instead of system directory (Installs in site.USER_SITE) documentation

Modules

Modules are a part of Modules Package in Linux. Modules Documentation
Modules allow for dynamic modification of the user's environment ($PATH, $MANPATH) viaย modulefiles.
In essence, modules prepares the environment and allows us to use commands.

We load modules we want with the command module load.
Example: module load anaconda to let us use the conda command.

Commonly used commands:

  • module help => Show help manual
  • module list => List currently added modules
  • module avail => Show available modules
  • module add/load [module file] => Add/Loads modules
  • module remove/unload [module file] => Remove/Unloads module
  • module show [modulefile] => Shows what module file does to environment
  • module whatis [modulefile] => Query what the module does

Accessing NSCC Login node

Visual Guide

Its reccomended to follow first half of the visual guide. Visual step by step guide

Steps Taken

Download Sophos VPN from NSCC website.
Download Auth app on phone, reccomneded to download Sophos Authenticator.
Use App to Aceess VPN network.
Login to NSCC Login node using PuTTY. \


Run Basic Python script on NSCC as a queue.

Quick Guide

NSCC has made a NSCC PBSPro Quick Start Guide which can be followed.

Transfer files to NSCC

Use WinSCP with this setting

  • Hostname: aspire.nscc.sg
  • PortNumber: 22
  • User name: <your_user_name>
  • Password: <leave bank, enter when required>

Click and drag to copy files over.

Submitting Job using submission script.

To submit a job.

  • qsub submit.pbs

Where submit.pbs is :

#!/bin/bash

#PBS โ€“q normal
#PBS โ€“l select=1:ncpus=1:mem=100M
#PBS โ€“l walltime=00:10:00
#PBS โ€“N Sleep_Job
#PBS -o ~/outputfiles/Sleep_Job.o
#PBS โ€“e ~/errorfiles/Sleep_Job.e

echo sleep job for 30 seconds
sleep 30

Of the format :

#!/bin/bash

[#PBS Commands, specify configs about job.]

[Rest of commands to run]

Checking Status And other commands

To check status on your submitted jobs qstat will list current running jobs.

To delete a job, use qdel <job_id>


Running Python With Tensorflow

Tensorflow 2.0 is a tricky module to add.
Tensorflow is GPU dependant, and module load tensorflow only supports Tensorflow 1.4.
Even worse, module load anaconda and module load tensorflow are incompatiable and will raise a warning.
Even pip install tensorflow can cause errors.

The only proper way to add GPU Tensorflow libraries i'hv found so far is by using containers.

Containers

Containers allows for OS-level virtualisation.
Basically its a Virtual Machine that virtualises processes and not the whole computer, which makes it much more efficient.

The main benefit of containers is isolation,
allowing us to package an application with all of its dependencies into a standardized unit.
In essence, we can put our setup into the container (such as installing dependacies), and it will work anywhere.

The most popular containerisation software is Docker.
However for High Performance Computing(HPC) in scientific context, Singularity is a popular option.
While both can be used with NSCC, in our guide we use Singularity.

NSCC has ready-made containers with tensorflow that works with their GPUs.
So we can:

  1. "Boot up" a container with tensorflow
  2. Add other dependacies as needed (pip install)
  3. Run our code from there.

Container Syntax

Singularity Container Syntax: singularity exec --nv <image> /bin/bash << EOF
Example: singularity exec --nv /app/singularity/images/tensorflow/tensorflow_2.3.0_gpu_py3.simg /bin/bash << EOF

  • singularity exec => executes singularity image
  • --nv => specify nvidia card
  • <image> => specify image path
  • /bin/bash => specify to use bash
  • << EOF => here-document, takes in terminal input and input into /bin/bash untill it see an EOF character

<< EOF and Bash variable resolution

Below << EOF, we enter commands we want to run inside the container.
Container starts at location defined in image. As this image is defined by NSCC, it means it starts at some weird nameless location.
We will need to cd into the correct directory like so, cd "$PBS_O_WORKDIR".
$PBS_O_WORKDIR refers to the directory the job was submitted in.

Variables in bash shell and in the container are different when we use << EOF.
In bash shell $variable will resolve the variable in bash shell.
Instead using \$variable will escape the $ character, so the variable will not be resolved in shell, but instead when it runs in container.


Running Jupyter NoteBook Server

Jupyter Notebook lets us run a webpage on a host server machine, and by accessing the webpage, we can run python code on the machine.
The key point to note, is that anyone that can access the webpage on the machine, can run python code on the machine.
connect to webpage => run python code!
Link to documentation

So the idea is to have our job run a Jupyter Notebook Server. Then we connect from our computer to the Jupyter Notebook Server!

SSH Tunneling

There are some caveats and problems to the above plan.
Our jobs run on a compute node, which is not directly exposed to the internet.
It is however connected to the login node(where you connect PuTTy to), which is exposed to the internet.

So... we need to "jump" from the login node to the compute node.
Welcome SSH Port Forwarding/Tunneling which does exactly that.
We basically ssh into the login node, then tell the login node to forward our messages to the compute node.

Syntax:
ssh -N -L localhost::-ib0: @aspire.nscc.sg

  • -N => dont do anything after ssh connection established
  • -L => Local Port Forwarding
  • => Start a connection from us
  • => On Port
  • -ib0 => when connected, forward to -ib0
  • => On Port (Port which Jupyter Notebook runs on)
  • @aspire.nscc.sg => Where to ssh to and as what user

We cannot determine which compute node our job will run on before hand(that i know of).
So we need to query the running job with `qstat -f ` and determine from EXEC_1720 variable

Hashing

Jupyter Notebook needs to verify whoever connecting is legit.
so we set a password using:

  1. from notebook.auth import passwd
    passwd()

This generates a salted hash from our password, and we copy this into our script.
So when we enter our password, the jupyter notebook server can verify

Why use a Salted Hash?
TL;DR cannot reverse compute password and wont fall to rainbow table attacks.

Exporting variables

If we want export Shell variables into Singularity Container,
we use SINGULARITYENV_<VAR_NAME>=<VAR_VALUE>

About

๐Ÿ“– Guide to run GPU jobs on NSCC HPC resources using "PBS Pro" job queue system

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors