This repository is designed to help people using the Iridis Supercomputer at the University of Southampton. Feel free to share what scripts you are using and what might be useful for other people!
- Useful external resources
- GPU availability status
- Monitor GPU usage
- Create recycle bin on Iridis
- Monitoring slurm job output
- Submitting to multiple partitions at once
- Checking information about a job
- Mounting scratch dir to your own device
- Backing up scratch directory to Onedrive
Princeton University supercomputer
Note: even if the partitions are named different, most of the examples can be easily adapted to Iridis 5.
There are a lot of examples with links to Pytorch implementations (e.g. Distributed training)
Wiki page for Iridis5 Iridis 5 University of Southampton
The HPC team is very active and willing to help. You can find them here: HPC teams
Slurm documentation page: Slurm wiki
Instead of your boring MAC terminal, you can use Termius. It is really neat! Termius
Check the availability of the GPU nodes (gtx1080, gpu, ecsstaff, ecsstudent).
- Nodes containing Nvidia 1080ti and Tesla v100 GPUs are locked when a user is granted access. This means that even if the user uses only 1 out of 4 available GPUs ( e.g. gtx1080 nodes) the others are not available to any other users.
- Nodes containing Nvidia rtx8000 GPUs are shared, meaning that if a user is granted access to 1 out of 2 GPUs available, the other GPU is still accessible by other users (this implies shared CPU and RAM).
- The 'ecsall' partition is a resource scavanger partition (using resources that would normally not be available). Your job could be preempted!
# Run the following script to get the availabilty of the GPUs
./status.sh
To make things easier, you can define an alias that runs the status.sh
script.
# 1. Place the script in your $HOME folder
mv status.sh $HOME
# 2. Open the ~/.bashrc file and add the following line
vim ~/.bashrc
alias status=". $HOME/status.sh"
# 3. Run the file
. ~/.bashrc
Now in your terminal you can run:
status
#example output:
-------------------------NODE STATUS-----------------------
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
gpu up 2-12:00:00 1/7/2/10 indigo[51-60]
gtx1080 up 2-12:00:00 4/4/2/10 pink[51-60]
ecsstaff up 5-00:00:00 3/0/0/3 alpha[51-53]
ecsstudents up 12:00:00 2/1/0/3 alpha[54-56]
Note: allocated/idle/other/total
--------------------------GPU STATUS-----------------------
------------------------------------------
|PARTITION| |USED| |NR GPUS|
------------------------------------------
ecsstudent 9 12
ecsstaff 5 12
Note: gtx1080 and v100 are GPUS locked to users on the node
rtx8000 are not locked to node users
It is importat to know if your GPUs are running at full capacity or there is a CPU (dataloading) bottleneck. Use the following comand to see the actual GPU usage:
ssh <slurm node> # e.g. indigo51
watch -n 1 nvidia-smi
# example output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:58:00.0 Off | 0 |
| N/A 61C P0 182W / 250W | 15194MiB / 16384MiB | 96% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 |
| N/A 56C P0 198W / 250W | 13248MiB / 16384MiB | 96% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 237303 C ...nda/envs/prose/bin/python 15190MiB |
| 1 N/A N/A 237304 C ...nda/envs/prose/bin/python 13244MiB |
+-----------------------------------------------------------------------------+
Be careful using the 'rm' command as there is no way of getting those files back. Instead, define a new commad that moves unwated files to a directory on scratch.
# 1. Open the ~/.bashrc file and add the following line
vim ~/.bashrc
alias binrm='mv -t /scratch/<your user name>/recycle-bin/'
# 2. Run the file
. ~/.bashrc
Now in your terminal you can run:
binrm <unwanted file>
Get into the habbit of using this command from now on. If you happen to make a mistake you will have the recycle-bin directory keeping your files for a while (before Iridis removes them).
If you want to monitor the output of your job in real time, you can run the 'tail' command every 1 second and keep displaying the most recent 50 lines. This can be useful for faster debugging.
# Add the following line in ~/.bashrc
alias analyse="watch -n 1 tail -n 50"
# example use:
analyse slurm-1294659.out
or
tail -f slurm-1294659.out
If you want to run your job on a partition regardless of the GPU memory and you want it to run as quick as possible, you can submit to multiple partitions at once. The job would run on the first partition that has available resources.
In your slurm script add the following line:
#SBATCH --partition=gtx1080,ecsall,ecsstaff,gpu
Optionally you could remove 'ecsall' to prevent preemption.
If you don't remember what a submitted job was for, run the following:
scontrol show job <jobid>
# example output:
JobId=3107069 JobName=flip-all-gpu
UserId=ii1g17(81851) GroupId=fp(245) MCS_label=N/A
Priority=3504 Nice=0 Account=ecsstaff QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:11:37 TimeLimit=1-06:00:00 TimeMin=N/A
SubmitTime=2023-05-02T10:43:20 EligibleTime=2023-05-02T10:43:20
AccrueTime=2023-05-02T10:43:20
StartTime=2023-05-02T10:43:26 EndTime=2023-05-03T16:43:26 Deadline=N/A
PreemptEligibleTime=2023-05-02T10:43:26 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-05-02T10:43:26 Scheduler=Main
Partition=ecsall AllocNode:Sid=cyan51:257724
ReqNodeList=(null) ExcNodeList=(null)
NodeList=alpha54
BatchHost=alpha54
NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=30G,node=1,billing=8,gres/gpu=1
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
MinCPUsNode=8 MinMemoryNode=30G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/mainfs/home/ii1g17/protein-embeddings/proemb/iridis-scripts/flip/1080ti.sh --flip_data_path=/scratch/ii1g17/protein-embeddings/data/FLIP --model_path=iridis-scripts/saved_models/multitask/v100-6gpu/3096239/iter_240000_checkpoint.pt --remote=True --split=one_vs_many
WorkDir=/mainfs/home/ii1g17/protein-embeddings/proemb/iridis-scripts/flip
StdErr=/mainfs/home/ii1g17/protein-embeddings/proemb/iridis-scripts/flip/slurm-3107069.out
StdIn=/dev/null
StdOut=/mainfs/home/ii1g17/protein-embeddings/proemb/iridis-scripts/flip/slurm-3107069.out
Power=
TresPerNode=gres:gpu:1
This way you can see the actual command and parameters when the job was submitted.
If you need to work with data which resides on the /scratch partition you can mount the folder.
On MacOS install: sshfs
In your own device terminal type:
sshfs <username>@iridis5_a.soton.ac.uk:/scratch/<username>/ <path to where you want to mount the folder>
The data in the scratch directory is not backed up, but in most of the cases you have to use that partition to save some parts of your work. You can sync that directory with the OneDrive provided by the university (5TB). First you need to mount the OneDrive on your device such that it appears as a directory.
In your own device terminal type:
rsync -avz --stats --progress <username>@iridis5_a.soton.ac.uk:/scratch/<username> <the directory of your OneDrive>
This will upload everything to OneDrive. You will have to re-run it in order to sync new files. You can automate this with a crontab job. If something is deleted from your scratch directory you can then sync it back with the contents from your OneDrive.