Unless you're lucky and you have a dedicated cluster that is completely under your control chances are that you will have to use SLURM to timeshare the GPUs with others. But, often, if you train at HPC, and you're given a dedicated partition you still will have to use SLURM.
This document will not try to teach you SLURM as there are many manuals out there, but we will cover some specific nuances that are useful to help in the training process.
One of the most important Unix tools is the crontab, which is essential for being able to schedule various jobs. It however usually is absent from SLURM environment. Therefore one must emulate it. Here is how.
For this presentation we are going to use $WORK/cron/
as the base directory. And that you have an exported environment variable WORK
pointing to some location on your filesystem - if you use Bash you can set it up in your ~/.bash_profile
or if a different shell is used use whatever startup equivalent file is.
We will use $WORK/cron/scheduler
dir for scheduler jobs, $WORK/cron/cron.daily
for daily jobs and $WORK/cron/cron.hourly
for hourly jobs:
$ mkdir -p $WORK/cron/scheduler
$ mkdir -p $WORK/cron/cron.daily
$ mkdir -p $WORK/cron/cron.hourly
Now copy these two slurm script in $WORK/cron/scheduler
:
after editing those to fit your specific environment's account and partition information.
Now you can launch the crontab scheduler jobs:
$ cd $WORK/cron/scheduler
$ sbatch cron-hourly.slurm
$ sbatch cron-daily.slurm
This is it, these jobs will now self-perpetuate and usually you don't need to think about it again unless there is an even that makes SLURM lose all its jobs.
Now whenever you want some job to run once a day, you simply create a slurm job and put it into the $WORK/cron/cron.daily
dir.
Here is an example job that runs daily to update the mlocate
file index:
$ cat $WORK/cron/cron.daily/mlocate-update.slurm
#!/bin/bash
#SBATCH --job-name=mlocate-update # job name
#SBATCH --ntasks=1 # number of MP tasks
#SBATCH --nodes=1
#SBATCH --hint=nomultithread # we get physical cores not logical
#SBATCH --time=1:00:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out # output file name
#SBATCH --partition=PARTITION # edit me
#SBATCH --account=GROUP@PARTITION # edit me
set -e
date
echo "updating mlocate db"
/usr/bin/updatedb -o $WORK/lib/mlocate/work.db -U $WORK --require-visibility 0
This builds an index of the files under $WORK
which you can then quickly query with:
/usr/bin/locate -d $WORK/lib/mlocate/work.db pattern
To stop running this job, just move it out of the $WORK/cron/cron.daily
dir.
The same principle applies to jobs placed into the $WORK/cron/cron.hourly
dir. These are useful for running something every hour.
Please note that this crontab implementation is approximate timing-wise, due to various delays in SLURM scheduling they will run approximately every hour and every day. You can recode these to ask SLURM to start something at a more precise time if you have to, but most of the time the just presented method works fine.
Additionally, you can code your own variations to meet specific needs of your project, e.g., every-30min or every-12h jobs.
Finally, since every cron launcher job will leave behind a log file (which is useful if for some reason things don't work), you want to create a cronjob to clean up these logs. Otherwise you may run out of inodes - these logs files are tiny, but there could be tens of thousands of those.
You could use something like this in a daily job.
find $WORK/cron -name "*.out" -mtime +7 -exec rm -f {} +
Please note that it's set to only delete files that are older than 7 days, in case you need the latest logs for diagnostics.
The scheduler runs with Unix permissions of the person who launched the SLRUM cron scheduler job and so all other SLURM scripts launched by that cron job.
SLURM runs on Unix, but surprisingly its designers haven't adopted the concept of group ownership with regards to SLURM jobs. So if a member of your team started an array of 10 jobs 20h each, and went on vacation - unless you have sudo
access you now can't do anything to stop those jobs if something is wrong.
I'm yet to find why this is so, but so far we have been using a kill switch workaround. You have to code it in your framework. For example, see how it was implemented in Megatron-Deepspeed (Meg-DS). The program polls for a pre-configured at start up path on the filesystem and if it finds a file there, it exits.
So if we start Meg-DS with --kill-switch-path $WORK/tmp/training17-kill-switch
and then at any point we need to kill the SLURM job, we simply do:
touch $WORK/tmp/training17-kill-switch
and the next time the program gets to check for this file it'll detect the event and will exit voluntarily. If you have a job array, well, you will have to wait until each job starts, detects the kill switch and exits.
Of course, don't forget to remove it when you're done stopping the jobs.
rm $WORK/tmp/training17-kill-switch
Now, this doesn't always work. If the job is hanging, it'll never come to the point of checking for kill-switch and the only solution here is to contact the sysadmins to kill the job for you. Sometimes if the hanging is a simple case pytorch's distributed setup will typically auto-exit after 30min of preset timeout time, but it doesn't always work.