Submit Batch Jobs¶
Job submission and control using SLURM¶
After logging in, you will be on a login node. These nodes are intended only for setting up and starting your jobs. On a supercomputer, you do not run your program directly. Instead you write a job script containing the resources you need and the commands to execute, and submit it to a queue via the Slurm workload manager with a sbatch
command:
DelftBlue uses the Slurm workload manager for the submission, control and management of user jobs. Slurm provides a rich set of features for organizing your workload and an extensive array of tools for managing your resource usage. The most frequently used commands with the batch system are the following three:
sbatch
- submit a batch scriptsqueue
- check the status of jobs on the systemscancel
- cancel a job and delete it from the queue
Note
Try the squeue --me
command. If it returns strange looking errors, try to load the slurm module by hand:
More info on the issue: see below.
Furthermore, the list of queues and partitions is available typing sinfo
or scontrol show partition
and past jobs saved in the Slurm database can be inspected with the sacct
command: please have a look at man sacct
for more information.
An appropriate Slurm job submission file for your parallel job is a shell script with a set of directives at the beginning: these directives are issued by starting a line with the string #SBATCH
(as a note for PBS batch system users, this is the Slurm equivalent of #PBS
). A suitable batch script is then submitted to the batch system using the sbatch
command.
A basic Slurm batch script can be written just adding the --ntasks
and --time
directives, but extra directives will give you more control on how your job is run.
Note
Slurm
is managing CPU, GPUs, memory, and runtime allocation. This means that Slurm
will enforce the amounts that you request. Defaults have been deliberately set low. So make sure you request reasonable amounts of resources in your submission script explicitly!
Specifically, the following parameters must be set:
-
Which partition do I need? The following partitions are available:
compute
orcompute-p1
for CPU jobs on Phase 1 compute nodes (48 CPUs and 185 GB of RAM per node)compute-p2
for CPU jobs on Phase 2 compute nodes (64 CPUs and 250 GB of RAM per node)gpu
orgpu-v100
for GPU jobs on Phase 1 gpu nodes (4x V100s cards with 32 GB of video RAM each per node)gpu-a100
for GPU jobs on Phase 2 gpu nodes (4x A100 cards with 80 GB of video RAM each per node)gpu-a100-small
for small GPU jobs, running no longer than 4 hours, requiring no more than 1 GPU with a maximum of 10 GB video RAM, and no more than 2 CPU cores.memory
for CPU jobs with high RAM requirements (more than 250 GB per node)visual
for visualization purposes
For example, to request the
compute
partition, put the following line in your submission script: -
How long do I expect my job to run? Here, an example for the job requesting 4 hours:
Note
The maximum running time of a job is 120h for the "research" accounts (48h on GPU-nodes) and 24h for the "education" accounts! This limit is in place to ensure that all users can have a reasonable waiting time in the queue.
Note
The shorter you job is, the higher chance it has to start quickly! There are special preferential scheduling rules in place for a) jobs shorter than 4 hours, and b) jobs shorter than 24 hours, respectively.
What happens when my job exceeds the allocated time?
--time=<time>
flag sets a limit on the total run time of the job allocation. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL. The delay between these two commands is 30 seconds.If your program needs more time to terminate, or needs extra signalling (some programs won't terminate on the first SIGTERM signal), then you can arrange that through the --signal parameter, e.g.:
#SBATCH --signal=15@60
sets signal 15 (SIGTERM) to be sent 60 seconds before the end of the allocated job time, which gives your program some extra time to finish more cleanly. -
Number of tasks/Number of CPUs per task needed:
Important
How do you know which resources to request? The answer to this question will depend on whether your software can run in parallel, and which exact type of parallelization it uses. Please refer to this excellent guide written by the IT Service Office of the University of Bern for some fundamentals. All information there applies equally to DelftBlue.
-
Amount of memory per CPU needed (Very important to avoid "out-of-memory" errors!):
-
Which account should my CPU hours be allocated to? (More info on accounting and shares.)
Important
srun
is the Slurm
version of mpirun
. srun
is the command to use to start a task in a job.
You can find some specific examples of a submission script below:
Serial (one core) job¶
Although the cluster's main use case scenario is highly parallel jobs, it is possible to also run single-core jobs in the queue. If you need to run a single-core jobs, please DO NOT request the full node! Instead, request one CPU. Slurm queuing system is smart, and it will allocate your job in such a way that other users can still user the rest of the node.
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
srun ./executable.x
'MPI only' (OpenMPI) job¶
Using 16 cores for one hour. The parallel job is invoked with srun
. Note that there is no need in principle to pass any arguments, like for example -np
in the case of locally executed mpirun
.
Important
Check if your application can actually scale properly up to the number of CPUs you are requesting! Running fewer CPUs per job has two advantages: 1) your waiting time in the queue will be shorter; 2) other users can share the same node with you, if there is sufficient amount of resources available.
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
module load 2023r1
module load openmpi
srun ./executable.x
If the OpenMPI job seems to be "stalling", it might be due to incorrect binding of the processes to physical CPU cores. Please submit a job interactively, and try to monitor your processes with top -u <NetID>
. If stalling is suspected, try to resubmit your job with the following srunflag:
srun --cpu-bind=none`.
Intel MPI job¶
After loading intel/oneapi-all
, you do not only get the Intel compilers and libraries, but also Intel's own implementation of MPI. Current setup of the intel
module requires explicitly specifying to the Intel MPI to use the Slurm PMI library (so that it can correctly bin Intel MPI
threads to CPU cores). Make sure you export this library variable in your submission script, before invoking your binary with srun
:
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
module load 2023r1
module load intel/oneapi-all
export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/current/lib64/libpmi2.so
srun ./executable
OpenMP job¶
Using 1 node, 1 task, 8 threads:
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./executable
hybrid MPI+OpenMP job¶
Using 2 nodes, 4 MPI processes per node (8 MPI processes in total), each of the MPI processes is using 12 OpenMP threads per task (2 * 4 * 12 = 96 cores, or 2 full compute nodes, in total):
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
module load 2023r1
module load openmpi
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./executable
GPU job¶
Note
GPU jobs are limited to a maximum of 48h for the research
and project
shares, and 24h for education
and innovation
shares.
Using one NVidia V100S GPU on a node for four hours, 8 threads and no MPI:
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=gpu
#SBATCH --time=04:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
module load 2023r1
module load cuda/11.6
srun ./executable
Sometimes, the following MPI error might occur:
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: gpu010
Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3409449 on node gpu010 exited on signal 4 (Illegal instruction).
--------------------------------------------------------------------------
This can be fixed by adding the following srun
flag and resubmitting the job: srun --mpi=pmix
.
It is also possible to measure the GPU usage of your program by adding the following lines to your sbatch
script:
#!/bin/bash
#
#SBATCH --job-name="job_name"
#SBATCH --partition=gpu
#SBATCH --time=04:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
module load 2023r1
module load cuda/11.6
previous=$(/usr/bin/nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/tail -n '+2')
srun ./executable
/usr/bin/nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/grep -v -F "$previous"
Running this script will give you the output of your program (./executable
) followed by something like this:
gpu_utilization [%], mem_utilization [%], max_memory_usage [MiB], time [ms]
98 %, 80 %, 26214 MiB, 77332 ms
Note
nvidia-smi
can create some very nice reports! Please check https://developer.nvidia.com/nvidia-system-management-interface and click on nvidia-smi documentation which brings you to the man
-page of nvidia-smi
.
High memory job¶
There are six nodes available in DelftBlue with 750GB, and four with 1.5TB of RAM. To use them, use the 'memory' partition. For example, if you need 250 GB of memory, specify the resource requirement like this:
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=memory
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=250G
#SBATCH --account=research-<faculty>-<department>
srun ./executable
Controlling the output file name¶
The output of your script will be put by default into the file slurm-<SLURM_JOB_ID>
is the Slurm batch job number of your job. The standard error will be put into a file called slurm-<SLURM_JOB_ID>.err
: both files will be found in the directory from which you launched the job.
Note that the output file is created when your job starts running, and the output from your job is placed in this file as the job runs, so that you can monitor your job's progress. Therefore do not delete this file while your job is running or else you will lose your output. Please keep in mind that Slurm performs file buffering by default when writing to the output files: output is held in a buffer until the buffer is flushed to the output file. If you want the output to be written to file immediately (at the cost of additional CPU, network and storage overhead), you should pass the option --unbuffered
to the srun
command: the output will then appear in the file as soon as it is produced.
If you wish to change the default names of the output and error files, you can use the --output
and --error
directives in the batch script that you submit using the sbatch
command. See the example below:
Exclusive use¶
Please note, that nodes are shared with other users by default. Normally, we recommend you to only request the number of CPUs, and let the Slurm scheduler decide which node to allocate those to. However, sometimes you might want to to request a full node exclusively. This can be done as follows:
This will allocate the entire node, including all its memory to your job.