Submit Batch Jobs¶

Job submission and control using SLURM¶

After logging in, you will be on a login node. These nodes are intended only for setting up and starting your jobs. On a supercomputer, you do not run your program directly. Instead you write a job script containing the resources you need and the commands to execute, and submit it to a queue via the Slurm workload manager with a sbatch command:

sbatch name-of-your-submission-script.sh

DelftBlue uses the Slurm workload manager for the submission, control and management of user jobs. Slurm provides a rich set of features for organizing your workload and an extensive array of tools for managing your resource usage. The most frequently used commands with the batch system are the following three:

sbatch - submit a batch script
squeue - check the status of jobs on the system
scancel - cancel a job and delete it from the queue

Note

Try the squeue --me command. If it returns strange looking errors, try to load the slurm module by hand:

module load slurm

More info on the issue: see below.

Furthermore, the list of queues and partitions is available typing sinfo or scontrol show partition and past jobs saved in the Slurm database can be inspected with the sacct command: please have a look at man sacct for more information.

An appropriate Slurm job submission file for your parallel job is a shell script with a set of directives at the beginning: these directives are issued by starting a line with the string #SBATCH (as a note for PBS batch system users, this is the Slurm equivalent of #PBS). A suitable batch script is then submitted to the batch system using the sbatch command.

A basic Slurm batch script can be written just adding the --ntasks and --time directives, but extra directives will give you more control on how your job is run.

Note

Slurm is managing CPU, GPUs, memory, and runtime allocation. This means that Slurm will enforce the amounts that you request. Defaults have been deliberately set low. So make sure you request reasonable amounts of resources in your submission script explicitly!

Specifically, the following parameters must be set:

Which partition do I need? The following partitions are available:
- compute or compute-p1 for CPU jobs on Phase 1 compute nodes (48 CPUs and 185 GB of RAM per node)
- compute-p2 for CPU jobs on Phase 2 compute nodes (64 CPUs and 250 GB of RAM per node)
- gpu or gpu-v100 for GPU jobs on Phase 1 gpu nodes (4x V100s cards with 32 GB of video RAM each per node)
- gpu-a100 for GPU jobs on Phase 2 gpu nodes (4x A100 cards with 80 GB of video RAM each per node)
- gpu-a100-small for small GPU jobs, running no longer than 4 hours, requiring no more than 1 GPU with a maximum of 10 GB video RAM, and no more than 2 CPU cores.
- memory for CPU jobs with high RAM requirements (more than 250 GB per node)
- visual for visualization purposes
For example, to request the compute partition, put the following line in your submission script:
```
#SBATCH --partition=compute
```
How long do I expect my job to run? Here, an example for the job requesting 4 hours:
```
#SBATCH --time=04:00:00
```
Note

The maximum running time of a job is 120h for the "research" accounts (48h on GPU-nodes) and 24h for the "education" accounts! This limit is in place to ensure that all users can have a reasonable waiting time in the queue.

Note

The shorter you job is, the higher chance it has to start quickly! There are special preferential scheduling rules in place for a) jobs shorter than 4 hours, and b) jobs shorter than 24 hours, respectively.

What happens when my job exceeds the allocated time?

--time=<time> flag sets a limit on the total run time of the job allocation. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL. The delay between these two commands is 30 seconds.

If your program needs more time to terminate, or needs extra signalling (some programs won't terminate on the first SIGTERM signal), then you can arrange that through the --signal parameter, e.g.:

#SBATCH --signal=15@60 sets signal 15 (SIGTERM) to be sent 60 seconds before the end of the allocated job time, which gives your program some extra time to finish more cleanly.

Number of tasks/Number of CPUs per task needed:

#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1

Important

How do you know which resources to request? The answer to this question will depend on whether your software can run in parallel, and which exact type of parallelization it uses. Please refer to this excellent guide written by the IT Service Office of the University of Bern for some fundamentals. All information there applies equally to DelftBlue.

Amount of memory per CPU needed (Very important to avoid "out-of-memory" errors!):
```
#SBATCH --mem-per-cpu=1G
```
Which account should my CPU hours be allocated to? (More info on accounting and shares.)
```
#SBATCH --account=research-<faculty>-<department>
```

Important

srun is the Slurm version of mpirun. srun is the command to use to start a task in a job.

You can find some specific examples of a submission script below:

Serial (one core) job¶

Although the cluster's main use case scenario is highly parallel jobs, it is possible to also run single-core jobs in the queue. If you need to run a single-core jobs, please DO NOT request the full node! Instead, request one CPU. Slurm queuing system is smart, and it will allocate your job in such a way that other users can still user the rest of the node.

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

srun ./executable.x

'MPI only' (OpenMPI) job¶

Using 16 cores for one hour. The parallel job is invoked with srun. Note that there is no need in principle to pass any arguments, like for example -np in the case of locally executed mpirun.

Important

Check if your application can actually scale properly up to the number of CPUs you are requesting! Running fewer CPUs per job has two advantages: 1) your waiting time in the queue will be shorter; 2) other users can share the same node with you, if there is sufficient amount of resources available.

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

module load 2023r1
module load openmpi

srun ./executable.x

If the OpenMPI job seems to be "stalling", it might be due to incorrect binding of the processes to physical CPU cores. Please submit a job interactively, and try to monitor your processes with top -u <NetID>. If stalling is suspected, try to resubmit your job with the following srunflag:srun --cpu-bind=none`.

Intel MPI job¶

After loading intel/oneapi-all, you do not only get the Intel compilers and libraries, but also Intel's own implementation of MPI. Current setup of the intel module requires explicitly specifying to the Intel MPI to use the Slurm PMI library (so that it can correctly bin Intel MPI threads to CPU cores). Make sure you export this library variable in your submission script, before invoking your binary with srun:

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

module load 2023r1
module load intel/oneapi-all

export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/current/lib64/libpmi2.so

srun ./executable

OpenMP job¶

Using 1 node, 1 task, 8 threads:

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun ./executable

hybrid MPI+OpenMP job¶

Using 2 nodes, 4 MPI processes per node (8 MPI processes in total), each of the MPI processes is using 12 OpenMP threads per task (2 * 4 * 12 = 96 cores, or 2 full compute nodes, in total):

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

module load 2023r1
module load openmpi

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun ./executable

GPU job¶

Note

GPU jobs are limited to a maximum of 48h for the research and project shares, and 24h for education and innovation shares.

Using one NVidia V100S GPU on a node for four hours, 8 threads and no MPI:

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=gpu
#SBATCH --time=04:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

module load 2023r1
module load cuda/11.6

srun ./executable

Sometimes, the following MPI error might occur:

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   gpu010
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3409449 on node gpu010 exited on signal 4 (Illegal instruction).
--------------------------------------------------------------------------

This can be fixed by adding the following srun flag and resubmitting the job: srun --mpi=pmix.

It is also possible to measure the GPU usage of your program by adding the following lines to your sbatch script:

#!/bin/bash
#
#SBATCH --job-name="job_name"
#SBATCH --partition=gpu
#SBATCH --time=04:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

module load 2023r1 
module load cuda/11.6

previous=$(/usr/bin/nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/tail -n '+2')

srun ./executable

/usr/bin/nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/grep -v -F "$previous"

Running this script will give you the output of your program (./executable) followed by something like this:

gpu_utilization [%], mem_utilization [%], max_memory_usage [MiB], time [ms]
98 %, 80 %, 26214 MiB, 77332 ms

Note

nvidia-smi can create some very nice reports! Please check https://developer.nvidia.com/nvidia-system-management-interface and click on nvidia-smi documentation which brings you to the man-page of nvidia-smi.

High memory job¶

There are six nodes available in DelftBlue with 750GB, and four with 1.5TB of RAM. To use them, use the 'memory' partition. For example, if you need 250 GB of memory, specify the resource requirement like this:

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=memory
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=250G
#SBATCH --account=research-<faculty>-<department>

srun ./executable

Controlling the output file name¶

The output of your script will be put by default into the file slurm-.out where <SLURM_JOB_ID> is the Slurm batch job number of your job. The standard error will be put into a file called slurm-<SLURM_JOB_ID>.err: both files will be found in the directory from which you launched the job.

Note that the output file is created when your job starts running, and the output from your job is placed in this file as the job runs, so that you can monitor your job's progress. Therefore do not delete this file while your job is running or else you will lose your output. Please keep in mind that Slurm performs file buffering by default when writing to the output files: output is held in a buffer until the buffer is flushed to the output file. If you want the output to be written to file immediately (at the cost of additional CPU, network and storage overhead), you should pass the option --unbuffered to the srun command: the output will then appear in the file as soon as it is produced.

If you wish to change the default names of the output and error files, you can use the --output and --error directives in the batch script that you submit using the sbatch command. See the example below:

#SBATCH --output=hello_world_mpi.%j.out
#SBATCH --error=hello_world_mpi.%j.err

Exclusive use¶

Please note, that nodes are shared with other users by default. Normally, we recommend you to only request the number of CPUs, and let the Slurm scheduler decide which node to allocate those to. However, sometimes you might want to to request a full node exclusively. This can be done as follows:

#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --mem=0

This will allocate the entire node, including all its memory to your job.