Submit Jobs¶
Job submission and control using SLURM¶
After logging in, you will be on a login node. These nodes are intended only for setting up and starting your jobs. On a supercomputer, you do not run your program directly. Instead you write a job script containing the resources you need and the commands to execute, and submit it to a queue via the Slurm workload manager with a sbatch
command:
DelftBlue uses the Slurm workload manager for the submission, control and management of user jobs. Slurm provides a rich set of features for organizing your workload and an extensive array of tools for managing your resource usage. The most frequently used commands with the batch system are the following three:
sbatch
- submit a batch scriptsqueue
- check the status of jobs on the systemscancel
- cancel a job and delete it from the queue
Note
Try the squeue --me
command. If it returns strange looking errors, try to load the slurm module by hand:
More info on the issue: see below.
Furthermore, the list of queues and partitions is available typing sinfo
or scontrol show partition
and past jobs saved in the Slurm database can be inspected with the sacct
command: please have a look at man sacct
for more information.
An appropriate Slurm job submission file for your parallel job is a shell script with a set of directives at the beginning: these directives are issued by starting a line with the string #SBATCH
(as a note for PBS batch system users, this is the Slurm equivalent of #PBS
). A suitable batch script is then submitted to the batch system using the sbatch
command.
A basic Slurm batch script can be written just adding the --ntasks
and --time
directives, but extra directives will give you more control on how your job is run.
Note
Slurm
is managing CPU, GPUs, memory, and runtime allocation. This means that Slurm
will enforce the amounts that you request. Defaults have been deliberately set low. So make sure you request reasonable amounts of resources in your submission script explicitly!
Specifically, the following parameters must be set:
-
Which partition do I need? The following partitions are available:
compute
orcompute-p1
for CPU jobs on Phase 1 compute nodes (48 CPUs and 192 GB of RAM per node)compute-p2
for CPU jobs on Phase 2 compute nodes (64 CPUs and 256 GB of RAM per node)gpu
orgpu-v100
for GPU jobs on Phase 1 gpu nodes (4x V100s cards with 32 GB of video RAM each per node)gpu-a100
for GPU jobs on Phase 2 gpu nodes (4x A100 cards with 80 GB of video RAM each per node)memory
for CPU jobs with high RAM requirementstrans
for data transfer (not fully configured yet)visual
for visualization purposes
For example, to request the
compute
partition, put the following line in your submission script: -
How long do I expect my job to run? Here, an example for the job requesting 4 hours:
Note
The maximum running time of a job is 120h for the "research" accounts and 24h for the "education" accounts! This limit is in place to ensure that all users can have a reasonable waiting time in the queue.
What happens when my job exceeds the allocated time?
--time=<time>
flag sets a limit on the total run time of the job allocation. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL. The delay between these two commands is 30 seconds.If your program needs more time to terminate, or needs extra signalling (some programs won't terminate on the first SIGTERM signal), then you can arrange that through the --signal parameter, e.g.:
#SBATCH --signal=15@60
sets signal 15 (SIGTERM) to be sent 60 seconds before the end of the allocated job time, which gives your program some extra time to finish more cleanly. -
Number of tasks/Number of CPUs per task needed:
Important
How do you know which resources to request? The answer to this question will depend on whether your software can run in parallel, and which exact type of parallelization it uses. Please refer to this excellent guide written by the IT Service Office of the University of Bern for some fundamentals. All information there applies equally to DelftBlue.
-
Amount of memory per CPU needed (Very important to avoid "out-of-memory" errors!):
-
Which account should my CPU hours be allocated to? (More info on accounting and shares.)
Important
srun
is the Slurm
version of mpirun
. srun
is the command to use to start a task in a job.
You can find some specific examples of a submission script below:
Serial (one core) job¶
Although the cluster's main use case scenario is highly parallel jobs, it is possible to also run single-core jobs in the queue. If you need to run a single-core jobs, please DO NOT request the full node! Instead, request one CPU. Slurm queuing system is smart, and it will allocate your job in such a way that other users can still user the rest of the node.
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
srun ./executable.x
'MPI only' (OpenMPI) job¶
Using 16 cores for one hour. The parallel job is invoked with srun
. Note that there is no need in principle to pass any arguments, like for example -np
in the case of locally executed mpirun
.
Important
Check if your application can actually scale properly up to the number of CPUs you are requesting! Running fewer CPUs per job has two advantages: 1) your waiting time in the queue will be shorter; 2) other users can share the same node with you, if there is sufficient amount of resources available.
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
module load 2023r1
module load openmpi
srun ./executable.x
If the OpenMPI job seems to be "stalling", it might be due to incorrect binding of the processes to physical CPU cores. Please submit a job interactively, and try to monitor your processes with top -u <NetID>
. If stalling is suspected, try to resubmit your job with the following srunflag:
srun --cpu-bind=none`.
Intel MPI job¶
After loading intel/oneapi-all
, you do not only get the Intel compilers and libraries, but also Intel's own implementation of MPI. Current setup of the intel
module requires explicitly specifying to the Intel MPI to use the Slurm PMI library (so that it can correctly bin Intel MPI
threads to CPU cores). Make sure you export this library variable in your submission script, before invoking your binary with srun
:
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
module load 2023r1
module load intel/oneapi-all
export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/current/lib64/libpmi2.so
srun ./executable
OpenMP job¶
Using 1 node, 1 task, 8 threads:
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./executable
hybrid MPI+OpenMP job¶
Using 2 nodes, 4 MPI processes per node (8 MPI processes in total), each of the MPI processes is using 12 OpenMP threads per task (2 * 4 * 12 = 96 cores, or 2 full compute nodes, in total):
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
module load 2023r1
module load openmpi
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./executable
GPU job¶
Using one NVidia V100S GPU on a node for four hours, 8 threads and no MPI:
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=gpu
#SBATCH --time=04:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
module load 2023r1
module load cuda/11.6
srun ./executable
Sometimes, the following MPI error might occur:
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: gpu010
Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3409449 on node gpu010 exited on signal 4 (Illegal instruction).
--------------------------------------------------------------------------
This can be fixed by adding the following srun
flag and resubmitting the job: srun --mpi=pmix
.
It is also possible to measure the GPU usage of your program by adding the following lines to your sbatch
script:
#!/bin/bash
#
#SBATCH --job-name="job_name"
#SBATCH --partition=gpu
#SBATCH --time=04:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>
module load 2023r1
module load cuda/11.6
previous=$(/usr/bin/nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/tail -n '+2')
srun ./executable
/usr/bin/nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/grep -v -F "$previous"
Running this script will give you the output of your program (./executable
) followed by something like this:
gpu_utilization [%], mem_utilization [%], max_memory_usage [MiB], time [ms]
0 %, 0 %, 561 MiB, 77332 ms
Note
nvidia-smi
can create some very nice reports! Please check https://developer.nvidia.com/nvidia-system-management-interface and click on nvidia-smi documentation which brings you to the man
-page of nvidia-smi
.
High memory job¶
There are six nodes available in DelftBlue with 750GB, and four with 1.5TB of RAM. To use them, use the 'memory' partition. For example, if you need 256 GB of memory, specify the resource requirement like this:
#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=memory
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=250G
#SBATCH --account=research-<faculty>-<department>
srun ./executable
File transfer job¶
Warning
The file transfer nodes are not fully configured yet! The recipe below should work in principle, but might not work just yet!!!
You can schedule jobs on dedicated file transfer nodes before and/or after a compute job is run. For this, use the 'transfer' partition. For example, to transfer the results of job 1234 to the TU Delft Project storage, submit the transfer job with a dependency on job 1234:
Where transfer-script.slurm script could look like this (rsync is explained here).
#!/bin/sh
#
#SBATCH --job-name="transfer_results"
#SBATCH --partition=trans
#SBATCH --time=00:15:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --account=research-<faculty>-<department>
results=(
'sim-output'
'parameter-files'
'output.txt'
'err.txt'
)
source="/scratch/${USER}/MySimulation"
destination='/tudelft.net/staff-umbrella/MyProject/DelftBlueResults/'
for result in "${results[@]}"
do
rsync -av --no-perms "${source}/${result}" "${destination}"
done
Controlling the output file name¶
The output of your script will be put by default into the file slurm-<SLURM_JOB_ID>
is the Slurm batch job number of your job. The standard error will be put into a file called slurm-<SLURM_JOB_ID>.err
: both files will be found in the directory from which you launched the job.
Note that the output file is created when your job starts running, and the output from your job is placed in this file as the job runs, so that you can monitor your job's progress. Therefore do not delete this file while your job is running or else you will lose your output. Please keep in mind that Slurm performs file buffering by default when writing to the output files: output is held in a buffer until the buffer is flushed to the output file. If you want the output to be written to file immediately (at the cost of additional CPU, network and storage overhead), you should pass the option --unbuffered
to the srun
command: the output will then appear in the file as soon as it is produced.
If you wish to change the default names of the output and error files, you can use the --output
and --error
directives in the batch script that you submit using the sbatch
command. See the example below:
Using visualization nodes¶
Follow the instructions in this document on how to use the visualization-nodes.
Interactive use¶
If you need to use a node of DelftBlue interactively, e.g. for running a debugger, you can issue the following command to get a terminal on one of the nodes:
where <your-sbatch-commands>
are the #SBATCH
lines from a regular submission script.
For example, to run an interactive (multi-threaded) job with eight cores for 30 minutes on a CPU node:
srun --job-name="int_job" --partition=compute --time=00:30:00 --ntasks=1 --cpus-per-task=8 --mem-per-cpu=1GB --pty bash
Accessing allocated nodes¶
A common use case of interactive sessions is logging in on the nodes where your batch job is running, e.g., to check the resource usage using 'top', or attach a debugger. This is possible by adding the --overlap
and --jobid=<JOBID>
flags to the above srun
command, where <JOBID>
can be found in the first column of squeue --me
.
In order to login on a specific node of the allocation, also add the --nodelist=<node>
option.
For example, if my job with JOBID=11111
is running on a cmp111
node of DelftBlue, I can connect to this node by issuing the following command:
This should bring me to the node cmp111
, where my original job is running on.
If my job with JOBID=22222
is running on several nodes of DelftBlue, for example cmp111
and cmp112
, I can connect to a specific node by issuing the following command:
This should bring me to the node cmp112
, one of the two nodes where my original job is running on.
Interactive GPU jobs¶
To run an interactive job on a GPU node:
srun --mpi=pmix --job-name="int_gpu_job" --partition=gpu --time=01:00:00 --ntasks=1 --cpus-per-task=1 --gpus-per-task=1 --mem-per-cpu=4G --account=research-faculty-department --pty /bin/bash -il
Interactive MPI jobs¶
To run an interactive MPI job on 16 cores (possibly scheduled across multiple nodes):
srun --job-name="int_mpi_job" --partition=compute --time=00:30:00 --ntasks=16 --cpus-per-task=1 --mem-per-cpu=1GB --account=research-faculty-department --pty bash -il
Exclusive use¶
Please note, that nodes are shared with other users by default. Normally, we recommend you to only request the number of CPUs, and let the Slurm scheduler decide which node to allocate those to. However, sometimes you might want to to request a full node exclusively. This can be done as follows:
This will allocate the entire node, including all its memory to your job.
Error messages and solutions¶
Check if Slurm
module is loaded!¶
Slurm is now available after login by default, without loading any additional modules.
How to load slurm module by hand
During the beta phase it is possible that Slurm
needs to be explicitly loaded as module.
If you get the following error, it means that Slurm
is not loaded (or not enabled by default):
[<netid>@login01 ~]$ squeue --me
squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
squeue: error: fetch_config: DNS SRV lookup failed
squeue: error: _establish_config_source: failed to fetch config
squeue: fatal: Could not establish a configuration source
If this is the case, make sure that Slurm
is loaded (currently, from the vendor-installed software stack /cm/local/modulefiles
modules):
Now, it should work:
Slurm out-of-memory (OOM) error¶
You might encounter the following error when submitting jobs via slurm
:
slurmstepd: error: Detected 2 oom-kill event(s) in StepId=1170.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
You need to set the --mem-per-cpu
value in the submission script. This value is the amount of memory in MB that slurm
allocates per allocated CPU. It defaults to 1 MB. If your job's memory use exceeds this, the job gets killed with an OOM error message. Set this value to a reasonable amount (i.e. the expected memory use with a little bit added for some head room).
Example: add the following line to the submission script:
Which allocates 1 GB per CPU.
Job management and statistics¶
When will my job start?¶
You can use the --start
flag of the squeue
command to see an estimation on when your pending job will start:
[dpalagin@login02 ~]$ squeue --me --start
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
3162383 compute pending_1 dpalagin PD 2023-04-01T07:03:01 1 (null) (Priority)
3162383 compute pending_2 dpalagin PD 2023-04-01T07:03:01 1 (null) (Priority)
Which jobs did I run recently?¶
You can use the sacct
command to show all job information starting from a specific date:
[NetID@login02 ~]$ sacct -S 2022-12-01 -u $USER
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1853039 vnc-1147 visual innovation 2 FAILED 1:0
1853039.bat+ batch innovation 2 FAILED 1:0
1853040 vnc-1149 visual innovation 2 FAILED 1:0
1853040.bat+ batch innovation 2 FAILED 1:0
1853041 vnc-1150 visual innovation 2 FAILED 1:0
1853041.bat+ batch innovation 2 FAILED 1:0
1853042 vnc-1151 visual innovation 2 TIMEOUT 0:0
1853042.bat+ batch innovation 2 CANCELLED 0:15
1853751 all-OH compute research-+ 48 COMPLETED 0:0
1853751.bat+ batch research-+ 25 COMPLETED 0:0
1853751.0 aims.2107+ research-+ 48 COMPLETED 0:0
1853752 fewer-OH compute research-+ 48 COMPLETED 0:0
1853752.bat+ batch research-+ 12 COMPLETED 0:0
1853752.0 aims.2107+ research-+ 48 COMPLETED 0:0
1853754 no-OH compute research-+ 48 COMPLETED 0:0
1853754.bat+ batch research-+ 28 COMPLETED 0:0
1853754.0 aims.2107+ research-+ 48 COMPLETED 0:0
1856064 water compute research-+ 4 COMPLETED 0:0
1856064.bat+ batch research-+ 4 COMPLETED 0:0
1856064.0 aims.2107+ research-+ 4 COMPLETED 0:0
How do I check information on a specific job?¶
You can use the seff
command to show the job details:
[NetID@login02 ~]$ seff 1853751
Job ID: 1853751
Cluster: delftblue
User/Group: dpalagin/domain users
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 24
CPU Utilized: 22-15:49:25
CPU Efficiency: 98.96% of 22-21:32:00 core-walltime
Job Wall-clock time: 11:26:55
Memory Utilized: 20.83 GB (estimated maximum)
Memory Efficiency: 21.70% of 96.00 GB (2.00 GB/core)
How do I get detailed information on a specific pending job?¶
You can use the scontrol
command to show the full job details:
[NetID@login02 ~]$ scontrol show jobid -dd 2475625
JobId=2475625 JobName=01_hello
UserId=dpalagin(588559) GroupId=domain users(100513) MCS_label=N/A
Priority=22962179 Nice=0 Account=research-eemcs-diam QOS=normal
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
SubmitTime=2023-06-27T14:48:42 EligibleTime=2023-06-27T14:48:42
AccrueTime=2023-06-27T14:48:42
StartTime=2023-06-29T06:40:07 EndTime=2023-06-29T06:50:07 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-27T16:04:42 Scheduler=Backfill:*
Partition=compute AllocNode:Sid=login02:878519
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1-1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=1G,node=1,billing=48
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=1G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld/helloworld-test2.sh
WorkDir=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld
StdErr=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld/slurm-2475625.out
StdIn=/dev/null
StdOut=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld/slurm-2475625.out
Power=