Skip to content

Submit Jobs

Job submission and control using SLURM

After logging in, you will be on a login node. These nodes are intended only for setting up and starting your jobs. On a supercomputer, you do not run your program directly. Instead you write a job script containing the resources you need and the commands to execute, and submit it to a queue via the Slurm workload manager with a sbatch command:

sbatch name-of-your-submission-script.sh

DelftBlue uses the Slurm workload manager for the submission, control and management of user jobs. Slurm provides a rich set of features for organizing your workload and an extensive array of tools for managing your resource usage. The most frequently used commands with the batch system are the following three:

  • sbatch - submit a batch script
  • squeue - check the status of jobs on the system
  • scancel - cancel a job and delete it from the queue

Note

Try the squeue --me command. If it returns strange looking errors, try to load the slurm module by hand:

module load slurm

More info on the issue: see below.

Furthermore, the list of queues and partitions is available typing sinfo or scontrol show partition and past jobs saved in the Slurm database can be inspected with the sacct command: please have a look at man sacct for more information.

An appropriate Slurm job submission file for your parallel job is a shell script with a set of directives at the beginning: these directives are issued by starting a line with the string #SBATCH (as a note for PBS batch system users, this is the Slurm equivalent of #PBS). A suitable batch script is then submitted to the batch system using the sbatch command.

A basic Slurm batch script can be written just adding the --ntasks and --time directives, but extra directives will give you more control on how your job is run.

Note

Slurm is managing CPU, GPUs, memory, and runtime allocation. This means that Slurm will enforce the amounts that you request. Defaults have been deliberately set low. So make sure you request reasonable amounts of resources in your submission script explicitly!

Specifically, the following parameters must be set:

  • Which partition do I need? The following partitions are available:

    • compute or compute-p1 for CPU jobs on Phase 1 compute nodes (48 CPUs and 192 GB of RAM per node)
    • compute-p2 for CPU jobs on Phase 2 compute nodes (64 CPUs and 256 GB of RAM per node)
    • gpu or gpu-v100 for GPU jobs on Phase 1 gpu nodes (4x V100s cards with 32 GB of video RAM each per node)
    • gpu-a100 for GPU jobs on Phase 2 gpu nodes (4x A100 cards with 80 GB of video RAM each per node)
    • memory for CPU jobs with high RAM requirements
    • trans for data transfer (not fully configured yet)
    • visual for visualization purposes

    For example, to request the compute partition, put the following line in your submission script:

    #SBATCH --partition=compute
    
  • How long do I expect my job to run? Here, an example for the job requesting 4 hours:

    #SBATCH --time=04:00:00
    

    Note

    The maximum running time of a job is 120h for the "research" accounts and 24h for the "education" accounts! This limit is in place to ensure that all users can have a reasonable waiting time in the queue.

    What happens when my job exceeds the allocated time?

    --time=<time> flag sets a limit on the total run time of the job allocation. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL. The delay between these two commands is 30 seconds.

    If your program needs more time to terminate, or needs extra signalling (some programs won't terminate on the first SIGTERM signal), then you can arrange that through the --signal parameter, e.g.:

    #SBATCH --signal=15@60 sets signal 15 (SIGTERM) to be sent 60 seconds before the end of the allocated job time, which gives your program some extra time to finish more cleanly.

  • Number of tasks/Number of CPUs per task needed:

    #SBATCH --ntasks=16
    #SBATCH --cpus-per-task=1
    

Important

How do you know which resources to request? The answer to this question will depend on whether your software can run in parallel, and which exact type of parallelization it uses. Please refer to this excellent guide written by the IT Service Office of the University of Bern for some fundamentals. All information there applies equally to DelftBlue.

  • Amount of memory per CPU needed (Very important to avoid "out-of-memory" errors!):

    #SBATCH --mem-per-cpu=1G
    

  • Which account should my CPU hours be allocated to? (More info on accounting and shares.)

    #SBATCH --account=research-<faculty>-<department>
    

Important

srun is the Slurm version of mpirun. srun is the command to use to start a task in a job.

You can find some specific examples of a submission script below:

Serial (one core) job

Although the cluster's main use case scenario is highly parallel jobs, it is possible to also run single-core jobs in the queue. If you need to run a single-core jobs, please DO NOT request the full node! Instead, request one CPU. Slurm queuing system is smart, and it will allocate your job in such a way that other users can still user the rest of the node.

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

srun ./executable.x  

'MPI only' (OpenMPI) job

Using 16 cores for one hour. The parallel job is invoked with srun. Note that there is no need in principle to pass any arguments, like for example -np in the case of locally executed mpirun.

Important

Check if your application can actually scale properly up to the number of CPUs you are requesting! Running fewer CPUs per job has two advantages: 1) your waiting time in the queue will be shorter; 2) other users can share the same node with you, if there is sufficient amount of resources available.

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

module load 2023r1
module load openmpi

srun ./executable.x  

If the OpenMPI job seems to be "stalling", it might be due to incorrect binding of the processes to physical CPU cores. Please submit a job interactively, and try to monitor your processes with top -u <NetID>. If stalling is suspected, try to resubmit your job with the following srunflag:srun --cpu-bind=none`.

Intel MPI job

After loading intel/oneapi-all, you do not only get the Intel compilers and libraries, but also Intel's own implementation of MPI. Current setup of the intel module requires explicitly specifying to the Intel MPI to use the Slurm PMI library (so that it can correctly bin Intel MPI threads to CPU cores). Make sure you export this library variable in your submission script, before invoking your binary with srun:

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

module load 2023r1
module load intel/oneapi-all

export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/current/lib64/libpmi2.so

srun ./executable

OpenMP job

Using 1 node, 1 task, 8 threads:

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun ./executable

hybrid MPI+OpenMP job

Using 2 nodes, 4 MPI processes per node (8 MPI processes in total), each of the MPI processes is using 12 OpenMP threads per task (2 * 4 * 12 = 96 cores, or 2 full compute nodes, in total):

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

module load 2023r1
module load openmpi

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun ./executable

GPU job

Using one NVidia V100S GPU on a node for four hours, 8 threads and no MPI:

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=gpu
#SBATCH --time=04:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

module load 2023r1
module load cuda/11.6

srun ./executable

Sometimes, the following MPI error might occur:

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   gpu010
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3409449 on node gpu010 exited on signal 4 (Illegal instruction).
--------------------------------------------------------------------------

This can be fixed by adding the following srun flag and resubmitting the job: srun --mpi=pmix.

It is also possible to measure the GPU usage of your program by adding the following lines to your sbatch script:

#!/bin/bash
#
#SBATCH --job-name="job_name"
#SBATCH --partition=gpu
#SBATCH --time=04:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --account=research-<faculty>-<department>

module load 2023r1 
module load cuda/11.6

previous=$(/usr/bin/nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/tail -n '+2')

srun ./executable

/usr/bin/nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/grep -v -F "$previous"

Running this script will give you the output of your program (./executable) followed by something like this:

gpu_utilization [%], mem_utilization [%], max_memory_usage [MiB], time [ms]
0 %, 0 %, 561 MiB, 77332 ms

Note

nvidia-smi can create some very nice reports! Please check https://developer.nvidia.com/nvidia-system-management-interface and click on nvidia-smi documentation which brings you to the man-page of nvidia-smi.

High memory job

There are six nodes available in DelftBlue with 750GB, and four with 1.5TB of RAM. To use them, use the 'memory' partition. For example, if you need 256 GB of memory, specify the resource requirement like this:

#!/bin/sh
#
#SBATCH --job-name="job_name"
#SBATCH --partition=memory
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=250G
#SBATCH --account=research-<faculty>-<department>

srun ./executable

File transfer job

Warning

The file transfer nodes are not fully configured yet! The recipe below should work in principle, but might not work just yet!!!

You can schedule jobs on dedicated file transfer nodes before and/or after a compute job is run. For this, use the 'transfer' partition. For example, to transfer the results of job 1234 to the TU Delft Project storage, submit the transfer job with a dependency on job 1234:

sbatch --dependency=afterok:1234 transfer_script.slurm

Where transfer-script.slurm script could look like this (rsync is explained here).

#!/bin/sh
#
#SBATCH --job-name="transfer_results"
#SBATCH --partition=trans
#SBATCH --time=00:15:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --account=research-<faculty>-<department>

results=(
 'sim-output' 
 'parameter-files'
 'output.txt'
 'err.txt'
)
source="/scratch/${USER}/MySimulation"
destination='/tudelft.net/staff-umbrella/MyProject/DelftBlueResults/'

for result in "${results[@]}"
do
  rsync -av --no-perms "${source}/${result}" "${destination}"
done

Controlling the output file name

The output of your script will be put by default into the file slurm-.out where <SLURM_JOB_ID> is the Slurm batch job number of your job. The standard error will be put into a file called slurm-<SLURM_JOB_ID>.err: both files will be found in the directory from which you launched the job.

Note that the output file is created when your job starts running, and the output from your job is placed in this file as the job runs, so that you can monitor your job's progress. Therefore do not delete this file while your job is running or else you will lose your output. Please keep in mind that Slurm performs file buffering by default when writing to the output files: output is held in a buffer until the buffer is flushed to the output file. If you want the output to be written to file immediately (at the cost of additional CPU, network and storage overhead), you should pass the option --unbuffered to the srun command: the output will then appear in the file as soon as it is produced.

If you wish to change the default names of the output and error files, you can use the --output and --error directives in the batch script that you submit using the sbatch command. See the example below:

#SBATCH --output=hello_world_mpi.%j.out
#SBATCH --error=hello_world_mpi.%j.err

Using visualization nodes

Follow the instructions in this document on how to use the visualization-nodes.

Interactive use

If you need to use a node of DelftBlue interactively, e.g. for running a debugger, you can issue the following command to get a terminal on one of the nodes:

srun <your-sbatch-commands> --pty bash 

where <your-sbatch-commands> are the #SBATCH lines from a regular submission script.

For example, to run an interactive (multi-threaded) job with eight cores for 30 minutes on a CPU node:

srun --job-name="int_job" --partition=compute --time=00:30:00 --ntasks=1 --cpus-per-task=8 --mem-per-cpu=1GB --pty bash

Accessing allocated nodes

A common use case of interactive sessions is logging in on the nodes where your batch job is running, e.g., to check the resource usage using 'top', or attach a debugger. This is possible by adding the --overlap and --jobid=<JOBID> flags to the above srun command, where <JOBID> can be found in the first column of squeue --me. In order to login on a specific node of the allocation, also add the --nodelist=<node> option.

For example, if my job with JOBID=11111 is running on a cmp111 node of DelftBlue, I can connect to this node by issuing the following command:

srun --pty --ntasks=1 --time=00:30:00 --overlap --jobid=11111 bash

This should bring me to the node cmp111, where my original job is running on.

If my job with JOBID=22222 is running on several nodes of DelftBlue, for example cmp111 and cmp112, I can connect to a specific node by issuing the following command:

srun --pty --ntasks=1 --time=00:30:00 --overlap --jobid=22222 --nodelist=cmp112 bash

This should bring me to the node cmp112, one of the two nodes where my original job is running on.

Interactive GPU jobs

To run an interactive job on a GPU node:

srun --mpi=pmix --job-name="int_gpu_job" --partition=gpu --time=01:00:00 --ntasks=1 --cpus-per-task=1 --gpus-per-task=1 --mem-per-cpu=4G --account=research-faculty-department --pty /bin/bash -il

Interactive MPI jobs

To run an interactive MPI job on 16 cores (possibly scheduled across multiple nodes):

srun --job-name="int_mpi_job" --partition=compute --time=00:30:00 --ntasks=16 --cpus-per-task=1 --mem-per-cpu=1GB --account=research-faculty-department --pty bash -il
and then start your application using

srun --overlap <executable>

Exclusive use

Please note, that nodes are shared with other users by default. Normally, we recommend you to only request the number of CPUs, and let the Slurm scheduler decide which node to allocate those to. However, sometimes you might want to to request a full node exclusively. This can be done as follows:

#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --mem=0

This will allocate the entire node, including all its memory to your job.

Error messages and solutions

Check if Slurm module is loaded!

Slurm is now available after login by default, without loading any additional modules.

How to load slurm module by hand

During the beta phase it is possible that Slurm needs to be explicitly loaded as module.

If you get the following error, it means that Slurm is not loaded (or not enabled by default):

[<netid>@login01 ~]$ squeue --me
squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
squeue: error: fetch_config: DNS SRV lookup failed
squeue: error: _establish_config_source: failed to fetch config
squeue: fatal: Could not establish a configuration source

If this is the case, make sure that Slurm is loaded (currently, from the vendor-installed software stack /cm/local/modulefiles modules):

[<netid>@login01 ~]$ module load slurm

Now, it should work:

[<netid>@login01 ~]$ squeue --me
            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              964   compute    2_1c6  <netid>  R    1:37:15      1 mem001

Slurm out-of-memory (OOM) error

You might encounter the following error when submitting jobs via slurm:

slurmstepd: error: Detected 2 oom-kill event(s) in StepId=1170.0. Some of your processes may have been killed by the cgroup out-of-memory handler.

You need to set the --mem-per-cpu value in the submission script. This value is the amount of memory in MB that slurm allocates per allocated CPU. It defaults to 1 MB. If your job's memory use exceeds this, the job gets killed with an OOM error message. Set this value to a reasonable amount (i.e. the expected memory use with a little bit added for some head room).

Example: add the following line to the submission script:

#SBATCH --mem-per-cpu=1G

Which allocates 1 GB per CPU.

Job management and statistics

When will my job start?

You can use the --start flag of the squeue command to see an estimation on when your pending job will start:

[dpalagin@login02 ~]$ squeue --me --start
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
           3162383   compute  pending_1 dpalagin PD 2023-04-01T07:03:01      1 (null)               (Priority)
           3162383   compute  pending_2 dpalagin PD 2023-04-01T07:03:01      1 (null)               (Priority)

Which jobs did I run recently?

You can use the sacct command to show all job information starting from a specific date:

[NetID@login02 ~]$ sacct -S 2022-12-01 -u $USER

JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1853039        vnc-1147     visual innovation          2     FAILED      1:0
1853039.bat+      batch            innovation          2     FAILED      1:0
1853040        vnc-1149     visual innovation          2     FAILED      1:0
1853040.bat+      batch            innovation          2     FAILED      1:0
1853041        vnc-1150     visual innovation          2     FAILED      1:0
1853041.bat+      batch            innovation          2     FAILED      1:0
1853042        vnc-1151     visual innovation          2    TIMEOUT      0:0
1853042.bat+      batch            innovation          2  CANCELLED     0:15
1853751          all-OH    compute research-+         48  COMPLETED      0:0
1853751.bat+      batch            research-+         25  COMPLETED      0:0
1853751.0    aims.2107+            research-+         48  COMPLETED      0:0
1853752        fewer-OH    compute research-+         48  COMPLETED      0:0
1853752.bat+      batch            research-+         12  COMPLETED      0:0
1853752.0    aims.2107+            research-+         48  COMPLETED      0:0
1853754           no-OH    compute research-+         48  COMPLETED      0:0
1853754.bat+      batch            research-+         28  COMPLETED      0:0
1853754.0    aims.2107+            research-+         48  COMPLETED      0:0
1856064           water    compute research-+          4  COMPLETED      0:0
1856064.bat+      batch            research-+          4  COMPLETED      0:0
1856064.0    aims.2107+            research-+          4  COMPLETED      0:0

How do I check information on a specific job?

You can use the seff command to show the job details:

[NetID@login02 ~]$ seff 1853751

Job ID: 1853751
Cluster: delftblue
User/Group: dpalagin/domain users
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 24
CPU Utilized: 22-15:49:25
CPU Efficiency: 98.96% of 22-21:32:00 core-walltime
Job Wall-clock time: 11:26:55
Memory Utilized: 20.83 GB (estimated maximum)
Memory Efficiency: 21.70% of 96.00 GB (2.00 GB/core)

How do I get detailed information on a specific pending job?

You can use the scontrol command to show the full job details:

[NetID@login02 ~]$ scontrol show jobid -dd 2475625

JobId=2475625 JobName=01_hello
   UserId=dpalagin(588559) GroupId=domain users(100513) MCS_label=N/A
   Priority=22962179 Nice=0 Account=research-eemcs-diam QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2023-06-27T14:48:42 EligibleTime=2023-06-27T14:48:42
   AccrueTime=2023-06-27T14:48:42
   StartTime=2023-06-29T06:40:07 EndTime=2023-06-29T06:50:07 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-27T16:04:42 Scheduler=Backfill:*
   Partition=compute AllocNode:Sid=login02:878519
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=1G,node=1,billing=48
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=1G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld/helloworld-test2.sh
   WorkDir=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld
   StdErr=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld/slurm-2475625.out
   StdIn=/dev/null
   StdOut=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld/slurm-2475625.out
   Power=