Trouble-shooting your Job¶

Error messages and solutions¶

Check if `Slurm` module is loaded!¶

Slurm is now available after login by default, without loading any additional modules.

How to load slurm module by hand

If you get the following error, it means that Slurm is not loaded (or not enabled by default):

[<netid>@login01 ~]$ squeue --me
squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
squeue: error: fetch_config: DNS SRV lookup failed
squeue: error: _establish_config_source: failed to fetch config
squeue: fatal: Could not establish a configuration source

If this is the case, make sure that Slurm is loaded:

[<netid>@login01 ~]$ module load slurm

Now, it should work:

[<netid>@login01 ~]$ squeue --me
            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              964   compute    2_1c6  <netid>  R    1:37:15      1 mem001

Slurm out-of-memory (OOM) error¶

You might encounter the following error when submitting jobs via slurm:

slurmstepd: error: Detected 2 oom-kill event(s) in StepId=1170.0. Some of your processes may have been killed by the cgroup out-of-memory handler.

You need to set the --mem-per-cpu value in the submission script. This value is the amount of memory in MB that slurm allocates per allocated CPU. It defaults to 1 MB. If your job's memory use exceeds this, the job gets killed with an OOM error message. Set this value to a reasonable amount (i.e. the expected memory use with a little bit added for some head room).

Example: add the following line to the submission script:

#SBATCH --mem-per-cpu=1G

Which allocates 1 GB per CPU.

Job management and statistics¶

When will my job start?¶

You can use the --start flag of the squeue command to see the current estimated start time for your pending jobs:

[dpalagin@login02 ~]$ squeue --me --start
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
           3162383   compute  pending_1 dpalagin PD 2023-04-01T07:03:01      1 (null)               (Priority)
           3162383   compute  pending_2 dpalagin PD 2023-04-01T07:03:01      1 (null)               (Priority)

This time can always change if resources become available earlier or when higher priority jobs are submitted to the queue.

Which jobs did I run recently?¶

You can use the sacct command to show all job information starting from a specific date:

[NetID@login02 ~]$ sacct -S 2024-05-01 -u $USER

JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1853039        vnc-1147     visual innovation          2     FAILED      1:0
1853039.bat+      batch            innovation          2     FAILED      1:0
1853040        vnc-1149     visual innovation          2     FAILED      1:0
1853040.bat+      batch            innovation          2     FAILED      1:0
1853041        vnc-1150     visual innovation          2     FAILED      1:0
1853041.bat+      batch            innovation          2     FAILED      1:0
1853042        vnc-1151     visual innovation          2    TIMEOUT      0:0
1853042.bat+      batch            innovation          2  CANCELLED     0:15
1853751          all-OH    compute research-+         48  COMPLETED      0:0
1853751.bat+      batch            research-+         25  COMPLETED      0:0
1853751.0    aims.2107+            research-+         48  COMPLETED      0:0
1853752        fewer-OH    compute research-+         48  COMPLETED      0:0
1853752.bat+      batch            research-+         12  COMPLETED      0:0
1853752.0    aims.2107+            research-+         48  COMPLETED      0:0
1853754           no-OH    compute research-+         48  COMPLETED      0:0
1853754.bat+      batch            research-+         28  COMPLETED      0:0
1853754.0    aims.2107+            research-+         48  COMPLETED      0:0
1856064           water    compute research-+          4  COMPLETED      0:0
1856064.bat+      batch            research-+          4  COMPLETED      0:0
1856064.0    aims.2107+            research-+          4  COMPLETED      0:0

How do I check information on a specific job?¶

You can use the seff command to show the job details:

[NetID@login02 ~]$ seff 1853751

Job ID: 1853751
Cluster: delftblue
User/Group: dpalagin/domain users
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 24
CPU Utilized: 22-15:49:25
CPU Efficiency: 98.96% of 22-21:32:00 core-walltime
Job Wall-clock time: 11:26:55
Memory Utilized: 20.83 GB (estimated maximum)
Memory Efficiency: 21.70% of 96.00 GB (2.00 GB/core)

At this moment, the memory information reported by seff is incorrect when using srun, and in that case must be ignored. Any Out-of-Memory (OOM) problems that your jobs encounter are correct, your jobs are then using more memory than requested (and memory usage should be reduced, or the requested memory should be increased).

How do I get detailed scheduling information on a job?¶

You can use the scontrol command to show the full job details:

[NetID@login02 ~]$ scontrol show jobid 2475625 --details

JobId=2475625 JobName=01_hello
   UserId=dpalagin(588559) GroupId=domain users(100513) MCS_label=N/A
   Priority=22962179 Nice=0 Account=research-eemcs-diam QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2023-06-27T14:48:42 EligibleTime=2023-06-27T14:48:42
   AccrueTime=2023-06-27T14:48:42
   StartTime=2023-06-29T06:40:07 EndTime=2023-06-29T06:50:07 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-27T16:04:42 Scheduler=Backfill:*
   Partition=compute AllocNode:Sid=login02:878519
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=1G,node=1,billing=48
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=1G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld/helloworld-test2.sh
   WorkDir=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld
   StdErr=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld/slurm-2475625.out
   StdIn=/dev/null
   StdOut=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld/slurm-2475625.out
   Power=