Trouble-shooting your Job¶
Error messages and solutions¶
Check if Slurm
module is loaded!¶
Slurm is now available after login by default, without loading any additional modules.
How to load slurm module by hand
If you get the following error, it means that Slurm
is not loaded (or not enabled by default):
[<netid>@login01 ~]$ squeue --me
squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
squeue: error: fetch_config: DNS SRV lookup failed
squeue: error: _establish_config_source: failed to fetch config
squeue: fatal: Could not establish a configuration source
If this is the case, make sure that Slurm
is loaded:
Now, it should work:
Slurm out-of-memory (OOM) error¶
You might encounter the following error when submitting jobs via slurm
:
slurmstepd: error: Detected 2 oom-kill event(s) in StepId=1170.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
You need to set the --mem-per-cpu
value in the submission script. This value is the amount of memory in MB that slurm
allocates per allocated CPU. It defaults to 1 MB. If your job's memory use exceeds this, the job gets killed with an OOM error message. Set this value to a reasonable amount (i.e. the expected memory use with a little bit added for some head room).
Example: add the following line to the submission script:
Which allocates 1 GB per CPU.
Job management and statistics¶
When will my job start?¶
You can use the --start
flag of the squeue
command to see the current estimated start time for your pending jobs:
[dpalagin@login02 ~]$ squeue --me --start
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
3162383 compute pending_1 dpalagin PD 2023-04-01T07:03:01 1 (null) (Priority)
3162383 compute pending_2 dpalagin PD 2023-04-01T07:03:01 1 (null) (Priority)
This time can always change if resources become available earlier or when higher priority jobs are submitted to the queue.
Which jobs did I run recently?¶
You can use the sacct
command to show all job information starting from a specific date:
[NetID@login02 ~]$ sacct -S 2024-05-01 -u $USER
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1853039 vnc-1147 visual innovation 2 FAILED 1:0
1853039.bat+ batch innovation 2 FAILED 1:0
1853040 vnc-1149 visual innovation 2 FAILED 1:0
1853040.bat+ batch innovation 2 FAILED 1:0
1853041 vnc-1150 visual innovation 2 FAILED 1:0
1853041.bat+ batch innovation 2 FAILED 1:0
1853042 vnc-1151 visual innovation 2 TIMEOUT 0:0
1853042.bat+ batch innovation 2 CANCELLED 0:15
1853751 all-OH compute research-+ 48 COMPLETED 0:0
1853751.bat+ batch research-+ 25 COMPLETED 0:0
1853751.0 aims.2107+ research-+ 48 COMPLETED 0:0
1853752 fewer-OH compute research-+ 48 COMPLETED 0:0
1853752.bat+ batch research-+ 12 COMPLETED 0:0
1853752.0 aims.2107+ research-+ 48 COMPLETED 0:0
1853754 no-OH compute research-+ 48 COMPLETED 0:0
1853754.bat+ batch research-+ 28 COMPLETED 0:0
1853754.0 aims.2107+ research-+ 48 COMPLETED 0:0
1856064 water compute research-+ 4 COMPLETED 0:0
1856064.bat+ batch research-+ 4 COMPLETED 0:0
1856064.0 aims.2107+ research-+ 4 COMPLETED 0:0
How do I check information on a specific job?¶
You can use the seff
command to show the job details:
[NetID@login02 ~]$ seff 1853751
Job ID: 1853751
Cluster: delftblue
User/Group: dpalagin/domain users
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 24
CPU Utilized: 22-15:49:25
CPU Efficiency: 98.96% of 22-21:32:00 core-walltime
Job Wall-clock time: 11:26:55
Memory Utilized: 20.83 GB (estimated maximum)
Memory Efficiency: 21.70% of 96.00 GB (2.00 GB/core)
At this moment, the memory information reported by seff
is incorrect when using srun
, and in that case must be ignored. Any Out-of-Memory (OOM) problems that your jobs encounter are correct, your jobs are then using more memory than requested (and memory usage should be reduced, or the requested memory should be increased).
How do I get detailed scheduling information on a job?¶
You can use the scontrol
command to show the full job details:
[NetID@login02 ~]$ scontrol show jobid 2475625 --details
JobId=2475625 JobName=01_hello
UserId=dpalagin(588559) GroupId=domain users(100513) MCS_label=N/A
Priority=22962179 Nice=0 Account=research-eemcs-diam QOS=normal
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
SubmitTime=2023-06-27T14:48:42 EligibleTime=2023-06-27T14:48:42
AccrueTime=2023-06-27T14:48:42
StartTime=2023-06-29T06:40:07 EndTime=2023-06-29T06:50:07 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-27T16:04:42 Scheduler=Backfill:*
Partition=compute AllocNode:Sid=login02:878519
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1-1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=1G,node=1,billing=48
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=1G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld/helloworld-test2.sh
WorkDir=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld
StdErr=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld/slurm-2475625.out
StdIn=/dev/null
StdOut=/home/dpalagin/DelftBlueWorkshop/Day1/Exercises/01_helloworld/slurm-2475625.out
Power=