CUDA¶
Available CUDA versions¶
The following CUDA versions are currently available on GPU nodes:
-------------------------------------
cuda:
-------------------------------------
Versions:
cuda/11.1.1
cuda/11.6
cuda/11.7
Using nvcc
¶
At the moment, if you would like to use the CUDA compiler nvcc
, you have two options:
1. Option 1: Load the cuda
module:
The available nvcc
version is then:
[<netid>@login03 ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
This has been checked on the gpu
node. Do not use the cuda/11.3
module as it does not provide nvcc
yet.
2. Option 2: Load the nvhpc
module:
This will make the nvcc
available:
[<NetID>@login01 ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Thu_Feb_10_18:23:41_PST_2022
Cuda compilation tools, release 11.6, V11.6.112
Build cuda_11.6.r11.6/compiler.30978841_0
Using nvfortran
¶
nvfortran
is not a part of the standard CUDA module, and is only available in the nvhpc
package. This package can be loaded as follows:
This will make the nvfortran
available:
[<NetID>@login01 ~]$ nvfortran --version
nvfortran 22.3-0 64-bit target on x86-64 Linux -tp skylake-avx512
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Using gpu cards on gpu-nodes¶
In slurm
the use of gpu card(s) need to be specified as an argument of srun
and sbatch
with --gres=gpu
(for one card) or --gres=gpu:2
(for two cards). A simple sbatch
-script test_gpu.sh
could look like this:
The test_gpu.cu
program could be something like this:
#include <stdio.h>
int main()
{
int GPU_N;
GPU_N = 0;
cudaError_t err;
err = cudaGetDeviceCount(&GPU_N);
printf("error: %s; number of gpus: %i\n", cudaGetErrorString(err), GPU_N);
}
which can be compiled on the head-node with nvcc
:
and tested with the sbatch
-script via the head-node:
Error messages and solutions¶
srun/MPI issues¶
Sometimes, the following MPI error might occur:
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: gpu010
Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3409449 on node gpu010 exited on signal 4 (Illegal instruction).
--------------------------------------------------------------------------
This can be fixed by adding the following srun
flag and resubmitting the job: srun --mpi=pmix
.