CUDA¶
Available CUDA versions¶
In order to see which CUDA modules are currently available, use the command
module avail cuda.
In order to compile code optimized for the V100s GPUs, use compute capacity 70, for the A100 GPUs use 80
(e.g., the flag -arch=sm_80 of nvcc, or -gpu=cc80 for nvc++, etc.
Using nvcc¶
At the moment, if you would like to use the CUDA compiler nvcc, you have two options:
1. Option 1: Load the cuda module:
2. Option 2: Load the nvhpc module:
This will make the nvcc available:
[<NetID>@login01 ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Thu_Feb_10_18:23:41_PST_2022
Cuda compilation tools, release 11.6, V11.6.112
Build cuda_11.6.r11.6/compiler.30978841_0
Using nvfortran¶
nvfortran is not a part of the standard CUDA module, and is only available in the nvhpc package. This package can be loaded as follows:
This will make the nvfortran available:
[<NetID>@login01 ~]$ nvfortran --version
nvfortran 22.3-0 64-bit target on x86-64 Linux -tp skylake-avx512
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Using gpu cards on gpu-nodes¶
In slurm the use of gpu card(s) need to be specified as an argument of srun and sbatch with --gres=gpu (for one card) or --gres=gpu:2 (for two cards). A simple sbatch-script test_gpu.sh could look like this:
The test_gpu.cu program could be something like this:
#include <stdio.h>
int main()
{
int GPU_N;
GPU_N = 0;
cudaError_t err;
err = cudaGetDeviceCount(&GPU_N);
printf("error: %s; number of gpus: %i\n", cudaGetErrorString(err), GPU_N);
}
which can be compiled on the head-node with nvcc:
and tested with the sbatch-script via the head-node:
Error messages and solutions¶
srun/MPI issues¶
Sometimes, the following MPI error might occur:
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: gpu010
Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3409449 on node gpu010 exited on signal 4 (Illegal instruction).
--------------------------------------------------------------------------
This can be fixed by adding the following srun flag and resubmitting the job: srun --mpi=pmix.