CUDA¶

Available CUDA versions¶

In order to see which CUDA modules are currently available, use the command module avail cuda.

In order to compile code optimized for the V100s GPUs, use compute capacity 70, for the A100 GPUs use 80 (e.g., the flag -arch=sm_80 of nvcc, or -gpu=cc80 for nvc++, etc.

Using `nvcc`¶

At the moment, if you would like to use the CUDA compiler nvcc, you have two options:

1. Option 1: Load the cuda module:

module load cuda/<version>

2. Option 2: Load the nvhpc module:

module load nvhpc

This will make the nvcc available:

[<NetID>@login01 ~]$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Thu_Feb_10_18:23:41_PST_2022
Cuda compilation tools, release 11.6, V11.6.112
Build cuda_11.6.r11.6/compiler.30978841_0

Using `nvfortran`¶

nvfortran is not a part of the standard CUDA module, and is only available in the nvhpc package. This package can be loaded as follows:

module load nvhpc

This will make the nvfortran available:

[<NetID>@login01 ~]$ nvfortran --version

nvfortran 22.3-0 64-bit target on x86-64 Linux -tp skylake-avx512
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Using gpu cards on gpu-nodes¶

In slurm the use of gpu card(s) need to be specified as an argument of srun and sbatch with --gres=gpu (for one card) or --gres=gpu:2 (for two cards). A simple sbatch-script test_gpu.sh could look like this:

#!/bin/sh
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2

module load cuda

srun test_gpu

The test_gpu.cu program could be something like this:

#include <stdio.h>

int main()
{
    int GPU_N;
    GPU_N = 0;
    cudaError_t err;
    err = cudaGetDeviceCount(&GPU_N);
    printf("error: %s; number of gpus: %i\n", cudaGetErrorString(err), GPU_N);
}

which can be compiled on the head-node with nvcc:

$ module load cuda
$ nvcc -o test_gpu test_gpu.cu

and tested with the sbatch-script via the head-node:

$ sbatch test_gpu.sh
$ cat slurm-<jobid>.out
error: no error; number of gpus: 2

Error messages and solutions¶

srun/MPI issues¶

Sometimes, the following MPI error might occur:

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   gpu010
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3409449 on node gpu010 exited on signal 4 (Illegal instruction).
--------------------------------------------------------------------------

This can be fixed by adding the following srun flag and resubmitting the job: srun --mpi=pmix.