Skip to content

CUDA

Available CUDA versions

The following CUDA versions are currently available on GPU nodes:

-------------------------------------
  cuda:
-------------------------------------
     Versions:
        cuda/11.1.1
        cuda/11.6
        cuda/11.7

Using nvcc

At the moment, if you would like to use the CUDA compiler nvcc, you have two options:

1. Option 1: Load the cuda module:

module load 2022r2
module load cuda/11.6

The available nvcc version is then:

[<netid>@login03 ~]$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

This has been checked on the gpu node. Do not use the cuda/11.3 module as it does not provide nvcc yet.

2. Option 2: Load the nvhpc module:

module load 2022r2
module load nvhpc

This will make the nvcc available:

[<NetID>@login01 ~]$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Thu_Feb_10_18:23:41_PST_2022
Cuda compilation tools, release 11.6, V11.6.112
Build cuda_11.6.r11.6/compiler.30978841_0

Using nvfortran

nvfortran is not a part of the standard CUDA module, and is only available in the nvhpc package. This package can be loaded as follows:

module load 2022r2
module load nvhpc

This will make the nvfortran available:

[<NetID>@login01 ~]$ nvfortran --version

nvfortran 22.3-0 64-bit target on x86-64 Linux -tp skylake-avx512
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Using gpu cards on gpu-nodes

In slurm the use of gpu card(s) need to be specified as an argument of srun and sbatch with --gres=gpu (for one card) or --gres=gpu:2 (for two cards). A simple sbatch-script test_gpu.sh could look like this:

#!/bin/sh
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2

module load 2022r2 cuda/11.6

srun test_gpu

The test_gpu.cu program could be something like this:

#include <stdio.h>

int main()
{
    int GPU_N;
    GPU_N = 0;
    cudaError_t err;
    err = cudaGetDeviceCount(&GPU_N);
    printf("error: %s; number of gpus: %i\n", cudaGetErrorString(err), GPU_N);
}

which can be compiled on the head-node with nvcc:

$ module load 2022r2 cuda/11.6
$ nvcc -o test_gpu test_gpu.cu

and tested with the sbatch-script via the head-node:

$ sbatch test_gpu.sh
$ cat slurm-<jobid>.out
error: no error; number of gpus: 2

Error messages and solutions

srun/MPI issues

Sometimes, the following MPI error might occur:

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   gpu010
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3409449 on node gpu010 exited on signal 4 (Illegal instruction).
--------------------------------------------------------------------------

This can be fixed by adding the following srun flag and resubmitting the job: srun --mpi=pmix.