Tensorflow¶

Let's try to run the following minimal example of Tensorflow job (howmanygpus.py):

howmanygpus.py

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

This can be submitted to the queue with the following submission script:

howmanygpus.slurm

#!/bin/bash -l
#
#SBATCH --job-name="tensorflow/howmanygpus"
#SBATCH --output=howmanygpus.out
#SBATCH --time=00:02:00
#SBATCH --partition=gpu-a100-small
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --gpus-per-task=2
#SBATCH --mem-per-cpu=1G
# make sure to add your account!
##SBATCH --account=<what>-<faculty>-<group>

module load 2024r1
module load python
module load openmpi
module load py-tensorflow

srun python howmanygpus.py

Please note, that you only have to load the python, openmpi, and py-tensorflow modules. Loading a separate CUDA module is not necessary, and can result in a conflict of CUDA libraries. Let the py-tensorflow module do its thing by itself!

Keras classification example¶

A slightly more advanced example that actually "does something" is given below. In the job script, we clone a benchmark data set onto the local SSD of a GPU node, and then perform a classification training (example taken from the TensorFlow Keras tutorial:

classifition.py

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow.python.keras import Sequential
from tensorflow.python.keras.layers import Flatten, Dense
from tensorflow.python.keras.losses import SparseCategoricalCrossentropy
from fashion_mnist.utils import mnist_reader

# Find GPUs
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

# Helper libraries
import numpy as np

print(tf.__version__)

# Import the Fashion MNIST dataset
# 60,000 images are used to train the network and 10,000 images to evaluate:

train_images, train_labels = mnist_reader.load_mnist('fashion_mnist/data/fashion', kind='train')
test_images, test_labels = mnist_reader.load_mnist('fashion_mnist/data/fashion', kind='t10k')

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

print('Train images shape is: '+ str(train_images.shape))
print('Length of train labels is: '+ str(len(train_labels)))
print('Test images shape is: '+ str(test_images.shape))
print('Length of test labels is: '+ str(len(test_labels)))

# Scale these values to a range of 0 to 1 before feeding them to the neural network model.
train_images = train_images / 255.0
test_images = test_images / 255.0


# Set up the layers
model = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(128, activation='relu'),
    Dense(10)
])

# Print Summary
model.summary()

# Compile the model
model.compile(optimizer='adam',
              loss=SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, epochs=10)

# Evaluate accuracy
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
print('\nTest accuracy:', test_acc)

And here is the corresponding job script:

classifition.slurm

#!/bin/sh
#
#SBATCH --job-name="tensorflow/classification"
#SBATCH --output=classification.out
#SBATCH --partition=gpu-a100-small
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu=3G
# make sure to add your account!
##SBATCH --account=<what>-<faculty>-<group>

module load 2024r1
module load git
module load python
module load openmpi
module load py-tensorflow

# for fastest possible I/O on a single node, we will work
# on the node-local SSD
RUNDIR=/tmp/${SLURM_JOBID}
mkdir ${RUNDIR}
cp classification.py ${RUNDIR}
cd ${RUNDIR}

# obtain the dataset
git clone https://github.com/zalandoresearch/fashion-mnist fashion_mnist


srun python classification.py

# always clean up afer yourself...
rm -rf ${RUNDIR}