Tensorflow¶
Let's try to run the following minimal example of Tensorflow job (howmanygpus.py
):
howmanygpus.py
This can be submitted to the queue with the following submission script:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
howmanygpus.slurm
#!/bin/bash -l
#
#SBATCH --job-name="tensorflow/howmanygpus"
#SBATCH --output=howmanygpus.out
#SBATCH --time=00:02:00
#SBATCH --partition=gpu-a100-small
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --gpus-per-task=2
#SBATCH --mem-per-cpu=1G
# make sure to add your account!
##SBATCH --account=<what>-<faculty>-<group>
module load 2024r1
module load python
module load openmpi
module load py-tensorflow
srun python howmanygpus.py
Please note, that you only have to load the python
, openmpi
, and py-tensorflow
modules. Loading a separate CUDA module is not necessary, and can result in a conflict of CUDA libraries. Let the py-tensorflow
module do its thing by itself!
Keras classification example¶
A slightly more advanced example that actually "does something" is given below. In the job script, we clone a benchmark data set onto the local SSD of a GPU node, and then perform a classification training (example taken from the TensorFlow Keras tutorial:
classifition.py
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow.python.keras import Sequential
from tensorflow.python.keras.layers import Flatten, Dense
from tensorflow.python.keras.losses import SparseCategoricalCrossentropy
from fashion_mnist.utils import mnist_reader
# Find GPUs
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
# Helper libraries
import numpy as np
print(tf.__version__)
# Import the Fashion MNIST dataset
# 60,000 images are used to train the network and 10,000 images to evaluate:
train_images, train_labels = mnist_reader.load_mnist('fashion_mnist/data/fashion', kind='train')
test_images, test_labels = mnist_reader.load_mnist('fashion_mnist/data/fashion', kind='t10k')
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
print('Train images shape is: '+ str(train_images.shape))
print('Length of train labels is: '+ str(len(train_labels)))
print('Test images shape is: '+ str(test_images.shape))
print('Length of test labels is: '+ str(len(test_labels)))
# Scale these values to a range of 0 to 1 before feeding them to the neural network model.
train_images = train_images / 255.0
test_images = test_images / 255.0
# Set up the layers
model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(128, activation='relu'),
Dense(10)
])
# Print Summary
model.summary()
# Compile the model
model.compile(optimizer='adam',
loss=SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
# Train the model
model.fit(train_images, train_labels, epochs=10)
# Evaluate accuracy
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print('\nTest accuracy:', test_acc)
And here is the corresponding job script:
classifition.slurm
#!/bin/sh
#
#SBATCH --job-name="tensorflow/classification"
#SBATCH --output=classification.out
#SBATCH --partition=gpu-a100-small
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu=3G
# make sure to add your account!
##SBATCH --account=<what>-<faculty>-<group>
module load 2024r1
module load git
module load python
module load openmpi
module load py-tensorflow
# for fastest possible I/O on a single node, we will work
# on the node-local SSD
RUNDIR=/tmp/${SLURM_JOBID}
mkdir ${RUNDIR}
cp classification.py ${RUNDIR}
cd ${RUNDIR}
# obtain the dataset
git clone https://github.com/zalandoresearch/fashion-mnist fashion_mnist
srun python classification.py
# always clean up afer yourself...
rm -rf ${RUNDIR}