Ray cluster¶

How to run a ray cluster on DelftBlue¶

1. Install ray:

Install conda and install ray in the conda environment using pip.

2. Create an executable script conda-run.sh to be used later for correct loading of the conda environment:

#!/bin/bash
# Activate a conda environment on a node and run a command
# This script takes two positional arguments:
# 1) conda environment name
# 2) command to execute

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

# dirty hack: https://github.com/conda/conda/issues/9392#issuecomment-617345019 
unset CONDA_SHLVL

source "$(conda info --base)""/etc/profile.d/conda.sh"
conda activate "$1"

# Use this to suppress error codes that may interrupt the job or interfere with reporting.
# Do show the user that an error code has been received and is being suppressed.
# see https://stackoverflow.com/questions/11231937/bash-ignoring-error-for-a-particular-command 
eval "${@:2}" || echo "Exit with error code $? (suppressed)"

3. Create another executable script ray-on-conda-on-slurm.sh to be later used from the slurm script to launch a ray server:

#!/bin/bash
# This script is meant to be called from a SLURM job file. 
# It launches a ray server on each compute node and starts a python script on the head node.
# It is assumed that a conda environment is required to activate ray.

# The script has four command line options:
# --python_runfile="..." name of the python file to run
# --python_arguments="..." optional arguments to the python script
# --conda_env="..." name of the conda environment (default: base)
# --rundir="..." name of the working directory (default: current)
# --temp_dir="..." location to store ray temporary files (default: /tmp/ray)

# Author: Simon Tindemans, s.h.tindemans@tudelft.nl
# Version: 20 April 2022
#
# Ray-on-SLURM instruction and scripts used for inspiration:
# https://docs.ray.io/en/latest/cluster/slurm.html
# https://github.com/NERSC/slurm-ray-cluster
# https://github.com/pengzhenghao/use-ray-with-slurm

# set defaults
python_runfile="MISSING"
python_arguments=""
conda_env="base"
rundir="."
temp_dir="/tmp/ray"

# parse command line parameters
# use solution adapted from https://unix.stackexchange.com/questions/129391/passing-named-arguments-to-shell-scripts 
for argument in "$@"
do
  key=$(echo ${argument} | cut -f1 -d=)

  key_length=${#key}
  value="${argument:${key_length}+1}"

  declare ${key#"--"}=${value}
done

# Abort if no main python file name is given
if [[ ${python_runfile} == "MISSING" ]]
then
  echo "Missing python_runfile option. Aborting."
  exit  
fi

if [[ $RAY_TIMELINE == "1" ]]
then
echo "RAY PROFILING MODE ENABLED"
fi

# generate password for node-to-node communication
redis_password=$(uuidgen)
export redis_password

# get node names and identify IP addresses
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)

node_1=${nodes_array[0]}
ip=$(srun --nodes=1 --ntasks=1 -w "$node_1" hostname --ip-address) # making redis-address

# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$ip" == *" "* ]]; then
  IFS=' ' read -ra ADDR <<< "$ip"
  if [[ ${#ADDR[0]} -gt 16 ]]; then
    ip=${ADDR[1]}
  else
    ip=${ADDR[0]}
  fi
  echo "IPV6 address detected. We split the IPV4 address as $ip"
fi

# set up head node
port=6379
ip_head=$ip:$port
export ip_head
echo "IP Head: $ip_head"

# launch head node, leaving one core unused for the main python script
echo "STARTING HEAD at $node_1"
srun --job-name="ray-head" --nodes=1 --ntasks=1 --cpus-per-task=$((SLURM_CPUS_PER_TASK-1)) -w "$node_1" \
    conda-run.sh "$conda_env" \
    "ray start --head --temp-dir="${temp_dir}" --include-dashboard=false --num-cpus=$((SLURM_CPUS_PER_TASK-1)) --node-ip-address=$ip --port=$port --redis-password=$redis_password --block"  &
sleep 10  # was sleep 30

# launch worker nodes
worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for ((i = 1; i <= worker_num; i++)); do
  node_i=${nodes_array[$i]}
  echo "STARTING WORKER $i at $node_i"
  srun --job-name="ray-worker" --nodes=1 --ntasks=1 -w "$node_i" \
    conda-run.sh "$conda_env" \
    "ray start --temp-dir="${temp_dir}" --num-cpus=$SLURM_CPUS_PER_TASK --address=$ip_head --redis-password=$redis_password --block" &
  sleep 5
done

# export RAY_ADDRESS, so that ray can be initialised using ray.init(), without address
RAY_ADDRESS=$ip_head
export RAY_ADDRESS

# launch main program file on a single core. Wait for it to exit
srun --job-name="main" --unbuffered --nodes=1 --ntasks=1 --cpus-per-task=1 -w "$node_1" \
    conda-run.sh "$conda_env" \
    "ray status; cd ${rundir} ; python -u  ${python_runfile}" 

# if RAY_TIMELINE == 1, save the timeline
if [[ $RAY_TIMELINE == "1" ]]
then
srun --job-name="timeline" --nodes=1 --ntasks=1 --cpus-per-task=1 -w "$node_1" \
    conda-run.sh "$conda_env" \
    "ray timeline"
cp -n /tmp/ray-timeline-* ${temp_dir}
fi

# stop the ray cluster
srun --job-name="ray-stop" --nodes=1 --ntasks=1 --cpus-per-task=1 -w "$node_1" \
    conda-run.sh "$conda_env" \
    "ray stop"

# wait for everything so we don't cancel head/worker jobs that have not had time to clean up
wait

4. Finally, prepare a slurm submission script slurm-submit.sbatch:

#!/bin/bash
# shellcheck disable=SC2206
#SBATCH --job-name=ray-job
#SBATCH --output=ray-job-%j.log
#SBATCH --partition=compute
### Select number of nodes
#SBATCH --nodes=1
### Always run exclusive to avoid issues with conflicting ray servers (needs fixing in the future)
#SBATCH --exclusive
#SBATCH --cpus-per-task=48
#SBATCH --mem-per-cpu=2GB
### Give all resources to a single Ray task, ray can manage the resources internally
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=0
#SBATCH --time=01:00:00

# Load Delft Blue modules here
# NOTE: git and openssh are included to not break dependencies 
# NOTE: python is not required if it is included in the conda environment. 
# If python is needed, it must be loaded before miniconda to avoid issues with conda activation (module conda not found within internal python calls)
module load 2022r2 miniconda3 git openssh 

# Option: save timeline info for debugging (yes="1", no="0"); see https://docs.ray.io/en/latest/ray-core/troubleshooting.html
declare -x RAY_TIMELINE="0"

# Adapt this to your job requirements
. ray-on-conda-on-slurm.sh \
    --python_runfile="runstorage_ray.py" \
    --python_arguments="" \
    --conda_env="raytest" \
    --rundir="../code" \
    --temp_dir="/scratch/${USER}/ray"

5. In your main python script invoked by the slurm script above, use the following lines:

import ray
ray.init()

The use of the RAY_ADDRESS environment variable in the slurm script means there is no need to point it to the right address.

Some notes on what this script does:

Launch a ray server on each node as a single task with the assigned number of cores.
On the head node, reserve 1 core for the main script.
Launch a python script on the head node (only!). This script dispatches tasks to all worker processes.
Clean up when done. It needs some hacks to suppress error codes as ray stop does not exit nicely (error codes 15 or -15).

Warning: the ray server needs additional configuration for multiple ray servers to run on a single node. With the current script, running two jobs on 12 cores each is likely to cause issues, as the same node may be used and port assignments may clash.