SLURM sessions

Running partitioned simulations on a cluster managed by SLURM requires special treatment of the provided MPI machines.

Please read how to run preCICE simulations on distributed systems first. This page handles SLURM-managed clusters. If you try to run on a single machine, read how to run preCICE locally in parallel first.

SLURM and MPI

The SLURM workload manager is commonly used on clusters and is responsible for scheduling user-submitted jobs on a cluster. These jobs describe

the required resources (compute nodes, time, partitions),
the environment of the job,
the actual work as a series of shell commands.

The scheduling then distributes such jobs to maximise some criteria, such as the utilisation of the entire cluster.

Once a job is run, SLURM sets environment variable that tell MPI which nodes it may use. Executing mpirun ... in the job script then automatically distributes the job an all available nodes.

SLURM and partitioned simulations

Running multiple MPI jobs in parallel in a single job script is a very unusual use case and leads to problems tough. Each invocation of mpirun ... uses the environment variables set by SLURM to determine which nodes to run on. As both see the identical list of nodes, they double allocate the nodes starting at node 1. This leads to an increased workload on the doubly-allocated nodes, while some nodes are unused (assuming as parallel coupling scheme). Furthermore, this can lead to problems with communication build-up.

Example of double allocation

mpirun -n 2 ./A &
mpirun -n 4 ./B

Nodes	1	2	3	4	5	6
A ranks	0	1
B ranks	0	1	2	3

In this case nodes 1 and 2 are double allocated, while nodes 5 and 6 aren’t used at all.

Heterogeneous jobs

If your cluster allows using heterogeneous jobs, and recommends using srun over mpirun, then you can directly allocate jobs in different groups.

For each participant, list the full SBATCH commands and separate them using #SBATCH hetjob:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=24
#SBATCH --partition=part

#SBATCH hetjob

#SBATCH --nodes=12
#SBATCH --ntasks-per-node=24
#SBATCH --partition=part

You can now run the participants in separate groups directly using srun. Note that mpirun and mpiexec don’t support this feature.

set -m
(
  srun --het-group=0 ./A &
  srun --het-group=1 ./B &
  wait
)

Partitioning available nodes

A viable remedy is to further partition the MPI session provided by SLURM and assign these partitions to the various MPI runs using hostfiles.

Warning: Hostfiles are not standardized and differ between MPI implementations OpenMPI, MPICH, MVAPICH2, MS-MPI, IntelMPI, etc.

Bash version

To generate a file containing a list of all hosts use:

#!/bin/bash
rm -f hosts.intel host.ompi
for host in $(scontrol show hostname $SLURM_JOB_NODELIST); do
  # IntelMPI, MPICH, and MVAPICH2 use the column notation
  echo "$host:$SLURM_TASKS_PER_NODE" >> hosts.intel;
  # OpenMPI uses slots notation
  echo "$host slots=$SLURM_TASKS_PER_NODE" >> hosts.ompi;
  # MS-MPI uses a space notation
  echo "$host $SLURM_TASKS_PER_NODE" >> hosts.ms;
done

If you have only 2 participants, you can partition the resulting file using head and tail:

# Example 2 hosts for A, the rest for B
head -2 hosts.ompi > hosts.a
tail +3 hosts.ompi > hosts.b

If you have more participants, you can extract sections with sed:

# Distributing 9 nodes accross 3 participants.
# Three nodes per participant.
sed -n "1,3p" hosts.ompi > hosts.a 
sed -n "4,6p" hosts.ompi > hosts.b
sed -n "7,9p" hosts.ompi > hosts.c 

Python

You can use this python script to partition the SLURM session for a given MPI implementation and amount of nodes.

For the above example of running 3 participants of 3 nodes each using OpenMPI, use:

./slurm-split openmpi 3 3 3

This produces the following hostfiles in the current directory:

hostfile.0
hostfile.1
hostfile.2

Running partitioned simulations

Once you generated the necessary hostfiles, you can invoke mpirun multiple times. MPI will automatically spawn as many jobs as slots available by the given hostfile.

# Group runs to prevent a failure from wasting resources
set -m
(
  mpirun -hostfile hosts.a solverA &
  mpirun -hostfile hosts.b solverB &
  mpirun -hostfile hosts.c solverC &
  wait
)
echo "All participants succeeded"

If you want to ensure the exact amount of jobs spawned for scalability studies or similar, then you can pass the amount of jobs to run using the flag -n.

# Each solver runs on 24 ranks even though there may be more slots available
# Group runs to prevent a failure from wasting resources
set -m
(
  mpirun -n 24 -hostfile hosts.a solverA &
  mpirun -n 24 -hostfile hosts.b solverB &
  mpirun -n 24 -hostfile hosts.c solverC &
  wait
)
echo "All participants succeeded"