Please read how to run preCICE simulations on distributed systems first. This page handles SLURM-managed clusters. If you try to run on a single machine, read how to run preCICE locally in parallel first.
SLURM and MPI
The SLURM workload manager is commonly used on clusters and is responsible for scheduling user-submitted jobs on a cluster. These jobs describe
- the required resources (compute nodes, time, partitions),
- the environment of the job,
- the actual work as a series of shell commands.
The scheduling then distributes such jobs to maximise some criteria, such as the utilisation of the entire cluster.
Once a job is run, SLURM sets environment variable that tell MPI which nodes it may use.
Executing mpirun ...
in the job script then automatically distributes the job an all available nodes.
SLURM and partitioned simulations
Running multiple MPI jobs in parallel in a single job script is a very unusual use case and leads to problems tough.
Each invocation of mpirun ...
uses the environment variables set by SLURM to determine which nodes to run on.
As both see the identical list of nodes, they double allocate the nodes starting at node 1.
This leads to an increased workload on the doubly-allocated nodes, while some nodes are unused (assuming as parallel coupling scheme).
Furthermore, this can lead to problems with communication build-up.
Example of double allocation
mpirun -n 2 ./A &
mpirun -n 4 ./B
Nodes | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
A ranks | 0 | 1 | ||||
B ranks | 0 | 1 | 2 | 3 |
In this case nodes 1 and 2 are double allocated, while nodes 5 and 6 aren’t used at all.
Partitioning available nodes
A viable remedy is to further partition the MPI session provided by SLURM and assign these partitions to the various MPI runs using hostfiles.
Bash version
To generate a file containing a list of all hosts use:
#!/bin/bash
rm -f hosts.intel host.ompi
for host in $(scontrol show hostname $SLURM_JOB_NODELIST); do
# IntelMPI, MPICH, and MVAPICH2 use the column notation
echo "$host:$SLURM_TASKS_PER_NODE" >> hosts.intel;
# OpenMPI uses slots notation
echo "$host slots=$SLURM_TASKS_PER_NODE" >> hosts.ompi;
# MS-MPI uses a space notation
echo "$host $SLURM_TASKS_PER_NODE" >> hosts.ms;
done
If you have only 2 participants, you can partition the resulting file using head
and tail
:
# Example 2 hosts for A, the rest for B
head -2 hosts.ompi > hosts.a
tail +3 hosts.ompi > hosts.b
If you have more participants, you can extract sections with sed
:
# Distributing 9 nodes accross 3 participants.
# Three nodes per participant.
sed -n "1,3p" hosts.ompi > hosts.a
sed -n "4,6p" hosts.ompi > hosts.b
sed -n "7,9p" hosts.ompi > hosts.c
Python
You can use this python script to partition the SLURM session for a given MPI implementation and amount of nodes.
For the above example of running 3 participants of 3 nodes each using OpenMPI, use:
./slurm-split openmpi 3 3 3
This produces the following hostfiles in the current directory:
hostfile.0
hostfile.1
hostfile.2
Running partitioned simulations
Once you generated the necessary hostfiles, you can invoke mpirun
multiple times.
MPI will automatically spawn as many jobs as slots available by the given hostfile.
# Group runs to prevent a failure from wasting resources
set -m
(
mpirun -hostfile hosts.a solverA &
mpirun -hostfile hosts.b solverB &
mpirun -hostfile hosts.c solverC &
wait
)
echo "All participants succeeded"
If you want to ensure the exact amount of jobs spawned for scalability studies or similar, then you can pass the amount of jobs to run using the flag -n
.
# Each solver runs on 24 ranks even though there may be more slots available
# Group runs to prevent a failure from wasting resources
set -m
(
mpirun -n 24 -hostfile hosts.a solverA &
mpirun -n 24 -hostfile hosts.b solverB &
mpirun -n 24 -hostfile hosts.c solverC &
wait
)
echo "All participants succeeded"