13.4. Running MPI applications
Error
This section represents old content from the <= v4.x FAQ that has not been properly converted to the new-style documentation. The content here was perfunctorily converted to RST, but it still needs to be:
Converted from a question-and-answer style to a regular documentation style (like the rest of these docs).
Removed from this section and folded into other sections in these docs.
To be clear, this section will eventually be deleted; do not write any new content in this section.
13.4.1. What prerequisites are necessary for running an Open MPI job?
In general, Open MPI requires that its executables are in your
PATH
on every node on which you will run and if Open MPI was
compiled was dynamic libraries (which is the default), the directory
where its libraries are located must be in your LD_LIBRARY_PATH
on
every node.
For example, if Open MPI was installed with a prefix of /opt/openmpi
,
then the following should be in your PATH
and LD_LIBRARY_PATH
Environment variable |
Value to add |
---|---|
|
|
|
|
Error
TODO Josh H points out that we might also want to mention
OMPIHOME
for PRRTE’s .ini
file here. Leaving this
as a future to-do item, since PRRTE’s .ini
file support
does not exist yet.
Depending on your environment, you may need to set these values in
your shell startup files (e.g., .bashrc
, .cshrc
, etc.).
Note
There are exceptions to this rule — see this FAQ
entry for a
description of the --prefix
option to mpirun
.
See this FAQ entry for more details on
how to add Open MPI to your PATH
and LD_LIBRARY_PATH
.
Additionally, Open MPI requires that jobs can be started on remote
nodes without any input from the keyboard. For example, if using
ssh
as the remote agent, you must have your environment setup to
allow execution on remote nodes without entering a password or
passphrase.
13.4.2. What ABI guarantees does Open MPI provide?
See this section for a description of Open MPI’s versioning and ABI scheme. The short version is:
Open MPI is source code compatible across all versions. This means that you can compile and link your compliant MPI application against any version of Open MPI that supports the version of the MPI standard to which your application was written.
Open MPI provided forward application binary interface (ABI) compatibility within a major series for MPI applications starting with v1.3.2. Prior to that version, no ABI guarantees were provided.
Open MPI reserves the right to break ABI compatibility at new major release series.
13.4.3. Do I need a common filesystem on all my nodes?
No, but it certainly makes life easier if you do.
A common environment to run Open MPI is in a “Beowulf”-class or
similar cluster (e.g., a bunch of 1U servers in a bunch of racks).
Simply stated, Open MPI can run on a group of servers or workstations
connected by a network. As mentioned above, there are several
prerequisites, however (for example, you typically must have an
account on all the machines, you can or ssh
between the
nodes without using a password, etc.).
Regardless of whether Open MPI is installed on a shared / networked
filesystem or independently on each node, it is usually easiest if
Open MPI is available in the same filesystem location on every node.
For example, if you install Open MPI to /opt/openmpi-$ver_current
on
one node, ensure that it is available in /opt/openmpi-$ver_current
on all nodes.
The where to install FAQ question contains some suggestions on where to install Open MPI.
13.4.4. How do I add Open MPI to my PATH
and LD_LIBRARY_PATH
?
Open MPI must be able to find its executables in your PATH
on every node (if Open MPI was compiled as dynamic libraries, then its
library path must appear in LD_LIBRARY_PATH
as well). As such, your
configuration/initialization files need to add Open MPI to your PATH
/ LD_LIBRARY_PATH
properly.
How to do this may be highly dependent upon your local configuration; you may need to consult with your local system administrator. Some system administrators take care of these details for you, some don’t. YMMV. Some common examples are included below, however.
You must have at least a minimum understanding of how your shell works
to get Open MPI in your PATH
/ LD_LIBRARY_PATH
properly. Note
that Open MPI must be added to your PATH
and LD_LIBRARY_PATH
in the following situations:
When you login to an interactive shell
If your interactive login environment is not configured properly, executables like
mpicc
will not be found, and it is typically obvious what is wrong. The Open MPI executable directory can manually be added to thePATH
, or the user’s startup files can be modified such that the Open MPI executables are added to thePATH
every login. This latter approach is preferred.All shells have some kind of script file that is executed at login time to set things like
PATH
andLD_LIBRARY_PATH
and perform other environmental setup tasks. This startup file is the one that needs to be edited to add Open MPI to thePATH
andLD_LIBRARY_PATH
. Consult the manual page for your shell for specific details (some shells are picky about the permissions of the startup file, for example). The table below lists some common shells and the startup files that they read/execute upon login:Shell
Interactive login startup files
bash
.bash_profile
if it exists, or.bash_login
if it exists, or.profile
if it exists(in that order). Note that some Linux distributions automatically come with
.bash_profile
scripts for users that automatically execute.bashrc
as well.Consult the
bash(1)
man page for more information.zsh
.zshrc
followed by.zshenv
sh
(or Bashnamed
sh
).profile
csh
.cshrc
followed by.login
tcsh
.tcshrc
if it exists,.cshrc
if it does not, followed by.login
When you login to non-interactive shells on remote nodes
If your non-interactive remote environment is not configured properly, executables like
mpirun
will not function properly, and it can be somewhat confusing to figure out.The startup files in question here are the ones that are automatically executed for a non-interactive login on a remote node (e.g.,
ssh othernode ps
). Note that not all shells support this, and that some shells use different files for this than listed for interactive logins. Some shells will supersede non-interactive login startup files with files for interactive logins. That is, running non-interactive login startup file may automatically invoke interactive login startup file. The following table lists some common shells and the startup file that is automatically executed, either by Open MPI or by the shell itself:Shell
Non-interactive login startup files
bash
.bashrc
if it existszsh
.zshrc
followed by.zshenv
sh
(or Bashnamed
sh
)This shell does not execute any file automatically,
so Open MPI will execute the
.profile
scriptbefore invoking Open MPI executables on remote nodes
csh
.cshrc
tcsh
.tcshrc
if it exists,.cshrc
if it does not
13.4.5. What if I can’t modify my PATH
and/or LD_LIBRARY_PATH
?
There are some situations where you cannot modify the PATH
or
LD_LIBRARY_PATH
— e.g., some ISV applications prefer to hide
all parallelism from the user, and therefore do not want to make the
user modify their shell startup files. Another case is where you want
a single user to be able to launch multiple MPI jobs simultaneously,
each with a different MPI implementation. Hence, setting shell
startup files to point to one MPI implementation would be problematic.
In such cases, you have two options:
Use
mpirun
’s--prefix
command line option (described below).Modify the wrapper compilers to include directives to include run-time search locations for the Open MPI libraries.
mpirun
’s --prefix
command line option takes as an argument the
top-level directory where Open MPI was installed. While relative
directory names are possible, they can become ambiguous depending on
the job launcher used; using absolute directory names is strongly
recommended.
For example, say that Open MPI was installed into
/opt/openmpi-$ver_current
. You would use the --prefix
option
thusly:
shell$ mpirun --prefix /opt/openmpi-$ver_current -n 4 a.out
This will prefix the PATH
and LD_LIBRARY_PATH
on both the
local and remote hosts with /opt/openmpi-$ver_current/bin
and
/opt/openmpi-$ver_current/lib
, respectively. This is usually
unnecessary when using resource managers to launch jobs (e.g., Slurm,
Torque, etc.) because they tend to copy the entire local environment
— to include the PATH
and LD_LIBRARY_PATH
— to
remote nodes before execution. As such, if PATH
and
LD_LIBRARY_PATH
are set properly on the local node, the resource
manager will automatically propagate those values out to remote nodes.
The --prefix
option is therefore usually most useful in
ssh
-based environments (or similar).
It is possible to make this the default behavior by passing to
configure
the flag --enable-mpirun-prefix-by-default
. This
will make mpirun
behave exactly the same as mpirun --prefix
$prefix ...
, where $prefix
is the value given to --prefix
in configure
.
Finally, note that specifying the absolute pathname to mpirun
is
equivalent to using the --prefix
argument. For example, the
following is equivalent to the above command line that uses
--prefix
:
shell$ /opt/openmpi-$ver_current/bin/mpirun -n 4 a.out
13.4.6. How do I launch Open MPI parallel jobs?
Similar to many MPI implementations, Open MPI provides the commands
mpirun
and mpiexec
to launch MPI jobs. Several of the
questions in this FAQ category deal with using these commands.
Note, however, that in Open MPI, mpirun
and mpiexec
are
exactly identical. Specifically, they are symbolic links to a common
back-end launcher command.
Note
The name of the back-end launcher command has changed over
time (it used to be orterun
, it is now prte
). This
back-end name is largely irrelevant to the user.
The rest of this FAQ usually refers only to mpirun
, even though
the same discussions also apply to mpiexec
(because they are both,
in fact, the same command).
13.4.7. How do I run a simple SPMD MPI job?
Open MPI provides both mpirun
and mpiexec
commands. A simple way
to start a single program, multiple data (SPMD) application in
parallel is:
shell$ mpirun -n 4 my_parallel_application
This starts a four-process parallel application, running four copies
of the executable named my_parallel_application
.
The rsh
starter component accepts the --hostfile
option (and
its synonym, the --machinefile
option) to indicate on which hosts
to start the processes:
shell$ cat my_hostfile
host01.example.com
host02.example.com
shell$ mpirun --hostfile my_hostfile -n 4 my_parallel_application
This command will launch one copy of my_parallel_application
on
each of host01.example.com
and host02.example.com
.
More information about the --hostfile
option, and hostfiles in
general, is available in this FAQ entry.
Note, however, that not all environments require a hostfile. For example, Open MPI will automatically detect when it is running in batch / scheduled environments (such as Slurm, PBS/Torque, SGE, LoadLeveler), and will use host information provided by those systems.
Also note that if using a launcher that requires a hostfile and no hostfile is specified, all processes are launched on the local host.
13.4.8. How do I run an MPMD MPI job?
Both the mpirun
and mpiexec
commands support multiple program,
multiple data (MPMD) style launches, either from the command line or
from a file. For example:
shell$ mpirun -n 2 a.out : -n 2 b.out
This will launch a single parallel application, but the first two
processes will be instances of the a.out
executable, and the
second two processes will be instances of the b.out
executable.
In MPI terms, this will be a single MPI_COMM_WORLD
, but the
a.out
processes will be ranks 0 and 1 in MPI_COMM_WORLD
, while
the b.out
processes will be ranks 2 and 3 in MPI_COMM_WORLD
.
mpirun
(and mpiexec
) can also accept a parallel application
specified in a file instead of on the command line. For example:
shell$ mpirun --app my_appfile
where the file my_appfile
contains the following:
# Comments are supported; comments begin with #
# Application context files specify each sub-application in the
# parallel job, one per line. The first sub-application is the 2
# a.out processes:
-n 2 a.out
# The second sub-application is the 2 b.out processes:
-n 2 b.out
This will result in the same behavior as running a.out
and b.out
from the command line.
Note that mpirun
and mpiexec
are identical in command-line options
and behavior; using the above command lines with mpiexec
instead of
mpirun
will result in the same behavior.
13.4.9. How do I specify the hosts on which my MPI job runs?
There are three general mechanisms:
The
--hostfile
option tompirun
.Use this option to specify a list of hosts on which to run. Note that for compatibility with other MPI implementations,
--machinefile
is a synonym for--hostfile
. See this FAQ entry for more information about the--hostfile
option.The
--host
option tompirun
.This option can be used to specify a list of hosts on which to run on the command line. See this FAQ entry for more information about the
--host
option.Running in a scheduled environment.
If you are running in a scheduled environment (e.g., in a Slurm, Torque, or LSF job), Open MPI will automatically get the lists of hosts from the scheduler.
Important
The specification of hosts using any of the above methods has nothing to do with the network interfaces that are used for MPI traffic. The list of hosts is only used for specifying which hosts on which to launch MPI processes.
13.4.10. How can I diagnose problems when running across multiple hosts?
When you are able to run MPI jobs on a single host, but fail to run them across multiple hosts, try the following:
Ensure that your launcher is able to launch across multiple hosts. For example, if you are using
ssh
, try tossh
to each remote host and ensure that you are not prompted for a password. For example:shell$ ssh remotehost hostname remotehost
If you are unable to launch across multiple hosts, check that your SSH keys are setup properly. Or, if you are running in a managed environment, such as in a Slurm, Torque, or other job launcher, check that you have reserved enough hosts, are running in an allocated job, etc.
Ensure that your
PATH
andLD_LIBRARY_PATH
are set correctly on each remote host on which you are trying to run. For example, withssh
:shell$ ssh remotehost env | grep -i path PATH=...path on the remote host... LD_LIBRARY_PATH=...LD library path on the remote host...
If your
PATH
orLD_LIBRARY_PATH
are not set properly, see this FAQ entry for the correct values. Keep in mind that it is fine to have multiple Open MPI installations installed on a machine; the first Open MPI installation found byPATH
andLD_LIBARY_PATH
is the one that matters.Run a simple, non-MPI job across multiple hosts. This verifies that the Open MPI run-time system is functioning properly across multiple hosts. For example, try running the
hostname
command:shell$ mpirun --host remotehost hostname remotehost shell$ mpirun --host remotehost,otherhost hostname remotehost otherhost
If you are unable to run non-MPI jobs across multiple hosts, check for common problems such as:
Check your non-interactive shell setup on each remote host to ensure that it is setting up the
PATH
andLD_LIBRARY_PATH
properly.Check that Open MPI is finding and launching the correct version of Open MPI on the remote hosts.
Ensure that you have firewalling disabled between hosts (Open MPI opens random TCP and sometimes random UDP ports between hosts in a single MPI job).
Try running with the
plm_base_verbose
MCA parameter at level 10, which will enable extra debugging output to see how Open MPI launches on remote hosts. For example:mpirun --mca plm_base_verbose 10 --host remotehost hostname``
Now run a simple MPI job across multiple hosts that does not involve MPI communications. The
hello_c
program in theexamples
directory in the Open MPI distribution is a good choice. This verifies that the MPI subsystem is able to initialize and terminate properly. For example:shell$ mpirun --host remotehost,otherhost hello_c Hello, world, I am 0 of 1, (Open MPI v$ver_current, package: Open MPI jsquyres@example.com Distribution, ident: $ver_current, DATE) Hello, world, I am 1 of 1, (Open MPI v$ver_current, package: Open MPI jsquyres@example.com Distribution, ident: $ver_current, DATE)
If you are unable to run simple, non-communication MPI jobs, this can indicate that your Open MPI installation is unable to initialize properly on remote hosts. Double check your non-interactive login setup on remote hosts.
Now run a simple MPI job across multiple hosts that does does some simple MPI communications. The
ring_c
program in theexamples
directory in the Open MPI distribution is a good choice. This verifies that the MPI subsystem is able to pass MPI traffic across your network. For example:shell$ mpirun --host remotehost,otherhost ring_c Process 0 sending 10 to 0, tag 201 (1 processes in ring) Process 0 sent to 0 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting
If you are unable to run simple MPI jobs across multiple hosts, this may indicate a problem with the network(s) that Open MPI is trying to use for MPI communications. Try limiting the networks that it uses, and/or exploring levels 1 through 3 MCA parameters for the communications module that you are using. For example, if you’re using the TCP BTL, see the output of:
ompi_info --level 3 --param btl tcp
13.4.11. I get errors about missing libraries. What should I do?
When building Open MPI with the compilers that have libraries in non-default search path locations, you may see errors about those compiler’s support libraries when trying to launch MPI applications if their corresponding environments were not setup properly.
For example, you may see warnings similar to the following:
# With the Intel compiler suite
shell$ mpirun -n 1 --host node1.example.com mpi_hello
prted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127 while
attempting to launch so we are aborting.
...more error messages...
# With the PGI compiler suite
shell$ mpirun -n 1 --host node1.example.com mpi_hello
prted: error while loading shared libraries: libpgcc.so: cannot open shared object file: No such file or directory
...more error messages...
# With the PathScale compiler suite
shell$ mpirun -n 1 --host node1.example.com mpi_hello
prted: error while loading shared libraries: libmv.so: cannot open shared object file: No such file or directory
...more error messages...
Specifically, Open MPI first attempts to launch a “helper” daemon
prted
on node1.example.com
, but it failed because one of
prted
’s dependent libraries was not able to be found. The
libraries shown above (libimf.so
, libpgcc.so
, and
libmv.so
) are specific to their compiler suites (Intel, PGI, and
PathScale, respectively). As such, it is likely that the user did not
setup the compiler library in their environment properly on this node.
Double check that you have setup the appropriate compiler environment on the target node, for both interactive and non-interactive logins.
Note
It is a common error to ensure that the compiler environment is setup properly for interactive logins, but not for non-interactive logins.
Here’s an example of a user-compiled MPI application working fine locally, but failing when invoked non-interactively on a remote node:
# Compile a trivial MPI application
head_node$ cd $HOME
head_node$ mpicc mpi_hello.c -o mpi_hello
# Run it locally; it works fine
head_node$ ./mpi_hello
Hello world, I am 0 of 1.
# Run it remotely interactively; it works fine
head_node$ ssh node2.example.com
Welcome to node2.
node2$ ./mpi_hello
Hello world, I am 0 of 1.
node2$ exit
# Run it remotely *NON*-interactively; it fails
head_node$ ssh node2.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
In cases like this, check your shell script startup files and verify that the appropriate compiler environment is setup properly for non-interactive logins.
13.4.12. Can I run non-MPI programs with mpirun
/ mpiexec
?
Yes.
For example:
shell$ mpirun -n 2 --host a,b uptime
This will launch a copy of the Unix command uptime
on the hosts a
and b
.
Other questions in the FAQ section deal with the specifics of the
mpirun
command line interface; suffice it to say that it works
equally well for MPI and non-MPI applications.
13.4.13. Can I run GUI applications with Open MPI?
Yes, but it will depend on your local setup and may require additional setup.
In short: you will need to have graphics forwarding (e.g., X11
forwarding) enabled from the remote processes to the display where you
want output to appear. In a secure environment, you can simply allow
all X requests to be shown on the target display and set the
DISPLAY
environment variable in all MPI processes’ environments to
the target display, perhaps something like this:
shell$ hostname
my_desktop.secure-cluster.example.com
shell$ xhost +
shell$ mpirun -n 4 -x DISPLAY=my_desktop.secure-cluster.example.com a.out
However, this technique is not generally suitable for unsecure environments (because it allows anyone to read and write to your display). A slightly more secure way is to only allow X connections from the nodes where your application will be running:
shell$ hostname
my_desktop.secure-cluster.example.com
shell$ xhost +compute1 +compute2 +compute3 +compute4
compute1 being added to access control list
compute2 being added to access control list
compute3 being added to access control list
compute4 being added to access control list
shell$ mpirun -n 4 -x DISPLAY=my_desktop.secure-cluster.example.com a.out
(assuming that the four nodes you are running on are compute1
through compute4
).
Other methods are available, but they involve sophisticated X
forwarding through mpirun
and are generally more complicated than
desirable.
13.4.14. Can I run ncurses-based / curses-based / applications with funky input schemes with Open MPI?
Maybe. But probably not.
Open MPI provides fairly sophisticated stdin / stdout / stderr forwarding. However, it does not work well with curses, ncurses, readline, or other sophisticated I/O packages that generally require direct control of the terminal.
Every application and I/O library is different — you should try to see if yours is supported. But chances are that it won’t work.
Sorry. :-(
13.4.15. What other options are available to mpirun
?
mpirun
supports the --help
option which provides a usage
message and a summary of the options that it supports. It should be
considered the definitive list of what options are provided.
Several notable options are:
--hostfile
: Specify a hostfile for launchers (such as thersh
launcher) that need to be told on which hosts to start parallel applications. Note that for compatibility with other MPI implementations, –machinefile is a synonym for--hostfile
.--host
: Specify a host or list of hosts to run on (see this FAQ entry for more details).-n
: Indicate the number of processes to start.--mca
: Set MCA parameters (see how to set MCA params for more details).--wdir DIRECTORY
: Set the working directory of the started applications. If not supplied, the current working directory is assumed (or$HOME
, if the current working directory does not exist on all nodes).-x ENV_VARIABLE_NAME
: The name of an environment variable to export to the parallel application. The-x
option can be specified multiple times to export multiple environment variables to the parallel application.
13.4.16. How do I use the --hostfile
option to mpirun
?
Error
TODO For cross reference, this is the PRRTE man page
section about --hostfile
:
https://github.com/openpmix/prrte/blob/b70a6f1a8d424e396c40c999a656b04e76cc0f91/src/tools/prte/prte-map.1.md?plain=1#L236
A subsequent commit removed this markdown file; the commit
message refers to moving the markdown file to another git
repo, but I didn’t chase down where it went.
The --hostfile
option to mpirun
takes a filename that lists
hosts on which to launch MPI processes.
Important
The hosts listed in a hostfile have nothing to do with which network interfaces are used for MPI communication. They are only used to specify on which hosts to launch MPI processes.
Hostfiles are simple text files with hosts specified, one per line. Each host can also specify a default and maximum number of slots to be used on that host (i.e., the maximum number of processes that will be launched on that node). Comments are also supported, and blank lines are ignored. For example:
# This is an example hostfile. Comments begin with #.
#
# Since no slots are specified, the number of slots defaults to the
# number of processor cores available on the machine.
foo.example.com
# We want to allow launching a maximum of 2 processes on this host
# (e.g., potentially because it has two processor cores):
bar.example.com slots=2
Slots are discussed in much more detail in this FAQ entry.
Hostfiles works in two different ways:
Exclusionary: If a list of hosts to run on has been provided by another source (e.g., by a hostfile or a batch scheduler such as Slurm, PBS/Torque, SGE, etc.), the hosts provided by the hostfile must be in the already-provided host list. If the hostfile-specified nodes are not in the already-provided host list,
mpirun
will abort without launching anything.In this case, hostfiles act like an exclusionary filter — they limit the scope of where processes will be scheduled from the original list of hosts to produce a final list of hosts.
For example, say that a scheduler job contains hosts
node01
throughnode04
. If you run:shell$ cat my_hosts node03 shell$ mpirun -n 1 --hostfile my_hosts hostname
This will run a single copy of
hostname
on the hostnode03
.However, presuming your job was allocated only to
node03
and you run the following:shell$ cat my_hosts node17 shell$ mpirun -n 1 --hostfile my_hosts hostname
This is an error (because
node17
is not allocated to your job), andmpirun
will abort.Finally, note that in exclusionary mode, processes will only be executed on the hostfile-specified hosts, If this ends up causing an oversubscription situation,
mpirun
will abort by default.Inclusionary: If a list of hosts has not been provided by another source, then the hosts provided by the
--hostfile
option will be used as the original and final host list.In this case,
--hostfile
acts as an inclusionary agent; all--hostfile
-supplied hosts become available for scheduling processes. For example (assume that you are not in a scheduling environment where a list of nodes is being transparently supplied):shell$ cat my_hosts node01.example.com slots=1 node02.example.com slots=1 node03.example.com slots=1 shell$ mpirun -n 3 --hostfile my_hosts hostname
This will launch a single copy of
hostname
on the hostsnode01.example.com
,node02.example.com
, andnode03.example.com
.
Note, too, that --hostfile
is essentially a per-application switch.
Hence, if you specify multiple applications (as in an MPMD job),
--hostfile
can be specified multiple times:
shell$ cat hostfile_1
node01.example.com
shell$ cat hostfile_2
node02.example.com
shell$ mpirun -n 1 --hostfile hostfile_1 hostname : -n 1 --hostfile hostfile_2 uptime
node01.example.com
06:11:45 up 1 day, 2:32, 0 users, load average: 21.65, 20.85, 19.84
Notice that hostname
was launched on node01.example.com
and
uptime
was launched on node02.example.com
.
13.4.17. How do I use the --host
option to mpirun
?
The --host
option to mpirun
takes a comma-delimited list of
hosts on which to run. For example:
shell$ mpirun -n 3 --host a,b,c hostname
Will launch one copy of hostname
on each of hosts a
, b
,
and c
. Specifically: each host defaults to 1 slot, unless
specified by the :N
suffix. For example:
shell$ mpirun --host a,b:2,c:3 hostname
Will launch one copy of hostname
on a
, two copies of
hostname
on b
, and three copies of hostname
and c
.
Slots are discussed in much more detail in this FAQ entry.
Important
The hosts specified by the --host
option have
nothing to do with which network interfaces are used
for MPI communication. They are only used to specify
on which hosts to launch MPI processes.
--host
works in two different ways:
Exclusionary: If a list of hosts to run on has been provided by another source (e.g., by a hostfile or a batch scheduler such as Slurm, PBS/Torque, SGE, etc.), the hosts provided by the
--host
option must be in the already-provided host list. If the--host
-specified nodes are not in the already-provided host list,mpirun
will abort without launching anything.In this case, the
--host
option acts like an exclusionary filter — it limits the scope of where processes will be scheduled from the original list of hosts to produce a final list of hosts.For example, say that the hostfile
my_hosts
contains the hostsnode1
throughnode4
. If you run:shell$ mpirun -n 1 --hostfile my_hosts --host node3 hostname
This will run a single copy of
hostname
on the hostnode3
. However, if you run:shell$ mpirun -n 1 --hostfile my_hosts --host node17 hostname
This is an error (because
node17
is not listed inmy_hosts
);mpirun
will abort.Finally, note that in exclusionary mode, processes will only be executed on the
--host
-specified hosts. If this ends up causing an oversubscription situation,mpirun
will abort by default.Inclusionary: If a list of hosts has not been provided by another source, then the hosts provided by the
--host
option will be used as the original and final host list.In this case,
--host
acts as an inclusionary agent; all--host
-supplied hosts become available for scheduling processes. For example (assume that you are not in a scheduling environment where a list of nodes is being transparently supplied):shell$ mpirun -n 3 --host a,b,c hostname
This will launch a single copy of
hostname
on the hostsa
,b
, andc
.
Note, too, that --host
is essentially a per-application switch.
Hence, if you specify multiple applications (as in an MPMD job),
--host
can be specified multiple times:
shell$ mpirun -n 1 --host a hostname : -n 1 --host b uptime
This will launch hostname
on host a
and uptime
on host b
.
13.4.18. What are “slots”?
Slots are Open MPI’s representation of how many processes can be launched on a given host.
Open MPI maintains the number of slots for each host in a given parallel job, and — by default — will not let you launch more processes on a host than it has slots.
Important
It is common to set the number of slots on a host to be less than or equal to the number of processor cores on that host.
But it is important to realize that Open MPI’s concept of slots is actually unrelated to the number of physical processor cores on a host.
Specifically: the number of slots on a host can be less than, equal to, or more than the number of processor cores on a host.
If you wish to run more processes on a host than it has slots, see the FAQ entry on oversubscription.
13.4.19. How are the number of slots calculated?
The number of slots on a host depends on a few factors:
If the host is specified by a job scheduler (e.g., Slurm, PBS/Torque, etc.), the job scheduler specifies the number of slots for that host.
If the host is specified in a hostfile:
If the
slots
parameter is specified, that value is used for the number of slots on that host.Otherwise:
If
--map-by :HWTCPUS
was specified, the number of slots defaults to the number of hardware threads on that host.Otherwise, the number of slots defaults to the number of processor cores on that host.
If the host is specified via the
--host
command line option:If the
:N
suffix is specified,N
is used for the number of slots on that host.Otherwise, the number of slots defaults to 1.
If the same host name is specified multiple times, the slots value for that host is increased by
N
if:N
is specified, or increased by 1 if:N
is not specified.
Caution
The exact scheme used to determine the number of slots has varied between different major versions of Open MPI. The scheme described above is relevant for Open MPI v5.0.x.
Max slot counts, however, are rarely specified by schedulers. The max slot count for each node will default to “infinite” if it is not provided (meaning that Open MPI will oversubscribe the node if you ask it to — see more on oversubscribing in this FAQ entry).
Error
TODO Ralph: do we still have the concept of “max slots”? Issue is open: https://github.com/openpmix/prrte/issues/770.
Here are some examples, all from unscheduled environments:
Use a hostfile and specify the
slots
parameter.shell$ cat my-hostfile node01.example.come slots=4 shell$ mpirun --hostfile my-hostfile hostname node01 node01 node01 node01
This launched 4 processes because
slots=4
was specified in the hostfile.Use a hostfile and do not specify the
slots
parameter (assume thatnode01.example.com
has 2 processor cores):shell$ cat my-hostfile node01.example.come shell$ mpirun --hostfile my-hostfile hostname node01 node01
This launched 2 processes because
slots
was not specified, andnode02
has 2 processor cores.Use
--host
:shell$ mpirun --host node01.example.com hostname node01
This launched 1 processes because
--host
with no:N
suffix increments the slot count for that host by 1.Use
--host
with a:N
suffix:shell$ mpirun --host node01.example.com:2 hostname node01 node01
This launched 2 processes because
:2
was specified on the command line.Use
--host
with a:N
suffix, and mention the host multiple times:shell$ mpirun --host node01.example.com:2,node01.example.com hostname node01 node01 node01
This launched 3 processes because
:2
was specified on the command line, and thennode01.example.com
was specified an additional time, incrementing the slot count for that host to 3.
13.4.20. How do I control how my processes are scheduled across hosts?
The short version is that if you are not oversubscribing your hosts (i.e., trying to run more processes than slots available on that host), scheduling is pretty simple and occurs either on a by-slot or by-node round robin schedule. If you’re oversubscribing, the issue gets much more complicated — keep reading.
The more complete answer is: Open MPI schedules processes to nodes by
asking two questions from each application on the mpirun
command
line:
How many processes should be launched?
Where should those processes be launched?
The “how many” question is directly answered with the -n
switch
to mpirun
. If -n
is not specified on the mpirun
command
line, its value is the sum of the slots on all the nodes.
The “where” question is a little more complicated, and depends on three factors:
The final node list (e.g., after
-hostname
/--host
exclusionary or inclusionary processing)The scheduling policy (which applies to all applications in a single job)
The default and maximum number of slots on each host
Error
TODO Ralph: do we still have the concept of “max slots”? Issue is open: https://github.com/openpmix/prrte/issues/770.
Open MPI currently supports two scheduling policies: by slot and by node:
By slot: This is the default scheduling policy, but can also be explicitly requested by using either the
--map-by slot
option tompirun
or by setting the MCA parameterrmaps_default_mapping_policy
to the stringslot
.In this mode, Open MPI will schedule processes on a node until all of its default slots are exhausted before proceeding to the next node. In MPI terms, this means that Open MPI tries to maximize the number of adjacent ranks in
MPI_COMM_WORLD
on the same host without oversubscribing that host.For example:
shell$ cat my-hosts node0 slots=2 max_slots=20 node1 slots=2 max_slots=20 shell$ mpirun --hostfile my-hosts -n 8 --map-by slot hello | sort Hello World I am rank 0 of 8 running on node0 Hello World I am rank 1 of 8 running on node0 Hello World I am rank 2 of 8 running on node1 Hello World I am rank 3 of 8 running on node1 Hello World I am rank 4 of 8 running on node0 Hello World I am rank 5 of 8 running on node0 Hello World I am rank 6 of 8 running on node1 Hello World I am rank 7 of 8 running on node1
By node: This policy can be requested either by using the
--map-by node
option tompirun
or by setting the MCA parameterrmaps_default_mapping_policy
to the string “node”.In this mode, Open MPI will schedule a single process on each node in a round-robin fashion (looping back to the beginning of the node list as necessary) until all processes have been scheduled. Nodes are skipped once their default slot counts are exhausted.
For example:
shell$ cat my-hosts node0 slots=2 max_slots=20 node1 slots=2 max_slots=20 shell$ mpirun --hostname my-hosts -n 8 --map-by node hello | sort Hello World I am rank 0 of 8 running on node0 Hello World I am rank 1 of 8 running on node1 Hello World I am rank 2 of 8 running on node0 Hello World I am rank 3 of 8 running on node1 Hello World I am rank 4 of 8 running on node0 Hello World I am rank 5 of 8 running on node1 Hello World I am rank 6 of 8 running on node0 Hello World I am rank 7 of 8 running on node1
In both policies, if the default slot count is exhausted on all nodes while there are still processes to be scheduled, Open MPI will trigger an oversubscription condition.
If :OVERSUBSCRIBE
is added as a modifier to the --map-by
option (e.g., mpirun --map-by node:OVERSUBSCRIBE ...
– see
this FAQ item for more
details), Open MPI will continue to loop through the list of nodes
again and try to schedule one more process to each node until all
processes are scheduled. Nodes are skipped in this process if their
maximum slot count is exhausted. If the maximum slot count is
exhausted on all nodes while there are still processes to be
scheduled, Open MPI will abort without launching any processes.
If :OVERSUBSCRIBE
is not specified and an oversubscription
condition occurs, Open MPI will abort without launching any processes.
13.4.21. Can I oversubscribe nodes (run more processes than processors)?
Yes. But it very much matters how you do it.
Specifically: it is critical that Open MPI knows that you are oversubscribing the node, or severe performance degradation can result.
Important
Here is a good general rule to follow: never specify a number of slots that is more than the available number of processors.
For example, if you want to run 4 processes on a host with 2 processor cores, then indicate that you only have 2 slots but want to run 4 processes. For example:
# In a hostfile, the number of slots will default to the number of
# processor cores on the host
shell$ cat my-hostfile
localhost
shell$ mpirun -n 4 --hostfile my-hostfile a.out
Specifically: we strongly suggest that you do NOT have a hostfile
that contains slots=4
(because there are only two available
processor cores).
That being said, the above command will fail, because you are trying
to run 4 processes but there are only 2 slots available. You must
specifically tell Open MPI that it is ok to oversubscribe via
--map-by :OVERSUBSCRIBE
:
shell$ cat my-hostfile
# For the purposes of this example, explicitly tell Open MPI
# that we have 2 slots on the host.
localhost slots=2
shell$ mpirun -n 4 --hostfile my-hostfile --map-by :OVERSUBSCRIBE a.out
The reason you should tell Open MPI whether you’re oversubscribing or
not (i.e., never specify a slots
value more than the number of
processor cores available) is because Open MPI basically runs its
message passing progression engine in two modes: aggressive and
degraded.
Degraded: When Open MPI thinks that it is in an oversubscribed mode (i.e., more processes are running than there are processor cores available), MPI processes will automatically run in degraded mode and frequently yield the processor to its peers, thereby allowing all processes to make progress.
Note
Be sure to see this FAQ entry that describes how degraded mode affects processor and memory affinity.
Aggressive: When Open MPI thinks that it is in an exactly- or under-subscribed mode (i.e., the number of running processes is equal to or less than the number of available processor cores), MPI processes will automatically run in aggressive mode, meaning that they will never voluntarily give up the processor to other processes. With some network transports, this means that Open MPI will spin in tight loops attempting to make message passing progress, effectively causing other processes to not get any CPU cycles (and therefore never make any progress).
For example, on a node with a two processor cores:
shell$ cat my-hostfile
localhost slots=4
shell$ mpirun -n 4 --hostfile my-hostfile a.out
This would cause all 4 MPI processes to run in aggressive mode because Open MPI thinks that there are 4 available processor cores to use. This is actually a lie (there are only 2 processor core — not 4), and can cause extremely bad performance.
13.4.22. Can I force Aggressive or Degraded performance modes?
Yes.
The MCA parameter mpi_yield_when_idle
controls whether an MPI
process runs in Aggressive or Degraded performance mode. Setting it
to 0 forces Aggressive mode; setting it to 1 forces Degraded mode (see
this FAQ entry to see how
to set MCA parameters).
Note that this value only affects the behavior of MPI processes when they are blocking in MPI library calls. It does not affect behavior of non-MPI processes, nor does it affect the behavior of a process that is not inside an MPI library call.
Open MPI normally sets this parameter automatically (see this FAQ entry for details). Users are cautioned against setting this parameter unless you are really, absolutely, positively sure of what you are doing.
13.4.23. How do I run with the TotalView parallel debugger?
This has changed with different releases of TotalView and Open MPI; it is best to consult TotalView’s documentation for how you should debug Open MPI applications with TotalView.
13.4.24. How do I run with the DDT parallel debugger?
This has changed with different releases of DDT and Open MPI; it is best to consult DDT’s documentation for how you should debug Open MPI applications with DDT.
13.4.25. How do I dynamically load libmpi at runtime?
If you want to load a the shared library libmpi
explicitly at
runtime either by using dlopen()
from C/C ++ or something like the
ctypes
package from Python, some extra care is required. The
default configuration of Open MPI uses dlopen()
internally to load
its support components. These components rely on symbols available in
libmpi
. In order to make the symbols in libmpi
available to
the components loaded by Open MPI at runtime, libmpi
must be
loaded with the RTLD_GLOBAL
option.
In C/C++, this option is specified as the second parameter to the
POSIX dlopen(3)
function.
When using ctypes
with Python, this can be done with the second
(optional) parameter to CDLL()
. For example (shown below in Mac OS
X, where Open MPI’s shared library name ends in .dylib
; other
operating systems use other suffixes, such as .so
):
from ctypes import *
mpi = CDLL('libmpi.0.dylib', RTLD_GLOBAL)
f = pythonapi.Py_GetArgcArgv
argc = c_int()
argv = POINTER(c_char_p)()
f(byref(argc), byref(argv))
mpi.MPI_Init(byref(argc), byref(argv))
# Your MPI program here
mpi.MPI_Finalize()
Other scripting languages should have similar options when dynamically loading shared libraries.
13.4.26. What MPI environment variables exist?
Open MPI provides the following environment variables that will be defined on every MPI process:
OMPI_COMM_WORLD_SIZE
: the number of processes in this process’s MPI_COMM_WORLDOMPI_COMM_WORLD_RANK
: the MPI rank of this process in MPI_COMM_WORLDOMPI_COMM_WORLD_LOCAL_SIZE
: the number of ranks from this job that are running on this node.OMPI_COMM_WORLD_LOCAL_RANK
: the relative rank of this process on this node within its job. For example, if four processes in a job share a node, they will each be given a local rank ranging from 0 to 3.OMPI_UNIVERSE_SIZE
: the number of process slots allocated to this job. Note that this may be different than the number of processes in the job.OMPI_COMM_WORLD_NODE_RANK
: the relative rank of this process on this node looking across all jobs.