13.7. Large Clusters
Error
This section represents old content from the <= v4.x FAQ that has not been properly converted to the new-style documentation. The content here was perfunctorily converted to RST, but it still needs to be:
Converted from a question-and-answer style to a regular documentation style (like the rest of these docs).
Removed from this section and folded into other sections in these docs.
To be clear, this section will eventually be deleted; do not write any new content in this section.
13.7.1. How do I reduce startup time for jobs on large clusters?
There are several ways to reduce the startup time on large clusters. Some of them are described on this page. We continue to work on making startup even faster, especially on the large clusters coming in future years.
Open MPI v5.0.0rc10 is significantly faster and more robust than its predecessors. We recommend that anyone running large jobs and/or on large clusters make the upgrade to the v5.0.x series.
Several major launch time enhancements have been made starting with the v3.0 release. Most of these take place in the background — i.e., there is nothing you (as a user) need do to take advantage of them. However, there are a few that are left as options until we can assess any potential negative impacts on different applications.
Some options are available when launching via mpirun
or when launching using
the native resource manager launcher (e.g., srun
in a Slurm environment).
These are activated by setting the corresponding MCA parameter, and include:
Setting the
pmix_base_async_modex
MCA parameter will eliminate a global out-of-band collective operation duringMPI_INIT
. This operation is performed in order to share endpoint information prior to communication. At scale, this operation can take some time and scales at best logarithmically. Setting the parameter bypasses the operation and causes the system to lookup the endpoint information for a peer only at first message. Thus, instead of collecting endpoint information for all processes, only the endpoint information for those processes this peer communicates with will be retrieved. The parameter is especially effective for applications with sparse communication patterns — i.e., where a process only communicates with a few other peers. Applications that use dense communication patterns (i.e., where a peer communicates directly to all other peers in the job) will probably see a negative impact of this option.Note
This option is only available in PMIx-supporting environments, or when launching via
mpirun
The
async_mpi_init
parameter is automatically set totrue
when thepmix_base_async_modex
parameter has been set, but can also be independently controlled. When set totrue
, this parameter causesMPI_Init
to skip an out-of-band barrier operation at the end of the procedure that is not required whenever direct retrieval of endpoint information is being used.Similarly, the
async_mpi_finalize
parameter skips an out-of-band barrier operation usually performed at the beginning ofMPI_FINALIZE
. Some transports (e.g., theusnic
BTL) require this barrier to ensure that all MPI messages are completed prior to finalizing, while other transports handle this internally and thus do not require the additional barrier. Check with your transport provider to be sure, or you can experiment to determine the proper setting.
13.7.2. Where should I put my libraries: Network vs. local filesystems?
Open MPI itself doesn’t really care where its libraries and plugins are stored. However, where they are stored does have an impact on startup times, particularly for large clusters, which can be mitigated somewhat through use of Open MPI’s configuration options.
Startup times will always be minimized by storing the libraries and plugins local to each node, either on local disk or in ramdisk. The latter is sometimes problematic since the libraries do consume some space, thus potentially reducing memory that would have been available for MPI processes.
There are two main considerations for large clusters that need to place the Open MPI libraries on networked file systems:
While dynamic shared objects (“DSO”) are more flexible, you definitely do not want to use them when the Open MPI libraries will be mounted on a network file system! Doing so will lead to significant network traffic and delayed start times, especially on clusters with a large number of nodes. Instead, be sure to configure your build with
--disable-dlopen
. This will include the DSO’s in the main libraries, resulting in much faster startup times.Many networked file systems use automount for user level directories, as well as for some locally administered system directories. There are many reasons why system administrators may choose to automount such directories. MPI jobs, however, tend to launch very quickly, thereby creating a situation wherein a large number of nodes will nearly simultaneously demand automount of a specific directory. This can overload NFS servers, resulting in delayed response or even failed automount requests.
Note that this applies to both automount of directories containing Open MPI libraries as well as directories containing user applications. Since these are unlikely to be the same location, multiple automount requests from each node are possible, thus increasing the level of traffic.
13.7.4. How do I reduce the time to wireup OMPI’s out-of-band communication system?
Open MPI’s run-time uses an out-of-band (OOB) communication
subsystem to pass messages during the launch, initialization, and
termination stages for the job. These messages allow mpirun
to tell
its daemons what processes to launch, and allow the daemons in turn to
forward stdio to mpirun
, update mpirun
on process status, etc.
The OOB uses TCP sockets for its communication, with each daemon
opening a socket back to mpirun
upon startup. In a large cluster,
this can mean thousands of connections being formed on the node where
mpirun
resides, and requires that mpirun
actually process all
these connection requests. mpirun
defaults to processing
connection requests sequentially — so on large clusters, a
backlog can be created that can cause remote daemons to timeout
waiting for a response.
Fortunately, Open MPI provides an alternative mechanism for processing
connection requests that helps alleviate this problem. Setting the MCA
parameter oob_tcp_listen_mode
to listen_thread
causes
mpirun
to startup a separate thread dedicated to responding to
connection requests. Thus, remote daemons receive a quick response to
their connection request, allowing mpirun
to deal with the message
as soon as possible.
Error
TODO This seems very out of date. We should have content about PMIx instant on.
This parameter can be included in the default MCA parameter file,
placed in the user’s environment, or added to the mpirun
command
line. See this FAQ entry
for more details on how to set MCA parameters.
13.7.5. I know my cluster’s configuration - how can I take advantage of that knowledge?
Clusters rarely change from day-to-day, and large clusters rarely change at all. If you know your cluster’s configuration, there are several steps you can take to both reduce Open MPI’s memory footprint and reduce the launch time of large-scale applications. These steps use a combination of build-time configuration options to eliminate components — thus eliminating their libraries and avoiding unnecessary component open/close operations — as well as run-time MCA parameters to specify what modules to use by default for most users.
One way to save memory is to avoid building components that will
actually never be selected by the system. Unless MCA parameters
specify which components to open, built components are always opened
and tested as to whether or not they should be selected for use. If
you know that a component can build on your system, but due to your
cluster’s configuration will never actually be selected, then it is
best to simply configure OMPI to not build that component by using the
--enable-mca-no-build
configure option.
For example, if you know that your system will only utilize the
ob1
component of the PML framework, then you can no_build
all
the others. This not only reduces memory in the libraries, but also
reduces memory footprint that is consumed by Open MPI opening all the
built components to see which of them can be selected to run.
In some cases, however, a user may optionally choose to use a
component other than the default. For example, you may want to build
all of the PRRTE routed
framework components, even though the vast
majority of users will simply use the default debruijn
component. This means you have to allow the system to build the other
components, even though they may rarely be used.
You can still save launch time and memory, though, by setting the
routed=debruijn
MCA parameter in the default MCA parameter file.
This causes OMPI to not open the other components during startup, but
allows users to override this on their command line or in their
environment so no functionality is lost — you just save some
memory and time.