13.7. Large Clusters

Error

This section represents old content from the <= v4.x FAQ that has not been properly converted to the new-style documentation. The content here was perfunctorily converted to RST, but it still needs to be:

  1. Converted from a question-and-answer style to a regular documentation style (like the rest of these docs).

  2. Removed from this section and folded into other sections in these docs.

To be clear, this section will eventually be deleted; do not write any new content in this section.


13.7.1. How do I reduce startup time for jobs on large clusters?

There are several ways to reduce the startup time on large clusters. Some of them are described on this page. We continue to work on making startup even faster, especially on the large clusters coming in future years.

Open MPI v5.0.0rc10 is significantly faster and more robust than its predecessors. We recommend that anyone running large jobs and/or on large clusters make the upgrade to the v5.0.x series.

Several major launch time enhancements have been made starting with the v3.0 release. Most of these take place in the background — i.e., there is nothing you (as a user) need do to take advantage of them. However, there are a few that are left as options until we can assess any potential negative impacts on different applications.

Some options are available when launching via mpirun or when launching using the native resource manager launcher (e.g., srun in a Slurm environment). These are activated by setting the corresponding MCA parameter, and include:

  • Setting the pmix_base_async_modex MCA parameter will eliminate a global out-of-band collective operation during MPI_INIT. This operation is performed in order to share endpoint information prior to communication. At scale, this operation can take some time and scales at best logarithmically. Setting the parameter bypasses the operation and causes the system to lookup the endpoint information for a peer only at first message. Thus, instead of collecting endpoint information for all processes, only the endpoint information for those processes this peer communicates with will be retrieved. The parameter is especially effective for applications with sparse communication patterns — i.e., where a process only communicates with a few other peers. Applications that use dense communication patterns (i.e., where a peer communicates directly to all other peers in the job) will probably see a negative impact of this option.

    Note

    This option is only available in PMIx-supporting environments, or when launching via mpirun

  • The async_mpi_init parameter is automatically set to true when the pmix_base_async_modex parameter has been set, but can also be independently controlled. When set to true, this parameter causes MPI_Init to skip an out-of-band barrier operation at the end of the procedure that is not required whenever direct retrieval of endpoint information is being used.

  • Similarly, the async_mpi_finalize parameter skips an out-of-band barrier operation usually performed at the beginning of MPI_FINALIZE. Some transports (e.g., the usnic BTL) require this barrier to ensure that all MPI messages are completed prior to finalizing, while other transports handle this internally and thus do not require the additional barrier. Check with your transport provider to be sure, or you can experiment to determine the proper setting.


13.7.2. Where should I put my libraries: Network vs. local filesystems?

Open MPI itself doesn’t really care where its libraries and plugins are stored. However, where they are stored does have an impact on startup times, particularly for large clusters, which can be mitigated somewhat through use of Open MPI’s configuration options.

Startup times will always be minimized by storing the libraries and plugins local to each node, either on local disk or in ramdisk. The latter is sometimes problematic since the libraries do consume some space, thus potentially reducing memory that would have been available for MPI processes.

There are two main considerations for large clusters that need to place the Open MPI libraries on networked file systems:

  • While dynamic shared objects (“DSO”) are more flexible, you definitely do not want to use them when the Open MPI libraries will be mounted on a network file system! Doing so will lead to significant network traffic and delayed start times, especially on clusters with a large number of nodes. Instead, be sure to configure your build with --disable-dlopen. This will include the DSO’s in the main libraries, resulting in much faster startup times.

  • Many networked file systems use automount for user level directories, as well as for some locally administered system directories. There are many reasons why system administrators may choose to automount such directories. MPI jobs, however, tend to launch very quickly, thereby creating a situation wherein a large number of nodes will nearly simultaneously demand automount of a specific directory. This can overload NFS servers, resulting in delayed response or even failed automount requests.

    Note that this applies to both automount of directories containing Open MPI libraries as well as directories containing user applications. Since these are unlikely to be the same location, multiple automount requests from each node are possible, thus increasing the level of traffic.


13.7.3. Static vs. shared libraries?

It is perfectly fine to use either shared or static libraries. Shared libraries will save memory when operating multiple processes per node, especially on clusters with high numbers of cores on a node, but can also take longer to launch on networked file systems.

Note

Be sure to also see this FAQ entry about network vs. local storage for suggestions on how to mitigate such problems.


13.7.4. How do I reduce the time to wireup OMPI’s out-of-band communication system?

Open MPI’s run-time uses an out-of-band (OOB) communication subsystem to pass messages during the launch, initialization, and termination stages for the job. These messages allow mpirun to tell its daemons what processes to launch, and allow the daemons in turn to forward stdio to mpirun, update mpirun on process status, etc.

The OOB uses TCP sockets for its communication, with each daemon opening a socket back to mpirun upon startup. In a large cluster, this can mean thousands of connections being formed on the node where mpirun resides, and requires that mpirun actually process all these connection requests. mpirun defaults to processing connection requests sequentially — so on large clusters, a backlog can be created that can cause remote daemons to timeout waiting for a response.

Fortunately, Open MPI provides an alternative mechanism for processing connection requests that helps alleviate this problem. Setting the MCA parameter oob_tcp_listen_mode to listen_thread causes mpirun to startup a separate thread dedicated to responding to connection requests. Thus, remote daemons receive a quick response to their connection request, allowing mpirun to deal with the message as soon as possible.

Error

TODO This seems very out of date. We should have content about PMIx instant on.

This parameter can be included in the default MCA parameter file, placed in the user’s environment, or added to the mpirun command line. See this FAQ entry for more details on how to set MCA parameters.


13.7.5. I know my cluster’s configuration - how can I take advantage of that knowledge?

Clusters rarely change from day-to-day, and large clusters rarely change at all. If you know your cluster’s configuration, there are several steps you can take to both reduce Open MPI’s memory footprint and reduce the launch time of large-scale applications. These steps use a combination of build-time configuration options to eliminate components — thus eliminating their libraries and avoiding unnecessary component open/close operations — as well as run-time MCA parameters to specify what modules to use by default for most users.

One way to save memory is to avoid building components that will actually never be selected by the system. Unless MCA parameters specify which components to open, built components are always opened and tested as to whether or not they should be selected for use. If you know that a component can build on your system, but due to your cluster’s configuration will never actually be selected, then it is best to simply configure OMPI to not build that component by using the --enable-mca-no-build configure option.

For example, if you know that your system will only utilize the ob1 component of the PML framework, then you can no_build all the others. This not only reduces memory in the libraries, but also reduces memory footprint that is consumed by Open MPI opening all the built components to see which of them can be selected to run.

In some cases, however, a user may optionally choose to use a component other than the default. For example, you may want to build all of the PRRTE routed framework components, even though the vast majority of users will simply use the default debruijn component. This means you have to allow the system to build the other components, even though they may rarely be used.

You can still save launch time and memory, though, by setting the routed=debruijn MCA parameter in the default MCA parameter file. This causes OMPI to not open the other components during startup, but allows users to override this on their command line or in their environment so no functionality is lost — you just save some memory and time.