12.3. Using Serial Debuggers to Debug Open MPI Applications

Since the GNU debugger (gdb) is fairly ubiquitiously available, it is common to use a serial debugger for debugging Open MPI applications. Parallel debuggers are generally better, but gdb is free, and therefore quite common.

There are two common ways to use serial debuggers.

12.3.1. Attach to Individual Running MPI processes

You can use multiple invocations of a serial debugger to individually attach to every application process, or to a subset of application process.

If you are using gdb, then you would log onto the nodes where you want to debug application processes and invoke gdb specifying the gdb --pid option and the PID for each application process you want to attach to.

You can use a similar technique to attach to application processes in an application that was just invoked by mpirun(1).

An inelegant-but-functional technique commonly used with this method is to insert the following code in your application where you want to suspend the application process until the debugger is attached, such as just after MPI is initialized.

{
    volatile int i = 0;
    char hostname[256];
    gethostname(hostname, sizeof(hostname));
    printf("PID %d on %s ready for attach\n", getpid(), hostname);
    fflush(stdout);
    while (0 == i)
        sleep(5);
}

This code will output a line to stdout outputting the name of the host where the process is running and the PID to attach to. It will then spin on the sleep(3) function forever waiting for you to attach with a debugger. Using sleep(3) as the inside of the loop means that the processor won’t be pegged at 100% while waiting for you to attach.

Once you attach with a debugger, go up the function stack until you are in this block of code (you’ll likely attach during the sleep(3)) then set the variable i to a nonzero value. With GDB, the syntax is:

(gdb) set var i = 7

Then set a breakpoint after your block of code and continue execution until the breakpoint is hit. Now you have control of your live MPI application and use of the full functionality of the debugger.

You can even add conditionals to only allow this “pause” in the application for specific MPI processes (e.g., MPI_COMM_WORLD rank 0, or whatever process is misbehaving).

12.3.2. Use mpirun to Launch Separate Instances of Serial Debuggers

This technique launches a separate window for each MPI process in MPI_COMM_WORLD, each one running a serial debugger (such as gdb) that will launch and run your MPI application. Having a separate window for each MPI process can be quite handy for low process-count MPI jobs, but requires a bit of setup and configuration that is outside of Open MPI to work properly. A naive approach would be to assume that the following would immediately work:

shell$ mpirun -n 4 xterm -e gdb my_mpi_application

If running on a personal computer, this will probably work. You can also use tmpi to launch the debuggers in separate tmux panes instead of separate xterm windows, which has the advantage of synchronizing keyboard input between all debugger instances.

Unfortunately, the tmpi or xterm approaches likely won’t work on an computing cluster. Several factors must be considered:

  1. What launcher is Open MPI using? In an ssh-based environment, Open MPI will use ssh by default. But note that Open MPI closes the ssh sessions when the MPI job starts for scalability reasons. This means that the built-in SSH X forwarding tunnels will be shut down before the xterms can be launched. Although it is possible to force Open MPI to keep its SSH connections active (to keep the X tunneling available), we recommend using non-SSH-tunneled X connections, if possible (see below).

  2. In non-ssh environments (such as when using resource managers), the environment of the process invoking mpirun may be copied to all nodes. In this case, the DISPLAY environment variable may not be suitable.

  3. Some operating systems default to disabling the X11 server from listening for remote/network traffic. For example, see this post on the Open MPI user’s mailing list describing how to enable network access to the X11 server on Fedora Linux.

  4. There may be intermediate firewalls or other network blocks that prevent X traffic from flowing between the hosts where the MPI processes (and xterm) are running and the host connected to the output display.

The easiest way to get remote X applications (such as xterm) to display on your local screen is to forego the security of SSH-tunneled X forwarding. In a closed environment such as an HPC cluster, this may be an acceptable practice (indeed, you may not even have the option of using SSH X forwarding if SSH logins to cluster nodes are disabled), but check with your security administrator to be sure.

If using non-encrypted X11 forwarding is permissible, we recommend the following:

  1. For each non-local host where you will be running an MPI process, add it to your X server’s permission list with the xhost command. For example:

    shell$ cat my_hostfile
    inky
    blinky
    stinky
    clyde
    shell$ for host in `cat my_hostfile` ; do xhost +host ; done
    
  2. Use the -x option to mpirun to export an appropriate DISPLAY variable so that the launched X applications know where to send their output. An appropriate value is usually (but not always) the hostname containing the display where you want the output and the :0 (or :0.0) suffix. For example:

    shell$ hostname
    arcade.example.come
    shell$ mpirun -n 4 --hostfile my_hostfile \
        -x DISPLAY=arcade.example.com:0 xterm -e gdb my_mpi_application
    

    Warning

    X traffic is fairly “heavy” — if you are operating over a slow network connection, it may take some time before the xterm windows appear on your screen.

  3. If your xterm supports it, the -hold option may be useful. -hold tells xterm to stay open even when the application has completed. This means that if something goes wrong (e.g., gdb fails to execute, or unexpectedly dies, or …), the xterm window will stay open, allowing you to see what happened, instead of closing immediately and losing whatever error message may have been output.

  4. When you have finished, you may wish to disable X11 network permissions from the hosts that you were using. Use xhost again to disable these permissions:

    shell$ for host in `cat my_hostfile` ; do xhost -host ; done
    

Note

mpirun will not complete until all the xterm instances are complete.