Before we start

Update your repository using git fetch and git pull.

Environment Modules

Often there are situations where it is desirable to have more than one implementation or version of a software package installed on a cluster or computer. The Lmod Environment Modules package facilitates this as it gives users a way to quickly change from one version to another.

To see what modules are available:

module avail
To see what modules are currently loaded:
module list
To use the module openmpi you would load it with the command
module load openmpi
Later, if you wanted to use MPICH, a different implementation of MPI, you can load it with
module load mpich
This will replace the MPI module that is already loaded since only one MPI implementation can be active at a time.

You can type man module or module help for more information. Here is list of the most common commands:

module availdisplay available modules
module listlist currently installed modules
module load <module name>load specified module
module unload <module name>unload specified module
module helpdisplay help information for the module command
module help <module name>display help for specified module
module whatis <module name>display one-line description of specified module

Setting Up Default Modules

Type module list to see what modules are currently loaded. Be sure that gcc/native is loaded; if it's note then type module load gcc/native; module save.

Modules automatically handle some dependencies

Type module unload openmpi mpich; module load hdf5 to make sure that the HDF5 module is loaded but no MPI module is loaded. Confirm that this is the case by typing module list. The HDF5 module currently loaded is designed for using with single or multithreaded programs, but not MPI programs.

To confirm this, type module avail and notice that in the section headed by /shared/modulefiles/Compilers/gcc/native you will see (L,D) next to the HDF5 module. The L means the module is currently loaded and the D means this it is the default HDF5 module to load if no module version is specified.

Now type module load openmpi and notice that you have message saying that the HDF5 module has been reloaded. Type module avail and notice that now there is a new section headed by something like /shared/modulefiles/MPI/gcc/native/openmpi/4.0.2 and that it now contains the loaded, default HDF5 module.

You may want to once again save your modules so the OpenMPI module will be loaded by default, but be aware that you must unload this module before compiling a non-MPI program that uses HDF5.

Using MPI on the Minor Prophets

Our MPI implementations provide front-ends for the native compilers that make it easy to build MPI programs. Rather than using gcc or g++ to compile C or C++ programs, you should use mpicc or mpic++ to compile and link your programs.

When developing your program you will probably just want to run it on your workstation. Do this with the mpiexec. Typical usage to run the program my_prog would be something like

mpiexec -n 4 ./my_prog
This runs the program in parallel using the available cores on the workstation.

There are a couple of ways you can run your program so that it uses more than one workstation. One it specify which machines to run the application on. This gets cumbersome, however, when a cluster has large number of machines. Cluster Resource Managers are usually used to do this. In addition, they have an overall view of the resources available in a cluster and control how parallel jobs are run so that programs that need it will have exclusive use of the necessary resources.

The Minor Prophets cluster uses SLURM (Simple Linux Utility for Resource Management) to manage the cluster's resources. To run my_prog on 16 cores one would type

salloc -Q -n 16 mpiexec ./mp_prog
If you're using MPICH rather than OpenMPI you can use the salloc command as we did here, but you can also use the slightly simpler command
srun -n 16 ./mp_prog
Both of these work when using MPICH on the Minor Prophets workstations, but you should use only the salloc command if using OpenMPI. Regardless of which MPI module is loaded, however, you can use srun to run non-MPI commands on cluster hosts.

Running an MPI job

Exercise:
  1. Change to the cps343-hoe/05-intro-to-mpi directory and use a directory listing to see the files there.

  2. Examine the source code in hello.cc. Notice it does three things common to nearly all MPI programs:

    1. initialize the MPI system with MPI_Init() and two other functions
    2. do something
    3. terminate the MPI system with MPI_Finalize()

    Compile hello.cc and run it:

    smake hello.cc
    salloc -Q -n 4 mpiexec ./hello
    salloc -Q -n 16 mpiexec ./hello
    salloc -Q -n 32 mpiexec ./hello
    
    You should see output that indicates 3, 15, and 31 helper processes are running, in addition to the original master (rank 0) process. Compare the output you see with the source code until you understand how the program works.

  3. Compile the pi_mpi.cc program and run it:

    smake pi_mpi.cc
    salloc -Q -n 4 mpiexec ./pi_mpi
    salloc -Q -n 16 mpiexec ./pi_mpi
    salloc -Q -n 32 mpiexec ./pi_mpi
    
    Run each of the commands several times and notice the variability in performance. In general, however. you should notice pretty close to linear speedup as the number of processors increases. How many process can you start?

  4. Compile the pass-msg.cc program and run it:

    smake pass.cc
    salloc -Q -n 4 mpiexec ./pass-msg
    salloc -Q -n 16 mpiexec ./pass-msg
    salloc -Q -n 32 mpiexec ./pass-msg
    
    You should see messages that indicate the message starts at the rank 0 process (the root or master process) and is passed from process to process until it is received by the highest rank process.

  5. Finally, you might be interested to note that that you can use the salloc/mpiexec combination or the srun to run non-MPI programs as well. Try

    salloc -Q -n 16 mpiexec hostname
    
    and
    srun -n 16 hostname
    
    Both of these commands run 16 instances of the hostname program on multiple machines in the cluster. This example is actually useful if you want to find out what nodes (computers in a cluster) SLURM is assigning jobs to.

SLURM: Our cluster resource manager

The commands srun and salloc are both part of SLURM (Simple Linux Utility for Resource Management). Jobs are usually run on clusters through a workload manager or resource manager, which is responsible for allocating the cluster's resources to jobs. Submitted jobs are placed on a queue and the workload manager (also called a job scheduler) assigns them to a processor or processors as they become available. At Gordon we use SLURM as the workload manager.

In addition to commands used to submit jobs, it is also easy to check on the status of jobs. The command

sinfo
will list all cluster nodes along with their state, often either idle, idle~, mix, or alloc. The state idle means the node is powered on and available and all of its CPU cores are currently free and can be allocated to a job. The state idle~ indicates the node is currently powered off but will be automatically started when it is needed. The states mix and alloc indicate that some or all of the node's cores are allocated to jobs.

The command

squeue
shows a list of jobs currently running on the cluster, including the job name, the user who submitted the job, the job's current running time, and the node(s) the job is running on. The command
sview
opens a window that provides the same information as squeue but with a graphical user interface.

Assignment: Passing messages around a ring

In each of the cases below run your program using different numbers of processes (up to 44).
  1. Create a simple ring-pass program called ring-pass1.cc.

    Using pass-msg.cc as starter code, write an MPI program called ring-pass1.cc that uses MPI_Send() and MPI_Recv() to pass a value around a ring of processes using the algorithm below. The three main variables are

    • message - an integer, initialized to 1000
    • prev - an integer, the rank of preceding process in the ring
    • next - and integer, the rank of the succeeding process in the ring

        initialize MPI
        initialize message to 1000
        compute prev and next, the ranks of the two neighbors.
        if this is the server process:
            send message to next
            receive message from prev
            print message
        else:
            receive message from prev
            print process rank and received message
            increment message
            send message to next
        end if
        finalize MPI
    

    How does one determine prev and next? Suppose we're thinking about how process rank = i in a collection of N processes will behave. Let's define the ring so that prev = i − 1 and next = i + 1. This is fine for most processes, but will fail if the rank is 0 or N − 1. We can easily handle these special cases using modular arithmetic.

    • Computing next is easy; merely do next = (i + 1) % N. This will be i + 1 in all cases except when i = N − 1, in which case it will be 0.
    • To compute prev we first subtract 1 then add N and compute the remainder when divided by N: prev = (i − 1 + N) % N. This is will evaluate to i − 1 except when i = 0, in which case it becomes N − 1.

    Test your program with different numbers of processes.

  2. Create a second version called ring-pass2.cc that introduces timing.

    Make copy of your program, calling it ring-pass2.cc, and modify it so that the master (server) task determines the elapsed time it takes to pass the message around the ring. The elapsed time should be displayed by the master process before it terminates. Use the MPI function MPI_Wtime() to get the timing information. Read the the manual page (type man mpi_wtime) for more information.

  3. Create a third version called ring-pass3.cc that passes a message around the ring a specified number of times.

    Copy ring-pass2.cc to ring-pass3.cc and modify it so that it optionally reads an integer from the command line that specifies the number of times the message should be passed (and incremented) around the ring. The default value for this optional parameter should be 1. You will run your program with a command like

      $ mpiexec -n 8 ./ring-pass3 5
    
    This should pass the message around the ring until the master process receives it for the fifth time. Note that all processes will have access to the command line parameter and should make use of this to know when to stop. Try various ring sizes as you test your program.

Assignment

Complete all three versions of the ring pass program and write a short report that presents your results. Submit your report along with your well commented source code for the third ring-pass program.