Overview and purpose

This exercise is designed to introduce you to the most common computer system environment used in high performance computing: a Unix or Linux (sometimes abbreviated as *nix) operating system and a language like C, C++, or Fortran. There are, of course, other environments, but in the HPC world knowledge of this environment is assumed. We'll be doing this in the context of beginning to explore matrix-matrix multiplication.

Many of you already have some experience with Linux and C++ through previous course work. Our primary development language will be C or C++. For this course we will rarely be explicitly using the object-oriented features of C++, but you are certainly welcome to use them as you design and write your programs.

If you do any work in HPC, from time to time you may find it necessary to read Fortran code, or even call functions written in Fortran. This exercise will introduce you to some of the basic features of Fortran so you'll be able to do this.

We'll be using the Git distributed source management system. This exercise will introduce you to Git and how we'll be using it this semester.

Prior to starting this you should have read Chapters 19—21 of Introduction to High-Performance Scientific Computing.

Assignment

Work through the following steps, taking notes as necessary. Organize and write up your notes, including observed timings and FLOP rates, in a one-to-three page document.

Estimating π: simple serial and parallel programs

Log in to a workstation and start a terminal shell by pressing Ctrl-Alt-T.
In this exercise we'll need to compile a program that uses MPI (Message Passing Interface), an API for cluster computing. To do this you'll need to do the following:
```
module load gcc/native
module load openmpi
module save
```
The first two lines set your environment to use the native (system) GCC compiler and a particular MPI implementation. The last line makes it so that these modules will be automatically loaded each time you log in.
Make a folder to use as a workspace for this course:
```
mkdir cps343
cd cps343
```
This semester we'll use a private Git repositories hosted on GitHub. You can clone the hands-on exercises repository with the following command:
```
git clone https://github.com/gordon-cs/cps343-hoe
```
You'll see some output text indicating that the repository is being copied. Now you should have a new subfolder named cps343-hoe containing files for at least the first two hands-on exercies.
Change into the cps343-hoe/00-calculate-pi directory and list the files there:
```
cd cps343-hoe/00-calculate-pi
ls -l
```
These programs all compute approximations to π using the midpoint integration rule with a rather large number (400,000,000) of subintervals. (There are much more efficient ways to do this; here we're interested in generating a lot of computations with a simple program.)
A makefile is a file that describes how to construct one or more targets by specifying their dependencies and the rules used to build the targets from the files they depend on. Looking at this example you'll see that comments start with the hash mark and continue to the end of the line. Variables are declared and assigned; convention says to use uppercase letters for variable names. The key structure in a makefile is

target: dependency-list
<tab>rule(s)

Examine the file Makefile. Its contents should be similar to the following:
```
# Makefile for serial and parallel pi computation examples
#
# Jonathan Senning 
# Department of Mathematics and Computer Science
# Gordon College, 255 Grapevine Road, Wenham MA 01984-1899
#
# This file is released into the public domain.

# Define variables
CC	= gcc
CFLAGS	= -Wall -O3 -funroll-loops
CXX	= g++
CXXFLAGS= $(CFLAGS)
MPICC	= mpicc
MPICXX	= mpic++
OMP_FLAG= -fopenmp

# List of sources and files to build from them
SRCS	= pi_serial.cc pi_omp.cc pi_omp_dyn.cc pi_mpi.cc
BINS	= $(SRCS:.cc=) # replaces .cc extension with empty string

# First (default) target
all: $(BINS)

# alternate targets used to clean up directory
# (pi_serial.o is created by PGI compiler)
clean:
	$(RM) pi_serial.o wtime.o

clobber:
	$(MAKE) clean
	$(RM) $(BINS)

# Explicit dependencies and adjustments to variables
pi_serial:	pi_serial.cc wtime.o

pi_omp:		CXXFLAGS += $(OMP_FLAG)
pi_omp:		pi_omp.cc

pi_omp_dyn:	CXXFLAGS += $(OMP_FLAG)
pi_omp_dyn:	pi_omp_dyn.cc

pi_mpi:		CXX = $(MPICXX)
pi_mpi:		pi_mpi.cc

wtime.o:	wtime.c wtime.h
```
This Makefile defines the variables CC, CFLAGS, CXX, and CXXFLAGS, all of which are used by default rules built into make. Other variables are defined as well, you will see them used further down in the file.

The first target is all, which is a “phony” target (it is not actually a target, it only exists so that multiple targets can be listed). In this case the variable BINS (short for “binaries”) will contain
```
pi_serial pi_omp pi_omp_dyn pi_mpi
```
so for “all” to be successfully built, each of these executable (binary) files must exist.

Next comes a pair of phony targets; clean and clobber. Each of these targets are followed by a rule or rules that removes certain files (RM is predefined to be rm -f, the “force remove file” command). A common convention is that clean is used to remove object files and other temporary files while clobber is used to remove all files that can be built from the sources.

The makefile ends with a list of targets. Some modify variables that will impact the building of the corresponding target while others specify dependencies. Predefined build rules are used to construct these targets.

NOTE: One “gotcha” to be aware of when writing makefiles: the first character in a rule for a target must be TAB; other types of white space will not work.
Make sure we start with a clean slate:
```
make clobber
```
Compile the programs by typing
```
make
```
Type
```
./pi_serial
```
to run the sequential version of the program. The output consists of an estimate of π, the time (in wall-clock seconds) it took to compute the estimate, and the computation rate in gigaflops. Use an editor or a viewer (e.g. less or more) to examine the source code. Something like
```
atom pi_serial.cc
```
should work nicely.
Compare the sequential version to the multithreaded OpenMP version pi_omp.cc:
```
meld pi_serial.cc pi_omp.cc
```
There are only three non-comment differences between them:
1. a changed #include statement
2. a new #pragma statement
3. a different timing function is used
Exit meld and run the new version of the program with
```
./pi_omp
```
and notice how much faster it runs than the previous version. A big improvement from a single compiler directive!
Repeat the last step with pi_omp_dyn.cc. In this program the compiler directives show up in the main program. Both of these programs use OpenMP to automatically create parallel threads that can take advantage of the dual cores on the workstations.
To run the MPI version of the program on a workstation use the command
```
mpiexec -n 4 ./pi_mpi
```
This tells the MPI system to use 4 processors. This should take about the same amount of time as the OpenMP versions of the program.
Rerun the program with 1 for the number of processors (i.e. mpiexec -n 1 ./pi_mpi); this should give you about the same time as the serial/sequential version of the program. Next try increasing the number of processors to 8 and then 16; you will need to add --oversubscribe to the command line:
```
mpiexec -n 8 --oversubscribe ./pi_mpi
mpiexec -n 16 --oversubscribe ./pi_mpi
```
What do you notice? Why do you think that is?

Running the parallel code on multiple workstations

To use more processor cores than are available on a single workstation you'll need to specify additional computers to use. To do this we'll use a cluster resource management tool called SLURM. Try each of the following commands and discuss with a classmate what conclusions you can make. (The hostname command displays the name of the host computer on which the command runs):

salloc -Q -n 2 mpiexec hostname
salloc -Q -n 4 mpiexec hostname
salloc -Q -n 8 mpiexec hostname
salloc -Q -n 32 mpiexec hostname

There are other options to salloc that we'll use later. For now, type the following to a command prompt to see how the program runs for up to 32 processors

for ((n=1;n<=32;n++)); do salloc -Q -n $n mpiexec ./pi_mpi; done

You can make the output a bit more useful using pipelines. The following will report the number of processors used and the GFLOP/s rate:

for ((n=1;n<=32;n++)); do echo "n = $n gives $(salloc -Q -n $n mpiexec ./pi_mpi | awk '{print $10}') GFLOPS"; done

Exercise: Modify the shell code just given so that each line only contains the number of processers and the GFLOP/s rate. This would be useful if you wanted to save the data for graphing or futhur analysis.

Matrix-Matrix Multiplication

Now we'll turn our attention to matrix-matrix multiplication. Change directory to ../01-matrix-mult and see what files are present:

cd ../01-matrix-mult
ls -l

Examine the C programs matmat_c1.c, matmat_c2.c, and matmat_c3.c and the Fortran programs matmat_f77.f and matmat_f95.f. Feel free to use meld or diff to help spot the differences. All the programs do the same things:
1. initalize matrices A and B,
2. Compute C₁=AB using the ijk-form of matrix-matrix multiplication,
3. Compute C₂=AB using the jki-form of matrix-matrix multiplication,
4. Compute C₃=AB using the ikj-form of matrix-matrix multiplication, and
5. Check that C₁=C₂, C₁=C₃, and C₂=C₃.
Spend a few minutes looking at Fortran programs: Notice that the Fortran 77 code uses line numbers for multiline looping constructions but the Fortran 95 example does not — this was a big improvement!
Type make to compile the programs and then run them. They all do the same work. For a given program, notice which performs better, the ijk version, the jki version, or the ikj version. Which runs faster, the C code or the Fortran code?
Edit matmat_ijk.c and write the bodies of the four functions that implement the four remaining loop orderings: ikj, jik, kij, and kji forms of the product.

What to turn in

Using your notes, write a one-to-three page report that reports your observed timings and FLOP rates along with your observations about the runs. Include what seems important to you, but be sure to address at least these questions: What parallel program and which ijk ordering performed the best? Did others observe similar behavior? What conclusions can draw?

This report is due at the start of the next class period.