CS322: Concurrent Processes and Programming

Discovering the potential for parallelism inherent in existing sequential algorithms

Most traditional algorithms have been developed on the assumption that they will be executed by a single process. With the growing availability of multiprocessors, it is important that we be able to modify these algorithms to take advantage of the parallelism available.

Example: sorting. (It was estimated that at one point in time 50% of all CPU power was used for sorting.)

Many algorithms are known that do O(n log n) comparisons; but it can be proven that no algorithm based on binary comparison can do better.
Time behavior better than O(n log n) can be achieved using concurrency, by doing several comparisons at the same time on different processors.

Consider the merge sort. Recall that its basic algorithm is:

Split the original file of n records into two subfiles of n/2 records
Sort each
Merge the two sorted files.

For example, to sort 8 3 9 5 2 1 4 7 we have

	a) 8 3 9 5         2 1 4 7
	b) 3 5 8 9         1 2 4 7
	c) 2 3 4 5 7 8 9

Time complexity is O(n log n).

Proof: observe that the algorithm is recursive. Initially, we sort one file of n records; but to do this we sort two files of (n/2) records, which requires sorting four files of (n/4) records. This can be pictured as a tree:

sort n records
sort n/2		sort n/2
sort n/4	sort n/4	sort n/4	sort n/4

and so on until the leaves of this tree amount to "sort 2" which can be done with a simple comparison. This tree will have log₂ n levels.

The following table shows the number of comparisons necessary at each level of the tree. Assume for simplicity that n is a power of 2.

Level	Number of comparisons
1	n/(n/1) * ( n/1 - 1 )	=	1 * ( n/1 - 1 )	=	n - 1 comparisons
2	n/(n/2) * ( n/2 - 1 )	=	2 * ( n/2 - 1 )	=	n - 2 comparisons
3	n/(n/4) * ( n/4 - 1 )	=	4 * ( n/4 - 1 )	=	n - 4 comparisons
.
.
.
log n - 2	n/8 * ( n/(n/8) - 1 )	=	n/8 * ( 8 - 1 )	=	n - n/8 comparisons
log n - 1	n/4 * ( n/(n/4) - 1 )	=	n/4 * ( 4 - 1 )	=	n - n/4 comparisons
log n	n/2 * ( n/(n/2) - 1 )	=	n/2 * ( 2 - 1 )	=	n - n/2 comparisons

The sum of the last column is the total number of comparisons will be

# of comparisons = (n - 1) + (n - 2) + (n - 4) + ... + (n - n/4) + (n - n/2)

In this sum there are log n terms, so that the sum reduces to

# of comparisons = n log n - ( 1 + 2 + 4 + ... + n/4 + n/2 )

The sum on the right is less than O(n log n) so the entire sum goes as O(n log n).

Now suppose we had unlimited processors. Then at level 2, we could do the two sort n/2 operations in parallel. If k is a multiplicative constant then our time would now be:

level 1 split-merge = kn

level 2 split-merge = kn/2

level 3 split-merge = kn/4

...

bottom level = kn/n

The time-complexity is

Total = kn(1 + 1/2 + 1/4 + 1/8 + ... + 1/n)

= k(n + n/2 + n/4 + ... + 1)

= k(2n - 1)

= O(n)

Thus, we have transformed a sequential algorithm requiring O(n log n) time into a parallel algorithm requiring O(n) time. Of course, the price tag is that we have moved from using a single processor to n/2 processors.
Suppose - more realistically - that we had m processors (m << n). Then we reach the limit of parallelism at a level where we have to do m sorts of size n/m. Each of these is now done by a standard merge sort on a single processor, so it takes O(n/m log n/m) time. The higher levels take O(n(1 + 1/2 + 1/4 + ... 1/m)) = O(n). The total time is therefore O(n) + O(n/m log n/m). If log n/m < m, then the dominant term is O(n), so we can achieve performance close to that above without needing O(n) processors.

Another example: searching. (processing queries against a database)

If we have to examine all the records in a database (or some fixed fraction of the number of records) then the time is clearly O(n). For large databases, this can be prohibitive, especially for online queries.
However, if we can put multiple processors on the task, the complexity remains O(n), but the constant of proportionality is reduced. For example, a query that requires 1 hour - making it prohibitive for an interactive user - would require an acceptable 6 minutes with 10 processors.

One of the most important areas of research in concurrent algorithms was in numerical linear algebra operations. Many (if not most) scientific problems that can be solved using a computer involve solving linear systems.

Two of the most important operations on vectors of data are the linear combination of two vectors (also called a SAXPY operation due to the name of the FORTRAN subroutine in the LINPACK library that performed this operation) and the inner (or dot) product of two vectors.

The linear combination operation is
```
	// do a linear combination (SAXPY)
	for i := 1 to n do
	    c[i] = a[i] + s * b[i];
	
```
The loop consists of n iterations. Suppose that we want perform this operation in parallel on m processors. We can give n/m iterations to each processor, and these operations can be done completely in parallel. (Note, however, that some computers like to have long vectors, so splitting them up for separate processors is not always the best thing to do).

The linear combination has an operation count of 2n but as in the case of searching the overall time-complexity on m processors is reduced by a factor of m to 2n/m.
The inner product operation is different. It looks like
```
	// do an inner (dot) product
	sum := 0;
	for i := 1 to n do
	    sum = sum + a[i] * b[i];
	
```
It is not quite so simple to parallelize this code. We can easily give n/m of the loops iterations to each of the m processors. In this way we can compute m partial sums. However we still need to combine these into a single sum. This process is called a fan-in and requires log m steps on m processors. So the overall time-complexity on m processors is 2n/m + log m. Notice that as m increases the first term becomes smaller but the second term grows larger.

While potential parallelism in some algorithms is easily discovered, in others it is not. There is much research to be done in this area.

We can formulate some general rules to help us locate portions of a sequential program that can be done in parallel.

Given: a sequential program S₁; S₂; S₃; S₄ ... S_n.
- the S_k may be compound statements, procedure calls or what have you.
- if the program contains loops, we may wish to unwind them first - e.g.
```
	for i := 1 to n do
		S_i
```
  becomes S₁; S₂ .. S_n.
We wish to determine whether any given pair of statements S_i and S_j can be done in parallel. The condition turns out to be that if i < j then S_i and S_j can be executed in parallel if:
- { variables written by S_i } ^ { variables read by S_j } = {}
  
  rationale: in the sequential form, S_j assumes the values of all variables written by S_i; but in the parallel form S_j might access a value before S_i writes it.
- { variables read by S_i } ^ { variables written by S_j } = {}
  
  rationale: if S_i and S_j are done in parallel, and S_j changes a variable S_i is going to read before S_i reads it, then the result of S_i will differ.
- { variables written by S_i } ^ { variables written by S_j } = {}
  
  rationale: in the sequential form, the effect of S_j replaces what S_i does. But in the parallel form, S_i might change a variable after S_j writes it.

Applying these rules to a sequential program allows us to construct a precedence graph in which an arrow from S_i -> S_j means that S_i must complete execution before S_j can begin.

b := d;		S₁
a := b + c;	S₂
d := e + f;	S₃
e := a;		S₄
f := b;		S₅
g := e + f;	S₆

R(S₁) = {d}	W(S₁) = {b}
R(S₂) = {b,c}	W(S₂) = {a}	S₁ must precede S₂ since
				  W(S₁) ^ R(S₂) = {b}
R(S₃) = {e,f}	W(S₃) = {d)	S₁ must precede S₃ since       
				  R(S₁) ^ W(S₃) = {d}
R(S₄) = {a}	W(S₄) = {e}	S₂ must precede S₄ since
				  W(S₂) ^ R(S₄) = {a}
				S₃ must precede S₄ since
				  R(S₃) ^ W(S₄) = {e}
R(S₅) = {b}	W(S₅) = {f}	S₁ must precede S₅ since
				  W(S₁) ^ R(S₅) = {b}
				S₃ must precede S₅ since
				  R(S₃) ^ W(S₅) = {f}
R(S₆) = {e,f}	W(S₆) = {g}	S₄ must precede S₆ since
				  W(S₄) ^ R(S₆) = {e}
				S₅ must precede S₆ since
				  W(S₅) ^ R(S₆) = {f}

S₁      (arc from S₁ to S₅ not needed since S₁ precedes S₃ and
| \     S₃ precedes S₅)
S₂ S₃
|  / \
S₄    S₅
  \  /
   S₆

S₂ can be done in parallel with S₃ and also with S₅; S₄ can be done in parallel with S₅.

$Id: concurrency2.html,v 1.3 1998/03/03 23:42:04 senning Exp $

These notes were written by Prof. R. Bjork of Gordon College. In February 1998 they were edited, converted into HTML and augmented by J. Senning of Gordon College.

level 1 split-merge	=	kn
level 2 split-merge	=	kn/2
level 3 split-merge	=	kn/4
...
bottom level	=	kn/n

Total	= kn(1 + 1/2 + 1/4 + 1/8 + ... + 1/n)
	= k(n + n/2 + n/4 + ... + 1)
	= k(2n - 1)
	= O(n)