Most traditional algorithms have been developed on the assumption that they will be executed by a single process. With the growing availability of multiprocessors, it is important that we be able to modify these algorithms to take advantage of the parallelism available.
Example: sorting. (It was estimated that at one point in time 50% of all CPU power was used for sorting.)
Many algorithms are known that do O(n log n) comparisons; but it can be proven that no algorithm based on binary comparison can do better.
Time behavior better than O(n log n) can be achieved using concurrency, by doing several comparisons at the same time on different processors.
Consider the merge sort. Recall that its basic algorithm is:
Split the original file of n records into two subfiles of n/2 records
Sort each
Merge the two sorted files.
a) 8 3 9 5 2 1 4 7 b) 3 5 8 9 1 2 4 7 c) 2 3 4 5 7 8 9
Time complexity is O(n log n).
Proof: observe that the algorithm is recursive. Initially, we sort one file of n records; but to do this we sort two files of (n/2) records, which requires sorting four files of (n/4) records. This can be pictured as a tree:
sort n records | |||
sort n/2 | sort n/2 | ||
sort n/4 | sort n/4 | sort n/4 | sort n/4 |
The following table shows the number of comparisons necessary at each level of the tree. Assume for simplicity that n is a power of 2.
Level | Number of comparisons | ||||
---|---|---|---|---|---|
1 | n/(n/1) * ( n/1 - 1 ) | = | 1 * ( n/1 - 1 ) | = | n - 1 comparisons |
2 | n/(n/2) * ( n/2 - 1 ) | = | 2 * ( n/2 - 1 ) | = | n - 2 comparisons |
3 | n/(n/4) * ( n/4 - 1 ) | = | 4 * ( n/4 - 1 ) | = | n - 4 comparisons |
. | |||||
. | |||||
. | |||||
log n - 2 | n/8 * ( n/(n/8) - 1 ) | = | n/8 * ( 8 - 1 ) | = | n - n/8 comparisons |
log n - 1 | n/4 * ( n/(n/4) - 1 ) | = | n/4 * ( 4 - 1 ) | = | n - n/4 comparisons |
log n | n/2 * ( n/(n/2) - 1 ) | = | n/2 * ( 2 - 1 ) | = | n - n/2 comparisons |
In this sum there are log n terms, so that the sum reduces to
The sum on the right is less than O(n log n) so the entire sum goes as O(n log n).
Now suppose we had unlimited processors. Then at level 2, we could do the two sort n/2 operations in parallel. If k is a multiplicative constant then our time would now be:
level 1 split-merge | = | kn |
level 2 split-merge | = | kn/2 |
level 3 split-merge | = | kn/4 |
... | ||
bottom level | = | kn/n |
Total | = kn(1 + 1/2 + 1/4 + 1/8 + ... + 1/n) |
= k(n + n/2 + n/4 + ... + 1) | |
= k(2n - 1) | |
= O(n) |
Thus, we have transformed a sequential algorithm requiring O(n log n) time into a parallel algorithm requiring O(n) time. Of course, the price tag is that we have moved from using a single processor to n/2 processors.
Suppose - more realistically - that we had m processors (m << n). Then we reach the limit of parallelism at a level where we have to do m sorts of size n/m. Each of these is now done by a standard merge sort on a single processor, so it takes O(n/m log n/m) time. The higher levels take O(n(1 + 1/2 + 1/4 + ... 1/m)) = O(n). The total time is therefore O(n) + O(n/m log n/m). If log n/m < m, then the dominant term is O(n), so we can achieve performance close to that above without needing O(n) processors.
Another example: searching. (processing queries against a database)
If we have to examine all the records in a database (or some fixed fraction of the number of records) then the time is clearly O(n). For large databases, this can be prohibitive, especially for online queries.
However, if we can put multiple processors on the task, the complexity remains O(n), but the constant of proportionality is reduced. For example, a query that requires 1 hour - making it prohibitive for an interactive user - would require an acceptable 6 minutes with 10 processors.
One of the most important areas of research in concurrent algorithms was in numerical linear algebra operations. Many (if not most) scientific problems that can be solved using a computer involve solving linear systems.
Two of the most important operations on vectors of data are the linear combination of two vectors (also called a SAXPY operation due to the name of the FORTRAN subroutine in the LINPACK library that performed this operation) and the inner (or dot) product of two vectors.
The linear combination operation is
// do a linear combination (SAXPY) for i := 1 to n do c[i] = a[i] + s * b[i];The loop consists of n iterations. Suppose that we want perform this operation in parallel on m processors. We can give n/m iterations to each processor, and these operations can be done completely in parallel. (Note, however, that some computers like to have long vectors, so splitting them up for separate processors is not always the best thing to do).
The linear combination has an operation count of 2n but as in the case of searching the overall time-complexity on m processors is reduced by a factor of m to 2n/m.
The inner product operation is different. It looks like
// do an inner (dot) product sum := 0; for i := 1 to n do sum = sum + a[i] * b[i];It is not quite so simple to parallelize this code. We can easily give n/m of the loops iterations to each of the m processors. In this way we can compute m partial sums. However we still need to combine these into a single sum. This process is called a fan-in and requires log m steps on m processors. So the overall time-complexity on m processors is 2n/m + log m. Notice that as m increases the first term becomes smaller but the second term grows larger.
While potential parallelism in some algorithms is easily discovered, in others it is not. There is much research to be done in this area.
We can formulate some general rules to help us locate portions of a sequential program that can be done in parallel.
Given: a sequential program S1; S2; S3; S4 ... Sn.
the Sk may be compound statements, procedure calls or what have you.
if the program contains loops, we may wish to unwind them first - e.g.
for i := 1 to n do Sibecomes S1; S2 .. Sn.
We wish to determine whether any given pair of statements Si and Sj can be done in parallel. The condition turns out to be that if i < j then Si and Sj can be executed in parallel if:
{ variables written by Si } ^ { variables read by Sj } = {}
rationale: in the sequential form, Sj assumes the values of all variables written by Si; but in the parallel form Sj might access a value before Si writes it.
{ variables read by Si } ^ { variables written by Sj } = {}
rationale: if Si and Sj are done in parallel, and Sj changes a variable Si is going to read before Si reads it, then the result of Si will differ.
{ variables written by Si } ^ { variables written by Sj } = {}
rationale: in the sequential form, the effect of Sj replaces what Si does. But in the parallel form, Si might change a variable after Sj writes it.
b := d; S1 a := b + c; S2 d := e + f; S3 e := a; S4 f := b; S5 g := e + f; S6 R(S1) = {d} W(S1) = {b} R(S2) = {b,c} W(S2) = {a} S1 must precede S2 since W(S1) ^ R(S2) = {b} R(S3) = {e,f} W(S3) = {d) S1 must precede S3 since R(S1) ^ W(S3) = {d} R(S4) = {a} W(S4) = {e} S2 must precede S4 since W(S2) ^ R(S4) = {a} S3 must precede S4 since R(S3) ^ W(S4) = {e} R(S5) = {b} W(S5) = {f} S1 must precede S5 since W(S1) ^ R(S5) = {b} S3 must precede S5 since R(S3) ^ W(S5) = {f} R(S6) = {e,f} W(S6) = {g} S4 must precede S6 since W(S4) ^ R(S6) = {e} S5 must precede S6 since W(S5) ^ R(S6) = {f} S1 (arc from S1 to S5 not needed since S1 precedes S3 and | \ S3 precedes S5) S2 S3 | / \ S4 S5 \ / S6S2 can be done in parallel with S3 and also with S5; S4 can be done in parallel with S5.
$Id: concurrency2.html,v 1.3 1998/03/03 23:42:04 senning Exp $
These notes were written by Prof. R. Bjork of Gordon College. In February 1998 they were edited, converted into HTML and augmented by J. Senning of Gordon College.