CS321 Lecture: External Sorting 3/21/88 - revised 11/22/99
Materials: Transparencies of basic merge sort, balanced 2-way merge sort,
Natural merge, stable natural merge, merge with internal run
generation, merge with replacement selection during run generation.
Transparency of Knuth "snowplow" analogy, polyphase analysis.
Introduction
------------
A. We have seen that the algorithms we use for searching tables stored on
disk are quite different from those used for searching tables stored in
main memory, because the disk access time dominates the processing time.
B. For much the same reason, we use different algorithms for sorting
information stored on disk than for sorting information in main memory.
1. We call an algorithm that sorts data contained in main memory an
INTERNAL SORTING algorithm, while one that sorts data on disk is
called an EXTERNAL SORTING algorithm.
2. In the simplest case - if all the data fits in main memory - we
can simply read the data from disk into main memory, sort it using
an internal sort, and then write it back out.
3. The more interesting case - and the one we consider here - arises
when the file to be sorted does not all fit in main memory.
4. Historically, external sorting algorithms were developed in the context
of systems that used magnetic tapes for file storage, and the
literature still uses the term "tape", even though files are most often
kept on some form of disk. It turns out, though, that the storage
medium being used doesn't really matter because the algorithms we will
consider all read/write data sequentially.
I. A Survey of External Sorting Methods
- - ------ -- -------- ------- -------
A. Most external sorting algorithms are variants of a basic algorithm
known as EXTERNAL MERGE sort. Note that there is also an internal
version of merge sort that we have considered. External merging
reads data one record at a time from each of two or more files, and
writes records to one or more output files. As was the case with
internal merging, external merging is O(n log n) for time, but O(n)
for extra space, and (if done carefully) it is stable.
B. First, though, we need to review some definitions:
1. A RUN is a sequence of records that are in the correct relative order.
2. A STEPDOWN normally occurs at the boundary between runs. Instead
of the key value increasing from one record to the next, it
decreases.
Example: In the following file: B D E C F A G H
- we have three runs (B D E, C F, A G H)
- we have two stepdowns (E C, F A)
3. Observe that an unsorted file can have up to n runs, and up
to n-1 stepdowns. In general (unless the file is exactly
backwards) there will be a lesser number than this of runs and
stepdowns, due to pre-existing order in the file.
4. Observe that a sorted file consists of one run, and has no
stepdowns.
C. We begin with a variant of external merge sort that one would not use
directly, but which serves as the foundation on which all the other
variants build.
1. In the simplest merge sort algorithm, we start out by regarding
the file as composed of n runs, each of length 1. (We ignore any
runs which may already be present in the file.) On each pass, we
merge pairs of runs to produce runs of double length.
a. After pass 1, we have n/2 runs of length 2.
b. After pass 2, we have n/4 runs of length 4.
c. The total number of passes will be ceil(log n). [Where ceil
is the ceiling function - smallest integer greater than or equal
to.] After the last pass, we have 1 run of length n, as desired.
d. Of course, unless our original file length is a power of 2, there
will be some irregularities in this pattern. In particular, we
let the last run in the file be smaller than all the rest -
possibly even of length zero.
Example: To sort a file of 6 records:
Initially: 6 runs of length 1
After pass 1: 3 runs of length 2 + 1 "dummy" run of length 0
After pass 2: 1 run of length 4 + 1 run of length 2
After pass 3: 1 run of length 6
2. We will use a total of three scratch files to accomplish the sort.
a. Initially, we distribute the input data over two files, so that
half the runs go on each. We do this alternately - i.e. first
we write a run to one file, then to the other - in order to
ensure stability.
b. After the initial distribution, each pass entails merging runs
from two of the scratch files and writing the generated runs on
the third. At the end of the pass, if we are not finished, we
redistribute the runs from the third file alternately back to the
first two.
Example: original file: B D E C F A G H
initial distribution: B E F G (File SCRATCH1)
D C A H (File SCRATCH2)
(remember we ignore runs existing in the raw data)
--------------------------------------------------
after first merge: BD CE AF GH (File SCRATCH3)
PASS 1
redistribution: BD AF (File SCRATCH1)
CE GH (File SCRATCH2)
--------------------------------------------------
after second merge: BCDE AFGH (File SCRATCH3)
PASS 2
redistribution: BCDE (File SCRATCH1)
AFGH (File SCRATCH2)
--------------------------------------------------
after third merge: ABCDEFGH (File SCRATCH3) PASS 3
(no redistribution)
3. Code: TRANSPARENCY
4. Analysis: we said earlier that all variants of external merge sort
are O(nlogn) for time and O(n) for extra space. To compare different
variants, we will need to use a more precise figure than just these
"big O" bounds.
a. We will, therefore, measure space in terms of the total number of
records of scratch space needed. We will also give consideration
to main memory space needed for buffers for the files and (later)
for other purposes as well.
b. We will measure time in terms of the number of records read and
written.
i. Actually, since each record read is eventually written to
another file, it will suffice to simply count reads, and then
double this number to get total records transferred.
ii. Of course, data is normally transferred to/from the file
in complete blocks, rather than record-by-record. However,
the total number of transfers is directly proportional to
the number of records transferred, so we will use the number of
records transferred as our measure.
5. Analysis of the basic merge sort
a. Space: three files, one of length n and two of length n/2. We
can use the output file as one of the scratch files, so
the total additional space is two files of length n/2
= total scratch space for n records
In addition, we need internal memory for three buffers - one
for each of the three files. In general, each buffer needs to
be big enough to hold an entire block of data (based on the
blocksize of the device), rather than a single record.
b. Time:
- Initial distribution involves n reads
- Each pass except the last involves 2n reads due to merging
followed by redistribution. The last pass involves just n reads.
- Total reads = 2 n ceil(log n), so total IO operations =
4n ceil(log n)
D. Improving the basic merge sort: two possible improvements suggest
themselves immediately, and entail minimal extra work.
1. If we had four scratch files instead of three, we could combine the
merging and redistribution into one operation as follows: On each
pass, we use two files for input and two for output. We write the
runs generated alternately to the two output files. After the pass,
we switch the roles of the input and output files.
a. Example: original file: B D E C F A G H
initial distribution: B E F G (File SCRATCH1)
D C A H (File SCRATCH2)
(remember we ignore runs existing in the raw data)
after first pass: BD AF (File SCRATCH3)
CE GH (File SCRATCH4)
after second pass: BCDE (File SCRATCH1)
AFGH (File SCRATCH2)
after third pass: ABCDEFGH (File SCRATCH3)
b. Code: TRANSPARENCY
c. Analysis:
i. Space: four files, two of which must now be of length n,
since we don't know ahead of time, in general, whether we will
have an odd or even number of passes. (The sorted file will
end up on either the first or third scratch file.) The
other two files can be of length n/2. As before, we can use
the output file as one of the scratch files, so the extra
space needed is one file of length n plus two of length n/2
= total scratch space for 2n records
Plus internal memory for 4 file buffers.
ii. Time: Because we have an initial distribution pass before
we start merging, we have a total of
n (1 + ceil(log n)) reads = 2n(1 + ceil(log n)) transfers
iii. We have gained almost a factor of 2 in speed, at the cost
of doubling the scratch file space.
d. This algorithm is known as BALANCED 2-WAY MERGE SORT.
2. Another improvement arises from the observation that our original
algorithm started out assuming that the input file consists of
n runs of length 1 - the worst possible case (a totally backward
file.) In general, the file will contain many runs longer than one
just as a consequence of the randomness of the data, and we can use
these to reduce the number of passes.
a. Example: The sample file we have been using contains 3 runs, so
we could do our initial distribution as follows:
initial distribution: BDE AGH (File SCRATCH1)
CF (File SCRATCH2)
after first pass: BCDEF (File SCRATCH3)
AGH (File SCRATCH4)
after second pass: ABCDEFGH (File SCRATCH1)
(Note: we have assumed the use of a balanced merge; but
a non-balanced merge could also have been used.)
b. Code: TRANSPARENCY
c. This algorithm is called a NATURAL MERGE. The term "natural"
reflects the fact that it relies on runs naturally occurring in
the data.
d. However, this algorithm has a quirk we need to consider.
i. Since we merge one run at a time, we need to know where one run
ends and another run begins. In the case of the previous
algorithms, this was not a problem, since we knew the size of
each run. Here, though, the size will vary from run to run.
In the code we just looked at, the the solution to this problem
involved recognizing that the boundary between runs is marked by
a stepdown. Thus, each time we read a new record from an input
file, we will keep track of the last key processed from that
file; and if our newly read key is smaller than that key, then we
know that we have finished processing one run from that file.
Example: in the initial distribution above, we placed two
runs in the first scratch file. The space between
them would not be present in the file; what we
would have is actually BDEAGH. But the run boundary
would be apparent because of the stepdown from E to A.
ii. However, if stability is important to us, we need to be very
careful at this point. In some cases, the stepdown between
two runs could disappear, and an unstable sort could result.
Consider the following file:
F E D C B A M1 Z N M2 (where records M1 and M2 have
identical keys.)
___ No stepdown here, so
| 2 runs look like one:
v
Initial distribution: F | D | B | N
E | C | A M1 Z | M2
F | D | B N
E | C | A M1 Z | M2
First pass: E F | A B M1 N Z
C D | M2
^
|___ No stepdown here, so two runs
look like one:
E F | A B M1 N Z
C D M2 |
Second pass: C D E F M2
A B M1 N Z
Third pass: A B C D E F M2 M1 N Z
^
|
In the case of equal keys, we take
record from first scratch file before
record from second, since first
scratch file should contain records
from earlier in original file.
iii. If stability is a concern, we can prevent this from occurring
by writing a special run-separator record between runs in our
scratch files. This might, for example, be a record whose
key is some impossibly big value like maxint or '~~~~~'.
Of course, processing these records takes extra overhead
that reduces the advantage gained by using the natural runs.
e. Code for a stable variant of natural merge: TRANSPARENCY
f. Analysis:
i. Space is the same as an ordinary merge if no run separator
records are used. However, in the worst case of a totally
backward input file, we would need n run separator records
on our initial distribution, thus potentially doubling the
scratch space needed.
ii. The time will be some fraction of the time needed by an
ordinary merge, and will depend on the average length of
the naturally occurring runs.
- If the naturally occurring runs are of average length 2, then
we save 1 pass - in effect we start where we would be on the
second pass of ordinary merge.
- In general, if the naturally occuring runs are of average
length m, we save at least floor(log m) passes. Thus, if we
use a balanced 2-way merge, our time will be
n (1 + ceil(log n - log m)) reads =
n (1 + ceil(log n/m)) reads or
2n (1 + ceil(log n/m)) IO operations
- Of course, if run separator records are used, then we actually
process more than n records on each pass. This costs
additional time for
n/m reads on first pass
n/2m reads on second pass
n/4m reads on third pass
...
= (2n/m - 1) additional reads,
or about 4n/m extra IO operations
- Obviously, a lot depends on the average run length in the
original data (m). It can be shown that, in totally random
data, the average run length is 2 - which translates into
a savings of 1 merge pass, or 2n IO operations. However, if
we use separator records, we would need 2n extra IO operations
to process them - so we gain nothing! (We could still gain
a little bit by omitting separator records if stability were
not an issue, though.)
- In many cases, though, the raw data does contain considerable
natural order, beyond what is expected randomly. In this
case, natural merging can help us a lot.
E. Two further improvements to the basic merge sort are possible at the cost
of significant added complexity. The first improvement builds on the
idea of the natural merge by using an internal sort during the
distribution phase to CREATE runs of some size.
1. The initial distribution pass now looks like this - assuming we
have room to sort s records at a time internally:
while not eof(infile) do
read up to s records into main memory
sort them
write them to one of the scratch files
2. Clearly, the effect of this is to reduce the merge time by a
factor of (log (n/s)) / (log n). For example, if s = sqrt(n),
we reduce the merge time by a factor of 2. The overall time is
not reduced as much, of course, because
a. The distribution pass still involves the same number of reads.
b. We must now add time for the internal sorting!
c. Nonetheless, the IO time saved makes internal run generation
almost always worthwhile.
Example: suppose we need to sort 65536 records, and have room
to internally sort 1024 at a time.
- The time for a simple merge sort is
65536 * (1 + log 65536) reads + the same number of writes
= 65536 * 17 * 2 = 2,228,224 IO operations
- The time with internal run generation is
65536 * (1+ log 65536/1024) reads + the same number of writes +
internal sort time
= 65536 * 7 * 2 = 917,504 IO operations + 64 1024-record sorts
3. Code: TRANSPARENCY
Note that this particular algorithm is unstable. Why? (ASK CLASS)
a. The merging process is perfectly stable.
b. But we use an unstable sort - quicksort - for initial run
generation.
c. Clearly, if stability were an issue, we would have to switch to
a stable internal sort.
4. Actually, we have room for further improvements here. The limiting
factor on the size of the initial runs we can generate is obviously
the amount of main memory available for the initial sort.
a. If we had enough room, we could just sort all of the records
internally at one time and be done with the job. The fact that
we are using external sorting presupposes limited main memory
capacity.
b. However, let's consider what happens after we have completed our
internal sort and we are writing our run out to a scratch file.
Each time we write a record out, we vacate a slot in main memory
that could be used to hold a new record from the input file.
Suppose we read one in at this point. There are two possibilities:
i. It could have a key >= the key of the record we just wrote out.
If so, we can insert it into its proper position vis-a-vis the
remaining records, and include it in the present run, thus
increasing the run size by 1.
ii. It could have a key < the key of the record we just wrote out.
In this case, it cannot go into the present run, since it would
cause a stepdown. So we must put it at one end of the buffer
to keep it out of the way.
c. This process is known as REPLACEMENT SELECTION. As each record
is written to the file, a new one is read, and is either included
in the present run or stuck away at the end of the internal buffer.
Eventually, the records stuck away will fill the buffer, of course -
at which point the run terminates.
d. It can be shown that, with random data, replacement selection
increases the average run size by a factor of 2 - i.e. it is like
doubling the size of the internal sorting buffer.
Knuth gives an argument to show this based on an analogy to a
snowplow plowing snow in a steady storm. (TRANSPARENCY - READ TEXT)
e. Code: TRANSPARENCY
Note that this code is unstable, for three reasons:
i. Use of unstable internal sorting method - heapsort - initially.
ii. The replacement selection process is unstable. A newly read
record, because it is placed on top of the heap, could get
ahead of a preceeding record with the same key.
iii. We detect run boundaries by using stepdowns, which can allow
two runs to collapse into one in the absence of separator
records (which we don't use.)
iv. Two of these four problems (the first and third) could be
easily fixed. As it turns out, though, the second cannot be
fixed without going to an O(n^2) kind of process for readjusting
the internal buffer after each new record is brought in.
Thus, we leave the algorithm in an unstable form.
5. Data structures for replacement selection
a. There are several different internal structures that can be used
to facilitate replacement selection. In general, we want to
sort the raw data, and then output records 1 at a time - each time
replacing the record we outputted with a new one from the input and -
if possible - putting it into its proper place vis-a-vis the
remaining records. Ordinarily we would like to do the original
sorting in O(n log n) time, and we would like each replacement to
take O(log n) time. We do n replacements, so the total time for
the output/replacement phase would then also be O(n log n).
b. The example we just looked at uses a variant of heapsort, but
builds an inverted heap in which the SMALLEST key is on top of
the heap (rather than the largest). Further, we require each
other key to be >= its parent (not <= as in standard heapsort.)
i. This configuration can be achieved by running just the first
phase of heapsort, with the inequalities reversed. (See
BuildInvertedHeap in TRANSPARENCY). Clearly, this takes
O(n log n) time.
ii. As we output each record, we run one iteration of the second
phase of heapsort (with inequalities reversed) to bring a
new record from the top of the heap to its right place.
(See AdjustInvertedHeap in TRANSPARENCY.)
- If possible, the new record we put on top of the heap
is from the input file.
- Otherwise, we promote a record from the rear of the heap,
and put the newly-read record into its place - thus
reducing the effective size of the heap.
Either way, each adjustment takes O(log n) time, so the
expected n adjustments take O(n log n) time.
iii. Example: suppose we have internal storage to sort three
records, and are generating runs from the following input
file:
D B G F A H C I E
Initial read-in: D
B G
Build heap B
D G
Output B to run, replace with F: F
D G
Adjust heap D
F G
Output D to run. Since A is too small to participate in this run,
replace D with G, G with A, and reduce size of heap:
G __
F / A <- not part of heap
Adjust heap F __
G /A
Output F to run, replace with H: H ___
G / A
Adjust heap: G ___
H / A
Output G to run. Since C is too small to participate in this run,
replace G with H, H with C, and reduce size of heap:
H
----
C A <- not part of heap
Adjust heap (no change)
Output H to run, replace with I: I
----
C A
Adjust heap (no change)
Output I to run. Since E is too small to participate in this run,
replace I with E and reduce size of heap:
E <- all not on heap
C A
Since the buffer is now filled up with records not eligible for
the current run, terminate the current run. The records in
the buffer can then be converted into a heap to start the
next run.
Resulting run: B D F G H I
c. Other structures can also be used - see Horowitz or Knuth.
The structures they discuss are based on the idea of a tournament
in which pairs of players compete; then the winners of each pair
compete in new pairs etc until an overall winner is chosen.
(Here the "winner" is the record with the smallest key).
The "winner" is outputted to the run, and a new record - if
eligible - is allowed to work its way through the tournament
along the path the winner previously followed.
d. None of these structures produces a stable sort, however.
For a stable sort, one would have to use an insertion or
selection sort-like structure, with O(n) time for each
replacement instead of O(log n) time.
F. Another improvement involves increasing the MERGE ORDER.
1. In our balanced merge sort, we merge two input files to produce
two output files, then exchange roles and keep going - using a
total of four files.
2. Suppose, however, that we could use six files. This would mean
merging runs from 3 files and a time, and distributing the resulting
runs alternately into 3 output files. How would this affect the
overall run time?
a. Each merge pass would still process all n records, requiring
2n IO operations (one read + one write for each record.)
b. However, each pass would cut the number of runs by a factor of 3.
Thus, we would need ceil(log n) passes instead of ceil(log n).
3 2
Since (log n) / (log n) = (log 2) = 0.63, this translates into
3 2 3
about a 37% savings in the number of passes and hence almost as
big a savings in the number of IO operations. (Our proportional
savings overall are less than 37% because we still have to do
the initial distribution pass.)
3. Of course, further savings are possible by increasing the number
of files. With eight files, we merge 4 runs at a time, and thus
cut the number of passes in half. In general, a BALANCED M-WAY
MERGE SORT has the following behavior:
a. Time: 2n (1 + ceil(log n)) IO operations.
m
(Number of merge passes reduced from 2-way merge by a factor
of log 2.)
m
b. Space: 2m files - of which two are length n and the rest are
of length n/m. If we use the output file as one of these,
then the total scratch space needed is room for
n + (2m - 2)(n/m) = 3n - 2n/m records
plus room for 2m buffers in main memory
(Notice that, when m=2, these reduce to our previous formulas
for balanced 2-way merge sort, as we would expect.)
4. We can use both internal sorting and increased merge order together
to get even better performance. For example, an m-way merge sort of
n records, assuming the ability to sort s records at a time internally,
would require log (n/s) passes.
m
Example: 65536 records, sorted 1024 at a time, using 4-way merge:
Number of IOs = 65536 * (1 + log (65536/1024)) * 2
4
= 65536 * (1 + 3) * 2 = 524,288 IO operations - almost twice as good
as our previous example using 1024 record internal sorting and
2-way merge.
G. Polyphase merging
1. We have seen that we can speed up an external merge sort by
increasing the merge order. Obviously, if we carry this to its
limit, we could sort a file of n records in just one pass by using
n-way merging.
a. The external storage required for scratch files would be room for
3n - (2n/n) = about 3n records, which seems manageable.
b. However, the fly in the ointment would be the need for 2n buffers
in main memory - each capable of holding at least one record.
If we had internal memory sufficient to hold 2n records, we wouldn't
be using external sorting in the first place!
c. Thus, in general, internal memory considerations will limit the
merge order to some value m (m >= 2), such that we have room for
2m buffers in main memory and no more.
2. These computations have assumed that we use a balanced merge sort.
Suppose, however, that we had room for 2m buffers, but used an
unbalanced merge - i.e.
a. We merge 2m - 1 runs at a time to produce runs in a single output
file.
b. We then redistribute those runs over 2m - 1 files, as in our
original basic merge sort algorithm.
c. The space needed is obviously the same as for balanced merging.
What about the time?
- we now need log n passes, as compared to 1 + log n.
(2m-1) m
- but each pass requires 4n IO operations as opposed to 2n because
we merge and redistribute in separate steps. (The initial
distribution and the fact that no redistribution is needed after
the last pass cancel each other.)
- Thus, total IOs = 4n log n versus 2n (1 + log n)
(2m-1) m
- alas, the figure for the unbalanced version is always slightly
bigger than the figure for the balanced version. This is
because the log is not quite cut in half, while the number of
IOs per pass are doubled due to the need to redistribute.
3. We now consider a sorting algorithm known as POLYPHASE merging that
uses 2m files to get a 2m-1 way merge WITHOUT a separate redistribution
of runs after every pass. It thus has the advantages of an unbalanced
sort without the disadvantage.
4. Rather than discuss the strategy in general, let's look at an example.
We will do a 2-way merge using 3 files:
Original file: B D E C F A G H
Initial distribution: B | E | F | G | H (File SCRATCH1)
D | C | A (File SCRATCH2)
(empty) (File SCRATCH3)
(note that one file contains more runs than the other - 5
versus 3, and we have one empty file to receive the results
of our first merge pass. This is all part of the plan.
Also, for simplicity we are ignoring runs existing in the
raw data, treating it as 8 runs of length 1.)
After first phase: G | H (File SCRATCH1)
(empty) (File SCRATCH2)
BD | CE | AF (File SCRATCH3)
(note that we didn't use up all the runs in the first
scratch file. Rather, we stopped the pass when we ran out
of runs in one of the scratch files - namely the second.
That is, it is no longer the case that each pass processes
all of the records. Thus, we use the term "phase" instead
of "pass" to avoid confusion.)
After second phase: (empty) (File SCRATCH1)
BDG | CEH (File SCRATCH2)
AF (File SCRATCH3)
After third phase: ABDFG (File SCRATCH1)
CEH (File SCRATCH2)
(empty) (File SCRATCH3)
After fourth phase: (empty) (File SCRATCH1)
(empty) (File SCRATCH2)
ABCDEFGH (File SCRATCH3)
Note that we end up with four phases - as opposed to three passes
in a standard 2-way merge sort. However, the we don't process all
the records on each phase. The total number of IOs is:
2 * (8 [initial distribution] + 6 [phase 1] + 6 [phase 2] +
5 [phase 3] + 8 [phase 4]) = 2 * 33 = 66 - compared with
2 * 8 * (1 + log 8) = 64 for a balanced 2-way merge
or 4 * 8 * (log 8) = 96 for an unbalanced 2-way merge
That is, we got nearly the performance of a balanced 2-way merge
(that would require 4 files) - but we only used three files!
5. Obviously, one key to the polyphase merge is the way runs are
distributed. Let's look again at the numbers of runs on the
two non-empty files:
Original file: 8
Initial distribution: 5 and 3
After phase 1: 3 and 2
After phase 2: 2 and 1
After phase 3: 1 and 1
Finally: 1
a. What do all these numbers have in common (ASK CLASS)?
- They are all Fibonacci numbers!
b. To see why this works, note that, at any time, we have one file
containing Fib runs, and another containing Fib . We merge Fib
n n-1 n-1
runs, yielding one file with Fib and one with Fib-Fib = Fib .
n-1 n n-1 n-2
This continues until we end up with Fib (= 1) runs in one file,
2
and Fib (= 1) run in the other, yielding one file with one run,
1
as desired.
c. Now it may appear that this pattern poses a problem, since it
appears only to work if the TOTAL number of runs initially is
a Fibonacci number. However, we can make it work in other cases
by assuming the existence of one or more dummy runs in either or
both files.
Example: suppose we are sorting a file with 6 runs - say the
following (ignoring pre-existing order):
Original file: B D E C F A
We distribute initially as follows: B | E | C | A + dummy
(Note: we will look at the algorithm D | F | + dummy
that does this shortly)
It turns out that our merging is slightly more efficient if we
regard the dummy runs as being at the START of the file, rather than
the end.
i. If we regard the dummies as being at the end, then our merging
goes like this:
Initial distribution B | E | C | A + dummy
12 read/writes D | F | + dummy
After Phase 1 (merging 3 runs): A + dummy
(empty)
10 read/writes BD | EF | C
After Phase 2 (merging 2 runs): (empty)
ABD | EF
10 read/writes C
After Phase 3 (merging 1 run): ABCD
EF
8 read/writes (empty)
After Phase 4 (merging 1 run): (empty)
(empty)
12 read/writes ABCDEF
Total read writes = 12 + 10 + 10 + 8 + 12 = 52
ii. However, if we regard the dummies as being at the beginning
of the files, then we have:
Initial distribution dummy + B | E | C | A
12 read/writes dummy + D | F |
After Phase 1 (merging 3 runs) C | A
(empty)
8 read/writes dummy + BD | EF
After Phase 2 (merging 2 runs) (empty)
C | ABD
8 read/writes EF
After Phase 3 (merging 1 run) CEF
ABD
6 read/writes (empty)
After Phase 4 (merging 1 run): (empty)
(empty)
12 read/writes ABCDEF
Total read writes = 12 + 8 + 8 + 6 + 12 = 46
6. We have been considering a 2-way polyphase merge. If we have
room for more scratch files, we can use a higher order merge.
a. In this case, we base the distribution on GENERALIZED FIBONACCI
NUMBERS.
(p)
The pth order Fibonacci numbers F are defined as follows:
n
(p)
F = 0 for 0 <= n <= p - 2
n
(p)
F = 1 for n = p - 1
n
(p) (p) (p) (p)
F = F + F + ... + F for n >= p
n n-1 n-2 n-p
(I.e. after the basis cases, a given generalized Fibonacci number
of order p is simply the sum of its p predecessors.)
Example: the third-order Fibonacci numbers are the following
(3)
sequence (where the first is counted as F ):
0
0 0 1 1 2 4 7 13 24 44 ...
Note that the ordinary Fibonacci numbers are, in fact, Fibonacci
numbers of order 2 by this definition
b. At the start of any phase of a P-way polyphase merge, the
P input files contain
(p) (p) (p) (p) (p) (p) (p) (p)
F , F + F , F + F + F , ... , F + ... + F
n n n-1 n n-1 n-2 n n-p+1
[1 term] [2 terms] [3 terms] [p terms]
(p)
runs, respectively, for some n. During this phase, F runs will
n
be merged, leaving the following number of runs in each file:
(p) (p) (p) (p) (p)
0, F , F + F , ... , F + ... + F
n-1 n-1 n-2 n-1 n-p+1
[1 term] [2 terms] [p-1 terms]
In addition, the output file will now contain
(p) (p) (p)
F = F + ... + F runs
n n-1 n-p
[p terms]
But this has the same form as the initial distribution, with n
replaced by n-1, and with the file that was the output file now
occupying the last position in the list.
c. Eventually, the merge will reduce down to the pattern
(p) (p (p) (p) (p)
F , F + F , ... , F + ... + F
p-1 p-1 p-2 p-1 0
= 1, 0 , ... , 0
- which is our sorted file
d. Of course, we will not, in general, have a perfect number of runs
to produce a Fibonacci distribution, so we will have to put some
number of dummy runs at the front of one or more files.
e. Example - 3-way polyphase merge of our original 8 records.
(We will compare to a balanced 2-way merge that uses the same number
of scratch files, and required a total of 64 record transfers.)
Original file: B D E C F A G H
Initial distribution B | C | A | G
16 read/writes D | F | H
dummy + E
(3) (3) (3) (3) (3) (3)
Note: # of runs = 2, 3, 4 = F , F + F , F + F + F
4 4 3 4 3 2
Phase 1 - merge 2 runs: A | G
10 read/writes H
(empty)
BD | CEF
Phase 2 - merge 1 run: G
8 read/writes (empty)
ABDH
CEF
Phase 3 - merge 1 run: (empty)
16 read/writes ABCDEFGH
(empty)
(empty)
Total IO Operations = 16 + 10 + 8 + 16 = 50
f. There is a fairly straightforward way to calculate the
distribution by hand:
i. If you are doing a P-way polyphase merge, you will use P+1
files - so draw P+1 columns on scratch paper. We will develop
the distributions by working BACKWARDS from the final state - so
the first distribution we create will be that for the last phase
of the merge, etc.
ii. Each row will record the distribution as of the start of one
phase. The final distribution will, of course, be:
0 0 .... 0 1
And the distribution just before it will be:
1 1 .... 1 0
In each succeeding phase, the empty file will move left one
slot - so at the start of the second-to-the-last phase, we have:
x x .... 0 x <- where the x's represent values to be filled in
Each x will be the sum of the value in this column for the
previously calculated phase plus the value in the column that
is now zero (1 in this case). Thus, our next distribution is
2 2 .. 2 0 1
And the next is
2+2 2+2 .. 0 0+2 1+2 =
4 4 .. 0 2 3
iii. Continue this process until the total number of runs at the
start of the phase is >= the number of runs in the original
input. Add dummy runs if necessary.
iv. Example: 3-way polyphase merge sort of raw data containing 17
runs intially (4 files used):
Final state 0 0 0 1 Total 1
Before last phase 1 1 1 0 Total 3
Before 2nd to last 2 2 0 1 Total 5
Before 3rd to last 4 0 2 3 Total 9
Before 4th to last 0 4 6 7 Total 17
7. Algorithm:
a. Assume a p-way polyphase merge, using p+1 files - here designated
file[1] .. file[p+1]. The initial distribution will spread runs
over file[1] .. file[p] - leaving file[p+1] empty. (In fact, if
one wished to economize on files, file[p+1] could serve as the input
file - but of course the input file would then be destroyed.)
b. The first merge phase will put its runs in file[p+1]; the second in
file[p]; the third in file[p-1], etc. After a phase puts its runs
in file[1], the next will put its runs in file[p+1] and thus repeat
the cycle.
c. Variables used in both initial distribution and merging process:
level: current level number. Level 1 = final merge, 0 = done.
distribution[i]: total number of runs in file i (1 <= i <= p+1)
dummy[i]: number of dummy runs in file i (1 <= i <= p+1)
d. Initial distribution phase:
set level = 1, distribution[i] = 1 for all i <= p,
dummy[i] = 1 for all i <= p,
distribution[p+1] = dummy[p+1] = 0
set f = 1 <- we will put the next run into file f
generate one run and put it into file f
(* Run generation may be by any method - treat each record as a
run, natural runs, internal sorting, sorting with replacement
selection *)
dummy[f] := 0
while the input file is not empty do
if dummy[f] < dummy[f+1] then
f := f + 1
else
if dummy[f] = 0 then
(* At this point, all values of dummy[] must be zero *)
level := level + 1;
compute new values of distribution[] and dummy[]
f := 1;
generate one run and put it into file f
dummy[f] := dummy[f] - 1
where we compute new values of distribution[] and dummy[] as
follows:
merges_this_level := distribution[1];
for i := 1 to p do
dummy[i] := merges_this_level + distribution[i+1] -
distribution[i]
distribution[i] := distribution[i] + dummy[i]
Example: Distribute 17 runs for a 3-way merge (p = 3)
Level Distribution[i]/Dummy[i] f #runs
so far
1 2 3 4
1 1/1 1/0 1/0 0/0 1 0 Initialization
1/0 1 Generate 1st run
1/0 2 2 First time thru loop
1/0 3 3 2nd time thru loop
Enter loop 3rd time
Compute new distribution. Merge_this_level = 1
2 2/1 2/1 1/0 1
2/0 4 Finish loop
2/0 2 5 2nd time thru loop
Enter loop 3rd time
Compute new distribution. Merges_this_level = 2.
3 4/2 3/1 2/1 1
4/1 6 Finish loop
4/0 7 2nd time thru loop
3/0 2 8 3rd time thru loop
2/0 3 9 4th time thru loop
Enter loop 5th time
Compute new distribution. Merges_this_level = 4.
4 7/3 6/3 4/2 1
7/2 10 Finish loop
6/2 2 11 2nd time thru loop
7/1 1 12 3rd time thru loop
6/1 2 13 4th time thru loop
4/1 3 14 5th time thru loop
7/0 1 15 6th time thru loop
6/0 2 16 7th time thru loop
4/0 3 17 8th time thru loop
OUT OF RUNS - 17 GENERATED SO FAR
e. Actually doing the merges
for i := 1 to level do
case i mod (p + 1) of
1: merge from files 1..p to p+1
2: merge from files p+1, 1 .. p-1 to p
3: merge from files p .. p+1, 1 .. p-2 to p-1
...
0: merge from files 2 .. p+1 to 1
Where for each merge, we do the following:
for j := 1 to distribution[last input file] do
if dummy[] > 0 for all input files then
dummy[output_file] := dummy[output_file] + 1
for each input file:
if dummy[this_file] > 0 then
dummy[this_file] := dummy[this_file] - 1
end_of_run[this_file] := true
else
end_of_run[this_file] := eof(file[this_file])
while not all files at end of run do
transfer record with smallest key among those not
at end of run to output file
8. Note that polyphase merging is NOT STABLE.
9. Knuth discusses, in detail, the analysis of polyphase merge
sorting, various improvements to it, plus other similar algorithms.
The following summarizes the time behavior as a function of
the number of "tapes".
TRANSPARENCY
II. The following summarizes the external sorts we have considered:
-- ---- -------- ---------- --- -------- ----- -- ---- ----------
Method Scratch space Records Internal Stable
(records) read/written space/time
Basic merge n 4n ceil(log n) 3 buffers yes
Balanced 2-way 2n 2n (1+ceil(log n)) 4 buffers yes
Balanced Natural 2n 2n (1+ceil(log n/m)) 4 buffers no *
- initial runs of
average size m
Balanced with 2n 2n (1+ceil(log n/s)) 4 buffers + yes
internal sort of space to sort
s records s records
(n/s) log(s)
internal sort
time
Balanced with 2n 2n (1+ceil(log n/2s))= [SAME] no
internal sort & 2n (ceil (log n/s))
replacement
selection -
random data
Balanced m-way 3n - 2n/m 2n (1+ceil log n) 2m buffers yes
m
Balanced m-way 3n - 2n/m 2n (1+ceil log n/s) 2m buffers + yes
with internal m space to
sort of size s sort s
(no replacement records
selection)
Polyphase m-way [ must analyze distribution case-by-case] m+1 no
* = can be made stable by using separator records
III. Sorting with multiple keys
--- ------- ---- -------- ----
A. Thus far, we have assumed that each record in the file to be sorted
contains one key field. What if the record contains multiple keys -
e.g. a last name, first name, and middle initial?
1. We wish the records to be ordered first by the primary key (last
name).
2. In the case of duplicate primary keys, we wish ordering on the
secondary key (first name).
3. In the case of ties on both keys, we wish ordering on the tertiary
key (middle initial).
etc - to any number of keys.
B. The approach we will discuss here applies to BOTH INTERNAL AND EXTERNAL
SORTS.
C. There are two techniques that can be used for cases like this:
1. We can modify an existing algorithm to consider multiple keys when
it does comparisons - e.g.
a. Original algorithm says:
if item[i].key < item[j].key then
b. Revised algorithm says:
if (item[i].primary_key < item[j].primary_key) or
((item[i].primary_key = item[j].primary_key) and
(item[i].secondary_key < item[j].secondary_key) or
((item[i].primary_key = item[j].primary_key) and
(item[i].secondary_key = item[j].secondary_key) and
(item[i].tertiary_key < item[j].tertiary_key)) then
2. We can sort the same file several times, USING A STABLE SORT.
a. First sort is on least significant key.
b. Second sort is on second least significant key.
c. Etc.
d. Final sort is on primary key.
3. The first approach is useable when we are embedding a sort in a
specific application package; the second is more viable when we are
building a utility sorting routine for general use [but note that we
are now forced to a stable algorithm.]
Copyright ©1999 - Russell C. Bjork