CS321 Lecture: External Sorting 3/21/88 - revised 11/22/99 Materials: Transparencies of basic merge sort, balanced 2-way merge sort, Natural merge, stable natural merge, merge with internal run generation, merge with replacement selection during run generation. Transparency of Knuth "snowplow" analogy, polyphase analysis. Introduction ------------ A. We have seen that the algorithms we use for searching tables stored on disk are quite different from those used for searching tables stored in main memory, because the disk access time dominates the processing time. B. For much the same reason, we use different algorithms for sorting information stored on disk than for sorting information in main memory. 1. We call an algorithm that sorts data contained in main memory an INTERNAL SORTING algorithm, while one that sorts data on disk is called an EXTERNAL SORTING algorithm. 2. In the simplest case - if all the data fits in main memory - we can simply read the data from disk into main memory, sort it using an internal sort, and then write it back out. 3. The more interesting case - and the one we consider here - arises when the file to be sorted does not all fit in main memory. 4. Historically, external sorting algorithms were developed in the context of systems that used magnetic tapes for file storage, and the literature still uses the term "tape", even though files are most often kept on some form of disk. It turns out, though, that the storage medium being used doesn't really matter because the algorithms we will consider all read/write data sequentially. I. A Survey of External Sorting Methods - - ------ -- -------- ------- ------- A. Most external sorting algorithms are variants of a basic algorithm known as EXTERNAL MERGE sort. Note that there is also an internal version of merge sort that we have considered. External merging reads data one record at a time from each of two or more files, and writes records to one or more output files. As was the case with internal merging, external merging is O(n log n) for time, but O(n) for extra space, and (if done carefully) it is stable. B. First, though, we need to review some definitions: 1. A RUN is a sequence of records that are in the correct relative order. 2. A STEPDOWN normally occurs at the boundary between runs. Instead of the key value increasing from one record to the next, it decreases. Example: In the following file: B D E C F A G H - we have three runs (B D E, C F, A G H) - we have two stepdowns (E C, F A) 3. Observe that an unsorted file can have up to n runs, and up to n-1 stepdowns. In general (unless the file is exactly backwards) there will be a lesser number than this of runs and stepdowns, due to pre-existing order in the file. 4. Observe that a sorted file consists of one run, and has no stepdowns. C. We begin with a variant of external merge sort that one would not use directly, but which serves as the foundation on which all the other variants build. 1. In the simplest merge sort algorithm, we start out by regarding the file as composed of n runs, each of length 1. (We ignore any runs which may already be present in the file.) On each pass, we merge pairs of runs to produce runs of double length. a. After pass 1, we have n/2 runs of length 2. b. After pass 2, we have n/4 runs of length 4. c. The total number of passes will be ceil(log n). [Where ceil is the ceiling function - smallest integer greater than or equal to.] After the last pass, we have 1 run of length n, as desired. d. Of course, unless our original file length is a power of 2, there will be some irregularities in this pattern. In particular, we let the last run in the file be smaller than all the rest - possibly even of length zero. Example: To sort a file of 6 records: Initially: 6 runs of length 1 After pass 1: 3 runs of length 2 + 1 "dummy" run of length 0 After pass 2: 1 run of length 4 + 1 run of length 2 After pass 3: 1 run of length 6 2. We will use a total of three scratch files to accomplish the sort. a. Initially, we distribute the input data over two files, so that half the runs go on each. We do this alternately - i.e. first we write a run to one file, then to the other - in order to ensure stability. b. After the initial distribution, each pass entails merging runs from two of the scratch files and writing the generated runs on the third. At the end of the pass, if we are not finished, we redistribute the runs from the third file alternately back to the first two. Example: original file: B D E C F A G H initial distribution: B E F G (File SCRATCH1) D C A H (File SCRATCH2) (remember we ignore runs existing in the raw data) -------------------------------------------------- after first merge: BD CE AF GH (File SCRATCH3) PASS 1 redistribution: BD AF (File SCRATCH1) CE GH (File SCRATCH2) -------------------------------------------------- after second merge: BCDE AFGH (File SCRATCH3) PASS 2 redistribution: BCDE (File SCRATCH1) AFGH (File SCRATCH2) -------------------------------------------------- after third merge: ABCDEFGH (File SCRATCH3) PASS 3 (no redistribution) 3. Code: TRANSPARENCY 4. Analysis: we said earlier that all variants of external merge sort are O(nlogn) for time and O(n) for extra space. To compare different variants, we will need to use a more precise figure than just these "big O" bounds. a. We will, therefore, measure space in terms of the total number of records of scratch space needed. We will also give consideration to main memory space needed for buffers for the files and (later) for other purposes as well. b. We will measure time in terms of the number of records read and written. i. Actually, since each record read is eventually written to another file, it will suffice to simply count reads, and then double this number to get total records transferred. ii. Of course, data is normally transferred to/from the file in complete blocks, rather than record-by-record. However, the total number of transfers is directly proportional to the number of records transferred, so we will use the number of records transferred as our measure. 5. Analysis of the basic merge sort a. Space: three files, one of length n and two of length n/2. We can use the output file as one of the scratch files, so the total additional space is two files of length n/2 = total scratch space for n records In addition, we need internal memory for three buffers - one for each of the three files. In general, each buffer needs to be big enough to hold an entire block of data (based on the blocksize of the device), rather than a single record. b. Time: - Initial distribution involves n reads - Each pass except the last involves 2n reads due to merging followed by redistribution. The last pass involves just n reads. - Total reads = 2 n ceil(log n), so total IO operations = 4n ceil(log n) D. Improving the basic merge sort: two possible improvements suggest themselves immediately, and entail minimal extra work. 1. If we had four scratch files instead of three, we could combine the merging and redistribution into one operation as follows: On each pass, we use two files for input and two for output. We write the runs generated alternately to the two output files. After the pass, we switch the roles of the input and output files. a. Example: original file: B D E C F A G H initial distribution: B E F G (File SCRATCH1) D C A H (File SCRATCH2) (remember we ignore runs existing in the raw data) after first pass: BD AF (File SCRATCH3) CE GH (File SCRATCH4) after second pass: BCDE (File SCRATCH1) AFGH (File SCRATCH2) after third pass: ABCDEFGH (File SCRATCH3) b. Code: TRANSPARENCY c. Analysis: i. Space: four files, two of which must now be of length n, since we don't know ahead of time, in general, whether we will have an odd or even number of passes. (The sorted file will end up on either the first or third scratch file.) The other two files can be of length n/2. As before, we can use the output file as one of the scratch files, so the extra space needed is one file of length n plus two of length n/2 = total scratch space for 2n records Plus internal memory for 4 file buffers. ii. Time: Because we have an initial distribution pass before we start merging, we have a total of n (1 + ceil(log n)) reads = 2n(1 + ceil(log n)) transfers iii. We have gained almost a factor of 2 in speed, at the cost of doubling the scratch file space. d. This algorithm is known as BALANCED 2-WAY MERGE SORT. 2. Another improvement arises from the observation that our original algorithm started out assuming that the input file consists of n runs of length 1 - the worst possible case (a totally backward file.) In general, the file will contain many runs longer than one just as a consequence of the randomness of the data, and we can use these to reduce the number of passes. a. Example: The sample file we have been using contains 3 runs, so we could do our initial distribution as follows: initial distribution: BDE AGH (File SCRATCH1) CF (File SCRATCH2) after first pass: BCDEF (File SCRATCH3) AGH (File SCRATCH4) after second pass: ABCDEFGH (File SCRATCH1) (Note: we have assumed the use of a balanced merge; but a non-balanced merge could also have been used.) b. Code: TRANSPARENCY c. This algorithm is called a NATURAL MERGE. The term "natural" reflects the fact that it relies on runs naturally occurring in the data. d. However, this algorithm has a quirk we need to consider. i. Since we merge one run at a time, we need to know where one run ends and another run begins. In the case of the previous algorithms, this was not a problem, since we knew the size of each run. Here, though, the size will vary from run to run. In the code we just looked at, the the solution to this problem involved recognizing that the boundary between runs is marked by a stepdown. Thus, each time we read a new record from an input file, we will keep track of the last key processed from that file; and if our newly read key is smaller than that key, then we know that we have finished processing one run from that file. Example: in the initial distribution above, we placed two runs in the first scratch file. The space between them would not be present in the file; what we would have is actually BDEAGH. But the run boundary would be apparent because of the stepdown from E to A. ii. However, if stability is important to us, we need to be very careful at this point. In some cases, the stepdown between two runs could disappear, and an unstable sort could result. Consider the following file: F E D C B A M1 Z N M2 (where records M1 and M2 have identical keys.) ___ No stepdown here, so | 2 runs look like one: v Initial distribution: F | D | B | N E | C | A M1 Z | M2 F | D | B N E | C | A M1 Z | M2 First pass: E F | A B M1 N Z C D | M2 ^ |___ No stepdown here, so two runs look like one: E F | A B M1 N Z C D M2 | Second pass: C D E F M2 A B M1 N Z Third pass: A B C D E F M2 M1 N Z ^ | In the case of equal keys, we take record from first scratch file before record from second, since first scratch file should contain records from earlier in original file. iii. If stability is a concern, we can prevent this from occurring by writing a special run-separator record between runs in our scratch files. This might, for example, be a record whose key is some impossibly big value like maxint or '~~~~~'. Of course, processing these records takes extra overhead that reduces the advantage gained by using the natural runs. e. Code for a stable variant of natural merge: TRANSPARENCY f. Analysis: i. Space is the same as an ordinary merge if no run separator records are used. However, in the worst case of a totally backward input file, we would need n run separator records on our initial distribution, thus potentially doubling the scratch space needed. ii. The time will be some fraction of the time needed by an ordinary merge, and will depend on the average length of the naturally occurring runs. - If the naturally occurring runs are of average length 2, then we save 1 pass - in effect we start where we would be on the second pass of ordinary merge. - In general, if the naturally occuring runs are of average length m, we save at least floor(log m) passes. Thus, if we use a balanced 2-way merge, our time will be n (1 + ceil(log n - log m)) reads = n (1 + ceil(log n/m)) reads or 2n (1 + ceil(log n/m)) IO operations - Of course, if run separator records are used, then we actually process more than n records on each pass. This costs additional time for n/m reads on first pass n/2m reads on second pass n/4m reads on third pass ... = (2n/m - 1) additional reads, or about 4n/m extra IO operations - Obviously, a lot depends on the average run length in the original data (m). It can be shown that, in totally random data, the average run length is 2 - which translates into a savings of 1 merge pass, or 2n IO operations. However, if we use separator records, we would need 2n extra IO operations to process them - so we gain nothing! (We could still gain a little bit by omitting separator records if stability were not an issue, though.) - In many cases, though, the raw data does contain considerable natural order, beyond what is expected randomly. In this case, natural merging can help us a lot. E. Two further improvements to the basic merge sort are possible at the cost of significant added complexity. The first improvement builds on the idea of the natural merge by using an internal sort during the distribution phase to CREATE runs of some size. 1. The initial distribution pass now looks like this - assuming we have room to sort s records at a time internally: while not eof(infile) do read up to s records into main memory sort them write them to one of the scratch files 2. Clearly, the effect of this is to reduce the merge time by a factor of (log (n/s)) / (log n). For example, if s = sqrt(n), we reduce the merge time by a factor of 2. The overall time is not reduced as much, of course, because a. The distribution pass still involves the same number of reads. b. We must now add time for the internal sorting! c. Nonetheless, the IO time saved makes internal run generation almost always worthwhile. Example: suppose we need to sort 65536 records, and have room to internally sort 1024 at a time. - The time for a simple merge sort is 65536 * (1 + log 65536) reads + the same number of writes = 65536 * 17 * 2 = 2,228,224 IO operations - The time with internal run generation is 65536 * (1+ log 65536/1024) reads + the same number of writes + internal sort time = 65536 * 7 * 2 = 917,504 IO operations + 64 1024-record sorts 3. Code: TRANSPARENCY Note that this particular algorithm is unstable. Why? (ASK CLASS) a. The merging process is perfectly stable. b. But we use an unstable sort - quicksort - for initial run generation. c. Clearly, if stability were an issue, we would have to switch to a stable internal sort. 4. Actually, we have room for further improvements here. The limiting factor on the size of the initial runs we can generate is obviously the amount of main memory available for the initial sort. a. If we had enough room, we could just sort all of the records internally at one time and be done with the job. The fact that we are using external sorting presupposes limited main memory capacity. b. However, let's consider what happens after we have completed our internal sort and we are writing our run out to a scratch file. Each time we write a record out, we vacate a slot in main memory that could be used to hold a new record from the input file. Suppose we read one in at this point. There are two possibilities: i. It could have a key >= the key of the record we just wrote out. If so, we can insert it into its proper position vis-a-vis the remaining records, and include it in the present run, thus increasing the run size by 1. ii. It could have a key < the key of the record we just wrote out. In this case, it cannot go into the present run, since it would cause a stepdown. So we must put it at one end of the buffer to keep it out of the way. c. This process is known as REPLACEMENT SELECTION. As each record is written to the file, a new one is read, and is either included in the present run or stuck away at the end of the internal buffer. Eventually, the records stuck away will fill the buffer, of course - at which point the run terminates. d. It can be shown that, with random data, replacement selection increases the average run size by a factor of 2 - i.e. it is like doubling the size of the internal sorting buffer. Knuth gives an argument to show this based on an analogy to a snowplow plowing snow in a steady storm. (TRANSPARENCY - READ TEXT) e. Code: TRANSPARENCY Note that this code is unstable, for three reasons: i. Use of unstable internal sorting method - heapsort - initially. ii. The replacement selection process is unstable. A newly read record, because it is placed on top of the heap, could get ahead of a preceeding record with the same key. iii. We detect run boundaries by using stepdowns, which can allow two runs to collapse into one in the absence of separator records (which we don't use.) iv. Two of these four problems (the first and third) could be easily fixed. As it turns out, though, the second cannot be fixed without going to an O(n^2) kind of process for readjusting the internal buffer after each new record is brought in. Thus, we leave the algorithm in an unstable form. 5. Data structures for replacement selection a. There are several different internal structures that can be used to facilitate replacement selection. In general, we want to sort the raw data, and then output records 1 at a time - each time replacing the record we outputted with a new one from the input and - if possible - putting it into its proper place vis-a-vis the remaining records. Ordinarily we would like to do the original sorting in O(n log n) time, and we would like each replacement to take O(log n) time. We do n replacements, so the total time for the output/replacement phase would then also be O(n log n). b. The example we just looked at uses a variant of heapsort, but builds an inverted heap in which the SMALLEST key is on top of the heap (rather than the largest). Further, we require each other key to be >= its parent (not <= as in standard heapsort.) i. This configuration can be achieved by running just the first phase of heapsort, with the inequalities reversed. (See BuildInvertedHeap in TRANSPARENCY). Clearly, this takes O(n log n) time. ii. As we output each record, we run one iteration of the second phase of heapsort (with inequalities reversed) to bring a new record from the top of the heap to its right place. (See AdjustInvertedHeap in TRANSPARENCY.) - If possible, the new record we put on top of the heap is from the input file. - Otherwise, we promote a record from the rear of the heap, and put the newly-read record into its place - thus reducing the effective size of the heap. Either way, each adjustment takes O(log n) time, so the expected n adjustments take O(n log n) time. iii. Example: suppose we have internal storage to sort three records, and are generating runs from the following input file: D B G F A H C I E Initial read-in: D B G Build heap B D G Output B to run, replace with F: F D G Adjust heap D F G Output D to run. Since A is too small to participate in this run, replace D with G, G with A, and reduce size of heap: G __ F / A <- not part of heap Adjust heap F __ G /A Output F to run, replace with H: H ___ G / A Adjust heap: G ___ H / A Output G to run. Since C is too small to participate in this run, replace G with H, H with C, and reduce size of heap: H ---- C A <- not part of heap Adjust heap (no change) Output H to run, replace with I: I ---- C A Adjust heap (no change) Output I to run. Since E is too small to participate in this run, replace I with E and reduce size of heap: E <- all not on heap C A Since the buffer is now filled up with records not eligible for the current run, terminate the current run. The records in the buffer can then be converted into a heap to start the next run. Resulting run: B D F G H I c. Other structures can also be used - see Horowitz or Knuth. The structures they discuss are based on the idea of a tournament in which pairs of players compete; then the winners of each pair compete in new pairs etc until an overall winner is chosen. (Here the "winner" is the record with the smallest key). The "winner" is outputted to the run, and a new record - if eligible - is allowed to work its way through the tournament along the path the winner previously followed. d. None of these structures produces a stable sort, however. For a stable sort, one would have to use an insertion or selection sort-like structure, with O(n) time for each replacement instead of O(log n) time. F. Another improvement involves increasing the MERGE ORDER. 1. In our balanced merge sort, we merge two input files to produce two output files, then exchange roles and keep going - using a total of four files. 2. Suppose, however, that we could use six files. This would mean merging runs from 3 files and a time, and distributing the resulting runs alternately into 3 output files. How would this affect the overall run time? a. Each merge pass would still process all n records, requiring 2n IO operations (one read + one write for each record.) b. However, each pass would cut the number of runs by a factor of 3. Thus, we would need ceil(log n) passes instead of ceil(log n). 3 2 Since (log n) / (log n) = (log 2) = 0.63, this translates into 3 2 3 about a 37% savings in the number of passes and hence almost as big a savings in the number of IO operations. (Our proportional savings overall are less than 37% because we still have to do the initial distribution pass.) 3. Of course, further savings are possible by increasing the number of files. With eight files, we merge 4 runs at a time, and thus cut the number of passes in half. In general, a BALANCED M-WAY MERGE SORT has the following behavior: a. Time: 2n (1 + ceil(log n)) IO operations. m (Number of merge passes reduced from 2-way merge by a factor of log 2.) m b. Space: 2m files - of which two are length n and the rest are of length n/m. If we use the output file as one of these, then the total scratch space needed is room for n + (2m - 2)(n/m) = 3n - 2n/m records plus room for 2m buffers in main memory (Notice that, when m=2, these reduce to our previous formulas for balanced 2-way merge sort, as we would expect.) 4. We can use both internal sorting and increased merge order together to get even better performance. For example, an m-way merge sort of n records, assuming the ability to sort s records at a time internally, would require log (n/s) passes. m Example: 65536 records, sorted 1024 at a time, using 4-way merge: Number of IOs = 65536 * (1 + log (65536/1024)) * 2 4 = 65536 * (1 + 3) * 2 = 524,288 IO operations - almost twice as good as our previous example using 1024 record internal sorting and 2-way merge. G. Polyphase merging 1. We have seen that we can speed up an external merge sort by increasing the merge order. Obviously, if we carry this to its limit, we could sort a file of n records in just one pass by using n-way merging. a. The external storage required for scratch files would be room for 3n - (2n/n) = about 3n records, which seems manageable. b. However, the fly in the ointment would be the need for 2n buffers in main memory - each capable of holding at least one record. If we had internal memory sufficient to hold 2n records, we wouldn't be using external sorting in the first place! c. Thus, in general, internal memory considerations will limit the merge order to some value m (m >= 2), such that we have room for 2m buffers in main memory and no more. 2. These computations have assumed that we use a balanced merge sort. Suppose, however, that we had room for 2m buffers, but used an unbalanced merge - i.e. a. We merge 2m - 1 runs at a time to produce runs in a single output file. b. We then redistribute those runs over 2m - 1 files, as in our original basic merge sort algorithm. c. The space needed is obviously the same as for balanced merging. What about the time? - we now need log n passes, as compared to 1 + log n. (2m-1) m - but each pass requires 4n IO operations as opposed to 2n because we merge and redistribute in separate steps. (The initial distribution and the fact that no redistribution is needed after the last pass cancel each other.) - Thus, total IOs = 4n log n versus 2n (1 + log n) (2m-1) m - alas, the figure for the unbalanced version is always slightly bigger than the figure for the balanced version. This is because the log is not quite cut in half, while the number of IOs per pass are doubled due to the need to redistribute. 3. We now consider a sorting algorithm known as POLYPHASE merging that uses 2m files to get a 2m-1 way merge WITHOUT a separate redistribution of runs after every pass. It thus has the advantages of an unbalanced sort without the disadvantage. 4. Rather than discuss the strategy in general, let's look at an example. We will do a 2-way merge using 3 files: Original file: B D E C F A G H Initial distribution: B | E | F | G | H (File SCRATCH1) D | C | A (File SCRATCH2) (empty) (File SCRATCH3) (note that one file contains more runs than the other - 5 versus 3, and we have one empty file to receive the results of our first merge pass. This is all part of the plan. Also, for simplicity we are ignoring runs existing in the raw data, treating it as 8 runs of length 1.) After first phase: G | H (File SCRATCH1) (empty) (File SCRATCH2) BD | CE | AF (File SCRATCH3) (note that we didn't use up all the runs in the first scratch file. Rather, we stopped the pass when we ran out of runs in one of the scratch files - namely the second. That is, it is no longer the case that each pass processes all of the records. Thus, we use the term "phase" instead of "pass" to avoid confusion.) After second phase: (empty) (File SCRATCH1) BDG | CEH (File SCRATCH2) AF (File SCRATCH3) After third phase: ABDFG (File SCRATCH1) CEH (File SCRATCH2) (empty) (File SCRATCH3) After fourth phase: (empty) (File SCRATCH1) (empty) (File SCRATCH2) ABCDEFGH (File SCRATCH3) Note that we end up with four phases - as opposed to three passes in a standard 2-way merge sort. However, the we don't process all the records on each phase. The total number of IOs is: 2 * (8 [initial distribution] + 6 [phase 1] + 6 [phase 2] + 5 [phase 3] + 8 [phase 4]) = 2 * 33 = 66 - compared with 2 * 8 * (1 + log 8) = 64 for a balanced 2-way merge or 4 * 8 * (log 8) = 96 for an unbalanced 2-way merge That is, we got nearly the performance of a balanced 2-way merge (that would require 4 files) - but we only used three files! 5. Obviously, one key to the polyphase merge is the way runs are distributed. Let's look again at the numbers of runs on the two non-empty files: Original file: 8 Initial distribution: 5 and 3 After phase 1: 3 and 2 After phase 2: 2 and 1 After phase 3: 1 and 1 Finally: 1 a. What do all these numbers have in common (ASK CLASS)? - They are all Fibonacci numbers! b. To see why this works, note that, at any time, we have one file containing Fib runs, and another containing Fib . We merge Fib n n-1 n-1 runs, yielding one file with Fib and one with Fib-Fib = Fib . n-1 n n-1 n-2 This continues until we end up with Fib (= 1) runs in one file, 2 and Fib (= 1) run in the other, yielding one file with one run, 1 as desired. c. Now it may appear that this pattern poses a problem, since it appears only to work if the TOTAL number of runs initially is a Fibonacci number. However, we can make it work in other cases by assuming the existence of one or more dummy runs in either or both files. Example: suppose we are sorting a file with 6 runs - say the following (ignoring pre-existing order): Original file: B D E C F A We distribute initially as follows: B | E | C | A + dummy (Note: we will look at the algorithm D | F | + dummy that does this shortly) It turns out that our merging is slightly more efficient if we regard the dummy runs as being at the START of the file, rather than the end. i. If we regard the dummies as being at the end, then our merging goes like this: Initial distribution B | E | C | A + dummy 12 read/writes D | F | + dummy After Phase 1 (merging 3 runs): A + dummy (empty) 10 read/writes BD | EF | C After Phase 2 (merging 2 runs): (empty) ABD | EF 10 read/writes C After Phase 3 (merging 1 run): ABCD EF 8 read/writes (empty) After Phase 4 (merging 1 run): (empty) (empty) 12 read/writes ABCDEF Total read writes = 12 + 10 + 10 + 8 + 12 = 52 ii. However, if we regard the dummies as being at the beginning of the files, then we have: Initial distribution dummy + B | E | C | A 12 read/writes dummy + D | F | After Phase 1 (merging 3 runs) C | A (empty) 8 read/writes dummy + BD | EF After Phase 2 (merging 2 runs) (empty) C | ABD 8 read/writes EF After Phase 3 (merging 1 run) CEF ABD 6 read/writes (empty) After Phase 4 (merging 1 run): (empty) (empty) 12 read/writes ABCDEF Total read writes = 12 + 8 + 8 + 6 + 12 = 46 6. We have been considering a 2-way polyphase merge. If we have room for more scratch files, we can use a higher order merge. a. In this case, we base the distribution on GENERALIZED FIBONACCI NUMBERS. (p) The pth order Fibonacci numbers F are defined as follows: n (p) F = 0 for 0 <= n <= p - 2 n (p) F = 1 for n = p - 1 n (p) (p) (p) (p) F = F + F + ... + F for n >= p n n-1 n-2 n-p (I.e. after the basis cases, a given generalized Fibonacci number of order p is simply the sum of its p predecessors.) Example: the third-order Fibonacci numbers are the following (3) sequence (where the first is counted as F ): 0 0 0 1 1 2 4 7 13 24 44 ... Note that the ordinary Fibonacci numbers are, in fact, Fibonacci numbers of order 2 by this definition b. At the start of any phase of a P-way polyphase merge, the P input files contain (p) (p) (p) (p) (p) (p) (p) (p) F , F + F , F + F + F , ... , F + ... + F n n n-1 n n-1 n-2 n n-p+1 [1 term] [2 terms] [3 terms] [p terms] (p) runs, respectively, for some n. During this phase, F runs will n be merged, leaving the following number of runs in each file: (p) (p) (p) (p) (p) 0, F , F + F , ... , F + ... + F n-1 n-1 n-2 n-1 n-p+1 [1 term] [2 terms] [p-1 terms] In addition, the output file will now contain (p) (p) (p) F = F + ... + F runs n n-1 n-p [p terms] But this has the same form as the initial distribution, with n replaced by n-1, and with the file that was the output file now occupying the last position in the list. c. Eventually, the merge will reduce down to the pattern (p) (p (p) (p) (p) F , F + F , ... , F + ... + F p-1 p-1 p-2 p-1 0 = 1, 0 , ... , 0 - which is our sorted file d. Of course, we will not, in general, have a perfect number of runs to produce a Fibonacci distribution, so we will have to put some number of dummy runs at the front of one or more files. e. Example - 3-way polyphase merge of our original 8 records. (We will compare to a balanced 2-way merge that uses the same number of scratch files, and required a total of 64 record transfers.) Original file: B D E C F A G H Initial distribution B | C | A | G 16 read/writes D | F | H dummy + E (3) (3) (3) (3) (3) (3) Note: # of runs = 2, 3, 4 = F , F + F , F + F + F 4 4 3 4 3 2 Phase 1 - merge 2 runs: A | G 10 read/writes H (empty) BD | CEF Phase 2 - merge 1 run: G 8 read/writes (empty) ABDH CEF Phase 3 - merge 1 run: (empty) 16 read/writes ABCDEFGH (empty) (empty) Total IO Operations = 16 + 10 + 8 + 16 = 50 f. There is a fairly straightforward way to calculate the distribution by hand: i. If you are doing a P-way polyphase merge, you will use P+1 files - so draw P+1 columns on scratch paper. We will develop the distributions by working BACKWARDS from the final state - so the first distribution we create will be that for the last phase of the merge, etc. ii. Each row will record the distribution as of the start of one phase. The final distribution will, of course, be: 0 0 .... 0 1 And the distribution just before it will be: 1 1 .... 1 0 In each succeeding phase, the empty file will move left one slot - so at the start of the second-to-the-last phase, we have: x x .... 0 x <- where the x's represent values to be filled in Each x will be the sum of the value in this column for the previously calculated phase plus the value in the column that is now zero (1 in this case). Thus, our next distribution is 2 2 .. 2 0 1 And the next is 2+2 2+2 .. 0 0+2 1+2 = 4 4 .. 0 2 3 iii. Continue this process until the total number of runs at the start of the phase is >= the number of runs in the original input. Add dummy runs if necessary. iv. Example: 3-way polyphase merge sort of raw data containing 17 runs intially (4 files used): Final state 0 0 0 1 Total 1 Before last phase 1 1 1 0 Total 3 Before 2nd to last 2 2 0 1 Total 5 Before 3rd to last 4 0 2 3 Total 9 Before 4th to last 0 4 6 7 Total 17 7. Algorithm: a. Assume a p-way polyphase merge, using p+1 files - here designated file[1] .. file[p+1]. The initial distribution will spread runs over file[1] .. file[p] - leaving file[p+1] empty. (In fact, if one wished to economize on files, file[p+1] could serve as the input file - but of course the input file would then be destroyed.) b. The first merge phase will put its runs in file[p+1]; the second in file[p]; the third in file[p-1], etc. After a phase puts its runs in file[1], the next will put its runs in file[p+1] and thus repeat the cycle. c. Variables used in both initial distribution and merging process: level: current level number. Level 1 = final merge, 0 = done. distribution[i]: total number of runs in file i (1 <= i <= p+1) dummy[i]: number of dummy runs in file i (1 <= i <= p+1) d. Initial distribution phase: set level = 1, distribution[i] = 1 for all i <= p, dummy[i] = 1 for all i <= p, distribution[p+1] = dummy[p+1] = 0 set f = 1 <- we will put the next run into file f generate one run and put it into file f (* Run generation may be by any method - treat each record as a run, natural runs, internal sorting, sorting with replacement selection *) dummy[f] := 0 while the input file is not empty do if dummy[f] < dummy[f+1] then f := f + 1 else if dummy[f] = 0 then (* At this point, all values of dummy[] must be zero *) level := level + 1; compute new values of distribution[] and dummy[] f := 1; generate one run and put it into file f dummy[f] := dummy[f] - 1 where we compute new values of distribution[] and dummy[] as follows: merges_this_level := distribution[1]; for i := 1 to p do dummy[i] := merges_this_level + distribution[i+1] - distribution[i] distribution[i] := distribution[i] + dummy[i] Example: Distribute 17 runs for a 3-way merge (p = 3) Level Distribution[i]/Dummy[i] f #runs so far 1 2 3 4 1 1/1 1/0 1/0 0/0 1 0 Initialization 1/0 1 Generate 1st run 1/0 2 2 First time thru loop 1/0 3 3 2nd time thru loop Enter loop 3rd time Compute new distribution. Merge_this_level = 1 2 2/1 2/1 1/0 1 2/0 4 Finish loop 2/0 2 5 2nd time thru loop Enter loop 3rd time Compute new distribution. Merges_this_level = 2. 3 4/2 3/1 2/1 1 4/1 6 Finish loop 4/0 7 2nd time thru loop 3/0 2 8 3rd time thru loop 2/0 3 9 4th time thru loop Enter loop 5th time Compute new distribution. Merges_this_level = 4. 4 7/3 6/3 4/2 1 7/2 10 Finish loop 6/2 2 11 2nd time thru loop 7/1 1 12 3rd time thru loop 6/1 2 13 4th time thru loop 4/1 3 14 5th time thru loop 7/0 1 15 6th time thru loop 6/0 2 16 7th time thru loop 4/0 3 17 8th time thru loop OUT OF RUNS - 17 GENERATED SO FAR e. Actually doing the merges for i := 1 to level do case i mod (p + 1) of 1: merge from files 1..p to p+1 2: merge from files p+1, 1 .. p-1 to p 3: merge from files p .. p+1, 1 .. p-2 to p-1 ... 0: merge from files 2 .. p+1 to 1 Where for each merge, we do the following: for j := 1 to distribution[last input file] do if dummy[] > 0 for all input files then dummy[output_file] := dummy[output_file] + 1 for each input file: if dummy[this_file] > 0 then dummy[this_file] := dummy[this_file] - 1 end_of_run[this_file] := true else end_of_run[this_file] := eof(file[this_file]) while not all files at end of run do transfer record with smallest key among those not at end of run to output file 8. Note that polyphase merging is NOT STABLE. 9. Knuth discusses, in detail, the analysis of polyphase merge sorting, various improvements to it, plus other similar algorithms. The following summarizes the time behavior as a function of the number of "tapes". TRANSPARENCY II. The following summarizes the external sorts we have considered: -- ---- -------- ---------- --- -------- ----- -- ---- ---------- Method Scratch space Records Internal Stable (records) read/written space/time Basic merge n 4n ceil(log n) 3 buffers yes Balanced 2-way 2n 2n (1+ceil(log n)) 4 buffers yes Balanced Natural 2n 2n (1+ceil(log n/m)) 4 buffers no * - initial runs of average size m Balanced with 2n 2n (1+ceil(log n/s)) 4 buffers + yes internal sort of space to sort s records s records (n/s) log(s) internal sort time Balanced with 2n 2n (1+ceil(log n/2s))= [SAME] no internal sort & 2n (ceil (log n/s)) replacement selection - random data Balanced m-way 3n - 2n/m 2n (1+ceil log n) 2m buffers yes m Balanced m-way 3n - 2n/m 2n (1+ceil log n/s) 2m buffers + yes with internal m space to sort of size s sort s (no replacement records selection) Polyphase m-way [ must analyze distribution case-by-case] m+1 no * = can be made stable by using separator records III. Sorting with multiple keys --- ------- ---- -------- ---- A. Thus far, we have assumed that each record in the file to be sorted contains one key field. What if the record contains multiple keys - e.g. a last name, first name, and middle initial? 1. We wish the records to be ordered first by the primary key (last name). 2. In the case of duplicate primary keys, we wish ordering on the secondary key (first name). 3. In the case of ties on both keys, we wish ordering on the tertiary key (middle initial). etc - to any number of keys. B. The approach we will discuss here applies to BOTH INTERNAL AND EXTERNAL SORTS. C. There are two techniques that can be used for cases like this: 1. We can modify an existing algorithm to consider multiple keys when it does comparisons - e.g. a. Original algorithm says: if item[i].key < item[j].key then b. Revised algorithm says: if (item[i].primary_key < item[j].primary_key) or ((item[i].primary_key = item[j].primary_key) and (item[i].secondary_key < item[j].secondary_key) or ((item[i].primary_key = item[j].primary_key) and (item[i].secondary_key = item[j].secondary_key) and (item[i].tertiary_key < item[j].tertiary_key)) then 2. We can sort the same file several times, USING A STABLE SORT. a. First sort is on least significant key. b. Second sort is on second least significant key. c. Etc. d. Final sort is on primary key. 3. The first approach is useable when we are embedding a sort in a specific application package; the second is more viable when we are building a utility sorting routine for general use [but note that we are now forced to a stable algorithm.]