CS222 Lecture: Pipelining                                       3/23/99

Objectives
----------

1. To introduce the basic concept of CPU speedup through pipelining
2. To explain how data and branch hazards arise as a result of pipelining, and
   various means by which they can be resolved.
3. To introduce superpipelining and superscalar processors as means to get 
   further speedup, including techniques for dealing with more complex hazard
   conditions that can arise.

Materials: Transparency of Patterson and Hennessy figure 6.12

I. Introduction
-  ------------

   A. We saw at the start of the course that the total time for the execution
      of a given program is given by:

        Time = cycle time * # of instructions * CPI

             = # of instructions * CPI
               ----------------------
                  clock-rate

   B. This equation suggests three basic strategies for running a given program
      in less time:  (ASK CLASS TO MAKE SUGGESTIONS FOR EACH)

      1. Reduce the cycle time (increase the clock rate)

         a. Can be achieved by use of improved hardware design/manufacturing
            techniques.  In particular, reducing the FEATURE SIZE (chip area 
            needed for one component), results in lower capacitance and 
            inductance, and can therefore run the chip at a higher frequency.

         b. Do less computation on each cycle (which increases CPI, of course!)

      2. Reduce the instruction count.

         a. Better algorithms.

         b. More powerful instruction sets - an impetus for the development of
            CISCs.  (This, however, leads to increased CPI!)

      3. Reduce CPI

         a. Simplify the instruction set - an impetus for the development of
            RISCs.  (This, however, leads to increased program length!).

         b. Do more work per clock.  (This, however, requires a longer
            clock cycle and leads to a lower clock rate!).

      4. Note that, in the case of clock rate and instruction count, there are
         speedup techniques that are clear "wins" - utilizing them does not
         adversely affect the other two components of the equation. It appears,
         though, that it is only possible to reduce CPI at the cost of more 
         instructions or a slower clock.

      5. While there is no way to reduce the total number of clocks needed for
         an individual instruction without adversely impacting some other
         component of performance, it is possible to reduce the AVERAGE CPI
         by doing portions of two or more instructions in parallel.  That is
         the topic we look at in the next few lectures.

   C. We now look at serveral speed-up techniques - of increasing sophistication
      - whereby the different phases of several successive instructions are done
      in parallel.  To some extent, what we will look at is applicable to any 
      architecture; but it becomes most fully implementable for RISCs.

      1. The timing of the CPU operations as we have discussed them thus far
         can be pictured as follows, using a Gantt chart (where S1, S2, S3 are
         a series of instructions being executed one after the other):

        Step 5                 -S1-                -S2-                -S3-

        Step 4             -S1-                -S2-                -S3-

        Step 3         -S1-                -S2-                -S3-

        Decode     -S1-                -S2-                -S3-

        Fetch  -S1-                -S2-                -S3-

        Time ---------->

      2. Notes:

         a. This chart is a bit different in style from the ones in your book.
            The horizontal axis is time, and the vertical axis is the various
            steps of instruction execution.  An instruction moves UPWARD and TO
            THE RIGHT as it passes through the various steps.

         b. To avoid tying the chart to any specific architecture, generic names
            are used for the instruction phases after the first two. 
            (Regardless of architecture, the first two steps in any instruction
            are fetching it and decoding it.)

         c. The chart makes two simplifying assumptions that may or may not be
            valid for a given architecture:

            i. All instructions have the same number of steps.

           ii. Each step takes the same amount of time.

            In reality, neither of these assumptions is typically valid for
            CISCs, but they do hold for RISCs (if we allow that, for
            instructions that need fewer steps, we consider the last few
            steps to be "do-nothings".)

      3. By building additional hardware, it is possible to speed up execution
         by doing portions of two or more instructions in parallel.

   D. One of the simplest speedup techniques is PRE-FETCHING of instructions.

        Step 5                 -S1-            -S2-            -S3-

        Step 4             -S1-            -S2-            -S3-

        Step 3         -S1-            -S2-             -S3-

        Decode     -S1-            -S2-            -S3-

        Fetch  -S1-            -S2-            -S3-

        Time ---------->

      1. Though each instruction still takes 5 cycles from start to finish,
         the average number of clocks per instruction - in the steady state -
         is four - i.e. one instruction completes every 4th cycle.   This
         represents a 20% speedup.

      2. One complication can arise if the current instruction is a conditional
         branch.  In this case, one cannot know while the instruction is being
         executed whether to prefetch the next sequential instruction or the
         one at the branch address.  This is known as a BRANCH DEPENDENCY or
         BRANCH (or CONTROL) HAZARD.

         a. Prefetching can be suspended during execution of a branch 
            instruction until the outcome is known.

         b. Some machines always prefetch the next instruction, or always
            prefetch the branch target, or use some heuristic rule for 
            "guessing" which way the branch will turn out.  If the guess is 
            wrong, then another fetch of the correct instruction must occur 
            before further computation can be done.

         c. Other machines attempt to prefetch both ways and then discard the
            wrong instruction.

         d. We will look at some additional alternatives shortly.
        
      3. Prefetching may not seem possible on machines that have variable
         length instructions, since we don't know how long an instruction is
         until we have decoded it (and perhaps partially executed it.). 
         However, many variable length instruction machines still do a form of
         prefetching by using an instruction queue that simply holds as yet
         uninterpreted bytes from the instruction stream.

         a. Whenever execution calls for another byte from the instruction
            stream, one is taken from the queue if possible.

         b. Whenever the memory is idle, if there is room in the queue then
            one or more bytes are fetched ahead.

         c. If the outcome of a conditional branch sends the program down a 
            different path from the one the CPU has been prefetching on, then 
            the queue is flushed.

   E. Further parallelism between instructions can be achieved at the cost
      of increased complexity of hardware.  

      1. For example, suppose that we not only prefetched instructions, but also
         overlapped the decoding of each instruction with the execution of its
         predecessor.

        Step 5                 -S1-        -S2-        -S3-

        Step 4             -S1-        -S2-        -S3-

        Step 3         -S1-        -S2-         -S3-

        Decode     -S1-        -S2-        -S3-

        Fetch  -S1-        -S2-        -S3-

        Time ---------->

        [In this - idealized - case we have reduced average per instruction
         time by 40%]

      2. The ultimate degree of parallelism would be total overlap between all 
         phases of different instructions - e.g.

        Step 5                 -S1--S2--S3--S4--S5--S6--S7--S8-

        Step 4             -S1--S2--S3--S4--S5--S6--S7--S8-

        Step 3         -S1--S2--S3--S4--S5--S6--S7--S8-

        Decode     -S1--S2--S3--S4--S5--S6--S7--S8-

        Fetch  -S1--S2--S3--S4--S5--S6--S7--S8-

        Time ---------->

         [We have now reduced average per instruction time by 80% !]

         a. We call this a FULLY PIPELINED CPU.  In the steady state, it 
            completes one instruction on every cycle, so its average CPI is 1.

         b. Of course, an average CPI of 1 is attainable only when the pipeline
            is full of valid instructions.  When the pipeline has been
            flushed (e.g. after a branch), it may take several cycles for the
            pipeline to fill up again.  As a result, a fully pipelined machine
            ends up in practice having a CPI somewhat bigger than 1.

      3. This latter degree of parallelism introduces some further 
         complications, however.

         a. Suppose instruction S2 uses some register as one of its
            its operands, and suppose that the result of S1 is stored in this
            same register - e.g.

                S1:     lw $2, some-address
                S2:     addi $3, $2, 1

            (Clearly, the intention is for S2 to use the value stored by S1)
            If S2 needs this operand on its first execution step (Step 3) and 
            S1 doesn't store it until its last step (Step 5), then the value 
            that S2 gets will be the PREVIOUS VALUE in the register - not the
            one stored by S1, as intended by the programmer.

         b. This sort of situation is called a DATA HAZARD or DATA DEPENDENCY.
            We will discuss how it can be handled shortly.

   F. So far in our discussion, we have assumed that the time for the actual
      computation (EXEC) phase of an instruction is a single cycle.  This is 
      realistic for simple operations like AND, fixed point ADD etc.  However, 
      for some instructions multiple cycles are needed for the actual 
      computation.

      1. These include fixed-point multiply and divide and all floating point 
         operations.

      2. To deal with this issue, some pipelined CPU's simply exclude
         such instructions from their instruction set - relegating them to
         co-processors.

      3. if such long operations are common (as would be true in a machine
         dedicated to scientific computations), further parallelism might
         be considered in which the computation phases of two or more
         instructions overlap.  We will not discuss this now, but will come
         back to it when we get to vector processors toward the end of the
         course.

   G. The book discusses MIPS pipelining in detail.  Before considering this,
      we will look at a simpler pipeline structure, based on that of one of
      the earliest RISC's: Berkeley RISC.

II. A Three-Stage Pipeline
--  - ----------- --------

   A. Before discussing the MIPS pipeline, which has five stages, we will
      first look at a simpler, three-stage pipeline.

      1. Stage one is instruction fetch that both fetches an instruction from
         the memory address pointed to by the PC and updates the PC).

      2. Stage two is an ALU operation

         a. Processing at this stage involves

            i. Reading two source values out of appropriate register(s)
               and/or a field in the instruction.

           ii. Performing some basic operation (e.g. add, sub, etc.)

          iii. Storing the result back into a register if approrpriate.

         (Note: MIPS does each of these in a separate stage.  There is no
          reason why they cannot all be done in one stage if we are willing
          to accept a longer clock cycle - which is, of course, the motivation 
          for breaking them into three stages on MIPS!)

         b. The precise function performed depends on the instruction type

            i. For R-Type instructions, this step does the actual computation.

           ii. For memory reference instructions, this step calculates the 
               address.

          iii. For branch type instructions, this step calculates the target
               address and (if appropriate) updates the PC with the new value.

      3. Stage three is a memory operation (read or write) - if needed.  The
         value is transferred between memory and a register in a single
         cycle (not two as on MIPS).  This stage is a "do-nothing" stage for
         R-Type and branch type instructions.

      4. To allow these three stages to be pipelined, the CPU is organized as
         follows:

             IF Unit                    ALU Unit                  Load/store
        - +4 -                 I ----------------------------> I     Unit
       |     |                 n                               n
      PC -----> Instruction -> s  Register ----> Arithmetic    s  Data
       ^        Memory         t  File     ----> Logic ------  t  Memory
       |                            ^            Unit       |       ^
       |                       R    |                       |  R    |
       |                       e    |                       |  e    |       
       |                       g    |                       |  g    |
       -----------------------------------------------------<--------

         a. There are three functional units - an instruction fetch unit, an 
            ALU, and a load/store unit - corresponding to the three stages of 
            instruction execution.  To facilitate this, there are two access
            paths to memory - one used by the instruction fetch stage
            (instruction memory) and one used by the laod/store stage (data
            memory.)

         b. As an instruction is executed, it is passed from stage to stage -
            like stations on an assembly line.   It spends one basic clock
            cycle at each stage.  There are two instruction registers - one
            between the instruction fetch stage and the ALU stage, and one
            between the ALU stage and the load/store stage.

            i. At the end of each cycle, the instruction fetch unit puts a new
               instruction into the IR between the first two stages.

           ii. At the end of each cycle, the instruction that is in the first
               IR is copied into the second one.

         c. Three instructions are in the pipeline at any time - one at each
            stage.  Thus, although it takes up to 3 clock cycles to execute any
            one instruction, one instruction is completed on each clock and
            so the effective time for an instruction is only one cycle - a
            threefold speedup.

         d. Actually, this pipeline - as we have described it - contains one
            problematic feature.  R-Type instructions write their result to the
            register file on stage 2, while load type write their result to the
            register file on stage 3.  Thus, if a load type instruction is
            immediately followed by an R-Type instruction, two writes to the
            register file would occur at the same time (though hopefully not
            to the same register!).  This is not impossible - but in fact, most
            actual pipelines do not encounter such a problem in any case
            due to having more stages, as we shall see.

   B. Because the pipeline is so regular in its operation, the compiler can
      use knowledge about the pipeline to optimize the code it generates.

      1. One problem faced by pipelined CPU's is data dependencies.

         a. In this case, if one instruction loads a memory value into a 
            register and the very next uses that same register as an 
            input, then we have a problem since the load/store phase of the 
            first (which actually transfers the data from memory) overlaps 
            the ALU phase of the second (which uses it.)

         b. One approach to handle such a situation is a pipeline stall, or
            "bubble".

            i. The hardware can detect the situation where the second IR
               contains a load instruction whose destination register is the
               same as one of the source operands of the instruction in the
               first IR.   (This is a simple comparison between IR field
               contents that is easily implemented with just a few gates.)

           ii. In such cases, the hardware can replace the instruction in the
               first IR with a NOOP and force the IF stage to refetch the same
               instruction instead of going on to the next.

          iii. Of course, this means wasting a clock cycle, since the NOOP
               does no useful work.

         c. A better approach is to require the compiler to anticipate such a 
            problem by never emitting an instruction that uses the result of a 
            load immediately after that load.  (The compiler can either put some
            other, unrelated instruction after the load, or it can emit a NOOP 
            if all else fails.)

               Example: suppose a programmer writes:

                        d := a + b + c + 1

               This could be translated to the following.

                        load    r10, a          ; r10 <- a
                        load    r11, b          ; r11 <- b
                        noop    -- inserted to prevent data conflict
                        add     r11, r10, r11   ; r11 <- r10 + r11
                        load    r12, c          ; r12 <- c
                        noop    -- inserted to prevent data conflict
                        add     r12, r11, r12   ; r12 <- r11 + r12 
                        add     r12, r12, #1    ; r12 <- r12 + 1
                        store   r12, d          ; d <- r12

              However, the NOOPs can be eliminated by rearranging the
              operations.  

                        load    r10, a          ; r10 <- a
                        load    r11, b          ; r11 <- b
                        load    r12, c          ; r12 <- c
                        add     r11, r10, r11   ; r11 <- r10 + r11
                        add     r12, r11, r12   ; r12 <- r11 + r12 
                        add     r12, r12, #1    ; r12 <- r12 + 1
                        store   r12, d          ; d <- r12

         d. This latter approach - requiring that code not use a register
            immediately after it has been loaded - is called DELAYED LOAD.

      2. Another source of potential problems is branch dependencies.

         a. Branch instructions are executed in the second stage of the
            pipeline, and the branch target address (and decision whether or
            not the branch is to be taken) are not available until the end of
            the cycle.  Thus, while the branch is being executed, the next
            sequential instruction is being fetched by the instruction fetch
            unit.

         b. In this case, attempting to somehow predict the outcome of the
            branch does not help, because the TARGET ADDRESS of the branch
            becomes available at the same time the outcome becomes known - if
            we predict the branch to be taken in advance of knowing its outcome,
            we cannot do anything with the prediction because we don't yet know
            where the branch will go.

         c. Again, one way to handle this situation is with a "bubble" - if
            the branch is taken, then the instruction behind it is converted
            to a NOOP.

         d. An alternate approach (which was used on Berkeley RISC and is used
            by a number of other RISCS) is called DELAYED BRANCH.

            i. All control transfer instructions (subroutine calls and returns 
               as well as jumps) take effect AFTER the next instruction in 
               sequence is executed.

               Example:

               Suppose we were compiling

                        if something then
                            a := a + 1;
                        else
                            ...

               Suppose further that a is a local variable allocated to
               reside in r16.

               Then the code for the "then" part would consist of an add 1 to
               r16 plus a branch to skip over the "else" part.  This could be
               done this way:

                        addi    r16, r16, 1             r16 <- r16 + 1
                        jmp     end_if
                        noop
               else_part:
                        ....
               end_if:

               The noop is needed because the instruction after the jmp is
               always executed - it's in the pipeline before the branch is
               actually done.

               However, a good compiler would emit the code this way

                        jmp     end_if
                        add     r16, r16, #1            r16 <- r16 + 1
               else_part:
                        ....
               end_if:

           ii. As the above example illustrates, the compiler can normally work
               with this feature of the hardware by inserting the JUMP
               instruction ahead of the last instruction to be done in the
               current block of code.  In some cases, though, a NOOP must be
               inserted (e.g. if the jump is a conditional that depends on the
               last operation to be done before the jump is taken.)

III. MIPS Pipelining
---  ---- ----------

   A. As discussed in the book, MIPS instructions are implemented in up to
      five steps.  In a pipelined implementation, EVERY instruction has all
      all five steps (though some may not actually do any useful work), and
      the pipeline has 5 stages:

      TRANSPARENCY - FIGURE 6.12

      1. IF - both of the following (in parallel)

         a. instruction fetch

         b. program counter increment

      2. ID - all of the following (in parallel)

         a. instruction decode

         b. register file read into ALU input holding registers A and B

         c. branch target address calculation

         (not all of these results may in fact be used.)

      3. EXEC - one of the following

         a. ALU operation for an R-Type instruction

         b. address calculation for a memory reference instruction

         In either case, the inputs come from input holding registers A and B
         and/or a field in the instruction, and the output goes to the
         ALU Out holding register

      4. MEM - one of the following

         a. Memory operation - read or write.

         b. On non memory-reference instructions, this step does nothing

      5. WB - one of the following

         a. Write ALU Out register (computed in step 3) back to appropriate
            destination in the register file (R-Type instruction)

         b. Write value read from memory (in step 4) into to appropriate
            destionation in the register file (load instruction)

         c. On all other instructions, do nothing

      6. Note the presence of pipeline registers between each pair of stages.
         Since there are five instructions in the pipeline at any time, there
         is a need to keep copies of four instructions (or at least portions
         of them) in registers at any time.  (There are only four instructions 
         in registers, because one is coming out of memory during stage 1).
         In addition, we need to keep certain data in these registers.

         a. IF/ID holds an instruction, plus the incremented PC value of where
            it came from.

         b. ID/EXEC holds the op-code (possibly in some decoded form) plus the
            destination register specifier, immediate value, and funct fields of
            the instruction, plus the A and B source values read out of the 
            register file, plus the PC passed on from the IF/ID register.

         c. EXEC/MEM holds the op-code and destination register specifier 
            fields of the instruction (copied from ID/EXEC), plus the ALUOut 
            value, plus the data that is to be written to memory if the 
            instruction is a store (contents of register specified by rt).

         d. MEM/WB holds the op-code and destination register specifier
            fields of the instruction and the ALUOut value (copied from 
            ID/EXEC) plus the value just read from memory if the instruction
            is a load.

   B. The motivation for going to a five-stage pipeline on MIPS appears to be
      the following 

      1. Doing the register file read, ALU operation, and write back of the
         ALU result in one step (as is the case for the three-stage pipeline)
         would require a longer clock cycle for this stage, and thus for all
         stages in the pipline.

      2. Likewise, doing a memory read and writing the item read back to a
         register in one step (as is also the case for the three-stage
         pipeline) would pose a similar problem.

      3. Since the load instruction ends up requiring 5 steps (which is the
         longest instruction) a 5 stage pipeline is called for.

   C. Actually, the way the book describes the MIPS pipeline - and the way we
      have described it here - is a bit oversimplified.  The actual pipeline
      on most MIPS implementations has five stages, but uses only four clocks
      because two of the stages are just half a clock long.  (Recall that the
      clock is a square wave, and a complete cycle includes both rising and
      falling edges).  Here is the actual structure:
                                         
                                              ----------------------------------
WB (1/2 cycle)                                | S1 |/////| S2 |/////| S3 | ... 
                                   ---------------------------------------------
MEM                                |    S1    |    S2    |    S3    |    S4    |
                        --------------------------------------------------------
EXEC                    |    S1    |    S2    |    S3    |    S4    | ...
                   --------------------------------------------------
ID (1/2 cycle)     | S1 |/////| S2 |/////| S3 |/////| S4 | ...
        --------------------------------------------------
IF      |    S1    |    S2    |    S3    |    S4    | ...
        ---------------------------------------------

        ///// = idle on this half-cycle

      (We will stick with the simplified version used in the book for most of
       our discussion, since the basic issues are not affected.)
        
   D. The use of a five-stage pipeline allows for a shorter clock cycle, but
      the downside is that it complicates dealing with hazards.

      1. At first glance, it would appear that the branch hazard problem would
         be exacerbated, because instructions are normally executed in stage 3
         (meaning that there would now be 2 instructions in the pipeline when
         a branch is executed.)  However, the MIPS hardware is arranged so that
         branch instructions are executed in stage 2 of the pipeline, and MIPS
         deals with the single instrucition behind it in the pipeline by using
         delayed branching.

         (Note: It would appear that use of delayed branching would have posed
          a problem when you were writing MIPS assembly language programs in
          lab, since you were unaware of this at the time.  However, the MIPS
          assembler automatically inserts a NOOP after a branch, so this was
          not an issue.)

      2. On the other hand, the data hazard issue is made much worse.
         In particular,  in the 3-stage pipeline, a data hazard could arise when
         using the value read from memory by a load instruction too soon.  In 
         the 5-stage pipeline, data hazards can also arise from dependent
         sequences of computational instructions.

         a. Example: Consider the following program fragment:

                S1:     add     $2, $4, $5
                S2:     add     $3, $6, $7
                S3:     add     $3, $2, $3
                S4:     add     $2, $2, $8

            (where it is the intention that S3 use the values in $2 and $3
             computed by S1 and S2, and S4 uses the value in $2 computed by S1)

            i. This program would work correctly on the 3-stage pipeline.

           ii. But consider what happens with a 5-stage pipline

        WB                                      S1:     S2:     ...     ...
                                                $2 <-   $3 <-
                                                $4+$5   $6+$7

        MEM                             S1:     S2:     ...     ...
                                        (pass   (pass
                                        ALUOut  ALUOut
                                        thru)   thru)

        EXEC                    S1:     S2:     S3:     S4:
                                ALUOut  ALUOut  ALUOut  ALUOut
                                <- A+B  <- A+B  <- A+B  <- A+B
                                ($4+$5) ($6+$7) ($2+$3) ($2+$8)

        ID              S1:     S2:     S3:     S4:
                        A<-$4   A<-$6   A<-$2   A<-$2
                        B<-$5   B<-$7   B<-$3   B<-$8
                                        (BOTH   (A
                                         VALUES  VALUE
                                         WRONG!) WRONG!)
        
        IF      S1      S2      S3      S4

            How many bubbles, NOOPs, or other instructions would need to be
            inserted between S2 and S3 to make S3 and S4 get the right values?

            ASK

            - One would take care of getting the right $2 for S4, but
              would not help S3 at all.

            - Two would take care of getting the right $2 for S3 as well, 
              but $3 would still be wrong

            - Three would make everything work correctly:

        WB                                      S1:     S2:     ...     ...
                                                $2 <-   $3 <-
                                                $4+$5   $6+$7

        MEM                             S1:     S2:     ...     ...     ...
                                        (pass   (pass
                                        ALUOut  ALUOut
                                        thru)   thru)

        EXEC                    S1:     S2:     ...     ...     ...     ...
                                ALUOut  ALUOut
                                <- A+B  <- A+B
                                ($4+$5) ($6+$7)

        ID              S1:     S2:     extra1  extra2  extra3  S3:     S4:
                        A<-$4   A<-$6                           A<-$2   A<-$2
                        B<-$5   B<-$7                           B<-$3   B<-$8
        
        IF      S1      S2      extra1  extra2  extra3  S3      S4

         b. Requiring three instructions between the time a value is computed
            and the time it is used would have a very severe negative impact
            on performance, so some other solution is desirable.  MIPS uses
            two.

         c. A one instruction reduction in the size of the problem is
            achieved automatically by the fact that ID and WB are half cycles.
            (Refer to more accurate timing diagram).

         d. To squeeze the remaining two delays out, observe that the values
            needed by S3 EXIST at the time S3 needs them - they're just not
            in the right places.

            i. The value which will go into $2 is sitting in the ALUOut 
               portion of the MEM/WB pipeline register when S3 needs to use it 
               during its EXEC step.

           ii. The value which will go into $3 is sitting in the ALUOut
               portion of the EXEC/MEM pipeline register when S3 needs to use
               it during its EXEC step.

          iii. The two stage delay that would otherwise be needed between
               computing a result and using it could be avoided if the ALU
               input selection logic could be modified to either use a value
               from any of the following:

               - The register A or B portion of the ID/EXEC pipeline register
                 (as the case may be)

           or  - The ALUOut register portion of the EXEC/MEM pipeline register

           or  - The ALUOut register portion of the MEM/WB pipeline register

               (This must be handled separately for each of the two ALU
                inputs).

        WB                                      S1:     S2:     ...     ...
                                                $2 <-   $3 <-
                                                $4+$5   $6+$7

        MEM                             S1:     S2:     ...     ...
                                        (pass   (pass
                                        ALUOut  ALUOut
                                        thru)   thru)

        EXEC                    S1:     S2:     S3:     S4:
                                ALUOut  ALUOut  ALUOut  ALUOut
                                <- A+B  <- A+B  <- M/W  <- A+B
                                                + E/M   
                                                ALUOuts
                                ($4+$5) ($6+$7) ($2+$3) ($2+$8)

        ID              S1:     S2:     S3:     S4:
                        A<-$4   A<-$6   A<-$2   A<-$2
                        B<-$5   B<-$7   B<-$3   B<-$8
                                        (BOTH   
                                         VALUES 
                                         WRONG!)
        
        IF      S1      S2      S3      S4


           iv. This strategy is called DATA FORWARDING.

         3. We have examined the impact of the five-stage pipeline on R-Type
            instructions, and have seen that data forwarding can prevent
            hazards.  What about load-type instructions (where even the
            three-stage pipeline had a hazard)?

            a. Both R-Type instructions and load instructions write their value
               to a register in the last stage of the pipeline, so the basic
               issue is the same.

            b. However, a key difference is that, with an R-Type instruction,
               the value is actually available at the end of the EXEC stage
               (stage 3) and can be forwarded from them, where as in the
               load instruction the value does not become available until
               the end of the MEM stage (stage 4).

            c. As a result, by use of data forwarding, we can reduce the
               delay down to the same delay as what is experienced in the
               3-stage pipeline - i.e. we must have one instruction
               intervening between a load instruction and any other
               instruction that uses its result.  On MIPS, this is handled by
               using DELAYED LOAD, as discussed previously.  (Once again,
               this is hidden from the programmer by the compiler or
               assembler.)
       
IV. Moving Beyond Basic Pipelining
--  ------ ------ ----- ----------

   A. The potential speedup from pipelining is a function of the number of
      stages in the pipeline.

      1. For example, suppose that an instruction that would take 40 ns to
         execute is implemented using a 4 stage pipeline with each stage
         taking 10 ns.  Then the speedup gained by pipelining is

         w/o pipeline  - 1 instruction / 40 ns
         with pipeline - 1 instruction / 10 ns
 
         40ns/10ns = 4:1

         Now if, instead, we could implement the same instruction using a
         5 stage pipeline with each stage taking 8ns, we could get a 5:1
         speedup instead.

      2. This leads to a desire to make the pipeline consist of as many
         stages as possible, each as short as possible.  This strategy is
         known as SUPERPIPLINING, and is used by a number of CPU's.
         However, superpiplining does run into some problems that prevent the
         theoretical maximum speedup from being achieved.

         a. If the individual steps an instruction is broken up into don't
            all take the same time (a likely situation), then the time for
            the longest step becomes the time for all steps.

            Example: if in going from 4 stages to 5 in our example above it
                     turned out that one of the steps needed only 6 ns but
                     another needed 10 (still 40 ns total), the cycle time
                     would have to be 10ns, and the speedup would still be
                     only 4:1

         b. The longer the pipeline, the greater the potential waste of
            time due to data and branch hazards.

      3. Note that superpipelining attempts to maintain CPI at 1 (or as close
         as possible) while using a longer pipeline to allow the use of a
         faster clock.

   B. It would appear, at first, that a CPI of 1 is as good as we can get -
      so there is nothing further that can be done beyond full pipelining to
      reduce CPI.  Actually, though, we can get CPI less than one if we
      execute two or more instructions fully in parallel (i.e. fetch them at
      the same time, do each of their steps at the same time, etc) by
      duplicating major portions of the instruction execution hardware.
      
      1. If we can start 2 instructions at the same time and finish them at
         the same time, we complete 2 instructions per clock, so average CPI
         drops to 0.5.  If we can do 4 at a time, average CPI drops to 0.25.

      2. Because various hazards make it impossible to always achieve the
         maximum degree of parallelism, we speak of a machine as ISSUING
         some number of instructions on a given clock.  (An instruction is
         issued when it is actually allowed to begin execution.)
         Potentially, for a given machine, this might be 2 or 4; however, the 
         actual number issued on a given clock may be less.

      3. This is facilitated by taking advantage of the fact that many
         CPU's have separate execution units for executing different types
         of instructions - e.g. there may be:

         a. An integer execution unit used for executing integer instructions
            like add, bitwise or, shift etc..

         b. A floating point execution unit for executing floating point 
            arithmetic instructions.  (Note that many architectures use
            separate integer and floating point register sets).
        
         c. A branch execution unit used for executing branch instructions.

         If two instructions need two different execution units (e.g. if one
         is an integer instruction and one is floating point) then they can
         be issued simuiltaneously and execute totally in parallel with each
         other, without needing to replicate execution hardware (though
         decode and issue hardware does need to be replicated.)

         Note that, for example, many scientific programs contain a mixture of 
         floating point operations (that do the bulk of the actual computation),
         integer operations (used for subscripting arrays of floating point 
         values and for loop control), and branch instructions (for loops).
         For such programs, issuing multiple instructions at the same time
         becomes very feasible.

         Note: in effect, a group of instructions that do a computation - say -
               on an array element and then update a pointer to point to the
               next element become, in effect, like a single CISC instruction
               with autoincrement!

      4. The earliest scheme used for doing this was the VERY LONG INSTRUCTION
         WORD architecture.  In this architecture, a single instruction could
         specify more than one operation to be performed - in fact, it could
         specify one operation for each execution unit on the machine.

         a. The instruction contains one group of fields for each type of
            instruction - e.g. one to specify an integer operation, one to
            specify a floating point operation, etc.

         b. If it is not possible to find operations that can be done at the
            same time for all functional units, then the instruction may
            contain a NOOP in the group of fields for unneeded units.

         c. The VLIW architecture requires the compiler to be very knowledgeable
            of implementation details of the target computer, and may require a
            program to be recompiled if moved to a different implementation of 
            the same architecture.  

         d. Because most instrucion words contain some NOOP's, VLIW programs
            tend to be very long.

      5. Current practice - found on a number of RISCs including Dec Alpha and
         the Power PC - is to use SUPERSCALAR architecture.

         a. A superscalar CPU fetches groups of instructions at a time -
            typically two (64 bits) or four (128 bits) and decodes them in
            parallel.

         b. A superscalar CPU has just one instruction fetch unit (which
            fetches a whole group of instructions), but it has 2 or 4
            decode units and a number of different execution units.

         c. If the instructions fetched together need different execution units,
            then they are issued at the same time.   If two instructions need 
            the same execution unit, then only the first is issued; the second 
            is issued on the next clock.  (This is called a STRUCTURAL HAZARD).

         d. To reduce the number of structural hazards that occur, some
            superscalar CPU's have two or more integer execution units,
            along with a branch unit and a floating point unit, since integer
            operations are more frequent.  Or, they might have a unit that
            handles integer multiply and divide and one that does add and
            subtract.

      6. Once again, the issue of data and branch hazards becomes more
         complicated when multiple instructions are issued at once, since
         an instruction cannot depend on the results of any instruction
         issued at the same time, nor on the results of any instruction issued
         on the next one or more clocks.  With multiple instructions issued
         per clock, this increases the potential for interaction between
         instructions, of course.

         a. Example: If a CPU issues 4 instructions per clock, then up to
            seven instructions following a branch might be in the pipeline
            by the time the branch instruction finishes computing its target
            address.  (If it is the first of a group of 4, plus a second
            group of 4.)

         b. Example: If a CPU issues 4 instructions per clock, then there may
            need to be a delay of up to seven instructions before one can
            use the result of a load instruction, even with data forwarding
            as described above.

   C. Dealing with Hazards on a Superscalar Machine

      1. Data hazards

         a. We have previously seen how data forwarding can be used to
            eliminate data hazards between successive instructions where one
            instruction uses a result computed by an immediately-preceeding
            one.  However, if "producer" and "consumer" instruction are
            executed simultaneously in different execution units, forwarding
            no longer helps.  Likewise, the unavoidable one cycle delay
            needed by a load could effect many successive instructions.

         b. Superscalar machines typically incorporate hardware interlocks
            to prevent data hazards from leading to wrong results.  When an
            instruction that will store a value into a particular register is
            issued, a lock bit is set for that register that is not cleared
            until the value is actually stored - typically several cycles
            later.  An instruction that uses a locked register as a data input
            is not issued until the register(s) it needs is/are unlocked.

         c. Further refinements on this include a provision that allows the
            hardware to schedule instructions dynamically, so that a "later"
            instruction that does not depend on a currently executing
            instruction might be issued after an "earlier" instruction that
            does.  (This is called OUT OF ORDER EXECUTION.)

      2. Branch hazards

         a. To reduce the impact of branch hazards, some machines (CISCs as
            well as highly-pipelined machines) make use of BRANCH PREDICTION.  
            When a conditional branch instruction is encountered, the fetch 
            unit makes a guess as to whether or not the branch is going to be 
            taken.  If the prediction is that the branch will be taken (or the 
            branch is unconditional), then the fetch unit begins to fetch 
            instructions from the  new location; otherwise, it keeps fetching 
            as usual.  Only if the prediction is wrong does a pipeline flush 
            occur.

            i. How can such a prediction be done?

              - One way to do the prediction is to use the following rule of
                thumb: assume that forward conditional branches will not be 
                taken, and backward conditional branches will be taken.

                Why?  ASK

                - Forward branches typically arise from a construct like

                  if something then
                        common case
                  else
                        less common case

                - Backward branches typically result from loops - and only the
                  last time the branch is encountered will it not be taken.

              - Some machines incorporate bits into the format for branch 
                instructions whereby the compiler can furnish a hint as to
                whether the branch will be taken.

              - Some machines maintain a branch history table which stores
                the address from which a given branch instruction was fetched
                and an indication as to whether it was taken the last time
                it was executed.

           ii. In any case, prediction requires the ability to reach a 
               definitive decision about whether the branch is going to be 
               taken before any following instructions have stored any values 
               into memory or registers.

          iii. Prediction cannot eliminate all problems, because if the branch
               is predicted to be taken, it is not possible to begin 
               prefetching down the new path until the target address has
               been computed (in stage 2 of the pipelines we have been using
               for examples.)  CPU's that record branch history can avoid this
               problem by also storing the target address of the branch.
               (Since the target address of a branch instruction is generally
               computed by PC + displacement in instruction, a given branch
               instruction will always point to the same target.)

         b. An alternative to branch prediction that is being used on the new
            Intel RISC architecture in development (IA 64) is called
            PREDICATION.  In this strategy, the CPU includes a number of one
            bit predicate registers that can be set by conditional instructions.
            The instruction format includes a number of bits that allow
            execution of an instruction to be contingent on a particular
            predicate register being true (or false).  Further, a predicated
            instruction can begin executing before the value of its predicate
            is actually known, as long as the value becomes known before the
            instruction needs to store its result.  This can eliminate the
            need for a lot of branch instructions.

            Example:

                if r10 = r11 then
                    r9 = r9 + 1
                else
                    r9 = r9 - 1

            Would be translated on MIPS as:

                bneq    $10, $11, else
                nop                     # Branch delay slot
                br      endif
                addi    $9, $9, 1       # In branch delay slot - always done
            else:
                addi    $9, $9, -1
            endif:

            Which is 5 instructions long and needs 4 clocks if $10 = $11 
            and 3 if not.

            But on a machine with predication as:

                set predicate register 1 true if $10 = $11
                (if predicate register 1 is true) addi $9, $9, 1
                (if predicate register 1 is false) addi $9, $9, -1

            Which is 3 instructions long (all of which can be done in
            parallel, provided the set predicate instruction sets the
            predicate register earlier in its execution than the other
            two store their results.)
                
   D. Advanced CPU's use both superpiplining and superscalar techniques.
      (E.g. Dec Alpha is both superpipelined and superscalar).  The benefits
      that can be achieved are, of course, dependent on the ability of the
      compiler to arrange instructions in the program so that when one
      instruction depends upon another it occurs enough later in the program
      to prevent hazards from stalling execution and wasting the speedup that
      could otherwise be attained.

Copyright ©1999 - Russell C. Bjork