CS222 Lecture: CPU Implementation                       revised 10/20/2000

Materials: Transparency of Patterson-Hennessy figures 5.35, 5.32

I. Introduction
-  ------------

   A. We now shift the focus of the course from computer architecture to
      computer organization - going from describing how a computer behaves at
      the machine/assembly language level to how it is implemented.

   B. At the start of the course, we noted that a computer system may be viewed
      at various levels of abstraction:  (ASK)

      1. The user level

      2. The higher-level language programming level

      3. The machine/assembly language programming level

      4. The hardware design level

      5. The solid-state physics level

      We have spent the first half of this course at the machine/assembly
      language programming level.  We are now going to drop down to the
      hardware design level, to see how the functionality we have been
      studying can actually be realized.

   C. It turns out that each of these levels can be divided into sublevels
      for detailed study.

      1. For example, we studied machine language as one level for a couple of 
         weeks, and then built assembly language as a higher level on top of 
         that.

      2. At the hardware design level, we can talk about the following
         sublevels:

         a. The system level (coming in a week or so)

         b. The CPU implementation level (we start this today)

         c. The logic design level (we did this in CS221)

      3. Our particular concern at this point is with the implementation of
         the CPU, which is the component that implements the architectural
         capabilities we have been studying for the first half of the
         course.  From there, we will move on to consider how the CPU is
         combined with other subsystems (IO, memory, etc.) to build a
         complete computer.

   D. Rather than attempting to describe the implementation of a specific CPU,
      we will talk in more general terms about what might be needed to
      implement an Instruction Set Architecture (ISA).  Our examples will
      mostly be based from a one-address architecture machine.

   E. Consider a typical one-address machine instruction - say ADD.  

        ADD     X       ; Meaning AC <- AC + contents of memory cell "X"

      What must take place to perform this instruction (beginning with fetching
      the instruction itself)?  (ASK)

      1. Instruction fetch (IF) - go to the memory location pointed to by the
         PC, and fetch the instruction stored there; then update the PC.

      2. Instruction decode - (IOD) determine out what the instruction is.  
         (Exactly how this done depends on how the control unit of the CPU is 
         implemented, something we will discuss in the next series of lectures.)

      NOTE: The above two steps are always the same, regardless of what
            instruction is being executed.

      3. Operand address calculation (OAC) - this depends on what addressing 
         mode is being used (e.g. it might involve simply extracting an absolute
         address from the instruction; or it might involve adding a displacement
         to the PC or some other register; or it might even involve a trip
         to memory if some sort of deferred mode is being used.

      4. Operand fetch (OF) - get the value of the operand from the memory 
         address just calculated.

      5. Execution (EXEC) - add the value just fetched to the AC, and store the
         result back into the AC.

      NOTE: The actual execution of the computation is just a small part of
            the effort involved in executing the instruction.

   F. Of course, the exact series of steps (after instruction fetch and
      decode) will vary with the instruction being executed.

      1. An instruction like STORE X will not need an operand fetch step, but
         will need an operand store (OS) step as its last step - store the AC
         in the location whose address was calculated in the OAC step.

      2. An instruction like branch will require an OAC step, but not an OF
         or OS step.

      3. An instruction that doesn't involve any memory location (e.g.
         shift the AC left one place) won't need OAC, OF, or OS - but will
         still need OF, IOD, and EXEC.

   G. The steps will also vary with the architecture of the machine:

      1. A memory-memory architecture machine may require either 2 (two-address 
         architecture) or 3 (three-address architecture) OAC steps; plus two OF
         steps and one OS step, all for the same instruction.

      2. A load-store machine instruction will either have an OAC and OF or OS
         step or an EXEC step - but not both.

      3. A machine with variable length instructions (like the VAX) may require
         additional portions of the IF step after initial IOD.

      etc.

   H. We now look at the hardware required to carry out these steps.

II. Overview of CPU Components
--  -------- -- --- ----------

   A. Though CPU's vary widely in design, many can be modelled by a structure 
      like the following:

    |------------------------------------------------------------|
    |           |===============|                                v
    |           ||             ||                            ---------------
    |           ||        ------------     -------           |             |
    |           ||       /    ALU     \ -->|Flags|---------->| Control     |
    |           ||      /              \   -------           |     --------|
    |           ||      ----------------                     |     | State |
    |           ||        /||\    /||\                       ---------------
    |           ||         ||      ||                            ^
    |           ||      |-----|  |-----|                         |
    |           ||      |_____|  |_____|                         |
---------       ||         ||      ||                            |
| Clock |       ||         |========|                            |
|       |       ||             ||                                |
---------       ||      -----------------                        |
    |           ||      | CPU Registers |                        |
    |           ||      | (Some Visible |                        |
    |------------------>|  to the user, |                        |
    |           ||      |  others not)  |                        |
    |           ||      |               |                        |
    |           ||      -----------------                        |
    |           ||      |     MAR       |                        |
    |           ||      |---------------|                     -------
    |           ||      |     MBR       | ===================>| IR  |
    |           ||      -----------------                     -------
    |           ||          /||\   /||\                   
    |           ||           ||     ||
    |           |=============|     ||
    |                              \||/
    |
    |--------------------> Memory and/or IO Buses

   B. The most visible component of the CPU is the register set.  As we have
      seen, the particular assortment of registers will vary from machine to
      machine, but will typically include one or more registers that can 
      function as accumulators, index registers, a program counter etc.

      1. Often, the register set will include some registers used for internal 
         purposes that are not directly accessible to the assembly language 
         programmer; these can be used as scratch pads for more complex 
         computations such as multiplication, division, or floating point 
         operations, and to provide support for operating system functions
         such as memory management that are not directly visible to the
         ordinary programmer.

      2. Also included among the registers are an MBR and MAR which serve as
         the interface between the CPU and the memory system.  

         a. To perform a read access to memory, the CPU places the address 
            desired in the MAR and then issues a "read" command to the memory
            system.  When the memory completes the operation, the data requested
            will be in the MBR, from which the CPU can transfer it to whereever
            it is needed.

         b. A write is similar, but the CPU places the data to be written into
            the MBR before issuing the command to the memory.

         c. Strictly speaking, these registers may not be necessary on a given 
            CPU; the memory bus may connect directly to the input and output 
            of the ALU.  However, it is conceptually simpler to think of them
            as being present.

      3. Coming out of the register system are two buses serving as inputs to
         the ALU.  

         a. The busses may be formed by using tri-state outputs on each 
            individual register.

         b. Or, there may be a set of multiplexers - one for each bit in -
            with one bit of each register connected to each MUX.

      4. There is also a bus which carries the output of the ALU back to the
         register set, where it can be loaded into a specific register at
         the end of the cycle.  This can be accomplished by giving each register
         a parallel load capability.

   C. The ALU would generally include several subunits for functions like:

      1. Addition/subtraction.

      2. Logical operations: AND, OR etc.

      3. Shifts

      4. On higher power machines, the ALU may include special hardware
         for fast multiplication/division and/or floating point operations, as
         we discussed earlier under arithmetic algorithms.

   F. The ALU is also connected to a set of flags that can be set according
      to the outcome of an operation - e.g. the following are frequently found:

      1. A Z flag set if the result is zero

      2. An N flag set if the result is negative

      3. A C flag set if there is carry out from addition/subtraction or
         containing the bit shifted out by a shift.

      4. A V flag set if there is overflow on addition/subtraction or a shift.

      5. Other flags may be included: parity odd or even, half carry for
         decimal operations etc.

   E. The control unit controls the operation of the remaining units by
      enabling select inputs to the various MUXes etc.

      1. This would be determined by a system clock, internal state information
         in the control, and the contents of:

         a. An IR holding the current machine language instruction being 
            executed.

         b. The ALU flags

      2. The output of control would include lines to select:

         a. Which registers serve as input to the ALU (selection lines to MUXes
            or address to RAM or the like.)

         b. What function(s) the ALU performs.

         c. Which register receives the ALU output.

         d. Other functions such as control of the memory system etc.

         These lines are activated at the start of a clock cycle, the
         computation they carry out is carried out during the clock cycle,
         and the final result is stored in a register and/or the flags at
         the end of the clock cycle.

      3. The control is by far the most complex part of the CPU.  For this
         reason, we will devote a separate lecture to it.  For now, we go on
         to consider the various kinds of operations that occur within the CPU.

   F. As we have drawn it, many of the components are used by two or more
      steps in performing an instruction.

      1. E.g. the memory bus/MAR/MBR are used by OF, OAC (possibly), OF, OS

      2. The ALU is used by OF (update PC), OAC, and EXEC

      etc.

   G. It is also poss to build a CPU with functional units dedicated to the
      various step units - e.g. an instruction fetch unit, an address
      calculation unit, a data memory interface unit, an instruction execution
      unit etc.  This entails replication of certain hardware components, but
      allows enhanced speed through parallelism.

III. The Register-Transfer Level of System Description
---  --- ----------------- ----- -- ------ -----------

   A. We have just noted that computer systems can be described at various 
      levels.  Associated with each of these levels is one or more "languages" 
      or systems of notation.

      1. At the user level, computer systems have command "languages", which
         may be textual (e.g. DCL, DOS) or graphical.

      2. At the HLL level, we have languages like Pascal, C, C++ ...

      3. At the machine language programming level, we have used two languages:
         machine language and assembly language.

      4. At the hardware design level, we have seen that we can divide it into
         sublevels, each of which will have a language of its own:

         a. The logic design level is described using the language of gates,
            flip-flops, and finite state machines, as we have already seen.

         b. The CPU implementation level uses a notation called REGISTER
            TRANSFER LANGUAGE, which we are about to learn.

         c. We will also utilize a system of notation for describing the
            overall system level.

   B. In describing the organization of the CPU, we need a system of
      notation to describe the basic operations that are allowed to take place,
      and the circumstances under which they occur.  Such a system of notation
      is called a register-transfer language, and resembles a programming
      language in the sense that it is used to describe the series of steps
      needed accomplish some task - only in this case the steps are primative
      hardware operations such as transfering a word from one register to
      another, placing data on a bus, or computing a sum in an adder.

   C. Each primative operation described by RTL is called a micro-operation.
     
      1. A micro-operation is a primative data transfer or transformation
         operation accomplished by the hardware in A SINGLE CLOCK CYCLE.

      2. In contrast, a macro-operation is a single machine instruction as
         seen by an assembly-language programmer.

         a. When you studied assembly language, you learned that a single
            higher-level language statement might require several machine
            language instructions.  Consider the implementation of the
            following Pascal statement on a one-accumulator machine using
            one-address instructions:

                Pascal:         X := Y + Z

                machine:        LOAD    Y
                                ADD     Z
                                STORE   X

         b. Likewise, each machine language instruction (macro-operation) will 
            be implemented as a series of micro-operations.  For example, take 
            the ADD instruction abov, and assume the use of absolute addressing.
         .  The following series of micro-operations may be used to actually 
            perform it.

                machine:        ADD     Z

                becomes:        MAR <- PC                       (OF)
                                MBR <- M[MAR]                    "
                                PC <- PC + size of instruction   "
                                IR <- op code portion of MBR    (IOD)
                                MAR <- address portion of MBR   (OAC)
                                MBR <- M[MAR]                   (OF)
                                AC <- AC + MBR                  (EXEC)
            
   D. RTL not only allows us to describe primative operations, but also the
      conditions under which those operations occur.  This is accomplished
      by preceeding the micro-operation with a logical expression followed
      by a colon.  For example, in the above, suppose that the OP code is
      stored in the instruction register (IR), and that the op code for ADD
      is 1001.  Suppose further that the addition step takes place when an
      internal timing signal T7 is true.  Then the last microoperation could 
      be written:

        IR = 1001 and T7: AC <- AC + MBR

      1. Note that the portion before the colon becomes the description for a
         combinatorial network that must be implemented to generate the control
         signals necessary to effect the specified transfer - e.g. the parallel
         load enable input to the AC plus possible inputs to several MUX's to
         select AC and MBR as inputs to the adder (assuming other inputs are
         possible) and the adder output as input to the AC (assuming other
         inputs to the AC are possible).

      2. At the RTL level, a system can be viewed as consisting of two parts:

         a. A data part, consisting of registers, data paths (busses) and the
            data transformation elements that comprise the ALU.

         b. A control part that generates the necessary enable and selection
            inputs to the devices in the data part at the correct time.

         c. A micro-operation specification in RTL can be read as:

            if the following conditions are true, then 
                the control unit must generate the control signals needed
                to cause ___ to occur.

   E. RTL allows us to specify that a number of micro-operations occur in
      parallel, by separating them by commas.  

      1. For example, most CPU's are constructed in such a way that the 
         operation of incrementing the program counter uses different hardware 
         than that used to actually read a word from memory.

      2. Thus, in the ADD instruction example we considered above, these two 
         steps could be done in parallel:

                                MBR <- M[MAR], PC <- PC + instruction size

      3. In the ADD instruction, these are the only steps that can be done
         in parallel, because each of the other steps depends on the result
         of the previous step.

   F. Basic RTL nomenclature:

      1. Registers are referred to by all capital letter names - e.g. AC, R3
         etc.

      2. Busses are referred to similarly.

      3. A single bit of a register or bus is referred to by using a subscript-
         e.g. R3
                2

      4. A group of bits of a register or bus are referred to by enclosing the
         bit numbers or a mnemonic name in parentheses - e.g. IR(15-0),
         MBR(AD).

      5. An arrow is used to denote the loading of a value into a register or
         its gating onto a bus - e.g. AC <- AC + 1 or ABUS <- AC.  Cf the
         assignment operation of Pascal: AC := AC + 1.

      6. A colon separates the conditions under which a micro-operation is to
         be done (boolean expression) from the operation itself.  Cf the
         if..then of pascal:

                IR=1001 and T7: AC <- AC + MBR

                if (IR=1001) and T7 then
                   AC := AC + MBR

      7. Commas are used to separate micro-operations done in parallel (at the
         same time.)

IV. Survey of typical micro-operations:
--  ------ -- ------- ----------------

    A. Parallel transfer:       condition: dest <- source

       1. Meaning: If condition is true, then all bits of destination register 
          are loaded with corresponding bit of source on the next clock pulse.

       2. Implementation: destination register is a register with parallel
          load.  Its inputs are tied to corresponding outputs of source, which
          may be another register, an array of arithmetic elements (eg a
          16 bit adder) or a bus.  Parallel load enable is activated by a
          network realizing the specialized boolean condition.

          Example:      for     xy: A(3-0) <- B(3-0)

                 ____            ____________
           x ---|    \  Load    |            |
                |     )---------| Register A |----------- Clock
           y ---|____/  Enable  |____________|
                                   | | | | Inputs
                                   | | | |
                                   | | | | Outputs
                                 __|_|_|_|___
                                |            |
                                | Register B |
                                |____________|

   B. Bus transfer:     condition:      BUS <- source
                or      condition:      dest <- BUS
                or      condition:      BUS <- source, dest <- BUS

         (In the first case, the bus is presumably used as input to some
          functional unit such as the adder; in the second case, the bus is
          presumably the output from some functional unit such as the adder;
          in the third case the bus is used to route data directly from one
          register to another without any intervening computation.)

      1. Meaning: If condition is true, then one of several possible sources 
         is selected and placed onto a common bus and/or at the next clock
         pulse all bits appearing on the bus are copied into destination.

      2. Implementation: each of the sources connects to the bus in one of
         two ways:

         a. One input to a MUX.

         b. A tri-state gate.

         The bus connects to the parallel input of the destination.  A logical
         network realizing condition is used to enable either the correct MUX
         channel or the tri-state gates, as well as parallel load.

         Example: Consider a CPU having 4 bit-registers, any one of which
                  can be transferred to the input of a fifth such register.
                  We consider how to do this two ways: with MUXes, and with
                  tristate outputs on the registers.

         - with MUXes (note: selection lines of all four MUXes would be
           tied together)

                                 ________
                                | 4 bits |
                                |________|
                                 | | | |
            _____________________| | | |_____________________
            |               _______| |_______               |
        ____|____       ____|____       ____|____       ____|____
        | 4 x 1 |       | 4 x 1 |       | 4 x 1 |       | 4 x 1 |
        |  MUX  |       |  MUX  |       |  MUX  |       |  MUX  |
        |_______|       |_______|       |_______|       |_______|
         | | | |         | | | |         | | | |         | | | |________
         | | | |         | | | |         | | | |_________|_|_|________ |
         | | | |         | | | |         | | |____       | | |       | |
         | | | |         | | | |_________|_|_____|_______|_|_|______ | |
         | | | |         | | |___________|_|____ |       | | |     | | |
 ________| | | |_________|_|_____________|_|___|_|_______|_|_|____ | | |
 |         | |___________|_|_____________|_|__ | |       | | |   | | | |
 |         |______       | |             | | | | |       | | |   | | | |
 | ______________|_______| |             | | | | |       | | |   | | | |
 | |             | ________|             | | | | |       | | |   | | | |
 | | ____________|_|_____________________| | | | |       | | |   | | | |
 | | | __________|_|_______________________|_|_|_|_______| | |   | | | |
 | | | |         | | ______________________| | | |         | |   | | | |
 | | | |         | | | ______________________|_|_|_________| |   | | | |
 | | | |         | | | |                     | | | __________|   | | | |
_|_|_|_|_       _|_|_|_|_                   _|_|_|_|_           _|_|_|_|_
| 4 bits |      | 4 bits |                  | 4 bits |          | 4 bits |
|________|      |________|                  |________|          |________|

         - with tri-states: (note: only one of the four registers would
           have its tri-state enable input active.)

                         ________
                        | 4 bits |
                        |________|
                         | | | |
    +--------------------+-|-|-|-------------+-------------------+
    | +------------------|-+-|-|-------------|-+-----------------|-+
    | | +----------------|-|-+-|-------------|-|-+---------------|-|-+
    | | | +--------------|-|-|-+-------------|-|-|-+-------------|-|-|-+  
   _|_|_|_|_            _|_|_|_|_           _|_|_|_|_           _|_|_|_|_
---| 4 bits |        ---| 4 bits |      ---| 4 bits |        ---| 4 bits |
   |________|           |________|         |________|           |________|


   C. Memory transfer:  condition:      MAR <- address  or      MAR <- address
                                        MBR <- M[MAR]           MBR <- source
                                        dest <- MBR             M[MAR] <- MBR

      1. Meaning: if condition is true, a word of data is transferred to/from
         a specified memory address.

      2. Implementation: The memory system interfaces to the rest of the
         system via two registers, MAR and MBR, which have parallel load
         capabilities.  Ordinary parallel transfer techniques are used to
         move data between them and CPU registers; the memory itself is
         controlled by lines such as chip select and write enable.  (We will
         discuss all this later.)

      Note: sometimes these are abbreviated to  

                dest <- M[address] or   M[address] <- s

      But when we do so we are normally describing a SERIES of micro-operations
      that take place over a series of cycles.

   D. Arithmetic Operations

      1. The most basic arithmetic operation is ADD, which takes two full-size
         operands plus a one-bit carry in and produces a full-size result plus
         a one-bit carry out.  

         a. RTL:        Dest <- Source1 + Source2

         b. This can be implemented by using an array of ripple carry or 
            carry-lookahead, as discussed previously.

      2. By allowing the inputs to the adders to come from multiplexers, we can
         realize several different kinds of addition operations, as follows:

                   Carry-out   Sum
                          |    ||
                      ____|____||____
                     / n Full adders \ <------ carry - in
                    /  _____________  \
                   /__/             \__\
                    ||               ||
              ______||_____    ______||_____          
Select  -----| n 2x1 MUXes |  | n 4x1 MUXes |---- Select
             |_____________|  |_____________|----
                ||   ||        ||  ||  ||  ||
                      _            _
                 A    A        B   B   0   -1 (all 1's)

         Possible functions:

         Function       A MUX select    B MUX select    Carry - in

         A + B          A  (0)          B  (00)         0
         A + B + 1      A               B               1
         A + B'         A               B' (01)         0
         A - B          A               B' (01)         1
         A              A               0  (10)         0
         A + 1          A               0  (10)         1
         A - 1          A               -1 (11)         0
         A'             A' (1)          0  (10)         0
         -A             A'              0               1
         B - A          A'              B               1

         (Others are possible but would probably not be useful.)

   E. Logic operations.

      1. A typical instruction set requires us to perform certain kinds of
         bit-wise logical operations on pairs of operands - e.g. some subset
         of bitwise or, and, xor, bit-clear.

      2. It is possible to realize a general logic network in which a
         MUX is used to select one of the possible logic operations.  For
         example, the following network realizes one of the four functions
         A^B, AvB, AO+B, A^B' as selected by a two-bit selection line.

                       ____|____
                      | 4x1 MUX |-- Select
                      |_________|--
                 ____   | | | |
           A ---|    \  | | | |
                |     )-+ | | |
           B ---|____/    | | |
                 ____     | | |
           A ----\   \    | | |
                  )   >---+ | |
           B ----/___/      | |
                 ____       | |
           A --\-\   \      | |
                ) )   >-----+ |
           B --/-/___/        |
                 ____         |
           A ---|    \        |
                |     )-------+
           B'---|____/

      3. Logical operations can be used to manipulate individual bits or
         groups of bits, as in converting ASCII to decimal, unpacking
         packed data etc.  The following operations are useful in this
         regard:
                                        _
         a. Selective clear:    A <- A ^ B   - a 1 in B causes the
                                               corresponding bit in A to be
                                               cleared.
         b. Selective set:      A <- A u B   - a 1 in B causes the
                                               corresponding bit in A to be 
                                               set.
         c. Mask                A <- A ^ B   - only the bits in A corresponding
                                               to 1's in B remain set.  Note
                                               that this is an alternative to
                                               selective clear.

   F. Shift operations:

      1. Shift operations can be classified as to direction and type.

         a. Direction: left or right.

         b. Type: logical, rotate, arithmetic.  The distinction is what gets
            shifted into the sign position and into the bit position vacated.

            i. Logical shifts shift a zero into the vacated bit.  Thus
               shl(1111) --> 1110; shr(1111) --> 0111

           ii. Rotate shifts the bit shifted out into the vacated bit.  Thus
               cil(1000) --> 0001; cir(0001) --> 1000

          iii. Arithmetic shifts implement the operations *2 (ashl) and
               div 2 (ashr).  The rules depend on the sign-convention in use:

               a. Unsigned: same as logical shift.  If a one is shifted out
                  on an ashl, then overflow has occurred.

               b. Sign-magnitude: use rule for unsigned on all bits except
                  the sign, which is left unchanged.

               c. 2's complement:

                  i. Left shift: the sign is left unchanged, and 0 is shifted
                     into the low order bit.  If the bit shifted out is not
                     the same as the sign, then overflow has occurred.

                        example: ashl(0001) -> 0010     1 -> 2
                                 ashl(1100) -> 1000     -4 -> -8
                                 ashl(0100) -> 0000 ovf 4 -> 0 should be 8

                 ii. Right shift: the sign is propagated.

                        example: ashr(0011) -> 0001     3 -> 1
                                 ashr(1001) -> 1100     -7 -> -4

      2. Implementation - two choices:

         a. Use of a shift register, with appropriate shift in/out connections.

         b. Use of a gating network.  Example: a network that realizes all 6
            operations on 4 bit 2's comp numbers, controlled by 3 select lines
            with two combinations unused.  (shl A, shr A, cil A, cir A,
            ashl A, ashr A in that order.   Bits numbered with 0 = lsb)

                Result            Result           Result          Result
                   |  3             |   2            |   1            |  0
               ____|____        ____|____        ____|____        ____|____
              | 8x1 MUX |      | 8x1 MUX |      | 8x1 MUX |      | 8x1 MUX |
              |_________|      |_________|      |_________|      |_________|
              | | | | | |      | | | | | |      | | | | | |      | | | | | |
              A 0 A A A A      A A A A A A      A A A A A A      0 A A A 0 A
               2   2 0 3 3      1 3 1 3 1 3      0 2 0 2 0 2       1 3 1   1

         c. Sometimes it is desirable to shift more than one place in a single
            cycle.  A network like that described above can, of course, be
            custom-designed to shift any number of places (up to the word 
            length). To get a general shift capability, one can use a 
            LOGARITHMIC SHIFTER.  For example, the following could shift its 
            input any number of places from 0 to 15:

                        ||||||||||||||||
                        -----------------
                        | 0 or 8 place  |
                        | shift         |--- control
                        -----------------
                        ||||||||||||||||
                        -----------------
                        | 0 or 4 place  |
                        | shift         |--- control
                        -----------------
                        ||||||||||||||||
                        -----------------
                        | 0 or 2 place  |
                        | shift         |--- control
                        -----------------
                        ||||||||||||||||
                        -----------------
                        | 0 or 1 place  |
                        | shift         |--- control
                        -----------------
                        ||||||||||||||||
                        
           Each stage shifts its input either the specified number of places,
           or no places at all, based on its control input.

           To do, say, a 13 place shift, one would enable the 8, 4, and 1
           place shifters and disable the 2 place shifter.  (Note that the
           binary representation of the number of places to shift becomes
           the set of control signals to the various stages!)

V. Execution of Machine-Language Instructions - One Address Machine
-  --------- -- ---------------- ------------ - --- ------- -------

   A. We have noted that each macro-instruction (instruction in the set that
      is visible to the assembly-language programmer) is implemented by an 
      appropriate series of micro-operations like the above, one per clock 
      cycle.  

   B. As we shall see in our next lecture, it is the task of the control part
      of the CPU to arrange for these micro-operations to occur in the correct
      order.

   C. Typically, the execution of each is done in a series of phases - each
      consisting of one or more micro-operations.  (But some phases may not
      be needed for some instructions.)

      For illustration purposes, we will give below a series of micro-operations
      for each phase of an instruction for a one-address machine.  These are 
      meant to give an idea of what MIGHT occur - actual implementations will 
      vary. We assume an instruction format in which the op-code and address are
      fetched from memory together as a single unit - e.g.

                __________________
                | op  | address  |
                ------------------

      1. Instruction fetch:

                MAR <- PC
                MBR <- M[MAR]
                PC <- PC + size of instruction

         (Note: on machines with variable-format instructions, several memory
          references may be needed.)

      2. Instruction decoding

                IR <- MBR

      3. Operand address calculation (perhaps for several operands).

         a. Example: direct addressing:

                MAR <- address part of IR

         b. Example: displacement addressing:

                MAR <- address part of IR + designated register

         c. Example: indirect addressing:

                MAR <- address part of IR
                MBR <- M[MAR]
                MAR <- MBR

      4. Operand fetch (if needed).

                MBR <- M[MAR]           -- once for each operand

      5. Instruction execution

         a. Example: ADD memory location to R0

                R0 <- R0 + MBR

         b. Example: Branch to some address if N condition code is set

                N: PC <- branch address

      6. Storing the result and/or setting condition codes (if needed)

         Example:
        
                MBR <- result of operation
                M[MAR] <- MBR

      Note that (1) involves a read from memory, and (3) and (4) may involve
      one or more reads from memory, while (6) may involve a write to memory.

   D. In the simplest case, the various phases of an instruction are done in
      sequence - one after another.  Thus, at any given moment of time, the CPU 
      is carrying out one particular phase of one particular instruction - 
      fetching it, or decoding it, or calculating the address of its operand(s),
      or ...

   E. However, one very important way of improving system performance is by 
      the use of various forms of parallelism in the instruction 
      fetch/decode/execute cycle.  We will discuss this under the topic of
      "pipelining" later in the course.

   F. In our discussion of the control unit which comes next, we will stick
      with a simplified model of the CPU in which all phases of execution of
      an instruction are done sequentially.  We will come back to parallelism
      later.

VI. Execution of Machine-Language Instructions - Load-Store Machine
--  --------- -- ---------------- ------------ - ---------- -------

   A. Thus far, our discussions of CPU implementation have been based on a
      one-address architecture machine.  We now look at how this would change
      for a load-store machine such as MIPS.

   B. For a one-address machine, each instruction is performed by a subset of
      of the following steps - always done in the relative order shown:

        IF -   Instruction fetch
        IOD -  Instruction decode
        OAC -  Operand Address calculation
        OF -   Operand fetch
        EXEC - Instruction execution
        OS -   Operand store

      Each of these stages make use of some subset of the basic functional units
      of the CPU - e.g.

        IF uses PC, MAR, MBR, Memory system, and ALU
        IOD uses MBR, IR
        OAC uses MAR, MBR, ALU, and (maybe) Memory system
        OF uses MAR, MBR, and Memory system
        EXEC uses some subset of PC, MBR, ALU, AC, and flags
        OS uses MAR, MBR, Memory system

   C. For a load-store architecture machine, as presented in our text, the 
      basic steps and usage of functional units is similar, but not identical.

      1. Instead of a single memory system, there may be two:

         a. Instruction memory

         b. Data memory

         (These are not ultimately two totally separate systems, but rather
          separate subsystems of a single memory system.  We will say more about
          how this is accomplished later, when we talk about memory systems.)

         c. Some ramifications:

            i. No need for an MAR, as such:

               - For accesses to instruction memory, the PC always furnishes
                 the address.

               - For accesses to data memory, the address is always calculated
                 by the ALU - hence the ALU output furnishes the address.

           ii. No need for an MBR, as such:

               - For accesses to instruction memory, the data read always goes
                 into the IR.

               - For accesses to data memory, the data either comes from or
                 goes to one of the general registers.

      2. In the one address machine, all computational instructions use the AC
         as one source, and as the destination, and use the MBR as the other
         source.  In the load-store architecture, each of the sources and the
         destination can be any of the programmer-visible registers.

         a. For the one address machine, it is reasonable to do the complete
            execution of a computational instruction as a single step, in one
            clock.

         b. For the load-store machine, execution of a computational instruction
            needs to be broken into several steps if the clock cycle time is
            not to be made unreasonably long:

            i. Reading the two source registers into holding registers at the
               ALU inputs.

           ii. Doing the actual computation and writing the result into a
               holding register at the ALU output.

          iii. Writing the result back into the appropriate register.

      3. The separation of memory reference instructions (load/store) from
         computational instructions means that an instruction will either do
         an operand address calculation or an arithmetic/logic computation,
         but not both.  Further, the same field(s) in the instruction specify
         the source register(s) and immediate value (if any) for whichever
         type of computation is being done.  Thus, these two steps can be 
         folded into one.

      4. The fact that all instructions have the same length (1 word) and
         basic format means that certain computations can be done 
         speculatively during the IOD step - i.e. before it is known for 
         certain that their result will be needed, since they don't take any
         extra time and don't change any programmer-visible part of the machine
         state:

         a. Source registers can be read out of the register file into the
            ALU input holding registers, even if the instruction turns out to
            be one that does not use them.

         b. The target address for a branch instruction can be computed and
            stored in the ALU output holding register, even if the instruction
            is not actually a branch or is a conditional branch that is not 
            taken.

   D. This leads to the following series of steps for various types of
      instructions:

      TRANSPARENCY: Patterson-Hennessy figure 5.35

      1. R-Type instructions require a total of 4 steps

      2. Memory reference instructions require 4 (store) or 5 (load)

      3. Branch instructions require 3 steps.

   E. Structure of CPU Data Paths for a MIPS-like machine

      TRANSPARENCY: Patterson-Hennessy figure 5.32

      Go over 

      (Note: In this example, there is one memory system which can accept
       an address from one of two sources (PC or ALU) and can send data
       read to one of two destinations (IR or Memory data register)

      (Note: figure 5.33 adds some additional logic to support jmp, which we
       won't discuss now)

Copyright ©2000 - Russell C. Bjork