Floating Point Arithmetic

CS311 Lecture: Floating Point Arithmetic                        9/25/03

Objectives

1. Review IEEE 754 representation
2. Discuss the basic process of doing floating point arithmetic
3. Introduce MIPS facilities for floating point arithmetic

I. Review of basic issues 
-  ------ -- ----- ------

   A. Recall that we learned earlier in the course about the standard way to
      represent real numbers internally as a mantissa times a power of some
      radix - i.e.
                     e
                m * r

   B. We also met the most widely-used representation for floating-point
      numbers - IEEE 754

      1. The basic format

              31 30       23 22                   0
              -------------------------------------
              |s|exp        |fraction             |
              -------------------------------------

            s = sign of mantissa: 0 = +, 1 = -

            exp = exponent as power of 2, stored in excess 127 form - i.e. 
            value in this field stored is 127 + true value of exponent.

            The significand (magnitude of mantissa) is normalized to lie in the 
            range 1.0 <= |m| < 2.0.  This implies that the leftmost bit is a 1. 
            It is not actually stored - the 23 bits allocated are used to store
            the bits to the right of the binary point, and a 1 is inserted to 
            the left of the binary point by the hardware when doing arithmetic.
            (This is called hidden-bit normalization, and is why this field is 
            labelled "fraction".) In effect, we get 24 bits of precision by 
            storing 23 bits.
  
      2. We saw that certain exponent values (0, 255) are reserved for special
         purposes, and when they appear the interpretation of the signficand 
         changes:

         0, unnormalized numbers, +/- infinity, NaN

   C. Arithmetic on floating point numbers is, of course, much more complex
      than integer (or fixed-point) arithmetic.  

      1. Floating point addition or subtraction entails the following steps:

         a. Reinsertion of the hidden bit.  

         b. Denormalization: 
      
         c. The addition/subtraction proper.

         d. Renormalization

         e. Preparation for storage (rounding using the guard bits, 
            removal of the hidden bit)

      2. Floating point division and multiplication are - relatively speaking -
         simpler than addition and subtraction.

         a. The basic rule for multiplication is

            i. Reinsert the hidden bit.

           ii. Multiply the mantissas

          iii. Add the exponents

           iv. If necessary, normalize the product by shifting right and increase
               the exponent by 1.  (Note that if the mantissas are normalized, 
               they will lie in the range:              1 <= m < 2

               Therefore, the product of the mantissas will lie in the range:

                        1 <= m < 4

               So at most one right shift is needed.

            v. Store the result less the hidden bit after appropriate rounding.

         b. The basic rule for division is
 
            i. Reinsert the hidden bit.

           ii. Divide the mantissas

          iii. Subtract the exponents

           iv. If necessary, normalize the quotient by shifting left and decrease
               the exponent by 1.  (Note that if the mantissas are normalized, 
               they will lie in the range:              1 <= m < 2

               Therefore, the quotient of the mantissas will lie in the range

                        0.1 < m < 2.0
                           2          2
               So at most one left shift is needed.

            v. Store the result less the hidden bit after appropriate rounding.

      3. As can be seen, a floating point arithmetic unit needs to be able to
         add and subtract exponents, and to shift, add, and subtract mantissas.
         The latter can be done by using the same hardware as for the integer
         multiply/divide operations, or special, dedicated hardware.

II. Floating Point on MIPS
--  -------- ----- -- ----

   A. MIPS floating point arithmetic is performed by a coprocessor - which
      is typically part of the main CPU chip, though having its own dedicated
      registers and arithmetic circuitry.

   B. The MIPS floating point coprocessor has its own register set, consisting
      of 32 floating point registers.

      1. On 32 bit MIPS architectures, these registers are 32 bits each.
         Double precision arithmetic is done by using PAIRS of registers.
         (The instruction specifies the even-numbered member of the pair,
          and even single-precision arithmetic is restricted to using only
          the even numbered registers.)

      2. On 64 bit MIPS architecture, each register can hold a double-precision
         number itself; however, to support older software the 64 bit version
         of the architecture can emulate the 32 bit version (using pairs of
         registers.)

      3. These registers are denoted $f0, $f1 ... Note that $f0 is a regular
         register - it is not hardwired to zero.

   C. The following are the basic MIPS floating point instructions.  (There
      are quite a few more!)

      1. Load/Store floating point register:

         lwc1:  Load word into coprocessor 1 register.  (The Floating Point
                coprocessor is considered coprocessor 1)

         swc1:  Store word from coprocessor 1 register

         Both of these use I format.  rs - the base register used in
         calculating the memory address is always interpreted as one of the
         INTEGER registers; and rt - the register that is loaded or stored -
         is one of the FLOATING POINT registers.

      2. Move value between floating point registers and integer registers:

         mtc1:  Move word from an integer register to floating point register

         mfc1:  Move word from a floating point register to an integer register

      3. Move value between floating point registers (single and double
         precision versions):

         mov.s                  mov.d

         Note that this is a special instruction - in contrast to integer
         arithmetic, we don't use "add to zero" to move values between registers
         for two reasons:

         - There is no floating point register hardwired to zero
         - Floating add consumes more clock cycles than a simple move

      4. Floating point arithmetic (single and double precision versions.)

         add.s                  add.d
         sub.s                  sub.d
         mul.s                  mul.d
         div.s                  div.d

         These all use R format, but interpret the register specifier fields
         as designating floating point registers.

   D. MIPS handles testing the results of a floating point operation in
      a different way from integers, by using a special bit in the coprocessor
      called a CONDITION CODE.

      1. The floating point comparison instructions compare two floating
         numbers and set the condition code bit based on the result.

         a. There are actually four orderings possible for a pair of floating
            point numbers a and b:

            a < b
            a == b
            a > b
            a and b are unordered (true of infinities or when one is NaN.))

         b. Since > is the opposite of <=, there is no separate test for it.
            However, we have to allow for all combinations of the remaining
            tests:

            eq          ==
            olt         <
            ole         <=
            un          unordered
            
            ueq         == or unordered
            ult         < or unordered
            ule         <= or unordered

         c. The instructions for comparing two floating point operands
            specify two registers and a condition to be tested (single and
            double-precision versions_:

            c.x.s       reg, reg        where x is the condition to test - e.g.
            c.x.d       reg, reg

            Example:    c.eq.s  $f0, $f2

      2. There are two instructions for testing the condition code bit:

            bc1t        Branch if coprocessor 1 condition code set (true)
            bc1f        Branch if coprocessor 1 condition code clear (false)

            Used after a comparison that sets the bit as desired

   E. There are a lot of complexities that I have avoided discussing, arising
      from the need to handle specific exceptional conditions spelled out
      in the IEEE standard (overflow, underflow, arithmetic with NaN, etc.)
Copyright ©2003 - Russell C. Bjork