CS311 Lecture: Floating Point Arithmetic 9/25/03
Objectives
1. Review IEEE 754 representation
2. Discuss the basic process of doing floating point arithmetic
3. Introduce MIPS facilities for floating point arithmetic
I. Review of basic issues
- ------ -- ----- ------
A. Recall that we learned earlier in the course about the standard way to
represent real numbers internally as a mantissa times a power of some
radix - i.e.
e
m * r
B. We also met the most widely-used representation for floating-point
numbers - IEEE 754
1. The basic format
31 30 23 22 0
-------------------------------------
|s|exp |fraction |
-------------------------------------
s = sign of mantissa: 0 = +, 1 = -
exp = exponent as power of 2, stored in excess 127 form - i.e.
value in this field stored is 127 + true value of exponent.
The significand (magnitude of mantissa) is normalized to lie in the
range 1.0 <= |m| < 2.0. This implies that the leftmost bit is a 1.
It is not actually stored - the 23 bits allocated are used to store
the bits to the right of the binary point, and a 1 is inserted to
the left of the binary point by the hardware when doing arithmetic.
(This is called hidden-bit normalization, and is why this field is
labelled "fraction".) In effect, we get 24 bits of precision by
storing 23 bits.
2. We saw that certain exponent values (0, 255) are reserved for special
purposes, and when they appear the interpretation of the signficand
changes:
0, unnormalized numbers, +/- infinity, NaN
C. Arithmetic on floating point numbers is, of course, much more complex
than integer (or fixed-point) arithmetic.
1. Floating point addition or subtraction entails the following steps:
a. Reinsertion of the hidden bit.
b. Denormalization:
c. The addition/subtraction proper.
d. Renormalization
e. Preparation for storage (rounding using the guard bits,
removal of the hidden bit)
2. Floating point division and multiplication are - relatively speaking -
simpler than addition and subtraction.
a. The basic rule for multiplication is
i. Reinsert the hidden bit.
ii. Multiply the mantissas
iii. Add the exponents
iv. If necessary, normalize the product by shifting right and increase
the exponent by 1. (Note that if the mantissas are normalized,
they will lie in the range: 1 <= m < 2
Therefore, the product of the mantissas will lie in the range:
1 <= m < 4
So at most one right shift is needed.
v. Store the result less the hidden bit after appropriate rounding.
b. The basic rule for division is
i. Reinsert the hidden bit.
ii. Divide the mantissas
iii. Subtract the exponents
iv. If necessary, normalize the quotient by shifting left and decrease
the exponent by 1. (Note that if the mantissas are normalized,
they will lie in the range: 1 <= m < 2
Therefore, the quotient of the mantissas will lie in the range
0.1 < m < 2.0
2 2
So at most one left shift is needed.
v. Store the result less the hidden bit after appropriate rounding.
3. As can be seen, a floating point arithmetic unit needs to be able to
add and subtract exponents, and to shift, add, and subtract mantissas.
The latter can be done by using the same hardware as for the integer
multiply/divide operations, or special, dedicated hardware.
II. Floating Point on MIPS
-- -------- ----- -- ----
A. MIPS floating point arithmetic is performed by a coprocessor - which
is typically part of the main CPU chip, though having its own dedicated
registers and arithmetic circuitry.
B. The MIPS floating point coprocessor has its own register set, consisting
of 32 floating point registers.
1. On 32 bit MIPS architectures, these registers are 32 bits each.
Double precision arithmetic is done by using PAIRS of registers.
(The instruction specifies the even-numbered member of the pair,
and even single-precision arithmetic is restricted to using only
the even numbered registers.)
2. On 64 bit MIPS architecture, each register can hold a double-precision
number itself; however, to support older software the 64 bit version
of the architecture can emulate the 32 bit version (using pairs of
registers.)
3. These registers are denoted $f0, $f1 ... Note that $f0 is a regular
register - it is not hardwired to zero.
C. The following are the basic MIPS floating point instructions. (There
are quite a few more!)
1. Load/Store floating point register:
lwc1: Load word into coprocessor 1 register. (The Floating Point
coprocessor is considered coprocessor 1)
swc1: Store word from coprocessor 1 register
Both of these use I format. rs - the base register used in
calculating the memory address is always interpreted as one of the
INTEGER registers; and rt - the register that is loaded or stored -
is one of the FLOATING POINT registers.
2. Move value between floating point registers and integer registers:
mtc1: Move word from an integer register to floating point register
mfc1: Move word from a floating point register to an integer register
3. Move value between floating point registers (single and double
precision versions):
mov.s mov.d
Note that this is a special instruction - in contrast to integer
arithmetic, we don't use "add to zero" to move values between registers
for two reasons:
- There is no floating point register hardwired to zero
- Floating add consumes more clock cycles than a simple move
4. Floating point arithmetic (single and double precision versions.)
add.s add.d
sub.s sub.d
mul.s mul.d
div.s div.d
These all use R format, but interpret the register specifier fields
as designating floating point registers.
D. MIPS handles testing the results of a floating point operation in
a different way from integers, by using a special bit in the coprocessor
called a CONDITION CODE.
1. The floating point comparison instructions compare two floating
numbers and set the condition code bit based on the result.
a. There are actually four orderings possible for a pair of floating
point numbers a and b:
a < b
a == b
a > b
a and b are unordered (true of infinities or when one is NaN.))
b. Since > is the opposite of <=, there is no separate test for it.
However, we have to allow for all combinations of the remaining
tests:
eq ==
olt <
ole <=
un unordered
ueq == or unordered
ult < or unordered
ule <= or unordered
c. The instructions for comparing two floating point operands
specify two registers and a condition to be tested (single and
double-precision versions_:
c.x.s reg, reg where x is the condition to test - e.g.
c.x.d reg, reg
Example: c.eq.s $f0, $f2
2. There are two instructions for testing the condition code bit:
bc1t Branch if coprocessor 1 condition code set (true)
bc1f Branch if coprocessor 1 condition code clear (false)
Used after a comparison that sets the bit as desired
E. There are a lot of complexities that I have avoided discussing, arising
from the need to handle specific exceptional conditions spelled out
in the IEEE standard (overflow, underflow, arithmetic with NaN, etc.)
Copyright ©2003 - Russell C. Bjork