Error Detecting and Correcting Codes

CS311 Lecture: Error Detecting and Correcting Codes             11/21/01

I. Introduction
-  ------------

   A. With any technology, there is a danger that information will be corrupted
      due to physical imperfections in storage media or electronic "noise".
      Since an undetected error of even 1 bit can be catastrophic, we wish to
      take measures to detect such errors as might occur.  As a minimum, we
      seek error detection; but error correction is even better.

   B. We discuss error-detecting and correcting codes here - in the context of
      memory systems - because information in storage is vulnerable to
      corruption.  The same principles are often used when transmitting datq
      from one place to another over a network - the other place where data
      is especially vulnerable to corruption (even more so.)

II. Simple Error detecting/correcting codes
--  ------ ----- -------------------- -----

   A. All error detection/correction schemes depending on using more bits to
      store information than what one actually needs.  Of course, the simplest
      such scheme is parity.  With each data item, we store an extra bit
      called the parity bit.

      1. One convention, called odd parity, specifies that the parity bit be
         set so that the total number of 1 bits (including the parity bit) is
         odd.  Example:

                Data item       01010101
                Item + parity  101010101

                Data item       10000011
                Item + parity  010000011

      2. An alternate convention, called even parity, sets the parity bit so
         that the total number of 1 bits (including the parity bit) is even.
         The choice of which convention to use depends on what sort of
         catastrophic error is judged most likely to occur.  For example, if
         a complete system failure would normally result in all data being
         converted to 0's, then odd parity would report something wrong but
         even would not.  On the other hand, if system failure would turn all
         data to 1's (and the total length of data plus parity is odd) then
         even parity would be preferred,

      3. To check the correctness of an item, the receiver simply counts the
         number of 1 bits and ensures that they are odd or even, as the case
         may be.  (Of course, both the originator and the receiver of the item 
         must use the same convention as to odd/even parity.)

      4. Parity is what we call a 1-bit error detecting code

         a. It can tell us that some bit is corrupt - but not which one is
            corrupt.  When a parity error occurs, any data bit might be in
            error, or the data might itself be fine and the parity bit
            corrupt!

         b. It can be fooled by a multiple error.  If 2 bits are corrupted,
            a parity test will tell us everything is ok.

   B. More sophisticated schemes provide for not only detecting but also
      correcting errors, and may be able to detect multi-bit errors.  One
      rather simple example is a scheme often used on magnetic tape.

      1. Data on tape is written in blocks, composed of frames.  Each frame 
         typically has 8 data bits plus a frame parity bit.  Each block has a 
         an extra frame called the block checksum, each of whose bits is a 
         parity check on one row position throughout the length of the block - 
         e.g.

                d d d d d d d d d d l <- this bit checks all the d's to its left
                d d d d d d d d d d l
                d d d d d d d d d d l
                d d d d d d d d d d l
                d d d d d d d d d d l   d = data
                d d d d d d d d d d l   f = frame parity
                d d d d d d d d d d l   l = longitudinal parity
                d d d d d d d d d d l
                f f f f f f f f f f l <- this bit checks both all the l's above
                ^                        it and all the d's to its left
                |
                This bit checks all d's above it

      2. Consider what would happen if there were a single bit in error in
         the block.  When the user reconstructs the f's and l's, he would
         find that one f and one l are incorrect.  Together, these two would
         isolate the one bit in error, which could be corrected by inverting it.
         Thus, we have 1 bit error correction.

      3. Now suppose two bits were corrupted.

         a. If they were in the same frame, we would have 2 l's in error.

         b. If they were in the same row, we would have 2 f's in error.

         c. If they were in different frames and different rows, we would
            have two errors each in f's and l's.

         d. In any case, would be able to detect a 2 bit error (though we
            could not correct it.)

      4. Finally, consider the effect of a 3 bit error.  In most cases, we
         would get multiple error indications in both f's and l's - letting
         us know that the data is corrupt.  However, there is one pathological
         case to be aware of.  Let e be the bits in error:

                e                       e       <- l ok here




                e                               <- l signals error here

                ^                       ^
                |                       |
                f ok here               f signals error here

         Unfortunately, this would look like a (correctable) 1 bit error.
         But we would correct the wrong bit, thus ending up with a 4 bit error!

      5. For this reason, this encoding scheme is cqlled a one-bit error
         correcting, two-bit error detecting scheme.   Its usefulness is
         based on the assumption that errors involving three or more bits are
         so improbable 

         a. Suppose that the probability of a bit being corrupted is
            10^-9.  (One per billion).  

            i. Such an error can be corrected and processing can proceed.  If
               we process a billion bits per second, such a situation will
               arise, on the average, once per second.

           ii. The probability of a two bit error is 10^-18.  Such an error
               can be detected - and some alternate path can be pursued to
               get the correct value.  (E.g. recomputing it from an old value
               and a log of transactions, or retransmitting if we are dealing
               with a message over a network.)  If we process a billion bits per
               second, this will occur, on the average, once every 317 years.

          iii. The probability of a three bit error is 10^-27.  Such an error
               might be detected (many would be), but could escape detection.
               If we process a billion bits/second, this amounts to no more
               than one undetected error in 317 billion years - probably
               fairly safe!

         b. Note that neither this scheme, nor any other, can GUARANTEE that
            an undetected error won't occur - it can only make such an error
            improbable enough to make us willing to trust the system.

         c. The fact that error-correcting and detecting schemes are only
            probably correct means that, in some sense, computer-processed
            data is never ABSOLUTELY GUARANTEED to be accurate.

III. Hamming Codes
---  ------- -----

   A. Now we consider a scheme that can be used for error detection/correction
      in a single word of data.  The scheme is called a Hamming code.  The
      example we will develop is for a word length of 11 bits - but the idea
      could be extended to any word size.

      1. Let there be n data bits (numbered d .. d)
                                             0    n-1

      2. We will add m correcting bits, where m is the smallest integer
         such that 2^m >= (n + m + 1).  (Example: 11 data bits - let m = 4 -
         2^4 = 16 = (11 + 4 + 1).  We number these bits c  .. c
                                                         0     m-1
                                
      3. We will logically, though not necessarily physically, intersperse
         the correcting bits so that their positions in the overall word
         are powers of 2.  (We number bits in the overall word 1 .. m + n).

          Ex:   15  14  13  12  11  10   9   8   7   6   5   4   3   2   1
                d10 d9  d8  d7  d6   d5  d4  c3  d3  d2  d1  c2  d0  c1  c0

      4. Let each c be a parity check on the remaining bits in logical
                   i              i
         positions which contain 2  in their binary representation - e.g.

         c0 checks all bits in odd numbered positions
         c1 checks bits in positions 3, 6, 7, 10, 11, 14, 15
         c2 checks bits in positions 5, 6, 7, 12, 13, 14, 15
         c3 checks bits in positions 9 .. 15

         (Observe: no c checks any other c - only d's.  Thus, the c's can
          be computed knowing only the d's.)

      5. To store a word, add in the necessary correcting bits and store
         the combination.  To check a word retrieved from storage:

         a. Extract the d bits.

         b. Compute what the c bits should be.

         c. XOR the computed c bits with the actual c bits.

            i. If the result is all 0's, the stored word is ok.
           ii. If the result is non-0, treat the XOR result as the binary 
               representation of an integer.  This is the number of the bit
               position where the error occurred (assuming a 1 bit error.)

         Example (using odd parity): data originally:

                10101010101 => 1010101_010_1__ w/slots for c's

         Correcting bits: c0 = 0
                          c1 = 1
                          c2 = 0
                          c3 = 1

         Stored word is 101010110100110
                               -   - --

         Assume data bit 0 is corrupted in storage, so that word as read
         is 101010110100010

         The receiver would extract the data bits 10101010100, and would
         compute the c bits on this basis to be

                        c0 = 1
                        c1 = 0
                        c2 = 0
                        c3 = 1

         XORing with the c bits extracted from the stored data yields 0011 -
         indicating that the error is in bit 3 of the stored data, as desired.

   B. The code we have presented gives 1 bit error correction, but could be
      fooled by a 2 bit error.  (We would, in fact, create a 3-bit error
      by "correcting" the wrong bit!)

   C. To add 2-bit error detection, we simply add a conventional parity bit.

      1. In the absence of error, the parity bit will indicate no error,
         and the correcting bits will show no error (exact match between
         computed and stored values.)

      2. A 1-bit error will cause the parity bit to report an error, and
         the correcting bits can be used to correct it.

      3. A 2-bit error will cause the parity bit to NOT report an error,
         but the correcting bits WILL report an error - this is taken as
         an indication of a double error that we can detect but not fix.
            
         (There is no way to corrupt exactly two bits in such a way as to 
          cause the correcting bits to report no error, because each data
          bit affects at least two correcting bits, and no two data bits
          affect the same set of correcting bits.)

IV. More Sophisticated Schemes
--- ---- ------------- -------

   A. There are, of course, other error detecting and correcting schemes.
      The most widely used scheme is based on cylic-redundancy polynomials, 
      which we can't go into here.  Such schemes can be made to not only yield
      1-bit error correction, 2-bit error detection, but also can detect
      longer error bursts (successive corrupted bits). 

   B. A final general observation on the cost of error detection/correction
      in terms of added bits.  Some terminology:

      1. Any error detection/correction scheme operates by creating an
         extended encoding for the data in such a way that certain bit patterns
         are legal and others are not.  (For example, odd parity creates an
         encoding in which words having an odd number of 1's are legal and
         those having an even number are not.)

      2. For any such encoding, we can define the distance between two legal
         words as the number of bits that would need to change to go from one
         to the other.  For example, the distance from 101010 to 101100 is 2.

      3. For any encoding scheme, we can define the minimum distance to be
         the smallest distance between any two legal words.  For example,
         for simple parity the minimum distance is two.

      4. Now observe:

         a. For a 1-bit error detecting code, a minimum distance of 2 is 
            necessary.

         b. For a 1-bit error correcting code, a minimum distance of 3 is 
            necessary.  This guarantees that for any erroneous word
            there will be at most 1 word that has a distance of 1 from it.
            We take that correct word to be the original data that was
            corrupted.

         c. For 1-bit error correction plus 2 bit error detection, we
            need a minimum distance of 4.

         d. In general, for n-bit error correction we need a minimum distance
            of 1 + 2*n, and for m-bit error detection (m >= n) we need to
            increase the distance by m-n.

       5. We can reduce the sheer number of added bits by generating the
          detecting/correcting bits for longer blocks of data.

          Example: assume we want 1-bit error correction 2-bit error detection.
                   we need a minimum distance of 4.

          If we have a 128 byte message, we could achieve this by adding
          5 error-detecting/correcting bits to each byte, for a total of
          128 * (8 + 5) = 1664 bits.    (2^4 >= 8+4+1; add 1 for parity)

          Or, we could treat the message as 32 "words" of 4 bytes each,
          appending 7 error-detecting/correcting bits to each word, for a
          total of 32 * (32 + 7) = 1248 bits.  (2^6 >= 32+6+1; add 1 for parity)

          Or, we could treat the message as 16 "words" of 8 bytes each,
          appending 8 error-detecting/correcting bits to each word, for a
          total of 16 * (64 + 8) = 1152 bits. (2^7 >= 64+7+1; add 1 for parity)

          Or, we could treat the entire message a single long "word",
          appending 12 error-detecting/correcting bits to it, for a total
          of 1036 bits! (2^11 >= 1024+11+1; add 1 for parity)
Copyright ©2001 - Russell C. Bjork