CS222 Lecture: Memory Systems                           Last revised 4/23/99

Introduction

   A. In CS221, we looked at the basis building blocks of memory systems: the
      individual devices: chips, disks, tapes etc.

   B. We now focus on complete memory systems.  We will see that memory systems
      are seldom composed of just one type of memory; instead, they are
      HIERARCHICAL systems composed of a mixture of technologies aimed at
      achieving a good tradeoff between speed, capacity, cost, and physical
      size.

I. The memory hierarchy

   A. Since every instruction executed by the CPU requires at least one
      memory access (to fetch the instruction) and often more, the performance
      of memory has a strong impact on system performance.

   B. With present technologies, it turns out to be possible to build very fast
      memories - that are quite expensive - or to built quite inexpensive
      memories - that are relatively slow.  The following table summarizes the
      currently available technologies:

   (FILL IN QUANTITY COLUMN LAST)

   Technology  typical quantity in   access time   transfer rate    $/bit
               a multi-user                        (bytes/sec)
               system (bytes)

   CPU registers  10^1..10^5            ~ 10 ns         10^9            -
   and "on chip"
   memory

   Static         0..10^6               ~ 20 ns         10^8        10^-3..10^-4
   RAM

   Dynamic        10^6-10^8             ~ 60 ns         10^7        10^-5..10^-6
   RAM

   disk           10^8-10^11            ~ 10 ms         10^6        10^-7

   Tape           10^8-10^10 / reel     > 1 sec (due    10^6        10^-7..10^-9
                  (unlimited # reels)     to mounting)

   C. Thus, sophisticated systems will often have a HIERARCHY of memories, 
      using several different technologies, with a relatively small amount of 
      very fast memory, a much larger amount of fast memory (MOS), and a very
      large amount of slow memory (disk and tape.)

      FILL IN QUANTITY COLUMN - NOTE: Numbers represent a range of systems from
                                      moderate PC's to multi-user systems

      1. The hierarchy can be pictured this way:

                        _________________________________
        Registers       | CPU registers   (part of CPU) |
        ----------------|-------------------------------|---------------
        "Memory"        | Very fast memory      (cache) |
                        | (on CPU chip and/or separate) |
                        |-------------------------------|
                        | Fast memory           (main)  |
                        | (Dynamic RAM)                 |
                        |                               |
                        |                               |
                        |                               |
                        |                               |
                        |                               |
                        |                               |
                        |-------------------------------|
                        | Slow memory         (virtual) |
                        | (Disk)                        |
                        |                               |
                        |                               |
                        |                               |
                        |                               |
                        |                               |
                        |                               |
                        :                               :
                        :                               :
        ----------------|-------------------------------|---------------
        File system     | Disk and tape                 |
                        :                               :
                        :                               :

      2. Note that the portion described as "memory" corresponds to memory as
         it is viewed by the assembly language program (locations that can be
         specified by the operand address part of an instruction.)   Within
         this portion, only main memory is absolutely needed.

         a. Cache memory serves to provide fast access to a subset of
            memory that is needed often, and contains copies of frequently
            accessed locations in main memory.

         b. Virtual memory serves to provide additional capacity beyond what
            is physically available in main memory.  

         c. Both look to the programmer like main memory, but are not 
            physically implemented as such.

      3. Note, too, that disk plays two roles - as part of the memory system 
         and as part of the file system, and that tape is treated here as part 
         of the file system.  This is because disk and tape are often accessed
         from programs using IO statements, and are physically connected into
         the system as part of the IO system.  We consider only the virtual
         memory role for disk in this lecture.

   D. This hierarchy can provide very good performance by taking advantage of
      the principle of LOCALITY OF REFERENCE: if one observes the memory
      references generated by a program during a short window of time, it will
      be the case that most of the references will pertain to a small subset
      of the total address space of the program.

      1. This happens because most the execution time of a program is spent
         executing loops.

      2. Within a loop, the same body of code is being executed over and over;
         in addition, it is usually the case that there are some data items
         being accessed over and over.

      3. The basic idea, then, is to keep in the fastest memory in the 
         hierarchy those data items that are being used currently, with the
         moderately fast memory used for items that will be needed soon and
         slow memory used for items that will not be needed until the distant
         future.  

         a. Assembly language programmers and optimizing compilers try to
            make good use of the CPU registers toward that end.  This will not
            be something we will discuss here.

         b. We will focus on the memory system proper.

   E. We will proceed soon to discuss cache memory and virtual memory in
      detail.  First, though, we must consider a little more of the interface
      between the CPU and the memory system as a whole.

      1. As a user or system program is running, it generates a stream of memory
         references for instruction fetch, operand fetch, and result stores.  
         These take the form of an address plus a read/write specifier - e.g.

              CPU ---> memory

              Read  1703
              Write 3424
              Read  1704

      2. The majority of requests are reads

         a. Instruction fetches: every instruction involves at least one read

         b. In evaluating expressions, intermediate results are typically kept
            in CPU registers with a final store at the end.

              Ex:   X := A + B * C - D

            would need (on a 1-address plus general register machine):

            5 instruction fetches
            4 operand fetches
            1 store

            So 90% of the accesses are reads.  (70% - 90% reads is typical)

         c. Therefore, the primary design goal in the memory system is to 
            optimize reads from memory - without, of course, penalizing writes.

      3. Another issue we need to discuss is memory management:

         a. Many systems place some address translation hardware between the 
            CPU and the memory system.   This is called a memory management 
            unit, and may be physically part of the CPU or may be a separate
            device.

                CPU ---- Memory management ---- Memory system

         b. Each logical address generated by the CPU is translated into a
            physical address by the MMU.

         c. The original incentive for doing this was multiprogramming - 
            allowing several different programs (perhaps belonging to several
            different users) to be resident in memory at the same time.

            i. To prevent conflicts between the programs, it is necessary to
               ensure that each references a distinct region of physical
               memory.

           ii. However, when a program is written it is generally not possible
               to know what region of physical memory is will ultimately occupy
               when it is running.

          iii. Thus, programs might be written on the assumption that their 
               range of accessible memory addresses ranges from 0 .. some upper
               limit.  It is the task of the MMU to translate the addresses
               generated by the program into real physical addresses.

         d. Memory mangement also supports virtual memory - the MMU may
            translate some addresses into a reference to data currently on
            disk, rather than in main memory, and may cause the data to be
            brought from disk into main memory.

         e. We will say more about memory management later.

II. Cache Memory

   A. One technique used to improve overall memory system performance is CACHE
      MEMORY.

      1. At one time, cache memory was a feature generally found only in 
         higher-end computer systems.  However, as CPU speeds have continued to
         increase while memory speeds have not, cache memory has become a
         necessity on even desktop computer systems.

         a. About 10 years ago, common CPU clock speeds were on the
            order of 8-16 MHz, and DRAM cycle times on the order of 70-80 ns.
            Under these circumstances, it would be possible to perform a
            memory access every 1-2 clock cycles.

         b. Today, CPU clocks have gone to 200-500 MHz, while DRAM access time
            has improved only slightly, to about 60 ns.  Thus, an access to
            main memory takes on the order of 20-30 clocks.

         c. Since a RISC is designed to execute one instruction per clock,
            and since each instruction must be fetched from memory, it would
            seem that RISC's couldn't function at the clock speeds they do
            using memory of the sort available today.

         d. Execution of one (or more) instructions per clock is critically
            dependent on the use of cache memory.
   
      2. Cache memory is a small, high-speed memory, logically located between 
         the CPU and the rest of the memory system.  

         a. At one point in time, cache memory was usually separate from the
            CPU.

         b. Today's high-speed RISCs depend on having cache memory on the CPU
            chip that operates at the same speed as the CPU itself.  This has
            been made possible by improved chip manufacturing techniques that
            allow more transistors per chip.

         c. Many systems now use a two layer cache, with a small primary cache 
            on the CPU chip a larger separate secondary cache.

      3. Cache memory works because of the phenomenon of locality of reference -
         in any given interval of time, a program tends to do most of its memory
         accesses in a limited region of memory.  This comes about due to loops
         in the code and frequently-used data structures in the data.  The goal
         is to keep as many as possible of the currently-being-used memory 
         locations in the cache.

         a. Each memory read is first tried against the cache.  If the data is
            found there (a "hit"), the processor can proceed at maximum speed.
        
         b. Otherwise, we have a cache miss and the processor must wait for a 
            slower access to secondary cache or (if there is a miss there
            too) to main memory.
   
         c. Well-designed caches can achieve hit rates of 95% or more much of 
            the time.

         d. To see why this is important, suppose we have a 200 MHz CPU with
            a memory system that requires 100 ns to do an access (including 
            bus overhead) and a single, on-chip cache.

            i. Theoretically, the time to execute an instruction is 5 ns.

           ii. Suppose, however, that the cache has a hit rate of 90%.  Then
               90% of instructions can be fetched in 5 ns, but 10% require
               100 ns.  Now the average time per instructions becomes
               .9 x 5 ns + .1 x 100 ns = 14.5 ns - which is equivalent to
               reducing the clock rate by a factor of three!  (Plus any loss
               of time due to data cache misses.)

          iii. With a 95% hit rate primary cache, and a 20 ns secondary
               cache that hits 90% of the primary cache misses, we get an
               average instruction fetch time of

               .95 x 5 ns + (.05)(.9) x 20 ns + (.05)(.1) x 100 ns = 5.7 ns
               
   B. In principle, a cache is an associative memory containing pairs of the
      form:
   
                   memory address               value
   
      (Note: On a byte-addressable machine, it is common to have each entry in
       the cache contain several successive bytes - called a line or block.
       For example, most VAXes use a cache based on entries holding 8 
       consecutive bytes of data.  Thus, the lower three bits of an address 
       would be ignored when looking up an entry, but once the entry is found 
       they would be used to select the correct portion of it.)

      1. However, a fully associative memory is impractical.  For example, in
         a cache of 1000 entries, each address emitted by the CPU would have to 
         be compared to all 1000 entries at the same time.  Even if the required
         number of comparators could be economically built, the incoming address
         would have to drive 1000 logic loads.  This would require several 
         layers of buffering (since a typical gate output can drive < 10 
         others), which would inject intolerable delays.  So one of several 
         approximations to associative memory is used.

      2. Direct mapping.
   
         a. The cache is constructed as a set of pairs of the form tag, value.
            The data portion of the pair is called a CACHE LINE, or CACHE
            BLOCK.  Typically, the size of the line or block corresponds to the 
            system data bus size - e.g. 4-16 bytes - for first-level cache,
            but may be much bigger for a second-level cache (as big as, say,
            128 bytes).

         b. The number of entries is always a power of 2.  Suppose it is 2^n.
   
         c. Then the address from the CPU is broken down as follows, assuming
            each line contains 8 bytes of data:
   
                   tag        | entry number |  position in entry
                              |    n  bits   |  3 bits
    
            i. Bits 3 .. n + 2 of the address select one of the entries in
               the cache.  If the tag of that entry matches the rest of the
               address, we have a hit.  (The tag comparison is needed because
               many addresses map to the same cache entry.)
   
           ii. Otherwise, the a complete line of data is obtained from memory.  
               It is stored (along with the tag portion of its address) in the 
               cache for future use, replacing the entry currently there.
   
          iii. Note that this scheme implies that at any time at most one entry
               with any given pattern in its n entry-number bits can be in the
               cache.  This is usually not a problem - but may be if a loop
               accesses a data structure whose address is some multiple of
               2^n away from one of the instructions in the loop.

      3. Set Associative
   
         a. This is an improvement on direct mapping.  It addresses the problem
            of a loop that happens to access a data item whose address is some
            multiple of 2^n different from that of some instruction in the loop.
            The low order bits of the address select not one entry in the
            cache, but a set of entries.  (A set size of two or four is common.)
            The tags for each entry are compared in parallel with the incoming 
            address, and if one matches there is a hit.

         b. When a reference is not found in the cache, one of the entries
            in the set must be replaced.  This is typically done either in 
            FIFO fashion, LRU fashion, or randomly.

         Example: The cache memory on a VAX 11/780 was a two-way set associative
                  cache using 8 byte lines.  There are a total of 1024 entries,
                  organized as 512 sets of two entries each - so the cache 
                  capacity is 8K bytes.  Random replacement is used.

                  An address generated by the CPU is broken down as follows:

                   29                 12 11    3  2    0
                  | tag                 | index | offset |

                  (Note: a VAX physical address is 30 bits)

                  Each entry consists of 64 bits of data plus an 18 bit tag

                  Suppose that the processor generated the following physical
                  address (hex) 0001AC44, and wants to access 4 byte longword. 
                  The cache interprets this as:

                  tag = 0001A   index = 188     offset = 4

                  If set 188 (hex) contains an entry with tag 0001A, then bytes
                  4..7 of that entry are returned to the CPU.  Otherwise, the
                  contents of memory locations 0001AC40 .. 0001AC47 are fetched.
                  One of the two entries in set 188 (randomly chosen) is
                  replaced with a tag of 0001A and a value equal to the 8 bytes
                  fetched.  In addition, 4 of the 8 bytes fetched are returned
                  to the CPU.  (Note: Because the memory bus on the VAX is 8
                  bytes wide, the entire line can be fetched in one main memory
                  cycle.)

   C. Issues with regard to caches
   
      1. Replacement policies

         a. Because of its size, the cache can only store a small fraction of
            all the addresses lying in the address space of the current program.

         b. Thus, when a reference occurs to an item that is not in the cache,
            some item currently in the cache must be removed to make room for 
            the item.

         c. With a direct mapping cache, there is no choice involved, since 
            each address maps to one and only one cache location.  As noted
            above, with a set associative scheme LRU, FIFO, or random may be
            used.

      2. Write-through versus write-back
   
         a. What happens in the cache when a memory access is a write rather 
            than a read?  (and the location referenced is in the cache).
   
         b. In a write-through cache, the data is written to Mp and to cache
            at the same time. This slows the system down some (though the CPU
            can go on to the next instruction while the memory write occurs),
            but not drastically since writes are proportionally rare.
   
         c. In a write-back cache, the data is written only to the cache.
            A written in bit is associated with the item, so that when it is
            selected for replacement by a new entry, it is then written to 
            Mp.  This avoids waiting for multiple writes to a single location
            in Mp; but there is a potential problem if a DMA IO device is to 
            access data that has not yet been written back.  This can be
            handled by forcing the cache to be flushed to main memory before
            a DMA operation is initiated.

      3. Validity of cache items.
   
         a. When the system is first started up, or when there is a change of
            user in a multiprogrammed environment, the cache will not contain
            valid data until a sufficient number of reads have been done.
            Therefore, each entry in the cache must contain a valid bit - to
            be cleared initially, and to be set when an entry is copied from
            Mp.  (When the valid bit is clear, the cache misses on any 
            attempt that maps to that item.  Of course, in a set associative
            cache each member of a set must have its own bit.)
   
         b. There is often a provision for the operating system to invalidate 
            cache entries when a context change to a new user is done.
            (If the cache is write-back, this also results in changed entries
            being written back to main memory.)

      4. Placement of the cache
   
         a. As we shall see, many systems include memory management hardware
            between the CPU and Mp to perform translation of the addresses.
            The cache can either go between the CPU and the translation 
            hardware or between the translation hardware and memory.
   
                   CPU -- M.cache -- K.mapping -- Mp
   
             or    CPU -- K.mapping -- M.cache -- Mp
   
         b. Which is better?

            i. In the first position, it affects only CPU accesses to memory; in
               the second, both CPU and DMA IO.  The former has the problem that
               a DMA write access to memory could make data in the cache 
               incorrect.  This can be handled in one of two ways:
   
               - Software restrictions - no CPU access to a region of memory 
                 while DMA data is being transferred to it.
   
               - The cache can "listen in" on the memory bus (though this is
                  difficult with translation hardware in between.)

           ii. However, the first position has advantages, too:
   
               - The cache only serves the CPU and is not cluttered with 
                 one-time DMA accesses.
   
               - The cache can help reduce the number of address translations 
                 that must be done - if a CPU address is found in the cache, it 
                 need not be translated.

      5. Set size (for set associative caches).

         a. When designing a set associative cache, a set size must be chosen
            (always a power of 2).  (Note that using a larger set size means
            there can be fewer sets for a given total cache size.)

         b. A set size of two is commonly used, because it is simpler to build.
            A set size of four has been found experimentally to give marginally
            better hit/miss performance.  Experimental evidence suggests set
            sizes greater than 4 produce no significant gain.

III. Mapping Logical Memory Addresses to Physical Memory Addresses

   A. We noted earlier that the CPU generates a stream of addresses
      representing accesses to the memory.

   B. We will speak of the stream of addresses generated by the CPU as
      LOGICAL ADDRESSES.  (Sometimes the term virtual address is used, even if 
      the system is not a virtual memory system - but we will reserve the latter
      term for virtual memory systems per se.).

      1. We call them LOGICAL because many systems apply some mapping function
         to the address before actually presenting it to the physical memory.

      2. Logical addresses are generated by the following mechanisms:

         a. Contents of the PC: instruction fetches
      
         b. The various operand addressing modes: direct, relative, indirect 
            etc.

   C. Physically, MP is organized as an array of individual addressable units
      (bytes or words).  These units are numbered 0 .. total size-1.  We call
      this numbering a PHYSICAL ADDRESS.

   D. We call the range of possible logical addresses the LOGICAL ADDRESS
      SPACE, and the range of possible physical addresses the PHYSICAL ADDRESS
      SPACE.

      1. The logical address space is dictated by the CPU architecture.  For
         example, a CPU that has a 16 bit PC and 16 bit registers will
         generally have a logical address space of 64K.

      2. The physical address space is dictated by the number of memory chips
         installed.  This in turn may be limited by the number of bits
         provided on the memory address bus.  For example, a 24-bit memory
         address bus would dictate a maximum physical space of 16 meg - though
         a given system could have less if not all possible chips are
         present.

   E. There are a number of possibilities for the relationship between these
      two spaces:

      1. They might be equal.  

         a. This is, for example, true on small microcomputer systems, 
            particularly those using 8 or 16-bit CPU's.  It was also the case 
            on early mainframes and minis.

         b. Conceptually, this is the simplest possibility.  The address
            output of the CPU is simply connected directly to the address bus
            of the memory system.

         c. But it is inflexible.  The system supports a single size of
            memory, with no room for expansion.

         d. In multiprogrammed systems, partitioning memory is a problem.
            Each user's program must be compiled to run in a specific partition
            of memory, unless relative addressing is used throughout.

      2. Logical space might be smaller than physical space.  This possibility
         is more a matter of historical significance than something that is
         true of modern systems.

         a. Example: The PDP-11 had a 16-bit logical address (64K); but
            PDP-11 systems used either an 18 or 22 bit physical address bus
            (256K or up to 4Meg).

         b. PC's based on Intel 8086/80186/80286 chips used a 20-bit logical
            address (1 Meg with 640K available for RAM) but used various schemes
            to allow access to a physical address space as big as 4 Meg.

         c. To make use of all of physical memory, some sort of mapping scheme
            becomes necessary.  

            i. Different tasks running on the system may have the same logical 
               address mapped to different physical addresses - so any one task 
               is still of limited size, but available memory is used to
               support several tasks.

           ii. Or, the mapping scheme may be changeable "on the fly" to allow
               a program to access different regions of physical memory at
               different times.

      3. Logical space might be larger than physical space.  Here, there are
         two possibilities.

         a. Some logical addresses might simply be illegal and unused.

            Example: The Power PC chips used in MacIntoshes have a 32
                     bit address, which would allow 4 gigabytes of memory.  But 
                     the standard configuration is 16 to 64 meg of RAM plus
                     128K to 256K of ROM, so many addresses are unused.

         b. The system might use virtual memory, which we shall discuss 
            shortly.

IV. Virtual Memory Systems

   A. Virtual memory systems are characterized by logical address space >
      physical address space; but rather than making some logical addresses
      unusable, disk storage is used as an extension of main memory.  (Note:
      from here on out we will call "logical address space" "VIRTUAL address
      space" in recognition of the fact that this makes the memory available
      to a program look much larger than it actually is.)

   B. The basic idea is this: only a portion of a program's virtual address
      space is actually resident in physical memory at any given time.  The
      remainder is kept on secondary storage, to be brought in when needed.

   C. As with systems where virtual space < physical space, a mapping scheme
      is used to translate virtual addresses into physical.  However, this
      scheme includes one possibility not found in the earlier case.

      1. The mapping scheme described earlier had three possibilities:

         a. Success
         b. Failure due to attempting to access an unmapped portion of
            virtual space.
         c. Failure due to writing a readonly portion of virutal space.

      2. With virtual memory, a fourth possibility is introduced: the access
         may be valid, but the portion of memory desired may not be physically
         resident.

   D. To accomplish the necessary mapping of addresses, many systems (but not
      all) divide virtual space into fixed size regions called pages, and
      physical space into fixed size regions called page frames.  (The sizes
      are equal, so one page will fit into one frame.)

      Ex: DEC VAX - the pages and page frames are 512 bytes each.

      1. A virtual address is broken into two fields: a page number and an
         offset in page.

         Ex: On the VAX, the low order 9 bits of the virtual address are the
             offset, and the remaining 23 bits comprise the page number.

      2. A physical address is likewise broken into two fields: a page frame
         number and an offset.

         Ex: On a VAX with 8 Meg of physical memory, a (valid) physical
             address would be 23 bits long.  Of these, the high order 14 bits
             are the frame number.

      3. For each user, the operating system maintains a table in memory
         called the page table.  A memory management system register holds
         the address of the first word of the page table for the user whose
         program is currently running.  The page portion of a virtual address 
         is used as an index into this table to select an entry.

         a. One portion of the entry contains various bits to control access
            (such as readonly etc.)  One key bit is a valid bit that indicates
            if the page in question is currently resident in physical memory.

         b. The rest of the entry has two possible uses:

            i. If the page is physically resident, the entry will contain
               the number of the frame where it is to be found.

           ii. Otherwise, the entry may contain an indication as to where to
               go on disk to find the page if it is needed.

         c. An example of mapping, based on the DEC VAX:

            Suppose the page table contains the following entries, among others:

            Entry No    Valid    Frame Number

            ...
            4           1        10101010101010101010101
            5           0        ----
            ...

            If the CPU generates the virtual address 

               000000000000000000000000100101010101

            The memory management hardware divides this into a page number and
            offset:

               000000000000000000000000100 101010101

            Now, since the page number is 4, the hardware goes to entry 4 in
            the page table.  Since this is a valid entry, it extracts the
            frame number 

               10101010101010101010101

            and forms the physical address

               10101010101010101010101101010101

            On the other hand, a virtual address 

               000000000000000000000000101101010101

            would access page table entry 5.  Since this is flagged as not
            valid, a trap to the operating system would occur.  The current
            program would be suspended while the OS gets the proper page from
            disk, loads it into an available frame, resets the page table
            entry to point to the chosen frame (with valid bit now 1), and
            then restarts the failed instruction.

   4. To make this scheme work, an additional control register is needed.

      a. In principle, the page table would have to have one entry for each
         possible page number.  For example, a VAX page table would need
         2^23 = 8 million entries.  (Since each entry is several bytes long,
         this is all of physical memory several times over!)

      b. In practice, the total virtual space for any given user is usually
         much, much less than the maximum implied by the architecture.  
         Therefore, the actual page table usually only has as many entries as
         are needed for all of the pages actually used by the program.
         Therefore, in addition to the register that contains the address of
         the beginning of a user's page table, there is usually also a
         page table size register.  Any page number that exceeds the value in
         this register causes the mapping process to fail.

   5. Note that, when a page needs to be brought into main memory, the
      operating system must find a free frame to hold it.  This often means
      that a currently-resident page must be paged out to disk to make room.

      a. If the page has not been written in since it was last loaded, and
         if the original is still on disk, the page does not need to be
         written out to disk; its page table entry can be flagged as invalid
         and it can be displaced.  However, if it has been modified it must
         first be written back to disk.

      b. To facilitate this, each page table entry must include a "written
         in" bit that is set by the hardware when a write access to the page
         is done.

   E. An alternative to paging is segmentation.  The difference is that virtual
      space is divided into variable-size segments instead of fixed-size pages.

      1. One problem with paging is INTERNAL FRAGMENTATION of memory.  Since
         a user's actual needs seldom comprise an even number of pages, the
         last page allocated to each user will contain some unneeded and
         therefore wasted space.  Segmentation avoids this.

      2. This also has the advantage that the division of virtual space may 
         mirror program logic - e.g. a segment may be a single procedure or a 
         single large data structure.  This facilitates sharing of code among 
         different users.

      3. The virtual address is now treated as a segment number plus offset
         within segment.  The segment number indexes a segment table, which is
         like a page table except that each entry must include a segment length
         field as well as written-in and valid-bits and a frame address.  Also,
         since segments are variable size, frames must be variable size too;
         therefore the frame address must be a complete address, not just the
         high order bits.

   F. Segmentation runs into a physical memory fragmentation problem, since
      segments can be of any size.  

      1. In time, memory can become checkerboarded - e.g.

            --------------
            | 8 K in use |              Suppose a program needs to bring in a
            --------------              4K segment from disk.  Since a total
            | 2 K free   |              of 5K of physical memory is available,
            --------------              this should be possible.  But since
            | 6 K in use |              the 5K is in two non-contiguous
            --------------              portions, this cannot be done.
            | 3 K free   |
            --------------
            | rest in use|

      2. This is called EXTERNAL FRAGMENTATION.

      3. Some systems solve this problem by using segmentation with paging:
         Each segment is composed of 1 or more fixed-size pages.  The segment
         table now points to a page table for the segment.

         a. The virtual address now has three fields:

              segment number | page number within segment | offset within page

         b. The mapping process is as follows:

            i. Check segment number against segment limit.  If it is in range,
               use it as an index into the segment table.

           ii. From the segment table entry, extract the address and size of the 
               page table for the segment.

          iii. Check the page within segment to be sure it is within range.  If
               so, use it as an index into the segment's page table.

           iv. From the page table entry extract control bits plus (if the valid
               bit is set) the frame number of the physical address. 
               Concatenate the frame number and offset within page to form a 
               physical address.

      4. We have now reintroduced internal fragmentation, while getting rid
         of external fragmentation.  (However, this is a more manageable
         problem.)  However, we have retained the logical advantages of
         segmentation for code sharing etc.

      Example: The Intel 80386/80486/Pentium chips offer the choice of pure 
               segmentation or segmentation with paging - as determined by the 
               setting of a bit in a CPU control register.

               In either mode, logical addresses are composed of two parts -
               a segment selector, and an offset in segment.  The segment
               selector is contained in one of 6 CPU segment registers, and the 
               offset is computed using the addressing mode specified by the 
               instruction.  (The choice of segment register is normally 
               implicit in the type of reference being done - e.g. instruction 
               fetches use the CS segment register, references to the stack use 
               the SS segment register, and ordinary data references use the DS 
               segment register.  The programmer can override the default by 
               prefixing the instruction with a special segment prefix.)  

               The segment selector is used to reference a segment descriptor
               contained in one of two segment descriptor tables - a global
               table shared by all tasks, and a local table for each task.  One
               bit in the selector specifies which table to use, and 13 bits
               are used to select one of 8192 entries in the appropriate table.

               The descriptor, in turn, contains a segment base address, a
               segment limit (size), and some validity and protection bits.
               If the segment is valid and the access is allowed by the
               protection, then the offset is compared to the segment 
               limit; if it is <= the limit then the offset is added to the 
               segment base to form a "linear address" (i.e. an address within
               a linear, unsegmented space.)

               At this point, if paging is enabled, the resulting value is
               mapped using a two-level page table; otherwise, it is used
               directly as the physical address.

   G. Paging and both forms of segmentation suffer from an important
      overhead problem.

      1. Since page/segment tables are stored in memory, each memory access
         turns into multiple accesses:

         a. With paging: one access to the page table, then one to the data -
            two in all.

         b. With pure segmentation: one access to the segment table, then one
            to the data - two in all.

         c. With segmentation with paging: one access to the segment table, one
            to the segment's page table, and one to the data - three in all!
            (Or on the 80x86 four in all, because the page table has two 
            levels!)

      2. This overhead would, of course, be intolerable.  There are two ways
         to avoid or reduce it.

         a. If page tables can be kept small, then the page table can be kept 
            in high-speed registers rather than memory.  

            i. The 80x86 does something like this with the segment table.  The
               CPU contains 6 segment registers, each of which can be loaded
               with a segment number.  Most instructions implicitly use one of 
               these segment registers in forming a logical address - e.g. code 
               references in branches etc. are always made using the code 
               segement register (CS).  That is, most instructions generate 
               only the 32 bit offset portion of the address directly.

               As a hidden step, whenever a program loads a segment register the
               CPU also fetches the corresponding segment descriptor from the
               segment table and stores it in a hidden part of the segment
               register.  Thus, the descriptor is always available when needed.

           ii. Storing page/segment tables in registers is generally not useful 
               with virtual memory page tables, which tend to be quite large 
               because virtual memory is intended to support large address 
               spaces.

          b. An alternative (that is more commonly used) is a set of registers
             to store the most recently used virtual -> physical address
             translations.  These registers are sometimes called a translation
             look-aside buffer.  They are organized as an approximation to an 
             associative memory (like a cache).  When translating a virtual 
             address, the hardware first looks to see if the translation is 
             available in the set of registers.  If so, no additional memory 
             accesses are needed.

                        Pc -- K.mapping -- Mp
                                  |
                                M.translation look-aside

      3. Cache memory can also help, if it is installed between the CPU and the
         mapping hardware.  Now, only memory references that miss in the cache
         are translated at all - typically less than 10% of all references.

   H. Page Replacement Schemes

      1. In any virtual memory system, we must have a policy for deciding which
         page is to be removed from memory to provide a page frame for a newly
         requested page when a page-fault occurs.  When a process is just
         starting up, we normally give it an initial page frame allocation; and
         as it needs more room we give it an additional page frame each time it
         faults, so that the total allocation for the process grows dynamically.
         However, there must be an upper limit set for the process, and there 
         must be some mechanism for shrinking the allocation to a process 
         whose requirements have decreased.
   
      2. This is primarily a concern of a course on operating systems.  However,
         certain schemes require various sorts of hardware assistance,
         which we consider now.

      3. The optimal policy would be to replace the page whose next reference 
         is furthest in the future, treating a page that will not be referenced
         again at all as having infinite time until its next reference.
   
         a. It can be shown that this policy would minimize page faults.

         b. But, except in vary rare situations, it is totally impractical.

         c. The other policies that are used in practice are approximations 
            aimed at achieving close to the same effect.

      4. All schemes require a per-page written in bit, so that if a page is to
         be removed from memory it can be rewritten to disk only if it has been 
         modified.  (In fact, many schemes give priority to removing from memory
         pages which have not been modified so as to reduce disk traffic.  If a 
         modified page is kept around long enough one of two things will happen:
         it will be read again (and hence the write-out and read-in traffic is 
         saved), or the task using it will complete, and it need not be written 
         at all. Normally a copy of this bit is maintained in the TLB.  When a
         write occurs to a page already in the TLB, if the bit there is set
         then no update to the page table need occur; otherwise, the
         written-in bit is set both in the TLB and in the page table.
   
      5. Several schemes make use of a per page "referenced" bit - normally 
         stored in the page table along with valid bit and written in bit.

         a. When the page is first brought into memory - and perhaps at other
            times at well - the operating system software clears this bit.

         b. When any reference is made to the page, the hardware sets the bit
            automatically.  Thus, the operating system can determine whether any
            reference has been made to the page within the interval since it
            last cleared the bit.

      6. To summarize: hardware support for operating system's page replacement
         scheme includes:

         a. Minimally - a per-page "valid" bit and "written-in" bit.
         b. In some cases, a per-page "referenced" bit.

Copyright ©1999 - Russell C. Bjork