Memory Hierarchies

CS311 Lecture: Memory Hierarchies                       last revised 11/13/03

I. Introduction
-  ------------

   A. In  the previous lecture, we looked at the basis building blocks of 
      memory systems: the individual devices: chips, disks, tapes etc.

   B. We now focus on complete memory systems.  We will see that memory systems
      are seldom composed of just one type of memory; instead, they are
      HIERARCHICAL systems composed of a mixture of technologies aimed at
      achieving a good tradeoff between speed, capacity, cost, and physical 
      size.

II. The memory hierarchy
--  --- ------ ---------

   A. Since every instruction executed by the CPU requires at least one
      memory access (to fetch the instruction) and often more, the performance
      of memory has a strong impact on system performance.

   B. With present technologies, it turns out to be possible to build very fast
      memories - that are quite expensive - or to built quite inexpensive
      memories - that are relatively slow.  The following table summarizes the
      currently available technologies as they might be used on a moderate
      desktop system with a 1 GHz CPU.  (All columns are order of magnitude)

   Technology           typical quantity in a   access time     cost
                        desktop system (bytes)                  ($/byte)

   CPU registers        < 10^5                  < 1 ns          - (can't buy
   and "on chip"                                                   separately)
   memory

   Static RAM           < 10^6                  3-10 ns         >= 10^-4

   Dynamic RAM          < 10^9                  ~ 10^2 ns       >= 10^-7

   Non-removable        < 10^11                 ~ 10 ms         >= 10^-9
   disk                                         (= 10^7 ns)

   Removable media      (potentially            > 1 sec (due    >= 10^-9
   (disk/tape)           unlimited)             to mounting)

   C. Unfortunately, technological realities come into conflict with
      system requirements at this point.

      1. Since each instruction executed requires at least one memory access
         (to fetch the instruction) and possibly two, the memory system must
         support fast accesses.  For example, on a RISC with a 1 GHz clock,
         we must be able to access an instruction every clock cycle -
         implying an access time requirement of 1 ns.  Note that only
         memory on the CPU chip meets this requirement.

      2. If the system is supporting multiple concurrent tasks, and even
         concurrent users (multi-user system or server), it may require
         100's of megabytes or even gigabytes of memory to be available - 
         possibly more than even available Dynamic RAM can hold.

      3. To address this problem, sophisticated systems will have a HIERARCHY 
         of memories, using several different technologies, with smaller
         quantities of very fast memory and larger quantities of slow memory.

         The hierarchy can be pictured this way.  (Not to scale)

                        _________________________________
        Registers       | CPU registers   (part of CPU) |
        ----------------|-------------------------------|---------------
        "Memory"        | Memory on CPU chip (L1 cache) |
                        |-------------------------------|
                        | Static RAM chips   (L2 cache) |
                        |-------------------------------|
                        | Dynamic RAM     (main memory) |
                        |                               |
                        | (Note: this region is         |
                        |  referred to as Mp - primary  |
                        |  memory)                      |
                        |                               |
                        |                               |
                        |-------------------------------|
                        | Disk         (virtual memory) |
                        |                               |
                        | (Note: this region is         |
                        |  referred to as Ms -          |
                        |  secondary memory)            |
                        |                               |
                        |                               |
                        |                               |
                        :                               :
                        :                               :
        ----------------|-------------------------------|---------------
        File system     | Disk and tape                 |
                        :                               :
                        :                               :

      4. Note that the portion described as "memory" corresponds to memory as
         it is viewed by the assembly language program (locations that can be
         specified by the operand address part of an instruction such as lw or
         sw.)   Within this portion, only main memory is absolutely needed.

         a. Cache memory serves to provide fast access to a subset of
            memory that is needed often, and contains copies of frequently
            accessed locations in main memory.  Without this, a CPU clock
            rate in excess of 10-20 MHz would be wasted, because the memory
            system couldn't keep up with the need for instruction accesses!

         b. Virtual memory serves to provide additional capacity beyond what
            is physically available in main memory.  

         c. Both look to the programmer like main memory, but are not 
            physically implemented as such.

      5. Note, too, that disk plays two roles - as part of the memory system 
         and as part of the file system, and that tape is treated here as part 
         of the file system.  This is because disk and tape are often accessed
         from programs using IO statements, and are physically connected into
         the system as part of the IO system.  We consider only the virtual
         memory role for disk in this lecture.

   D. This hierarchy can provide very good performance by taking advantage of
      the principle of LOCALITY OF REFERENCE: 

      1. TEMPORAL LOCALITY: if one observes the memory references generated 
         by a program during a short window of time, it will be the case that 
         most of the references will pertain to a small subset of the total 
         address space of the program.

         a. This happens because most the execution time of a program is spent
            executing loops.

         b. Within a loop, the same body of code is being executed over and 
            over; in addition, it is usually the case that there some or all 
            the data items being accessed are accessed over and over.

      2. SPATIAL LOCALITY: often, successive references to memory refer to
         locations that are physically adjacent to one another.

         a. Absent branch instructions, code in successive locations in
            memory is executed in sequence.

         b. Data structures such as arrays and objects are stored in successive
            locations in memory.
            
      3. The basic idea, then, is to keep in the fastest memory in the 
         hierarchy those data items that are being used currently, with the
         moderately fast memory used for items that will be needed soon and
         slow memory used for items that will not be needed until the distant
         future.  When we copy information to a faster level in the hierarchy, 
         we may try to copy a larger block of data, anticipating that
         other locations in the same block will be needed soon. 

         a. Assembly language programmers and optimizing compilers try to
            make good use of the CPU registers toward that end.  This will not
            be something we will discuss here.

         b. We will focus on the memory system proper.

      4. Note that a memory hierarchy is based on a subset relationship - any
         information that is found in a higher level of the hierarchy is also
         found in all lower levels.  (This may become false when we do data
         writes, however, as well shall see.)

   E. We will proceed soon to discuss cache memory and virtual memory in
      detail.  First, though, we must consider a little more of the interface
      between the CPU and the memory system as a whole.

      1. As a user or system program is running, it generates a stream of memory
         references for instruction fetch, operand fetch, and result stores.  
         These take the form of an address plus a read/write specifier - e.g.

              CPU ---> memory

              Read  1703
              Write 3424
              Read  1704

      2. The majority of requests are reads

         a. Instruction fetches: every instruction involves at least one read

         b. In evaluating expressions, intermediate results are typically kept
            in CPU registers with a final store at the end.

              Ex:   X = A + B * C - D

            would need (on a load-store or a 1-address plus general register 
            machine):

            5 instruction fetches
            4 operand fetches
            1 store

            So 90% of the accesses are reads.  (70% - 90% reads is typical)

         c. Therefore, the primary design goal in the memory system is to 
            optimize reads from memory - without, of course, unduly penalizing 
            writes.

III. Cache Memory
---  ----- ------

   A. One technique used to improve overall memory system performance is CACHE
      MEMORY.

      1. At one time, cache memory was a feature generally found only in 
         higher-end computer systems.  However, as CPU speeds have continued to
         increase while memory speeds have not, cache memory has become a
         necessity on even desktop computer systems.

         a. About 10-15 years ago, common CPU clock speeds were on the
            order of 8-16 MHz, and DRAM cycle times on the order of 70-80 ns.
            Under these circumstances, it would be possible to perform a
            memory access every 1-2 clock cycles.

         b. Today, CPU clocks have gone above 1GHz, while DRAM access time
            has improved only slightly, to about 60 ns.  Thus, an access to
            main memory takes on the order of 60 clocks or more!

         c. Since a RISC is designed to execute one instruction per clock,
            and since each instruction must be fetched from memory, it would
            seem that RISC's couldn't function at the clock speeds they do
            using memory of the sort available today.

         d. Execution of one (or more) instructions per clock is critically
            dependent on the use of cache memory.
   
      2. Cache memory is a small, high-speed memory, logically located between 
         the CPU and the rest of the memory system.  

         a. At one point in time, cache memory was usually separate from the
            CPU.

         b. Today's high-speed CPU's depend on having cache memory on the CPU
            chip that operates at the same speed as the CPU itself.  This has
            been made possible by improved chip manufacturing techniques that
            allow more transistors per chip.

         c. Most systems now use a two level cache, with a small primary cache 
            on the CPU chip and a larger, separate (and slower) secondary cache.
            (The need for this is dictated by how much cache can actually fit
            on the CPU chip, due to space and power/heat considerations.)
            (Some systems have a three level cache)

         d. It is also common to find - at least at the primary cache level -
            that separate caches are used for instructions and data.  This
            facilitates having separate paths to memory for the instruction
            fetch unit and the data memory access unit of a pipelined machine.

      3. Cache memory works because of the phenomenon of locality of reference.

         a. Each memory read is first tried against the cache.  If the data is
            found there (a "hit"), the processor can proceed at maximum speed.
        
         b. Otherwise, we have a cache miss and the processor must wait for a 
            slower access to secondary cache or (if there is a miss there
            too) to main memory.
   
         c. Well-designed caches can achieve hit rates of 95% or more much of 
            the time.

         d. To see why this is important, suppose we have a 1 GHz CPU with
            a memory system that requires 100 ns to do an access (including 
            bus overhead) and a single, on-chip cache.

            i. Theoretically, the time to execute an instruction is 1 ns.

           ii. Suppose, however, that the cache has a hit rate of 90%.  Then
               90% of instructions can be fetched in 1 ns, but 10% require
               100 ns.  Now the average time per instructions becomes
               .9 x 1 ns + .1 x 100 ns = 10.9 ns - which is equivalent to
               reducing the clock rate by a factor of almost eleven!  (Plus 
               any loss of time due to data cache misses.)

          iii. With a 95% hit rate primary cache, and a 10 ns secondary
               cache that hits 90% of the primary cache misses, we get an
               average instruction fetch time of

               .95 x 1 ns + (.05)(.9) x 10 ns + (.05)(.1) x 100 ns = 1.9 ns
                
               (which is like cutting the clock speed in half)

           iv. Of course, higher hit ratios - especially on the primary
               cache - further improve performance.  Nonetheless, CPU clock
               rate alone tells you little about overall execution speed
               without some knowledge of how the caches perform!

   B. In principle, a cache is an associative memory containing pairs of the
      form:
                   memory address               value
   
      (Note: On a byte-addressable machine, it is common to have each entry in
       the cache contain several successive bytes - called a line or block.
       For example, many systems use a cache based on entries holding 8 
       consecutive bytes of data.  Thus, the lower three bits of an address 
       would be ignored when looking up an entry, but once the entry is found 
       they would be used to select the correct portion of it.)

      1. However, a large fully associative memory is impractical.  For example,
         in a cache of 10,000 entries, each address emitted by the CPU would
         have to be compared to all 10,000 entries at the same time.  Even if
         the required number of comparators could be economically built, the 
         incoming address would have to drive 10,000 logic loads.  This would 
         require several layers of buffering (since a typical gate output can 
         drive about 10 others), which would inject intolerable delays.  So one
         of several approximations to associative memory is used.

      2. Direct mapping.
   
         a. The cache is constructed as a set of pairs of the form tag, value.
            The data portion of the pair is called a CACHE LINE, or CACHE
            BLOCK.  Typically, the size of the line or block corresponds to the 
            system data bus size - e.g. 4-16 bytes - for first-level cache,
            but may be much bigger for a second-level cache (as big as, say,
            128 bytes - this to take advantage of spatial locality).

         b. The number of entries is always a power of 2.  Suppose it is 2^n.
   
         c. Then the address from the CPU is broken down as follows, assuming
            each line contains 8 bytes of data:
   
                   tag        | entry number |  position in entry
                              |    n  bits   |  3 bits
    
            i. Bits 3 .. n + 2 of the address select one of the entries in
               the cache.  If the tag of that entry matches the rest of the
               address, we have a hit.  (The tag comparison is needed because
               many addresses map to the same cache entry.)
   
           ii. Otherwise, the a complete line of data is obtained from memory.  
               It is stored (along with the tag portion of its address) in the 
               cache for future use, replacing the entry currently there.
   
          iii. Note that this scheme implies that at any time at most one entry
               with any given pattern in its n entry-number bits can be in the
               cache.  This is usually not a problem - but may be if a loop
               calls a procedure whose address is some multiple of 2^n away
               from the loop body, or if there are accesses in a loop to two
               data structures whose addresses are some multiple of 2^n apart.

      3. Set Associative
   
         a. This is an improvement on direct mapping.  It can address the 
            problem we just discussed.  

            i. The cache entries are divided into sets - typically involving 2
               or 4 entries.

           ii. The low order bits of the address select not one entry in the
               cache, but a set of entries.  The tags for each entry are 
               compared in parallel with the incoming address, and if one 
               matches there is a hit.

          iii. Note that, for a given total cache size, a set-associative
               cache with set size two contains half as many sets as the
               direct mapping cache contains lines - but each set can hold
               information from two different places in memory.

         b. When a reference is not found in the cache, one of the entries
            in the set must be replaced.  This is typically done either in 
            FIFO fashion, LRU fashion, or randomly.

   C. Issues with regard to caches
   
      1. Replacement policies

         a. Because of its size, the cache can only store a small fraction of
            all the addresses lying in the address space of the current program.

         b. Thus, when a reference occurs to an item that is not in the cache,
            some item currently in the cache must be removed to make room for 
            the item.

         c. With a direct mapping cache, there is no choice involved, since 
            each address maps to one and only one cache location.  As noted
            above, with a set associative scheme LRU, FIFO, or random may be
            used.

      2. Write-through versus write-back
   
         a. What happens in the cache when a memory access is a write rather 
            than a read?  (and the location referenced is in the cache).
   
         b. In a write-through cache, the data is written to Mp and to cache
            at the same time. This slows the system down some (though the CPU
            can go on to the next instruction while the memory write occurs),
            but not drastically since writes are proportionally rare.
   
         c. In a write-back cache, the data is written only to the cache.
            A written in bit is associated with the item, so that when it is
            selected for replacement by a new entry, it is then written to 
            Mp.  This avoids waiting for multiple writes to a single location
            in Mp; but there is a potential problem if a DMA IO device is to 
            access data that has not yet been written back.  This can be
            handled by forcing the cache to be flushed to main memory before
            a DMA operation is initiated.

      3. Validity of cache items.
   
         a. When the system is first started up, or when there is a change of
            user in a multiprogrammed environment, the cache will not contain
            valid data until a sufficient number of reads have been done.
            Therefore, each entry in the cache must contain a valid bit - to
            be cleared initially, and to be set when an entry is copied from
            Mp.  (When the valid bit is clear, the cache misses on any 
            attempt that maps to that item.  Of course, in a set associative
            cache each member of a set must have its own bit.)
   
         b. There is often a provision for the operating system to invalidate 
            cache entries when a context change to a new user is done.
            (If the cache is write-back, this also results in changed entries
            being written back to main memory.)

      4. Set size (for set associative caches).

         a. When designing a set associative cache, a set size must be chosen
            (generally a power of 2).  (Note that using a larger set size means
            there can be fewer sets for a given total cache size.)

         b. A set size of two is commonly used, because it is simpler to build.
            A set size of four has been found experimentally to give marginally
            better hit/miss performance.  Experimental evidence suggests set
            sizes greater than 4 produce no significant gain.

IV. Mapping Logical Memory Addresses to Physical Memory Addresses
--  ------- ------- ------ --------- -- -------- ------ ---------

   A. We noted earlier that the CPU generates a stream of addresses
      representing accesses to the memory.

   B. We will speak of the stream of addresses generated by the CPU as
      LOGICAL ADDRESSES.  (Sometimes the term virtual address is used, even if 
      the system is not a virtual memory system - but we will reserve the latter
      term for virtual memory systems per se.).

      1. We call them LOGICAL because many systems apply some mapping function
         to the address before actually presenting it to the physical memory.

      2. Logical addresses are generated by the following mechanisms:

         a. Contents of the PC: instruction fetches
      
         b. Computed operand addresses for load and store instructions.

   C. Physically, main memory is organized as an array of individual 
      addressable units (bytes or words).  These units are numbered 0 .. 
      total size-1.  We call this numbering a PHYSICAL ADDRESS.

   D. We call the range of possible logical addresses the LOGICAL ADDRESS
      SPACE, and the range of possible physical addresses the PHYSICAL ADDRESS
      SPACE.

      1. The logical address space is dictated by the CPU architecture.  For
         example, a CPU that has a 32 bit PC and 32 bit registers will
         generally have a logical address space of 4GB.

      2. The physical address space is dictated by the number and capacity of 
         memory chips actually installed.  

   E. There are a number of possibilities for the relationship between these
      two spaces:

      1. They might be equal.  

         a. This is, for example, true on small microcomputer systems, 
            particularly those using 8 or 16-bit CPU's.  It was also the case 
            on early mainframes and minis.

         b. Conceptually, this is the simplest possibility.  The address
            output of the CPU is simply connected directly to the address bus
            of the memory system.

         c. But it is inflexible.  The system supports a single size of
            memory, with no room for expansion.

         d. In multiprogrammed systems, partitioning memory is a problem.
            Each user's program must be compiled to run in a specific partition
            of memory, unless relative addressing is used throughout.

      2. Logical space might be smaller than physical space.  This possibility
         is more a matter of historical significance than something that is
         true of modern systems; however, it was in the context of these
         sorts of systems that fundamental concepts of memory management were
         developed that are still very much in use today.

         a. Example: The PDP-11 had a 16-bit logical address (64K); but
            PDP-11 systems used either an 18 or 22 bit physical address bus
            (256K or up to 4Meg).

         b. PC's based on Intel 8086/80186/80286 chips used a 20-bit logical
            address (1 Meg with 640K available for RAM) but used various schemes
            to allow access to a physical address space as big as 4 Meg.

         c. To make use of all of physical memory, some sort of mapping scheme
            became necessary.  

            i. Some sort of address translation hardware between the CPU and 
               the memory system.   This is called a memory management unit, and
               may be physically part of the CPU or may be a separate device.

                CPU ---- Memory management ---- Memory system

           ii. Each logical address generated by the CPU is translated into a
               physical address by the MMU.

          iii. The original incentive for doing this was multiprogramming - 
               allowing several different programs (perhaps belonging to several
               different users) to be resident in memory at the same time.

               - To prevent conflicts between the programs, it is necessary to
                 ensure that each references a distinct region of physical
                 memory.

               - However, when a program is compiled, it is generally not 
                 possible to know what region of physical memory is will 
                 ultimately occupy when it is running.

               - Thus, programs might be compiled on the assumption that their 
                 range of accessible memory addresses ranges from 0 .. some 
                 upper limit.  It is the task of the MMU to translate the 
                 addresses generated by the program into the real physical 
                 addresses (which might be different if different users are
                 running the same program).

           iv. Different tasks running on the system may have the same logical 
               address mapped to different physical addresses - so any one task 
               is still of limited size, but available memory is used to
               support several tasks. 

      3. Logical space might be larger than physical space.  Here, there are
         two possibilities.

         a. Some logical addresses might simply be illegal and unused.

         b. The system might use virtual memory, which we shall discuss 
            shortly.

V. Virtual Memory Systems
-  ------- ------ -------

   A. Virtual memory systems are characterized by logical address space >
      physical address space; but rather than making some logical addresses
      unusable, disk storage is used as an extension of main memory.  (Note:
      from here on out we will call "logical address space" "VIRTUAL address
      space" in recognition of the fact that this makes the memory available
      to a program look much larger than it actually is.)

   B. The basic idea is this: only a portion of a program's virtual address
      space is actually resident in physical memory at any given time.  The
      remainder is kept on secondary storage, to be brought in when needed.

   C. As with systems where virtual space < physical space, a mapping scheme
      is used to translate virtual addresses into physical.  However, this
      scheme includes one possibility not found in the earlier case.

      1. The mapping scheme described earlier had three possibilities:

         a. Success

         b. Failure due to attempting to access an unmapped portion of
            virtual space.

         c. Failure due to writing a readonly portion of virutal space.

      2. With virtual memory, a fourth possibility is introduced: the access
         may be valid, but the portion of memory desired may not be physically
         resident.

   D. To accomplish the necessary mapping of addresses, many systems (but not
      all) divide virtual space into fixed size regions called pages, and
      physical space into fixed size regions called page frames.  (The sizes
      are equal, so one page will fit into one frame.)

      Ex: DEC VAX - the pages and page frames are 512 bytes each.

      1. A virtual address is broken into two fields: a page number and an
         offset in page.

         Ex: On the VAX, the low order 9 bits of the virtual address are the
             offset, and the remaining 23 bits comprise the page number.

      2. A physical address is likewise broken into two fields: a page frame
         number and an offset.

         Ex: On a VAX with 512 Meg of physical memory, a (valid) physical
             address would be 29 bits long.  Of these, the high order 20 bits
             are the frame number.

      3. For each user, the operating system maintains a table in memory
         called the page table.  A memory management system register holds
         the address of the first word of the page table for the user whose
         program is currently running.  The page portion of a virtual address 
         is used as an index into this table to select an entry.

         a. One portion of the entry contains various bits to control access
            (such as readonly etc.)  One key bit is a valid bit that indicates
            if the page in question is currently resident in physical memory.

         b. The rest of the entry has two possible uses:

            i. If the page is physically resident, the entry will contain
               the number of the frame where it is to be found.

           ii. Otherwise, the entry may contain an indication as to where to
               go on disk to find the page if it is needed.

         c. An example of mapping, based on the DEC VAX:

            Suppose the page table contains the following entries, among others:

            Entry No    Valid    Frame Number

            ...
            4           1        10101010101010101010101
            5           0        ----
            ...

            If the CPU generates the virtual address 

               000000000000000000000000100101010101

            The memory management hardware divides this into a page number and
            offset:

               000000000000000000000000100 101010101

            Now, since the page number is 4, the hardware goes to entry 4 in
            the page table.  Since this is a valid entry, it extracts the
            frame number 

               10101010101010101010101

            and forms the physical address

               10101010101010101010101101010101

            On the other hand, a virtual address 

               000000000000000000000000101101010101

            would access page table entry 5.  Since this is flagged as not
            valid, a trap to the operating system would occur.  The current
            program would be suspended while the OS gets the proper page from
            disk, loads it into an available frame, resets the page table
            entry to point to the chosen frame (with valid bit now 1), and
            then restarts the failed instruction.

   4. To make this scheme work, an additional control register is needed.

      a. In principle, the page table would have to have one entry for each
         possible page number.  For example, a VAX page table would need
         2^23 = 8 million entries, each several bytes long.  Since each
         process needs its own page table, on a system with a moderate number
         of users all of physical memory might be consumed by page tables!  

      b. In practice, the total virtual space for any given user is usually
         much, much less than the maximum implied by the architecture.  
         Therefore, the actual page table usually only has as many entries as
         are needed for all of the pages actually used by the program.
         Therefore, in addition to the register that contains the address of
         the beginning of a user's page table, there is usually also a
         page table size register.  Any page number that exceeds the value in
         this register causes the mapping process to fail.

   5. Note that, when a page needs to be brought into main memory, the
      operating system must find a free frame to hold it.  This often means
      that a currently-resident page must be paged out to disk to make room.

      a. If the page has not been written in since it was last loaded, and
         if the original is still on disk, the page does not need to be
         written out to disk; its page table entry can be flagged as invalid
         and it can be displaced.  However, if it has been modified it must
         first be written back to disk.

      b. To facilitate this, each page table entry must include a "written
         in" bit that is set by the hardware when a write access to the page
         is done.

   E. An alternative to paging is segmentation.  The difference is that virtual
      space is divided into variable-size segments instead of fixed-size pages.

      1. One problem with paging is INTERNAL FRAGMENTATION of memory.  Since
         a user's actual needs seldom comprise an even number of pages, the
         last page allocated to each user will contain some unneeded and
         therefore wasted space.  Segmentation avoids this.

      2. This also has the advantage that the division of virtual space may 
         mirror program logic - e.g. a segment may be a single procedure or a 
         single large data structure.  This facilitates sharing of code among 
         different users.

      3. The virtual address is now treated as a segment number plus offset
         within segment.  The segment number indexes a segment table, which is
         like a page table except that each entry must include a segment length
         field as well as written-in and valid-bits and a frame address.  Also,
         since segments are variable size, frames must be variable size too;
         therefore the frame address must be a complete address, not just the
         high order bits.

   F. Segmentation runs into a physical memory fragmentation problem, since
      segments can be of any size.  

      1. In time, memory can become checkerboarded - e.g.

            --------------
            | 8 K in use |              Suppose a program needs to bring in a
            --------------              4K segment from disk.  Since a total
            | 2 K free   |              of 5K of physical memory is available,
            --------------              this should be possible.  But since
            | 6 K in use |              the 5K is in two non-contiguous
            --------------              portions, this cannot be done.
            | 3 K free   |
            --------------
            | rest in use|

      2. This is called EXTERNAL FRAGMENTATION.

      3. Some systems solve this problem by using segmentation with paging:
         Each segment is composed of 1 or more fixed-size pages.  The segment
         table now points to a page table for the segment.

         a. The virtual address now has three fields:

              segment number | page number within segment | offset within page

         b. The mapping process is as follows:

            i. Check segment number against segment limit.  If it is in range,
               use it as an index into the segment table.

           ii. From the segment table entry, extract the address and size of the 
               page table for the segment.

          iii. Check the page within segment to be sure it is within range.  If
               so, use it as an index into the segment's page table.

           iv. From the page table entry extract control bits plus (if the valid
               bit is set) the frame number of the physical address. 
               Concatenate the frame number and offset within page to form a 
               physical address.

      4. We have now reintroduced internal fragmentation, while getting rid
         of external fragmentation.  (However, this is a more manageable
         problem.)  However, we have retained the logical advantages of
         segmentation for code sharing etc.

      Example: The Intel 80386/80486/Pentium chips offer the choice of pure 
               segmentation or segmentation with paging - as determined by the 
               setting of a bit in a CPU control register.

               In either mode, logical addresses are composed of two parts -
               a segment selector, and an offset in segment.  The segment
               selector is contained in one of 6 CPU segment registers, and the 
               offset is computed using the addressing mode specified by the 
               instruction.  (The choice of segment register is normally 
               implicit in the type of reference being done - e.g. instruction 
               fetches use the CS segment register, references to the stack use 
               the SS segment register, and ordinary data references use the DS 
               segment register.  The programmer can override the default by 
               prefixing the instruction with a special segment prefix.)  

               The segment selector is used to reference a segment descriptor
               contained in one of two segment descriptor tables - a global
               table shared by all tasks, and a local table for each task.  One
               bit in the selector specifies which table to use, and 13 bits
               are used to select one of 8192 entries in the appropriate table.

               The descriptor, in turn, contains a segment base address, a
               segment limit (size), and some validity and protection bits.
               If the segment is valid and the access is allowed by the
               protection, then the offset is compared to the segment 
               limit; if it is <= the limit then the offset is added to the 
               segment base to form a "linear address" (i.e. an address within
               a linear, unsegmented space.)

               At this point, if paging is enabled, the resulting value is
               mapped using a two-level page table; otherwise, it is used
               directly as the physical address.

   G. Paging and both forms of segmentation suffer from an important
      overhead problem.

      1. Since page/segment tables are stored in memory, each memory access
         turns into multiple accesses:

         a. With paging: one access to the page table, then one to the data -
            two in all.

         b. With pure segmentation: one access to the segment table, then one
            to the data - two in all.

         c. With segmentation with paging: one access to the segment table, one
            to the segment's page table, and one to the data - three in all!
            (Or on the 80x86 four in all, because the page table has two 
            levels!)

      2. This overhead would, of course, be intolerable.  There are two ways
         to avoid or reduce it.

         a. If page tables can be kept small, then the page table can be kept 
            in high-speed registers rather than memory.  

            i. The 80x86 does something like this with the segment table.  The
               CPU contains 6 segment registers, each of which can be loaded
               with a segment number.  Most instructions implicitly use one of 
               these segment registers in forming a logical address - e.g. code 
               references in branches etc. are always made using the code 
               segement register (CS).  That is, most instructions generate 
               only the 32 bit offset portion of the address directly.

               As a hidden step, whenever a program loads a segment register the
               CPU also fetches the corresponding segment descriptor from the
               segment table and stores it in a hidden part of the segment
               register.  Thus, the descriptor is always available when needed.

           ii. Storing page/segment tables in registers is generally not useful 
               with virtual memory page tables, which tend to be quite large 
               because virtual memory is intended to support large address 
               spaces.

          b. An alternative (that is more commonly used) is a set of registers
             to store the most recently used virtual -> physical address
             translations.  These registers are sometimes called a translation
             look-aside buffer.  They are organized as an approximation to an 
             associative memory (like a cache).  When translating a virtual 
             address, the hardware first looks to see if the translation is 
             available in the set of registers.  If so, no additional memory 
             accesses are needed.

                        Pc -- K.mapping -- Mp
                                  |
                                M.translation look-aside

      3. Cache memory can also help, if it is installed between the CPU and the
         mapping hardware.  Now, only memory references that miss in the cache
         are translated at all - typically less than 10% of all references.

   H. Page Replacement Schemes

      1. In any virtual memory system, we must have a policy for deciding which
         page is to be removed from memory to provide a page frame for a newly
         requested page when a page-fault occurs.  When a process is just
         starting up, we normally give it an initial page frame allocation; and
         as it needs more room we give it an additional page frame each time it
         faults, so that the total allocation for the process grows dynamically.
         However, there must be an upper limit set for the process, and there 
         must be some mechanism for shrinking the allocation to a process 
         whose requirements have decreased.
   
      2. This is primarily a concern of a course on operating systems.  However,
         certain schemes require various sorts of hardware assistance,
         which we consider now.

      3. The optimal policy would be to replace the page whose next reference 
         is furthest in the future, treating a page that will not be referenced
         again at all as having infinite time until its next reference.
   
         a. It can be shown that this policy would minimize page faults.

         b. But, except in vary rare situations, it is totally impractical.

         c. The other policies that are used in practice are approximations 
            aimed at achieving close to the same effect.

      4. All schemes require a per-page written in bit, so that if a page is to
         be removed from memory it can be rewritten to disk only if it has been 
         modified.  (In fact, many schemes give priority to removing from memory
         pages which have not been modified so as to reduce disk traffic.  If a 
         modified page is kept around long enough one of two things will happen:
         it will be read again (and hence the write-out and read-in traffic is 
         saved), or the task using it will complete, and it need not be written 
         at all. Normally a copy of this bit is maintained in the TLB.  When a
         write occurs to a page already in the TLB, if the bit there is set
         then no update to the page table need occur; otherwise, the
         written-in bit is set both in the TLB and in the page table.

      5. Several schemes make use of a per page "referenced" bit - normally 
         stored in the page table along with valid bit and written in bit.

         a. When the page is first brought into memory - and perhaps at other
            times at well - the operating system software clears this bit.

         b. When any reference is made to the page, the hardware sets the bit
            automatically.  Thus, the operating system can determine whether any
            reference has been made to the page within the interval since it
            last cleared the bit.

      6. To summarize: hardware support for operating system's page replacement
         scheme includes:

         a. Minimally - a per-page "valid" bit and "written-in" bit.
         b. In some cases, a per-page "referenced" bit.
Copyright ©2003 - Russell C. Bjork