CS311 Lecture: Memory Hierarchies last revised 11/13/03
I. Introduction
- ------------
A. In the previous lecture, we looked at the basis building blocks of
memory systems: the individual devices: chips, disks, tapes etc.
B. We now focus on complete memory systems. We will see that memory systems
are seldom composed of just one type of memory; instead, they are
HIERARCHICAL systems composed of a mixture of technologies aimed at
achieving a good tradeoff between speed, capacity, cost, and physical
size.
II. The memory hierarchy
-- --- ------ ---------
A. Since every instruction executed by the CPU requires at least one
memory access (to fetch the instruction) and often more, the performance
of memory has a strong impact on system performance.
B. With present technologies, it turns out to be possible to build very fast
memories - that are quite expensive - or to built quite inexpensive
memories - that are relatively slow. The following table summarizes the
currently available technologies as they might be used on a moderate
desktop system with a 1 GHz CPU. (All columns are order of magnitude)
Technology typical quantity in a access time cost
desktop system (bytes) ($/byte)
CPU registers < 10^5 < 1 ns - (can't buy
and "on chip" separately)
memory
Static RAM < 10^6 3-10 ns >= 10^-4
Dynamic RAM < 10^9 ~ 10^2 ns >= 10^-7
Non-removable < 10^11 ~ 10 ms >= 10^-9
disk (= 10^7 ns)
Removable media (potentially > 1 sec (due >= 10^-9
(disk/tape) unlimited) to mounting)
C. Unfortunately, technological realities come into conflict with
system requirements at this point.
1. Since each instruction executed requires at least one memory access
(to fetch the instruction) and possibly two, the memory system must
support fast accesses. For example, on a RISC with a 1 GHz clock,
we must be able to access an instruction every clock cycle -
implying an access time requirement of 1 ns. Note that only
memory on the CPU chip meets this requirement.
2. If the system is supporting multiple concurrent tasks, and even
concurrent users (multi-user system or server), it may require
100's of megabytes or even gigabytes of memory to be available -
possibly more than even available Dynamic RAM can hold.
3. To address this problem, sophisticated systems will have a HIERARCHY
of memories, using several different technologies, with smaller
quantities of very fast memory and larger quantities of slow memory.
The hierarchy can be pictured this way. (Not to scale)
_________________________________
Registers | CPU registers (part of CPU) |
----------------|-------------------------------|---------------
"Memory" | Memory on CPU chip (L1 cache) |
|-------------------------------|
| Static RAM chips (L2 cache) |
|-------------------------------|
| Dynamic RAM (main memory) |
| |
| (Note: this region is |
| referred to as Mp - primary |
| memory) |
| |
| |
|-------------------------------|
| Disk (virtual memory) |
| |
| (Note: this region is |
| referred to as Ms - |
| secondary memory) |
| |
| |
| |
: :
: :
----------------|-------------------------------|---------------
File system | Disk and tape |
: :
: :
4. Note that the portion described as "memory" corresponds to memory as
it is viewed by the assembly language program (locations that can be
specified by the operand address part of an instruction such as lw or
sw.) Within this portion, only main memory is absolutely needed.
a. Cache memory serves to provide fast access to a subset of
memory that is needed often, and contains copies of frequently
accessed locations in main memory. Without this, a CPU clock
rate in excess of 10-20 MHz would be wasted, because the memory
system couldn't keep up with the need for instruction accesses!
b. Virtual memory serves to provide additional capacity beyond what
is physically available in main memory.
c. Both look to the programmer like main memory, but are not
physically implemented as such.
5. Note, too, that disk plays two roles - as part of the memory system
and as part of the file system, and that tape is treated here as part
of the file system. This is because disk and tape are often accessed
from programs using IO statements, and are physically connected into
the system as part of the IO system. We consider only the virtual
memory role for disk in this lecture.
D. This hierarchy can provide very good performance by taking advantage of
the principle of LOCALITY OF REFERENCE:
1. TEMPORAL LOCALITY: if one observes the memory references generated
by a program during a short window of time, it will be the case that
most of the references will pertain to a small subset of the total
address space of the program.
a. This happens because most the execution time of a program is spent
executing loops.
b. Within a loop, the same body of code is being executed over and
over; in addition, it is usually the case that there some or all
the data items being accessed are accessed over and over.
2. SPATIAL LOCALITY: often, successive references to memory refer to
locations that are physically adjacent to one another.
a. Absent branch instructions, code in successive locations in
memory is executed in sequence.
b. Data structures such as arrays and objects are stored in successive
locations in memory.
3. The basic idea, then, is to keep in the fastest memory in the
hierarchy those data items that are being used currently, with the
moderately fast memory used for items that will be needed soon and
slow memory used for items that will not be needed until the distant
future. When we copy information to a faster level in the hierarchy,
we may try to copy a larger block of data, anticipating that
other locations in the same block will be needed soon.
a. Assembly language programmers and optimizing compilers try to
make good use of the CPU registers toward that end. This will not
be something we will discuss here.
b. We will focus on the memory system proper.
4. Note that a memory hierarchy is based on a subset relationship - any
information that is found in a higher level of the hierarchy is also
found in all lower levels. (This may become false when we do data
writes, however, as well shall see.)
E. We will proceed soon to discuss cache memory and virtual memory in
detail. First, though, we must consider a little more of the interface
between the CPU and the memory system as a whole.
1. As a user or system program is running, it generates a stream of memory
references for instruction fetch, operand fetch, and result stores.
These take the form of an address plus a read/write specifier - e.g.
CPU ---> memory
Read 1703
Write 3424
Read 1704
2. The majority of requests are reads
a. Instruction fetches: every instruction involves at least one read
b. In evaluating expressions, intermediate results are typically kept
in CPU registers with a final store at the end.
Ex: X = A + B * C - D
would need (on a load-store or a 1-address plus general register
machine):
5 instruction fetches
4 operand fetches
1 store
So 90% of the accesses are reads. (70% - 90% reads is typical)
c. Therefore, the primary design goal in the memory system is to
optimize reads from memory - without, of course, unduly penalizing
writes.
III. Cache Memory
--- ----- ------
A. One technique used to improve overall memory system performance is CACHE
MEMORY.
1. At one time, cache memory was a feature generally found only in
higher-end computer systems. However, as CPU speeds have continued to
increase while memory speeds have not, cache memory has become a
necessity on even desktop computer systems.
a. About 10-15 years ago, common CPU clock speeds were on the
order of 8-16 MHz, and DRAM cycle times on the order of 70-80 ns.
Under these circumstances, it would be possible to perform a
memory access every 1-2 clock cycles.
b. Today, CPU clocks have gone above 1GHz, while DRAM access time
has improved only slightly, to about 60 ns. Thus, an access to
main memory takes on the order of 60 clocks or more!
c. Since a RISC is designed to execute one instruction per clock,
and since each instruction must be fetched from memory, it would
seem that RISC's couldn't function at the clock speeds they do
using memory of the sort available today.
d. Execution of one (or more) instructions per clock is critically
dependent on the use of cache memory.
2. Cache memory is a small, high-speed memory, logically located between
the CPU and the rest of the memory system.
a. At one point in time, cache memory was usually separate from the
CPU.
b. Today's high-speed CPU's depend on having cache memory on the CPU
chip that operates at the same speed as the CPU itself. This has
been made possible by improved chip manufacturing techniques that
allow more transistors per chip.
c. Most systems now use a two level cache, with a small primary cache
on the CPU chip and a larger, separate (and slower) secondary cache.
(The need for this is dictated by how much cache can actually fit
on the CPU chip, due to space and power/heat considerations.)
(Some systems have a three level cache)
d. It is also common to find - at least at the primary cache level -
that separate caches are used for instructions and data. This
facilitates having separate paths to memory for the instruction
fetch unit and the data memory access unit of a pipelined machine.
3. Cache memory works because of the phenomenon of locality of reference.
a. Each memory read is first tried against the cache. If the data is
found there (a "hit"), the processor can proceed at maximum speed.
b. Otherwise, we have a cache miss and the processor must wait for a
slower access to secondary cache or (if there is a miss there
too) to main memory.
c. Well-designed caches can achieve hit rates of 95% or more much of
the time.
d. To see why this is important, suppose we have a 1 GHz CPU with
a memory system that requires 100 ns to do an access (including
bus overhead) and a single, on-chip cache.
i. Theoretically, the time to execute an instruction is 1 ns.
ii. Suppose, however, that the cache has a hit rate of 90%. Then
90% of instructions can be fetched in 1 ns, but 10% require
100 ns. Now the average time per instructions becomes
.9 x 1 ns + .1 x 100 ns = 10.9 ns - which is equivalent to
reducing the clock rate by a factor of almost eleven! (Plus
any loss of time due to data cache misses.)
iii. With a 95% hit rate primary cache, and a 10 ns secondary
cache that hits 90% of the primary cache misses, we get an
average instruction fetch time of
.95 x 1 ns + (.05)(.9) x 10 ns + (.05)(.1) x 100 ns = 1.9 ns
(which is like cutting the clock speed in half)
iv. Of course, higher hit ratios - especially on the primary
cache - further improve performance. Nonetheless, CPU clock
rate alone tells you little about overall execution speed
without some knowledge of how the caches perform!
B. In principle, a cache is an associative memory containing pairs of the
form:
memory address value
(Note: On a byte-addressable machine, it is common to have each entry in
the cache contain several successive bytes - called a line or block.
For example, many systems use a cache based on entries holding 8
consecutive bytes of data. Thus, the lower three bits of an address
would be ignored when looking up an entry, but once the entry is found
they would be used to select the correct portion of it.)
1. However, a large fully associative memory is impractical. For example,
in a cache of 10,000 entries, each address emitted by the CPU would
have to be compared to all 10,000 entries at the same time. Even if
the required number of comparators could be economically built, the
incoming address would have to drive 10,000 logic loads. This would
require several layers of buffering (since a typical gate output can
drive about 10 others), which would inject intolerable delays. So one
of several approximations to associative memory is used.
2. Direct mapping.
a. The cache is constructed as a set of pairs of the form tag, value.
The data portion of the pair is called a CACHE LINE, or CACHE
BLOCK. Typically, the size of the line or block corresponds to the
system data bus size - e.g. 4-16 bytes - for first-level cache,
but may be much bigger for a second-level cache (as big as, say,
128 bytes - this to take advantage of spatial locality).
b. The number of entries is always a power of 2. Suppose it is 2^n.
c. Then the address from the CPU is broken down as follows, assuming
each line contains 8 bytes of data:
tag | entry number | position in entry
| n bits | 3 bits
i. Bits 3 .. n + 2 of the address select one of the entries in
the cache. If the tag of that entry matches the rest of the
address, we have a hit. (The tag comparison is needed because
many addresses map to the same cache entry.)
ii. Otherwise, the a complete line of data is obtained from memory.
It is stored (along with the tag portion of its address) in the
cache for future use, replacing the entry currently there.
iii. Note that this scheme implies that at any time at most one entry
with any given pattern in its n entry-number bits can be in the
cache. This is usually not a problem - but may be if a loop
calls a procedure whose address is some multiple of 2^n away
from the loop body, or if there are accesses in a loop to two
data structures whose addresses are some multiple of 2^n apart.
3. Set Associative
a. This is an improvement on direct mapping. It can address the
problem we just discussed.
i. The cache entries are divided into sets - typically involving 2
or 4 entries.
ii. The low order bits of the address select not one entry in the
cache, but a set of entries. The tags for each entry are
compared in parallel with the incoming address, and if one
matches there is a hit.
iii. Note that, for a given total cache size, a set-associative
cache with set size two contains half as many sets as the
direct mapping cache contains lines - but each set can hold
information from two different places in memory.
b. When a reference is not found in the cache, one of the entries
in the set must be replaced. This is typically done either in
FIFO fashion, LRU fashion, or randomly.
C. Issues with regard to caches
1. Replacement policies
a. Because of its size, the cache can only store a small fraction of
all the addresses lying in the address space of the current program.
b. Thus, when a reference occurs to an item that is not in the cache,
some item currently in the cache must be removed to make room for
the item.
c. With a direct mapping cache, there is no choice involved, since
each address maps to one and only one cache location. As noted
above, with a set associative scheme LRU, FIFO, or random may be
used.
2. Write-through versus write-back
a. What happens in the cache when a memory access is a write rather
than a read? (and the location referenced is in the cache).
b. In a write-through cache, the data is written to Mp and to cache
at the same time. This slows the system down some (though the CPU
can go on to the next instruction while the memory write occurs),
but not drastically since writes are proportionally rare.
c. In a write-back cache, the data is written only to the cache.
A written in bit is associated with the item, so that when it is
selected for replacement by a new entry, it is then written to
Mp. This avoids waiting for multiple writes to a single location
in Mp; but there is a potential problem if a DMA IO device is to
access data that has not yet been written back. This can be
handled by forcing the cache to be flushed to main memory before
a DMA operation is initiated.
3. Validity of cache items.
a. When the system is first started up, or when there is a change of
user in a multiprogrammed environment, the cache will not contain
valid data until a sufficient number of reads have been done.
Therefore, each entry in the cache must contain a valid bit - to
be cleared initially, and to be set when an entry is copied from
Mp. (When the valid bit is clear, the cache misses on any
attempt that maps to that item. Of course, in a set associative
cache each member of a set must have its own bit.)
b. There is often a provision for the operating system to invalidate
cache entries when a context change to a new user is done.
(If the cache is write-back, this also results in changed entries
being written back to main memory.)
4. Set size (for set associative caches).
a. When designing a set associative cache, a set size must be chosen
(generally a power of 2). (Note that using a larger set size means
there can be fewer sets for a given total cache size.)
b. A set size of two is commonly used, because it is simpler to build.
A set size of four has been found experimentally to give marginally
better hit/miss performance. Experimental evidence suggests set
sizes greater than 4 produce no significant gain.
IV. Mapping Logical Memory Addresses to Physical Memory Addresses
-- ------- ------- ------ --------- -- -------- ------ ---------
A. We noted earlier that the CPU generates a stream of addresses
representing accesses to the memory.
B. We will speak of the stream of addresses generated by the CPU as
LOGICAL ADDRESSES. (Sometimes the term virtual address is used, even if
the system is not a virtual memory system - but we will reserve the latter
term for virtual memory systems per se.).
1. We call them LOGICAL because many systems apply some mapping function
to the address before actually presenting it to the physical memory.
2. Logical addresses are generated by the following mechanisms:
a. Contents of the PC: instruction fetches
b. Computed operand addresses for load and store instructions.
C. Physically, main memory is organized as an array of individual
addressable units (bytes or words). These units are numbered 0 ..
total size-1. We call this numbering a PHYSICAL ADDRESS.
D. We call the range of possible logical addresses the LOGICAL ADDRESS
SPACE, and the range of possible physical addresses the PHYSICAL ADDRESS
SPACE.
1. The logical address space is dictated by the CPU architecture. For
example, a CPU that has a 32 bit PC and 32 bit registers will
generally have a logical address space of 4GB.
2. The physical address space is dictated by the number and capacity of
memory chips actually installed.
E. There are a number of possibilities for the relationship between these
two spaces:
1. They might be equal.
a. This is, for example, true on small microcomputer systems,
particularly those using 8 or 16-bit CPU's. It was also the case
on early mainframes and minis.
b. Conceptually, this is the simplest possibility. The address
output of the CPU is simply connected directly to the address bus
of the memory system.
c. But it is inflexible. The system supports a single size of
memory, with no room for expansion.
d. In multiprogrammed systems, partitioning memory is a problem.
Each user's program must be compiled to run in a specific partition
of memory, unless relative addressing is used throughout.
2. Logical space might be smaller than physical space. This possibility
is more a matter of historical significance than something that is
true of modern systems; however, it was in the context of these
sorts of systems that fundamental concepts of memory management were
developed that are still very much in use today.
a. Example: The PDP-11 had a 16-bit logical address (64K); but
PDP-11 systems used either an 18 or 22 bit physical address bus
(256K or up to 4Meg).
b. PC's based on Intel 8086/80186/80286 chips used a 20-bit logical
address (1 Meg with 640K available for RAM) but used various schemes
to allow access to a physical address space as big as 4 Meg.
c. To make use of all of physical memory, some sort of mapping scheme
became necessary.
i. Some sort of address translation hardware between the CPU and
the memory system. This is called a memory management unit, and
may be physically part of the CPU or may be a separate device.
CPU ---- Memory management ---- Memory system
ii. Each logical address generated by the CPU is translated into a
physical address by the MMU.
iii. The original incentive for doing this was multiprogramming -
allowing several different programs (perhaps belonging to several
different users) to be resident in memory at the same time.
- To prevent conflicts between the programs, it is necessary to
ensure that each references a distinct region of physical
memory.
- However, when a program is compiled, it is generally not
possible to know what region of physical memory is will
ultimately occupy when it is running.
- Thus, programs might be compiled on the assumption that their
range of accessible memory addresses ranges from 0 .. some
upper limit. It is the task of the MMU to translate the
addresses generated by the program into the real physical
addresses (which might be different if different users are
running the same program).
iv. Different tasks running on the system may have the same logical
address mapped to different physical addresses - so any one task
is still of limited size, but available memory is used to
support several tasks.
3. Logical space might be larger than physical space. Here, there are
two possibilities.
a. Some logical addresses might simply be illegal and unused.
b. The system might use virtual memory, which we shall discuss
shortly.
V. Virtual Memory Systems
- ------- ------ -------
A. Virtual memory systems are characterized by logical address space >
physical address space; but rather than making some logical addresses
unusable, disk storage is used as an extension of main memory. (Note:
from here on out we will call "logical address space" "VIRTUAL address
space" in recognition of the fact that this makes the memory available
to a program look much larger than it actually is.)
B. The basic idea is this: only a portion of a program's virtual address
space is actually resident in physical memory at any given time. The
remainder is kept on secondary storage, to be brought in when needed.
C. As with systems where virtual space < physical space, a mapping scheme
is used to translate virtual addresses into physical. However, this
scheme includes one possibility not found in the earlier case.
1. The mapping scheme described earlier had three possibilities:
a. Success
b. Failure due to attempting to access an unmapped portion of
virtual space.
c. Failure due to writing a readonly portion of virutal space.
2. With virtual memory, a fourth possibility is introduced: the access
may be valid, but the portion of memory desired may not be physically
resident.
D. To accomplish the necessary mapping of addresses, many systems (but not
all) divide virtual space into fixed size regions called pages, and
physical space into fixed size regions called page frames. (The sizes
are equal, so one page will fit into one frame.)
Ex: DEC VAX - the pages and page frames are 512 bytes each.
1. A virtual address is broken into two fields: a page number and an
offset in page.
Ex: On the VAX, the low order 9 bits of the virtual address are the
offset, and the remaining 23 bits comprise the page number.
2. A physical address is likewise broken into two fields: a page frame
number and an offset.
Ex: On a VAX with 512 Meg of physical memory, a (valid) physical
address would be 29 bits long. Of these, the high order 20 bits
are the frame number.
3. For each user, the operating system maintains a table in memory
called the page table. A memory management system register holds
the address of the first word of the page table for the user whose
program is currently running. The page portion of a virtual address
is used as an index into this table to select an entry.
a. One portion of the entry contains various bits to control access
(such as readonly etc.) One key bit is a valid bit that indicates
if the page in question is currently resident in physical memory.
b. The rest of the entry has two possible uses:
i. If the page is physically resident, the entry will contain
the number of the frame where it is to be found.
ii. Otherwise, the entry may contain an indication as to where to
go on disk to find the page if it is needed.
c. An example of mapping, based on the DEC VAX:
Suppose the page table contains the following entries, among others:
Entry No Valid Frame Number
...
4 1 10101010101010101010101
5 0 ----
...
If the CPU generates the virtual address
000000000000000000000000100101010101
The memory management hardware divides this into a page number and
offset:
000000000000000000000000100 101010101
Now, since the page number is 4, the hardware goes to entry 4 in
the page table. Since this is a valid entry, it extracts the
frame number
10101010101010101010101
and forms the physical address
10101010101010101010101101010101
On the other hand, a virtual address
000000000000000000000000101101010101
would access page table entry 5. Since this is flagged as not
valid, a trap to the operating system would occur. The current
program would be suspended while the OS gets the proper page from
disk, loads it into an available frame, resets the page table
entry to point to the chosen frame (with valid bit now 1), and
then restarts the failed instruction.
4. To make this scheme work, an additional control register is needed.
a. In principle, the page table would have to have one entry for each
possible page number. For example, a VAX page table would need
2^23 = 8 million entries, each several bytes long. Since each
process needs its own page table, on a system with a moderate number
of users all of physical memory might be consumed by page tables!
b. In practice, the total virtual space for any given user is usually
much, much less than the maximum implied by the architecture.
Therefore, the actual page table usually only has as many entries as
are needed for all of the pages actually used by the program.
Therefore, in addition to the register that contains the address of
the beginning of a user's page table, there is usually also a
page table size register. Any page number that exceeds the value in
this register causes the mapping process to fail.
5. Note that, when a page needs to be brought into main memory, the
operating system must find a free frame to hold it. This often means
that a currently-resident page must be paged out to disk to make room.
a. If the page has not been written in since it was last loaded, and
if the original is still on disk, the page does not need to be
written out to disk; its page table entry can be flagged as invalid
and it can be displaced. However, if it has been modified it must
first be written back to disk.
b. To facilitate this, each page table entry must include a "written
in" bit that is set by the hardware when a write access to the page
is done.
E. An alternative to paging is segmentation. The difference is that virtual
space is divided into variable-size segments instead of fixed-size pages.
1. One problem with paging is INTERNAL FRAGMENTATION of memory. Since
a user's actual needs seldom comprise an even number of pages, the
last page allocated to each user will contain some unneeded and
therefore wasted space. Segmentation avoids this.
2. This also has the advantage that the division of virtual space may
mirror program logic - e.g. a segment may be a single procedure or a
single large data structure. This facilitates sharing of code among
different users.
3. The virtual address is now treated as a segment number plus offset
within segment. The segment number indexes a segment table, which is
like a page table except that each entry must include a segment length
field as well as written-in and valid-bits and a frame address. Also,
since segments are variable size, frames must be variable size too;
therefore the frame address must be a complete address, not just the
high order bits.
F. Segmentation runs into a physical memory fragmentation problem, since
segments can be of any size.
1. In time, memory can become checkerboarded - e.g.
--------------
| 8 K in use | Suppose a program needs to bring in a
-------------- 4K segment from disk. Since a total
| 2 K free | of 5K of physical memory is available,
-------------- this should be possible. But since
| 6 K in use | the 5K is in two non-contiguous
-------------- portions, this cannot be done.
| 3 K free |
--------------
| rest in use|
2. This is called EXTERNAL FRAGMENTATION.
3. Some systems solve this problem by using segmentation with paging:
Each segment is composed of 1 or more fixed-size pages. The segment
table now points to a page table for the segment.
a. The virtual address now has three fields:
segment number | page number within segment | offset within page
b. The mapping process is as follows:
i. Check segment number against segment limit. If it is in range,
use it as an index into the segment table.
ii. From the segment table entry, extract the address and size of the
page table for the segment.
iii. Check the page within segment to be sure it is within range. If
so, use it as an index into the segment's page table.
iv. From the page table entry extract control bits plus (if the valid
bit is set) the frame number of the physical address.
Concatenate the frame number and offset within page to form a
physical address.
4. We have now reintroduced internal fragmentation, while getting rid
of external fragmentation. (However, this is a more manageable
problem.) However, we have retained the logical advantages of
segmentation for code sharing etc.
Example: The Intel 80386/80486/Pentium chips offer the choice of pure
segmentation or segmentation with paging - as determined by the
setting of a bit in a CPU control register.
In either mode, logical addresses are composed of two parts -
a segment selector, and an offset in segment. The segment
selector is contained in one of 6 CPU segment registers, and the
offset is computed using the addressing mode specified by the
instruction. (The choice of segment register is normally
implicit in the type of reference being done - e.g. instruction
fetches use the CS segment register, references to the stack use
the SS segment register, and ordinary data references use the DS
segment register. The programmer can override the default by
prefixing the instruction with a special segment prefix.)
The segment selector is used to reference a segment descriptor
contained in one of two segment descriptor tables - a global
table shared by all tasks, and a local table for each task. One
bit in the selector specifies which table to use, and 13 bits
are used to select one of 8192 entries in the appropriate table.
The descriptor, in turn, contains a segment base address, a
segment limit (size), and some validity and protection bits.
If the segment is valid and the access is allowed by the
protection, then the offset is compared to the segment
limit; if it is <= the limit then the offset is added to the
segment base to form a "linear address" (i.e. an address within
a linear, unsegmented space.)
At this point, if paging is enabled, the resulting value is
mapped using a two-level page table; otherwise, it is used
directly as the physical address.
G. Paging and both forms of segmentation suffer from an important
overhead problem.
1. Since page/segment tables are stored in memory, each memory access
turns into multiple accesses:
a. With paging: one access to the page table, then one to the data -
two in all.
b. With pure segmentation: one access to the segment table, then one
to the data - two in all.
c. With segmentation with paging: one access to the segment table, one
to the segment's page table, and one to the data - three in all!
(Or on the 80x86 four in all, because the page table has two
levels!)
2. This overhead would, of course, be intolerable. There are two ways
to avoid or reduce it.
a. If page tables can be kept small, then the page table can be kept
in high-speed registers rather than memory.
i. The 80x86 does something like this with the segment table. The
CPU contains 6 segment registers, each of which can be loaded
with a segment number. Most instructions implicitly use one of
these segment registers in forming a logical address - e.g. code
references in branches etc. are always made using the code
segement register (CS). That is, most instructions generate
only the 32 bit offset portion of the address directly.
As a hidden step, whenever a program loads a segment register the
CPU also fetches the corresponding segment descriptor from the
segment table and stores it in a hidden part of the segment
register. Thus, the descriptor is always available when needed.
ii. Storing page/segment tables in registers is generally not useful
with virtual memory page tables, which tend to be quite large
because virtual memory is intended to support large address
spaces.
b. An alternative (that is more commonly used) is a set of registers
to store the most recently used virtual -> physical address
translations. These registers are sometimes called a translation
look-aside buffer. They are organized as an approximation to an
associative memory (like a cache). When translating a virtual
address, the hardware first looks to see if the translation is
available in the set of registers. If so, no additional memory
accesses are needed.
Pc -- K.mapping -- Mp
|
M.translation look-aside
3. Cache memory can also help, if it is installed between the CPU and the
mapping hardware. Now, only memory references that miss in the cache
are translated at all - typically less than 10% of all references.
H. Page Replacement Schemes
1. In any virtual memory system, we must have a policy for deciding which
page is to be removed from memory to provide a page frame for a newly
requested page when a page-fault occurs. When a process is just
starting up, we normally give it an initial page frame allocation; and
as it needs more room we give it an additional page frame each time it
faults, so that the total allocation for the process grows dynamically.
However, there must be an upper limit set for the process, and there
must be some mechanism for shrinking the allocation to a process
whose requirements have decreased.
2. This is primarily a concern of a course on operating systems. However,
certain schemes require various sorts of hardware assistance,
which we consider now.
3. The optimal policy would be to replace the page whose next reference
is furthest in the future, treating a page that will not be referenced
again at all as having infinite time until its next reference.
a. It can be shown that this policy would minimize page faults.
b. But, except in vary rare situations, it is totally impractical.
c. The other policies that are used in practice are approximations
aimed at achieving close to the same effect.
4. All schemes require a per-page written in bit, so that if a page is to
be removed from memory it can be rewritten to disk only if it has been
modified. (In fact, many schemes give priority to removing from memory
pages which have not been modified so as to reduce disk traffic. If a
modified page is kept around long enough one of two things will happen:
it will be read again (and hence the write-out and read-in traffic is
saved), or the task using it will complete, and it need not be written
at all. Normally a copy of this bit is maintained in the TLB. When a
write occurs to a page already in the TLB, if the bit there is set
then no update to the page table need occur; otherwise, the
written-in bit is set both in the TLB and in the page table.
5. Several schemes make use of a per page "referenced" bit - normally
stored in the page table along with valid bit and written in bit.
a. When the page is first brought into memory - and perhaps at other
times at well - the operating system software clears this bit.
b. When any reference is made to the page, the hardware sets the bit
automatically. Thus, the operating system can determine whether any
reference has been made to the page within the interval since it
last cleared the bit.
6. To summarize: hardware support for operating system's page replacement
scheme includes:
a. Minimally - a per-page "valid" bit and "written-in" bit.
b. In some cases, a per-page "referenced" bit.
Copyright ©2003 - Russell C. Bjork