CS322: Memory Management

Introduction

Multiprogramming (sharing of system resources by 2 or more users)

We have seen that one of the key concepts in the design of many modern operating systems is multiprogramming: the sharing of the system resources (particularly the CPU) among 2 or more simultaneous users.
A process must be in main memory before it can be run

In our discussion of scheduling, we noted that it is the task of the short term scheduler to choose a process the receive the use of the CPU whenever the CPU becomes available as a result of an IO operation or interrupt. Of course, before a process can make use of the CPU, its code and data must be resident in main memory.
1. Earliest systems kept only one process in main memory
  
  Some of the earliest multiprogramming systems maintained only one process in main memory at a time (along with the monitor, of course.) When a process switch occurred, it was necessary to swap processes wholesale between main memory and backing storage such as drum, disk, or extended core storage. However, this kind of operation is very costly, and leaves the CPU either unused (in the case of DMA devices) or totally taken up with swapping overhead (in the case of extended core.)
2. Efficient use of CPU requires that more than one process be kept in main memory
  
  Thus, virtually all multiprogramming systems achieve efficient use of the CPU by keeping more than one process resident in memory. In these systems, the short term scheduler is constrained to choose a process to run from among those currently resident. Swapping may still be used, but in parallel with computation: the CPU is not given to a non-resident process. The choice of processes to be swapped in/out is the province of a medium-term scheduler.
Notice that the sharing of memory by processes is different from the sharing of the CPU and most other resources:
1. The CPU is mutually exclusive
  
  A process either has the CPU or it doesn't. When a process has the use of the CPU, it has the entire CPU. No other process has any use of the CPU during this time (apart from interrupt service.)
2. "Pieces" of memory are mutually exclusive
  
  A memory-resident process has the use of its memory at all times. whether it is running, ready, or waiting. But it only has a portion of memory; main memory is partitioned among various processes.
In this set of lectures, we approach this topic in two phases:
1. An exploration of some issues that arose early in the history of multiprogramming.
2. A consideration of some new hardware approaches (paging and/or segmentation) that address these issues in an elegant, but potentially costly way.

Basic issues

Strategy for allocating memory:
1. Early systems used blocks of contiguous memory
  
  Early multiprogramming systems built upon approaches developed in the days of one-user systems. A program on a one user system has available to it a contiguous region of memory, usually extending from the end of the monitor region to the physical end of memory. Early multiprogramming systems (and some in use today) emulated this model by giving each process a contiguous region of memory, bounded on either side by the monitor, another process, and/or physical end of memory. There are two principal contiguous allocation schemes:
  1. Fixed partitions
    
    memory is divided into several regions of fixed size. Generally, a process runs in a partition that is the best fit to its memory needs, though some internal fragmentation is almost always involved.
    
    An example is IBM MFT (Multiprogramming with a Fixed number of Tasks) which was part of OS/360. The fixed partitions were set up by the operator in the morning and remained the same for the rest of the day.
    
    Example: Three jobs of size 2K, 7K, 20K running on a system with fixed partitions of 4K, 8K, and 24K:
    
    2K job
    /////////////////////
    
    7K job
    
    /////////////////////
    
    20K job
    
    /////////////////////
    
    Here "/////////////////////" represents fragmentation: space that is allocated but not used (here 7K out of a total of 36K)
    - Lack of flexibility
      
      The basic problem with this scheme is its lack of flexibility. For example, in the above situation, a job of size 4K might be in the queue. This could fit in the space not needed by the 20K job; but this would not be possible since partitions are fixed.
    - Maximum job size determined by size of largest partition
      
      A related problem: the size of the largest partition determines the size of the largest runnable job. This either means that no unusually large jobs can ever run, or else one very large partition has to be created, most of whose space is unused at most times.
  2. Variable partitions
    
    each process is given just the memory allocation it needs (or sometimes slightly more). Partitions change location, size, and often even quantity as processes come and go. (Ex: IBM MVT) As processes leave the system, this results in external fragmentation.
    
    Example: The following processes arrive on a 56K system: 8K, 4K, 12K, 6K, 20K. The 4K and 6K processes terminate, leaving three holes (including the unallocated memory at the end) of size 4K, 6K, 4K:
    
    2K process
    
    (4K vacated)
    
    12K process
    
    (6K vacated)
    
    20K process
    
    (6K remaining)
    
    Now, suppose a process arrives needing 8K of memory. Observe that it cannot be fit in, even though the total available storage is enough for two such processes.
    - Basic problem is that over time memory will become fragmented into many small unusable units
    - Problem can be overcome by compaction
    - Partial compaction
      
      As a block of storage becomes free, a check is made to see if either or both of its neighbors are free. If so, then the freed block is conjoined to its neighbor(s) to make one large blocks. This can be done fairly simply - especially in a situation like this, where the total number of blocks is relatively small.
      
      Ex: In the above, if the 12K process terminates, we would combine its allocation with its 4K and 6K neighbors, to make one block of 22K instead of three blocks of 4, 6, and 12K.
    - Full compaction
      
      All existing allocations are moved so as to leave all free space in a single large block, typically at one end of memory.
      
      Ex: In the above, we could move storage in use down toward the start of memory to produce:
      
      2K process
      
      12K process
      
      20K process
      
      (16K free)
      
      But this is a non-trivial task, due to:
      - Processor time involved in doing the movement.
        
        On many machines, memory would have to be moved word by word, using a loop with as many as four instructions: move the word, increment pointer, test for completion, branch to loop. Some machines have block move instructions that make this faster.
      - Relocation
        
        More serious is the problem of relocation, to be discussed as a separate issue below. Without some form of hardware relocation, this problem makes compaction a virtual impossibility.
2. The basic problem with both of these schemes is that they require all of the memory allocated to a process to be contiguous.
  
  This was unavoidable on older architecture machines; but many newer machines designed for multiprogramming have hardware provisions that facilitate non-physically-contiguous allocation of memory. We will come back to this later under the heading of paging, segmentation, and combinations of segmentation and paging - as well as next time when we discuss virtual memory.
Scheduling strategies: selection of jobs to become memory resident and possibly selection of jobs to be swapped out.
1. Long-term scheduler may consider memory availability
  
  As we discussed a while ago, the long-term scheduler (job scheduler) selects processes to be allowed on the system on the basis of various considerations including external priority and projected utilization of system resources. Memory availability may be an issue for the scheduler to consider. The issue varies slightly depending on the allocation scheme in use.
  1. Fixed partitions: associate job queues with memory partitions
    
    Under fixed partitions, it is common to associate a job queue with each partition (or group of identical-size partitions.) A newly arrived job is placed in the queue corresponding to the smallest partition that can hold it. What happens if there are processes waiting for a smaller partition, and a larger partition becomes vacant, with no jobs waiting in its queue?
    
    Ex: The above. Suppose the 24K partition is vacant, but there are jobs waiting in the 4K and/or 8K queues.
    - We could run the smaller job in the larger partition.
      
      True, internal fragmentation would be massive; but the memory wasted would be less than if we left the partition vacant, and CPU utilization and throughput would improve.
    - but... new large jobs wait for smaller jobs
      
      On the other hand, a newly-arrived job awaiting the larger partition would have to wait for the smaller job to clear it. If the smaller job happens to be long running, this could bottle up the larger job without possibility of relief.
  2. Variable partitions: single job queue
    
    Under variable partitions, there is normally only one job queue. What if the next job in the queue is too big to fit any available partition, but there are smaller jobs behind it that would fit? Do we jump them over the bigger job? This has obvious advantages; but in a heavily loaded situation it is possible that this could lead to indefinite postponement for the larger job as the smaller jobs keep grabbing space as it becomes available, rather than allowing it to accumulate and be compacted to a sufficient size.
2. Intermediate scheduler selecting jobs for swapping
  
  There is also the possibility of an intermediate scheduler selecting jobs to be swapped out of memory to make room for others. This is a somewhat costly operation, particularly for large jobs, so it needs to be done with caution.
  1. Batch-only systems: may not be worthwhile
    
    On a batch-only system, it may not be worthwhile. The typical candidate for swapping would be a waiting job; but in a batch environment the typical IO wait time is of a magnitude comparable with swap time. However, if there were several compute bound jobs on the system and insufficient IO bound jobs to keep the IO devices busy, it might be better to swap a compute bound job to make room for one or more IO bound jobs. Likewise, if there is no compute bound job available to keep the CPU busy when all other processes are doing IO, it might be good to swap one in.
  2. Time-shared system: swapping may be worthwhile
    
    On a time-shared system, swapping might be more profitable, since minutes or even hours can pass between periods of interactive IO activity. Further, swapping may be unavoidable if the total memory requirements of all users desiring to access the system exceeds the memory available. (Note that even if a user is not admitted unless there is memory available, memory requirements can increase as a user moves from, say, just executing the command interpreter to a compilation using a large compiler.) However, as a pragmatic concern, swapping can degrade system performance. Often as system load increases, there is a sharply noticeable performance degradation as the load reaches the point where swapping becomes necessary.
  3. Swapping problem: processes waiting for IO
    
    If swapping is done, one problem that must be faced is how to handle processes waiting for IO. Typically, a process requesting IO has specified a buffer in its own memory space as the location to/from which the transfer is to occur. We don't want to have a swapped out user's IO activity taking place to/from the space allocated to the user swapped in in his place.
    - Use system memory for IO buffers
      
      One approach is to do all IO activity to/from buffers located in operating system space.
      
      On a write, the operating system transfers the data from the user's buffer to its own, then it initiates the actual transfer. On a read, when the transfer has been completed (to a system buffer), the operating system transfers the data to the user's buffer.
      
      Of course, this involves added overhead.
    - Don't swap processes waiting for IO
      
      Or, swapping of a process awaiting IO may be forbidden. But this means that an interactive process awaiting user command input that may not come for hours would be locked into memory.
    - Retain part of memory containing buffer
      
      In a paged/segmented system, only the actual page/segment containing the buffer need remain in physical memory.
    - Compromise approach
      
      A compromise approach may be taken - e.g. VMS: disk IO is done to/from buffers in user space, and so requires that the pages containing these buffers be locked until the IO is complete. Terminal IO is done to/from system buffers, with a copy between the system buffer and the user buffer needed. This allows interactive processes to be completely swapped out during potentially long terminal input waits (or output waits if the user pushes No Scroll.)
  4. Physical location in memory (is it important?)
    
    Another issue with swapping concerns the location in memory to/from which the program is swapped. Must a program always be swapped back into the same memory location it was swapped out of? This is clearly inconvenient (to say the least); but if not, then the program must be relocated to reflect its new "home".
Relocation

Any user program contains numerous addresses that refer either to other instructions in the program (e.g. for branches, calls) or to data items. Further, as it runs various pointers to code and data may be created dynamically by such operations as procedure calls (return address => pointer to code; parameters => pointers to data) or dynamic storage allocation as by the Pascal new procedure or the C++ new operator. Obviously, these addresses must reflect in some way the actual location in memory occupied by the program.
1. Absolute code
  
  The earliest translators produced absolute code, that had to be located at a specified address in memory. This was no problem on a one-user system; but on an MFT system it means that a program would be constrained to run in the specific partition for which it was translated. Such programs cannot be run on an MVT system without some provision for hardware relocation.
2. Relocatable code
  
  More recent translators produce relocatable code that contains within itself the information needed to allow the code to later be modified to run in any location in memory. Sometimes, this information is used by the operating system's loader to tailor the program to the specific memory allocation given to it on a given run. More often, however, this relocation is done by a linkage editor as part of the translation process - so the effect is the same as with absolute translators. (The advantage of relocation comes out in the ability to link together separately compiled programs and library routines, not in the ability to relocate the program at load time.)
3. Need for hardware support for relocation
  
  Even if a program can be software relocated at load time, there is still a problem with full compaction or swapping, since either can lead to a program being moved in mid-execution. Since it is not generally feasible to find all the address references in the program (especially those created at run time), some form of hardware relocation is called for.
  1. Address = base address + offset
    
    Some machines generate all run-time addresses as the sum of a programmer-accessible base register plus an offset provided by the programmer.
    
    Ex: IBM 360/370 family: each address is specified as the sum of a 12 bit offset plus the contents of one of 15 general registers (R1..R15).
    
    Ex: Intel 8086/8088: each address is specified in terms of a 16 bit offset plus the contents of one of 4 segment registers dedicated to this purpose alone (CS,DS,SS,ES). (Later members of the family use a similar approach, but can do much more sophisticated address mapping to be discussed later.)
    
    Relocation could be achieved by modifying the contents of these registers - but only if they are used in a carefully controlled way. In particular, if the programmer stores the value of a base register in a memory location temporarily, with the intention of reloading it later, and if the program is relocated before the register is reloaded, then it will be reloaded with the wrong value since the operating system will not know about the stored value. This temporary storage of pointers is hard to avoid on either of the example machines.
  2. Address = os register + offset
    
    A better approach is to do all relocation through a separate hardware register accessible only to the operating system. An example of such an approach is given below.
4. The Macintosh operating system uses an interesting approach to the problem of relocation.
  1. Memory compaction is important on windowed systems
    
    Windowed systems like the Mac rely heavily on dynamic allocation of memory for windows, dialog boxes, fonts and other resources, as well as regular dynamic allocation needed by the application. Without some scheme that allows for periodic memory compaction, fragmentation could become a severe problem.
  2. The Macintosh OS supports two kinds of memory allocation requests:
    - Non-relocatable
      
      Requests for a non-relocatable block return a pointer to a block that the OS will never move during compaction. The application may store as many copies of this pointer as it wants to where it wants to. The OS tries to keep all absolute blocks near one end of memory where they won't get in the way of compaction, and applications are encouraged to use non-relocatable blocks only when necessary and to allocate them early in execution.
    - Relocatable
      
      Most requests are for a relocatable block, which the OS is allowed to move. What is returned in this case is a handle: a pointer to a single master pointer that in turn points to the block. The master pointer is updated whenever the block is moved, and the application references the data through double indirection (**p). (The master pointer itself is in a non-relocatable block, and so never moves; but since master pointers are small a supply can be allocated at one end of memory at application startup where they will be out of the way of later compaction activity.
      
      Applications can store as many copies of the handle as they want wherever they want, but must be careful about storing a copy of the current value of the master pointer, since this may change. An exception is that an application may do what it wants to with a copy of a master pointer during any portion of program execution in which it is known that no calls to routines that can dynamically allocate memory will be made, since the OS only performs compaction when it needs to in order to satisfy an allocation request.
Memory Protection

Clearly multiple memory-resident processes must be protected from each other, just as the monitor must be. This calls for a hardware solution.
1. Fence (base) register
  
  The notion of a fence register used to protect the monitor on a one job at a time system may be extended to a pair of registers that delimit the space the current user has access to:
  1. Lower and upper limit registers
    
    One possibility is lower and upper limit registers, with each address generated being compared to both limits to see if it is in range.
  2. Base and bounds (limit) registers
    
    Or, one can have a base and bounds register pair. Addresses generated by the process are checked to be sure they lie in the range 0..bounds. (Actually, only one check is needed since addresses are usually regarded as unsigned numbers.) Then, the user-generated address is added to the base register to yield a physical address.
    
    Ex: In the example above, consider the 7K job in the 8K partition - addresses 4096 .. 12287, assuming it is allowed access to the full 8K.
    - With low and high limit registers, the low limit register would contain 4096 and the high limit register 12287. All addresses generated by the user would lie in this range.
    - With base and bounds registers, the base register would contain 4096 and the bounds register 8191. The user process would generate addresses in the range 0..8191 that would be mapped to 4096 .. 12287.
    This latter approach will also solve the relocation problem.
2. Protection codes
  
  Another way to protect memory is by the used of protection codes. Memory is divided up into blocks with a protection key for each block (ex. in IBM/MFT each block was 2K and the key size was 4 bits). The PSW (program status word) contained the key for a process, and this key was assigned to each block of memory that the process was using. If a process tried to access memory from a block that had a different key, the OS would cause an illegal access trap.
3. Memory management unit (MMU)
  
  What if appropriate registers are not provided in the CPU? (e.g. most microprocessors). Does this mean that multiprogramming such a processor is hard to impossible? No, relocation can be implemented in the memory system:
  1. Process in CPU sees memory 0 .. limit
    
    If this is done, then the CPU sees memory as an array of bytes/words numbered 0 .. limit. The mapping unit converts this viewpoint into an actual physical range of addresses.
  2. OS must have access to MMU registers
    
    Of course, the operating system must have access to the mapping units registers. This can be done by treating the mapping unit as an IO device, so that the operating system can read/write its registers like any other device - but only when in kernel mode.

Paged and Segmented Memory Organizations

Motivation

In our discussion above, we have seen that multiprogramming leads to memory management problems. The most difficult of these is the one we considered first: allocation. The choice between fixed partitions, with internal fragmentation, and variable partitions, with external fragmentation, seems to be a no-win situation.
1. Physically contiguous memory - do we need it?
  
  However, this dilemma springs from the requirement that each process must have a memory allocation that is physically contiguous. What if this could be relaxed?
2. Use hardware memory mapping
  
  In our discussion of relocation, we introduced the notion of memory mapping: some hardware mechanism for mapping one set of addresses generated by the CPU to a different set of actual physical addresses.
3. Three different memory mapping schemes
  
  Historically, this form of memory mapping was soon extended to a more general scheme that provides a better solution to the memory allocation problem. In fact, three different memory mapping schemes have evolved from the basic concept of mapping through a base register:
  1. Pure paging.
  2. Pure segmentation.
  3. Combined schemes.

Pure paging

We've seen how logical addresses can be converted to physical addresses

In the base and bounds register mapping scheme we discussed, each logical address generated by the CPU is translated into a physical address by adding in the contents of a base register. The CPU sees memory as a contiguous array of elements numbered 0.., which map to a contiguous sequence of physical memory locations.
Addresses composed of page number and offset

In a paging system, the single base register/bounds register pair is replaced by an array of base registers. Normally, the number of base registers is a power of two.

The logical address generated by the CPU is divided into two parts: a page number (consisting of some number of the highest order bits), and an offset within the page (the rest of the bits.)

The page number is used to select one of the base registers to accomplish the address mapping. Thus, number of bits in the page number portion of the address is related to the number of base registers as follows:

(Number of registers) = 2^{(Number of page number bits)}

Note, too, that the pages are of a fixed size:

(Page Size) = 2^{(Number of offset bits)}
1. Example: DEC PDP-11
  
  Many members of the DEC PDP-11 family use a mapping scheme based on 8 base/bounds register pairs. For now, we ignore the bounds registers, however, since in a pure paging scheme they are not needed because the page size is fixed. (The mapping scheme used by PDP-11 operating systems is actually a very limited form of segmentation; but the same hardware could just as easily be used for pure paging. We shall develop this possibility first.) Logical addresses generated by the CPU - which are 16 bits long - are treated as a 3 bit page number field and a 13 bit offset field:
```
		
        15    13 12                                   0
        ------------------------------------------------
        | Page  |        Offset                        |
        ------------------------------------------------
```
2. The logical address space is thus divided into a number of pages.
  
  Example: In the above each page is 8192 = 2¹³ bytes long. (However, it is more common to speak of the page length as 4096 words. A PDP-11 word is 2 bytes.)
3. Page boundaries are invisible to the programmer
  
  For example, if a PDP-11 programmer started a ten byte array at address 017774 (octal), then the first four bytes of the array would lie in page 0 and the next six bytes in page 1. However, this would pose no problem; a pointer to the array could increment from 017777 to 020000, crossing the page boundary without difficulty.

Logical address space divided into pages, physical address space divided in to frames

Just as logical address space is divided into pages, so physical address space is divided into page frames. In a pure paging system, the size of each page frame is exactly the same as the size of one page. Thus, the operating system can allocate any physical page frame to any logical page within a program by setting the base register for that page correctly.

Example: A PDP-11 with 128K words (256K bytes) of physical memory has 32 4K frames. Suppose that the memory looks like:

Operating System
(Frames 0..6)

Unused
(Frame 7)

User Job
(Frames 8..14)

Unused
(Frame 15)

User Job
(Frames 16..22)

Unused
(Frame 23)

User Job
(Frames 24..30)

Reserved for IO system
(Frame 31)

Now suppose a new process arrives, needing 3 pages of memory. Under the previous rule of contiguous allocation, it could not be accommodated, unless we somehow compacted memory. Now, however, we can fit it in by giving it the three free pages (not counting the reserved IO page). When this process is running, the memory mapping registers will be set as follows:

Logical Page Number and Range (Octal)		Base Address (Octal)
0	(000000 .. 017777)	160000
1	(020000 .. 037777)	360000
2	(040000 .. 057777)	560000
3	(060000 .. 077777)	invalid
4	(100000 .. 117777)	invalid
5	(120000 .. 137777)	invalid
6	(140000 .. 157777)	invalid
7	(160000 .. 177777)	invalid

Invalid pages

Note that, as in the above example, it may not be the case that a given process needs as many pages as the addressing scheme allows. In such a case, the unused entries in the page table contain a code indicating that an attempt to reference an address in this page is invalid. This is usually done by setting a special bit associated with the base register.
Some observations about pure paging:
1. Flexibility without external fragmentation problems
  
  It gives much of the flexibility of MVT without the problems of external fragmentation and the need for compaction. External fragmentation cannot occur, since all page frames are of the same size and thus are 100% interchangeable.
2. Relocation is never a problem
  
  even if a page must be moved from one physical location to another. An entire page can be relocated by simply resetting the pointer to it in the page table.
3. It does not totally eliminate internal fragmentation.
  
  Since memory is allocated in units of one page, a process will generally receive some excess memory. The average fragmentation is 1/2 page per process.
4. Page size vs. fragmentation
  
  To reduce fragmentation, it would seem desirable to use a relatively small page size. Indeed, page sizes of 512 to 2048 bytes are common. But this introduces a new problem: if we halve the page size, we double the number of page registers needed. For example, a PDP-11 using pages of 512 bytes (256 words) would need 128 page registers.
Too many page registers

In our discussion so far, we have assumed that the page registers are, in fact, hardware registers. When there are only eight of them, this is quite feasible. However, a logical address space of only 64K is quite small. Many paged machines have much larger logical address spaces, and generally use page sizes much smaller than those of the PDP-11. Thus, it is quite possible to need thousands of page registers to support a particular mapping scheme - an impossible situation.

Example: the mapping scheme of the IBM 370 would allow each process to have up to 4096 pages.
Replace page registers with a page table

To get around this problem, paging systems commonly replace the idea of an array of page registers with a page table stored in memory. (Each process has its own page table, and a single register in the mapping unit points to the beginning of the in-memory page table for the currently running process.) The page table contains one entry per logical page, and points to the physical frame containing the particular logical page of the process it is mapping, or a code to indicate that that particular page is not mapped.
1. Problem: doubles number of memory references
  
  This raises an obvious problem: If the page table is kept in memory, then each memory access required by a program actually leads to two memory references: one to look up the mapping data in the page table, and one to do the requested operation.
2. Translation look-aside buffer (TLB)
  
  Since such a situation would slow down the processor by a factor of 2, systems using page tables in memory also incorporate a hardware speed up device called a translation look-aside buffer. This is a small array of high speed registers that contain pairs: (Logical page number, Physical page frame address).
3. When the processor generates a memory address, its page number is translated to a page frame base address as follows:
  - The look-aside buffer is checked to see if there is an entry corresponding to the logical page number. If so, the corresponding physical page frame address is used.
  - Otherwise, a memory reference is made to the page table, and the page frame address found there is used. This information is also entered in the translation look-aside buffer, displacing a previous entry.
4. Principle of locality of reference
  
  Because of the principle of locality of reference, most memory references can be completed by looking up the page number in the look-aside buffer. Some memory references do, however still require two memory accesses. These can generally be held to less than 10% of all references made.
Summary

The paging scheme we have discussed makes it possible for different size processes to be allocated and deallocated memory without external fragmentation and compaction problems. It does lead, however, to some internal fragmentation. Hardware requirements consist of a mapping unit that either
1. Contains a set of base registers - one per logical page.
  
  or
2. Contains a pointer to the beginning of an in-memory page table. To avoid undue processor slow downs, such a mapping unit also contains a translation look-aside buffer to store the most recently used page table entries, on the assumption that they will be soon used again.

Pure Segmentation

Pages are of a fixed size

In the paging scheme we have discussed, pages are of a fixed size, and the division of a process's address space into pages is of little interest to the programmer. The beginning of a new page comes logically just after the end of the previous page.
Segments are of variable sizes

An alternate approach, called segmentation, divides the process's address space into a number of segments - each of variable size. A logical address is conceived of as containing a segment number and offset within segment. Mapping is done through a segment table, which is like a page table except that each entry must now store both a physical mapping address and a segment length (i.e. a base register and a bounds register) since segment size varies from segment to segment.
No (or little) internal fragmentation, but we now have external fragmentation

Whereas paging suffers from the problem of internal fragmentation due to the fixed size pages, a segmented scheme can allocate each process exactly the memory it needs (or very close to it - segment sizes are often constrained to be multiples of some small unit such as 16 bytes.) However, the problem of external fragmentation now comes back, since the available spaces between allocated segments may not be of the right sizes to satisfy the needs of an incoming process. Since this is a more difficult problem to cope with, it may seem, at first glance, to make segmentation a less-desirable approach than paging.

Segments can correspond to logical program units

However, segmentation has one crucial advantage that pure paging does not. Conceptually, a program is composed of a number of logical units: procedures, data structures etc. In a paging scheme, there is no relationship between the page boundaries and the logical structure of a program. In a segmented scheme, each logical unit can be allocated its own segment.

Example with shared segments

Example: A Pascal program consists of three procedures plus a main program. It uses the standard Pascal IO library for read, write etc. At runtime, a stack is used for procedure activation records. This program might be allocated memory in seven segments:
- One segment for the main routine.
- Three segments, one for each procedure.
- One segment for Pascal library routines.
- One segment for global data.
- One segment for the runtime stack.

Several user programs can reference the same segment

Some of the segments of a program may consist of library code shareable with other users. In this case, several users could simultaneously access the same copy of the code. For example, in the above, the Pascal library could be allocated as a shared segment. In this case, each of the processes using the shared code would contain a pointer the same physical memory location.

Segment table user A	Segment table user B	Segment table user C
Ptr to private code	Ptr to private code	Ptr to private code
Ptr to private code	Ptr to shared code	Ptr to private code
Ptr to shared code	Ptr to private code	Ptr to private code
Ptr to private code		Ptr to shared code
Ptr to private code		Ptr to private code

This would not be possible with pure paging, since there is no one-to-one correspondence between page table entries and logical program units.

Protection issues

Of course, the sharing of code raises protection issues. This is most easily handled by associating with each segment table entry an access control field - perhaps a single bit. If set, this bit might allow a process to read from the segment in question, but not to write to it. If clear, both read and write access might be allowed. Now, segments that correspond to pure code (user written or library) are mapped read only. Data is normally mapped read-write. Shared code is always mapped read only; shared data might be mapped read-write for one process and read only for others.

Related to this is the potential with segmentation for dynamic linking of code.
1. Programs on contiguous or paged memory systems each have entire code for libraries, etc.
  
  With conventional memory management (either contiguous or paged), an executable image stored on disk contains all of the routines needed for its execution: both user-written and library. Thus, routines from standard libraries are stored on disk scores, hundreds, or thousands of times. At any given time, several copies of a library routine may be resident in memory, each belonging to a different process. Moreover, if any of these routines needs to be changed, then every program using it must be relinked.
2. Segments make shared libraries possible (or at least easier to implement)
  
  In a segmented scheme, it is possible for library routines to not be included in the on-disk copy of an executable image. Instead, the image contains a reference to the library routine name. When the program is loaded, the segment table entries for such library routines can be set up in one of two ways:
  - The operating system can check to see if another process has loaded the library routine in question. If so, the new process's segment table entry can be set to point to the same copy. Otherwise, the routine can be loaded from the library on disk, with the segment table entry set to point to it.
  - The segment table entry can be flagged as invalid, with the intention of not actually loading the routine until the program calls it. When a call does occur, the operating system can find the routine as above. This technique can be helpful because some library routines may only be called under extraordinary circumstances - e.g. to handle an error.
Summary
1. Segmentation works with variable size segments, thus eliminating internal fragmentation at the expense of external fragmentation.
2. Segmentation facilitates the sharing of common code by processes.
3. The hardware required is similar to that needed for paging, except that each segment table entry is a pair: base, bounds. Thus, in general, a segment table entry is about twice as big as a page table entry; but since segments can be very long, the total number of entries may be less.

Combined schemes
1. Various schemes combining features of both segmentation and paging are possible.
2. RSTS/E - a hybrid OS for the PDP-11. Seeks to minimize both internal and external fragmentation.
  
  Example: the PDP-11 scheme we discussed earlier in terms of paging, but we noted that the hardware is really oriented toward segmentation since each page table entry includes both base and bounds registers. The RSTS/E operating system uses a hybrid scheme that seeks to minimize both internal and external fragmentation:
  1. Page size is 4K words, but memory is allocated in 1K word blocks
    
    Memory is allocated in units of 1K words (2K bytes). Thus, each allocated page is either 1K, 2K, 3K, or 4K words long. This reduces the otherwise severe internal fragmentation that would result from a page size of 4K words. Typically, all pages allocated to a process, save the last, are of the maximum size.
  2. Still have external fragmentation, but fragments can coalesce into page frames
    
    Physical memory allocations always start on 1K boundaries. Thus, while memory can become externally fragmented, all fragments will either be big enough to hold an entire page of maximum size, or else will be of size 1K, 2K or 3K. In most cases, if there is sufficient total memory for an incoming process, then the right size pieces can be found; if not, then a mini-compaction involving the movement of only a page or two might do the job. (e.g. if a 4K page is needed, and two 2K free fragments are separated by an active page, then that page can be moved 2K in either direction to compact the two free fragments into one.)
3. Example: The Multics system on the GE 645 achieved the advantages of segmentation without the problems of external fragmentation by paging the segments.
  1. Each segment was composed of 1 to 64 pages of 1K each. The 34 bit address is now interpreted as follows:
```
	33                  16 15     10 9         0
	---------------------------------------------
	|    Segment no.      | Page in | Offset in |
	|                     | segment | page      |
	---------------------------------------------
```
    The segment number within a logical address is used to select an entry in the segment table associated with the running process. The segment table entry for each segment points to a page table for that segment. The page in segment field is used to select an entry in that page table, which in turn is added to the offset in page to generate a physical address.
  2. Note that it is now possible for a single memory reference operation in a program to require three memory accesses: one to the segment table, one to a page table, and one to the actual location addressed. However, as with the other schemes, a translation look aside buffer can minimize this.
  3. Actually, on MULTICS this went a bit further. The segment table could potentially get very large, so it too was paged, making for the possibility of up to four memory references to translate an address!

$Id: memory.html,v 1.6 2000/03/13 03:39:50 senning Exp $

These notes were written by R. Bjork of Gordon College. They were edited, revised and converted to HTML by J. Senning of Gordon College in March 1998.