CS122 Lecture: Data Abstraction 1/26/89 - last revised 1/7/99
Objectives:
1. To discuss overall approaches to software design
2. To introduce the notion of abstract data types as a design methodology
3. To show how ADT's are defined in terms of a set of values and a set of
operations.
4. To show how ADT's are implemented by means of a storage structure and a set
of procedures.
5. To show how to use ADT's as a software design tool
6. To show how to realize ADT's in VAX Pascal
Materials: Handout #4, plus demonstration version of example program (in
CS122.HANDOUTS).
I. Introduction
- ------------
A. Which phase of the software life-cycle is the most challenging - in terms
of the knowledge and skill called for to carry it out successfully?
ASK
1. In my opinion, the HIGH-LEVEL (SYSTEM) DESIGN PHASE is the most
challenging phase. (At least, it's the most challenging phase to
teach!)
2. Much of your experience thus far has been with fairly simple programs -
programs that someone can successfully write with a minimum of advance
planning. However, a significant software project typically begins
with a set of software requirements that are much too complex for
anyone to implement directly. The major task accomplished during the
design phase is decomposing the original, large project into smaller
pieces that can be implemented - often going through many levels of
decomposition before arriving at components of manageable size.
3. In our discussion of software engineering, we noted that one of the
factors involved in the software crisis is the rapid growth in the SIZE
of the tasks we expect computers to do. This means that the difficulty
of the design phase keeps increasing.
a. This challenge has been made possible, in part, by the rapid rise in
memory capacities - the size of available memory limits what you can
do. In the 1960's, memories for large mainframes were typically
under 100K; today, a personal computer with 32 Meg is considered
"bare minimal"; and larger systems may have 100's or 1000's of
megabytes. (Note: I revised this lecture this year from the version
I used last year, and had to change this number from 16 M to 32 in
just one year's time.)
b. Software development effort is not a linear function of program
size, due to interaction between different parts of the program.
(e.g. if a program consists of 2 modules, there is one interaction
to worry about; if it consists of 3 modules, there are three
interactions; if it consists of 4, there are 6; if it consists of
n there are n!/2 interactions to worry about.
4. One writer has defined Computer Science as "the science of managing
complexity". While this is arguably an incomplete definition, it
actually does have a lot of truth to it. Over the years, computer
scientists have borrowed or developed a number of key principles which
help us to manage the ever-growing complexity of the tasks we are asked
to perform:
a. The principle of MODULARITY - actually borrowed from other
engineering disciplines.
b. The principle of ABSTRACTION - also borrowed in part from other
engineering disciplines.
c. The principle of ENCAPSULATION or INFORMATION HIDING - unique to CS.
B. Most engineering disciplines make use of the principle of MODULARITY as
a key tool both for design and to facilitate maintenance.
1. Example:
a. Suppose one day you press down on the brake pedal on your car, and
it goes all the way to the floor. Obviously, something is wrong!
b. If you know something about automobiles, you realize that a car is
composed of many different component modules. In this case, you are
able to eliminate most of them from consideration; you conclude that
the culprit is almost certainly one of the following:
i. The master cylinder.
ii. The brake lines from the master cylinder to the wheels.
iii. One of the wheel cylinders.
However, you would not suspect low air in the spare tire or a bad
battery!
c. Suppose there is a puddle of brake fluid lying on the ground under
one of the wheels, with evidence of drips coming out of the wheel
cylinder. Now we have a pretty good idea where the problem lies!
At this point, and only at this point, do we begin to worry about
the internal functioning of the wheel cylinder - and we never worry
about the master cylinder or the lines.
2. We have been talking about modularity in the context of isolating the
problem when something goes wrong (debugging); but modularity is
obviously important when doing the initial design, too. For example,
in the design of a new model car:
a. One engineer (or more likely team of engineers) is responsible
for the overall design, including allocation of space for the
major component such as the engine, drive train, etc.
b. The individual subsystems (engine etc.) are each designed by
their own team, separate from the team that is responsible for the
overall design.
c. In fact, it is common for one subsystem design to be used in more
than one model car - perhaps over a period of many years. Thus,
the designers of a new model car may be able to make use of
previously-designed major components without having to literally
"reinvent the wheel".
3. For modularity to help us, it is necessary that the overall system (eg
the system that stops the car when we press the brake pedal) be capable
of decomposition into a series of subsystems, each interacting with
the rest of the system in predictable ways. (In our example, it would
make life very complex indeed if low air in the spare tire could
CAUSE a leak from a wheel cylinder!)
a. We speak of these subsystems as MODULES.
b. We speak of these interactions between modules as INTERFACES.
4. In the realm of software, a good modular design exhibits two major
characteristics:
a. HIGH COHESION - all the parts of one module should logically belong
together because they contribute to the performance of a single,
well-defined subtask.
b. LOW COUPLING - the extent to which one module depends on
information about the structure of another, shared global variables,
etc. should be minimized.
This allows independent programmers or teams to work on each module,
so that the overall task can be completed more quickly.
(But bad design will result in severe problems at system
integration time when all the modules are put together!)
C. The principle of ABSTRACTION is closely related to the principle of
modularity.
1. The basic idea is this: each module in a design should perform a
well defined function that can be understood solely in terms of
its interface, without having to understand the details of its inner
workings.
a. This makes it possible to focus on one portion of the design at a
time, without becoming overwhelmed by a lot of detail about other
portions. When we are working on a module at one level, we think
of lower level modules only in terms of their interfaces (without
worrying about the details of their inner workings).
i. Example: the engineers who are responsible for the overall design
of the car think of it in terms of the abstractions:
Engine and drive train - Fuel goes in, rotation comes out
Steering system - Motion of the steering wheel goes in, change of
direction of the car comes out.
Braking system - Foot pressure goes in, slowing down comes out.
At a high level, they need to ensure that these functions are
present, allocate space for them within the car, etc.; but the
details of designing each subsystem are left to others.
ii. Example: The engineers who design the brake system of a car do
not need to know about the inner workings of the engine;
indeed, their task might become impossible if their
design did depend on the details of other subsystems.
b. During maintenance, abstraction makes it possible to isolate a
problem to a particular module, without needing to have a detailed
understanding of the inner workings of every part of the system.
Example: In the case of a brake system, each of the components is
itself a unit composed of several sub-assemblies
(cylinders, pistons, O-rings, lines, tubing, couplings
etc.) But when trying to solve a brake system problem,
it is not necessary to worry about that - we seek to
isolate the problem to the one offending component that
is not performing its assigned task - i.e. fulfilling
the abstraction specified by its interface.
2. Historically, the concept of abstraction has been approached in three
different ways in Computer Science.
a. The first form of abstraction used was PROCEDURAL ABSTRACTION: A
given task is broken down into subtasks, each of which is realized
by a procedure (or function).
i. You should already be familiar with this - it is the process
of STEPWISE REFINEMENT taught in CS121.
ii. In this form of abstraction, we focus on the VERBS in the
problem statement:
"My goal is to do X"
"To do X, I must do Y and Z"
"To do Y, I must do ..."
b. More recently, it was discovered that the same principles can
be applied to data, leading to DATA ABSTRACTION.
i. This is what we focus on today. We will be learning about the
concept of ABSTRACT DATA TYPES, or ADT's for short.
ii. In data abstraction, we focus on the NOUNS in the problem
statement.
"I need to manipulate objects of type X"
"An X has the following component parts ..."
c. A currently very active subfield within software engineering has
to do with OBJECT-ORIENTED DESIGN, which goes beyond data
abstraction. Here at Gordon, we currently teach OO at the junior
level - though this will change to the freshman year beginning
next year.
3. In any case, a key principle of Computer Science is this: ABSTRACTION
IS THE WAY TO CONTROL COMPLEXITY.
D. The principal of INFORMATION HIDING / ENCAPSULATION builds on the notion
of abstraction:
1. An abstraction (procedure, data type, or object) encapsulates certain
behavior as specified by its interface.
2. An abstraction hides the details of how it provides this behavior
from its clients, so its clients can focus on larger issues.
II. Introduction to Abstract Data Types
-- ------------ -- -------- ---- -----
A. We are familiar with the notion of "data types" from Pascal.
1. Scalar types, such as integers, real, boolean, char.
2. Structured types: arrays, records, sets and files.
3. Programmer-defined types built up from these.
B. What do we mean, though, when we speak of a data type? What is a
data type?
1. ASK
2. Basically, a data type can be characterized by two sets:
a. A set of values.
b. A set of operations.
3. For example, consider the data type integer.
a. The data type integer is characterized by the following sets:
Values: { -maxint .. maxint }
Operations: { +, -, *, div, mod, pred, succ, ord,
comparisons }
b. We can go further, and give an interface specification for
each operation - e.g.
+ input: two integers
output: an integer
precondition: sum of the integers must be within the
range -maxint .. +maxint
postcondition: output is the sum of the two inputs
...
= input: two integers
output: a boolean
preconditions: none
postcondition: output is true iff the two inputs are identical
4. As a more complicated example, consider an array data type. Of course,
we now have not just one data type, but a family of types. A specific
array type is defined by giving its indextype and basetype. Let's
consider a definition for the type:
type demoarray = array[1..10] of real
a. The data type demoarray is characterized by the following two sets:
Values: { sequences of 10 real numbers }
Operation: { [] (selection by subscript)}
b. We can further characterize the operation [] (on objects of type
demoarray) as follows:
[] input: a demoarray and an integer
output: a real
precondition: the integer must lie in the range 1..10
postcondition: result is the specified member of the sequence
C. You will notice that our specifications of the data types integer and
demoarray have been ABSTRACT: we have said how they behave, but not how
this behavior is realized.
1. Of course, someone does have to address this issue. Thus, associated
with every abstraction will be an IMPLEMENTATION.
2. An implementation is specified by giving:
a. A STORAGE STRUCTURE for mapping the values of the type to memory
b. A set of "procedures" for realizing the operations.
3. Sometimes, the implementation of a data type is built into the
hardware of the computer we are using. This is almost always the
case with the data type integer.
a. The typical storage structure for integers is a machine word which
holds the binary encoding of the integer.
b. The "procedures" for realizing the operations on integers are
generally basic machine instructions - e.g. most computers have
hardware instructions to add, subtract, multiply, divide, and
compare integers.
4. In other cases, the implementation of a data type is provided by
the programming language translator. This is the case with array
types.
a. The typical storage structure for an array is a series of adjacent
memory locations, each big enough to hold one element of the array.
b. The subscript operation is realized by using the following formula.
If an array is declared as
array [lo .. hi] of sometype
then the address of element[i] =
address of element[lo] + (i-lo) * size of one item of sometype
(Code to perform this computation is emitted by the Pascal compiler
whenever a subscript is encountered in a program.)
5. We will be working with abstract data types of our own invention where
we must furnish both the interface and the implementation.
6. One of the advantages of data abstraction is that the same abstraction
can sometimes be realized in different ways. Different implementations
may be used in different contexts.
a. For example, an alternative implementation for integers is to
store them as character strings, with subroutines being written to
perform the necessary operations. If a given Pascal implementation
chose to do this, the programmer who uses integers might never know
the difference.
i. Each method of implementation has advantages and disadvantages.
ii. For example, the typical binary implementation is compact and
fast, because it uses facilities built into the hardware. But
there is a limit to the range of integers that can be represented
(e.g. -maxint .. maxint).
iii. A chracter string implementation would allow virtually
unlimited range - e.g. a 100-digit integer should be no problem.
(Since all computers are finite, there is an ultimate limit, of
course).
b. If the user of the abstraction and the implementer of the
abstraction agree on an interface (the specifications of the set
of values and operations), then the implementer is free to change
the implementation without the user of the abstraction having to
change what he has done.
c. However, if the user of an abstraction relies on details of how
the abstraction is implemented on a particular system, then
any change in the implementation may damage the correctness of
a program using it. To prevent this, we use the principle of
INFORMATION HIDING:
i. The user of a data type is permitted to rely on the
specification of the type in terms of its set of values and
set of operations.
ii. The user of a data type is NOT permitted to rely on the details
of how the type is implemented - in fact, he need not know this
(and perhaps should not, to prevent inadvertent reliance on
them.)
D. We are now ready to define the notion of an "ABSTRACT DATA TYPE".
1. An abstract data type is defined by a SPECIFICATION that describes its
values and the operations permitted on objects of that type.
2. An abstract data type is implemented by appropriate programming
language constructs such as type declarations, procedures etc -
in such a way as to HIDE implementation details from the clients using
the type.
3. Some newer programming languages have had as one of their design
goals the provision of support for specifying abstract data types in
such a way as to hide the details of their implementation.
a. Standard Pascal is weak on this count; but, as we shall see later,
VAX Pascal incorporates some extensions that help quite a bit.
b. Languages with stronger support for ADT's include Modula-2
(Niklaus Wirth's successor to Pascal), C++ and Ada.
c. Object-oriented languages such as C++ and Java also incorporate
strong support for data abstraction.
III. Using ADT's to Design Software
--- ----- ----- -- ------ --------
A. As we indicated earlier, data abstraction can be used along with
procedural abstraction in designing software (both high-level and
low-level.) Basically, we preceed by a process of stepwise refinement.
1. We begin by asking the question "What are the major abstract data
types needed to model the reality we are working with?".
2. We then develop a specification for each type, by spelling out
the kind of values it can assume, and by defining each of the
needed operations.
3. Often, the new type will need to be defined in terms of a mixture
of standard data types (e.g. integers) and additional new abstract
types. Each of these new abstract types then must itself be
refined.
B. Sometimes, wd do design of abstract data types in something of
a bottom up, rather than top down fashion.
1. We essentially create a set of utility operations which we believe
to be needed to realize the higher-level operations uncovered in the
top down design.
2. This leads to a "sandwich" approach to design - we start from the
functional specifications and work down, and we start from the
reality to be modelled and work up, with the part in the middle being
done last.
C. An example
1. Consider the process of developing software for managing a video
rental store. What are the major entities that need to be
represented?
ASK
a. Customers
b. Videos
c. Late charges
etc.
2. Thinking about late charges for a moment, what information must be
recorded about a given late charge?
ASK
a. Customer owing the charge
b. Video that the charge is assessed for
c. Date video was due to be returned
d. Date video was actually returned
e. Amount of the charge.
etc.
The set of values of the abstract data type "Late charge" would be
tuples containing these various pieces of information.
Values = { (customer, video, date due, date returned, amount) }
3. What operations need to be performed on the data objects that represent
late charges?
ASK
a. Creating a new object (e.g. when a video is returned late)
b. Printing out information about it (if the customer wants to know
why he has been charged).
c. Summarizing the late charges owed by a given customer (total
amount due).
d. Recording that a charge has been paid.
etc.
Operations = { create, print, summarize by customer, pay }
D. Another example
1. Two of the pieces of information we need to record about a late
charge are calendar dates: the date the video was due, and the date
it was actually returned.
a. Actually, calendar dates are used by many, many applications.
b. However, most programming languages do not provide a built-in
data type for representing a date.
c. Thus, developing a date as an ADT can be quite useful for many
situations.
2. What is the set of values for a data object representing a calendar
date?
ASK
a. The problem is a bit more complex than it appears at first, because
practical considerations typically dictate that we actually limit
ourselves to some subrange of all possible dates. (Finite storage
always dictates this - compare the limitation of integers to
-maxint .. + maxint.)
b. As we all know, one choice that has often been made in practice
has been to limit the range of dates to one century - implicit in
the common decision to use a two-digit integer to represent the
year. As 2000 draws near, this choice looks very bad!
c. But note that even a choice involving using 4 digits for a year
still has its limiations (though we probably won't have to worry
about the "Y10K" problem becoming an issue for most software we
develop!)
d. There is also the issue of the EARLIEST date we allow ourselves
to represent. For example, if we are developing a date data type
for use by historians, we will have to allow for representing
dates both BC and AD.
e. The actual range of values needed will obviously be dictated by
the problem we are trying to solve. For our example, it would
probably be satisfactory to be able to represent dates running from
the present time until the end of the next century, though being
able to represent a larger range may end up happening as a natural
result of the representation we choose.
e.g.
Values = { x | x is a calendar date & today <= x < 1-jan-2100 }
3. What are the operations we need to be able to perform on a date?
ASK
a. We need some mechanism to convert back and forth between the
way we represent dates internally and a suitable printed form.
In the example code I will be distributing, I will call these
stringToDate and dateToString.
b. We need some mechanism for getting at today's date.
In the example code I will be distributing, I will call this
operation todaysDate.
c. We need the ability to compare two dates - e.g. to determine
whether a video is being returned early, on time, or late.
In the example code I will be distributing, this will be handled
by using one of the other operations we will be defining.
d. We need to be able to perform certain kind of arithmetic on
dates.
i. Add an integer to a date (needed to calculate a date due).
ii. Subtract two dates, producing an integer (needed to calculate
number of days a video is late.)
In the example code I will be distributing, I will call these
addToDate and subtractDates.
iii. As an aside, note that the arithmetic operations on dates
involve operands/results of type integer as well as dates,
and that the set of meaningful operations is quite limited -
e.g. the following operations would be absurd:
- Adding two dates together
- Multiplying two dates, or a date by an integer
- Dividing two dates, or a date by an integer
e. To fully specify an operation on an ADT, we give its name, the
type and meaning of its parameters and result, and its preconditions
and postconditions. We will due this for the two arithmetic
operations as an example:
i. addToDate (startDate, interval) returns Date
Preconditions: startDate is a valid Date
interval is an integer
Result of operation must fall within range of
representable dates.
Postconditions: Result is a Date the specified number of days
before or after startDate (after if interval is
positive; before if interval is negative)
ii. subtractDates (startDate, finishDate) returns integer
Preconditions: startDate and finishDate are valid Dates
Postconditions: Result is number of days between these dates -
positive if finishDate is after startDate,
negative if before; zero if the same.
4. The final step in implementing an ADT is developing an actual
implementation - which entails:
a. Developing a storage structure - a way of representing the values
of the ADT using either built in types or other, simpler ADT's.
b. Writing functions and/or procedures to implement the various
operations, using the chosen storage structure.
c. It is frequently the case, at this point, that we need to examine
a variety of possible storage structures that might be used -
choosing the one that lets us implement the desired operations
most easily and efficiently.
Example: What storage structures might be used to represent the
ADT Date?
ASK
i. Character string
- stringToDate and dateToString operations are trivial
- arithmetic and comparison operations are extremely complex
ii. Record with fields for month, day, and year
- stringToDate and dateToString are more complex, but not
horrible
- arithmetic and comparison are complex, but doable
Subordinate question: how to represent the individual fields?
iii. Integer representing the number of days since some base date
- This one is the most widely used - e.g. most operating systems
use something like this for their internal representation.
VMS: number of 100-nanosecond "ticks" since the Smithsonian
base date - November 17, 1858.
Unix: number of seconds since January 1, 1970
(Will eventually lead to a problem when # of time units exceeds
range of integer representation used - e.g. VMS systems will
face an "8664 problem" and Unix systems will face a "2038
problem".)
- stringToDate and dateToString are complex, but most systems
provide library routines for handling this for the standard
internal format. (Conversion from "ticks" or seconds since
base date to days since base date is a trivial matter of
dividing by "ticks" or seconds per day.)
- Arithmetic and comparison operations are trivial.
- Would be a poor choice for a system that has to represent
dates in the past as well as the future.
c. Because different storage structures have different strengths and
weaknesses, we build a "wall of abstraction" between the
SPECIFICATION of the ADT and the IMPLEMENTATION of the ADT. The
specification should be independent of the internal representation
used, and users of the ADT should depend only on the specification,
not the implementation. (Principle of INFORMATION HIDING). This
would allow the implementation to be changed at some future time
without effecting the code that uses the ADT.
(In the case of dates, some change of representation is needed on
many systems as 2000 or some other critical year approaches. In
fact, the current Y2K problem would be MUCH easier to solve if
good data abstraction techniques had been consistenly in use in
software design).
IV. Realizing ADT's in VAX Pascal
-- --------- ----- -- --- ------
A. In keeping with the principle of information hiding, it is desireable
that each abstract data type used in a program be implemented as a
module of its own, containing the definition of the storage structure
for the type plus the operations on it. This module should, in turn,
be composed of two parts:
1. A specification part, containing the information needed by other
parts of the program that use the ADT.
2. An implementation part, whose details should not be visible to the
rest of the program.
B. An approximation to this can be achieved in VAX Pascal by using
modules. (These are a major extension to standard Pascal.) (Note: this
is not the only reason why this facility of VAX Pascal exists.)
DISTRIBUTE, GO OVER HANDOUT 4. DEMO COMPILATION OF AND RUNNING PROGRAM
1. A VAX Pascal program may consist of a main program and any number of
modules. Each module resides in a separate file, and is compiled
separately - as is the main program. The main program and the other
modules are then joined together by the LINK command.
2. A main program begins with a program heading, and a module begins
with a module heading - e.g. the module for the ADT Date begins:
module Dates;
Note: if the module uses any files, they should appear in the module
heading, just like files used in the main program appear in the
program heading.
3. The bulk of the module consists of declarations of constants, types,
variables and/or procedures and functions.
4. The module terminates with an end statement and a period. There is
no begin associated with this end, since nothing in the module is
executed until it is called by the main program.
C. To make the declarations in the module accessible to other modules or
the main program, the module heading can be preceeded with an environment
directive.
1. This causes the compiler to create an environment file containing an
internalized representation of the declarations in the module.
2. Users of the module can begin with an inherit directive to access
this environment file.
D. In keeping with the principle of data abstraction, a module can be
physically divided into two parts - the specification part, and the
implementation part.
1. Declarations which users of the ADT need should be placed in the
specification part, and others in the implementation part.
2. In the case of procedures and functions, the forward declaration
can be used to separate the procedure's heading (specification)
from its implementation.
E. The handout example illustrates one limitation of the module facility of
VAX Pascal for data abstraction.
1. We had to declare the type Date in the visible part; thus, modules
that use Date "know" that dates are actually represented by
integers. (In fact, we would use some convention that a Date
is represented as the number of days that have elapsed since some
arbitrarily chosen "Day 0". The actual choice of base date, of
course, remains hidden - e.g. the VMS operating system uses
November 17, 1858 for this purpose.)
2. The problem with this is that someone might be tempted to bypass
the use of addToDate and subtractDates by just doing ordinary
integer addition and subtraction on dates. This will work fine -
unless the implementation of dates were later changed to, say,
use a record with three fields for month, day, and year; or to
use an integer representing seconds since the base date instead of
days.
3. Newer data abstraction and OO languages make it possible to declare
"private" data types. With a private type, the user of the ADT
knows that a type of a given name exists, but does not know how
it is declared.
Copyright ©1999 - Russell C. Bjork