Data Abstraction

CS122 Lecture: Data Abstraction         1/26/89 - last revised 1/7/99

Objectives:

1. To discuss overall approaches to software design
2. To introduce the notion of abstract data types as a design methodology
3. To show how ADT's are defined in terms of a set of values and a set of
   operations.
4. To show how ADT's are implemented by means of a storage structure and a set
   of procedures.
5. To show how to use ADT's as a software design tool
6. To show how to realize ADT's in VAX Pascal

Materials: Handout #4, plus demonstration version of example program (in
           CS122.HANDOUTS).

I. Introduction
-  ------------

   A. Which phase of the software life-cycle is the most challenging - in terms
      of the knowledge and skill called for to carry it out successfully?

      ASK

      1. In my opinion, the HIGH-LEVEL (SYSTEM) DESIGN PHASE is the most 
         challenging phase.  (At least, it's the most challenging phase to 
         teach!)

      2. Much of your experience thus far has been with fairly simple programs -
         programs that someone can successfully write with a minimum of advance
         planning.  However, a significant software project typically begins 
         with a set of software requirements that are much too complex for 
         anyone to implement directly.  The major task accomplished during the
         design phase is decomposing the original, large project into smaller 
         pieces that can be implemented - often going through many levels of 
         decomposition before arriving at components of manageable size.
      
      3. In our discussion of software engineering, we noted that one of the 
         factors involved in the software crisis is the rapid growth in the SIZE
         of the tasks we expect computers to do.  This means that the difficulty
         of the design phase keeps increasing.

         a. This challenge has been made possible, in part, by the rapid rise in
            memory capacities - the size of available memory limits what you can
            do.  In the 1960's, memories for large mainframes were typically 
            under 100K; today, a personal computer with 32 Meg is considered 
            "bare minimal"; and larger systems may have 100's or 1000's of 
            megabytes.  (Note: I revised this lecture this year from the version
            I used last year, and had to change this number from 16 M to 32 in
            just one year's time.)

         b. Software development effort is not a linear function of program 
            size, due to interaction between different parts of the program. 
            (e.g. if a program consists of 2 modules, there is one interaction 
            to worry about; if it consists of 3 modules, there are three 
            interactions; if it consists of 4, there are 6; if it consists of 
            n there are n!/2 interactions to worry about.

      4. One writer has defined Computer Science as "the science of managing
         complexity".  While this is arguably an incomplete definition, it 
         actually does have a lot of truth to it.  Over the years, computer
         scientists have borrowed or developed a number of key principles which 
         help us to manage the ever-growing complexity of the tasks we are asked
         to perform:

         a. The principle of MODULARITY - actually borrowed from other 
            engineering disciplines.

         b. The principle of ABSTRACTION - also borrowed in part from other
            engineering disciplines.

         c. The principle of ENCAPSULATION or INFORMATION HIDING - unique to CS.

   B. Most engineering disciplines make use of the principle of MODULARITY as
      a key tool both for design and to facilitate maintenance.

      1. Example: 

         a. Suppose one day you press down on the brake pedal on your car, and
            it goes all the way to the floor.  Obviously, something is wrong!

         b. If you know something about automobiles, you realize that a car is
            composed of many different component modules.  In this case, you are
            able to eliminate most of them from consideration; you conclude that
            the culprit is almost certainly one of the following:

            i. The master cylinder.
           ii. The brake lines from the master cylinder to the wheels.
          iii. One of the wheel cylinders.

            However, you would not suspect low air in the spare tire or a bad
            battery!

         c. Suppose there is a puddle of brake fluid lying on the ground under
            one of the wheels, with evidence of drips coming out of the wheel
            cylinder.  Now we have a pretty good idea where the problem lies!
            At this point, and only at this point, do we begin to worry about
            the internal functioning of the wheel cylinder - and we never worry 
            about the master cylinder or the lines.

      2. We have been talking about modularity in the context of isolating the 
         problem when something goes wrong (debugging); but modularity is 
         obviously important when doing the initial design, too.  For example,
         in the design of a new model car:

         a. One engineer (or more likely team of engineers) is responsible
            for the overall design, including allocation of space for the
            major component such as the engine, drive train, etc.

         b. The individual subsystems (engine etc.) are each designed by
            their own team, separate from the team that is responsible for the
            overall design.  

         c. In fact, it is common for one subsystem design to be used in more 
            than one model car - perhaps over a period of many years.  Thus,
            the designers of a new model car may be able to make use of
            previously-designed major components without having to literally
            "reinvent the wheel".

      3. For modularity to help us, it is necessary that the overall system (eg
         the system that stops the car when we press the brake pedal) be capable
         of decomposition into a series of subsystems, each interacting with
         the rest of the system in predictable ways.  (In our example, it would
         make life very complex indeed if low air in the spare tire could
         CAUSE a leak from a wheel cylinder!)

         a. We speak of these subsystems as MODULES.

         b. We speak of these interactions between modules as INTERFACES.

      4. In the realm of software, a good modular design exhibits two major
         characteristics:

         a. HIGH COHESION - all the parts of one module should logically belong 
            together because they contribute to the performance of a single,
            well-defined subtask.

         b. LOW COUPLING - the extent to which one module depends on 
            information about the structure of another, shared global variables,
            etc. should be minimized.

         This allows independent programmers or teams to work on each module, 
         so that the overall task can be completed more quickly.
               
         (But bad design will result in severe problems at system
          integration time when all the modules are put together!)

   C. The principle of ABSTRACTION is closely related to the principle of
      modularity.

      1. The basic idea is this: each module in a design should perform a
         well defined function that can be understood solely in terms of
         its interface, without having to understand the details of its inner 
         workings. 

         a. This makes it possible to focus on one portion of the design at a
            time, without becoming overwhelmed by a lot of detail about other
            portions.  When we are working on a module at one level, we think 
            of lower level modules only in terms of their interfaces (without 
            worrying about the details of their inner workings).  

            i. Example: the engineers who are responsible for the overall design
               of the car think of it in terms of the abstractions:

               Engine and drive train - Fuel goes in, rotation comes out
               Steering system - Motion of the steering wheel goes in, change of
                                 direction of the car comes out.
               Braking system - Foot pressure goes in, slowing down comes out.

               At a high level, they need to ensure that these functions are 
               present, allocate space for them within the car, etc.; but the 
               details of designing each subsystem are left to others.

           ii. Example: The engineers who design the brake system of a car do 
               not need to know about the inner workings of the engine;
               indeed, their task might become impossible if their 
               design did depend on the details of other subsystems.

         b. During maintenance, abstraction makes it possible to isolate a 
            problem to a particular module, without needing to have a detailed
            understanding of the inner workings of every part of the system.

            Example: In the case of a brake system, each of the components is 
                     itself a unit composed of several sub-assemblies
                     (cylinders, pistons, O-rings, lines, tubing, couplings 
                     etc.)  But when trying to solve a brake system problem,
                     it is not necessary to worry about that - we seek to 
                     isolate the problem to the one offending component that
                     is not performing its assigned task - i.e. fulfilling
                     the abstraction specified by its interface.

      2. Historically, the concept of abstraction has been approached in three
         different ways in Computer Science.

         a. The first form of abstraction used was PROCEDURAL ABSTRACTION:  A
            given task is broken down into subtasks, each of which is realized
            by a procedure (or function).  

            i. You should already be familiar with this - it is the process
               of STEPWISE REFINEMENT taught in CS121.

           ii. In this form of abstraction, we focus on the VERBS in the
               problem statement:

               "My goal is to do X"
               "To do X, I must do Y and Z"
               "To do Y, I must do ..."

         b. More recently, it was discovered that the same principles can
            be applied to data, leading to DATA ABSTRACTION.

            i. This is what we focus on today.  We will be learning about the 
               concept of ABSTRACT DATA TYPES, or ADT's for short.

           ii. In data abstraction, we focus on the NOUNS in the problem
               statement.

               "I need to manipulate objects of type X"
               "An X has the following component parts ..."

         c. A currently very active subfield within software engineering has
            to do with OBJECT-ORIENTED DESIGN, which goes beyond data 
            abstraction.  Here at Gordon, we currently teach OO at the junior
            level - though this will change to the freshman year beginning
            next year.

      3. In any case, a key principle of Computer Science is this: ABSTRACTION 
         IS THE WAY TO CONTROL COMPLEXITY.

   D. The principal of INFORMATION HIDING / ENCAPSULATION builds on the notion
      of abstraction:

      1. An abstraction (procedure, data type, or object) encapsulates certain
         behavior as specified by its interface.

      2. An abstraction hides the details of how it provides this behavior
         from its clients, so its clients can focus on larger issues.

II. Introduction to Abstract Data Types
--  ------------ -- -------- ---- -----

   A. We are familiar with the notion of "data types" from Pascal.

      1. Scalar types, such as integers, real, boolean, char.

      2. Structured types: arrays, records, sets and files.

      3. Programmer-defined types built up from these.

   B. What do we mean, though, when we speak of a data type?  What is a
      data type?

      1. ASK

      2. Basically, a data type can be characterized by two sets:

         a. A set of values.

         b. A set of operations.

      3. For example, consider the data type integer.

         a. The data type integer is characterized by the following sets:

            Values:     { -maxint .. maxint }

            Operations: { +, -, *, div, mod, pred, succ, ord,
                          comparisons }

         b. We can go further, and give an interface specification for
            each operation - e.g.

            +   input:          two integers
                output:         an integer

                precondition:   sum of the integers must be within the
                                 range -maxint .. +maxint
        
                postcondition:  output is the sum of the two inputs

            ...

            =   input:          two integers
                output:         a boolean

                preconditions:  none

                postcondition:  output is true iff the two inputs are identical

      4. As a more complicated example, consider an array data type.  Of course,
         we now have not just one data type, but a family of types.  A specific
         array type is defined by giving its indextype and basetype.  Let's
         consider a definition for the type:

                type demoarray = array[1..10] of real

         a. The data type demoarray is characterized by the following two sets:

            Values:     { sequences of 10 real numbers }

            Operation:  { [] (selection by subscript)}

         b. We can further characterize the operation [] (on objects of type
            demoarray) as follows:

            []  input:          a demoarray and an integer
                output:         a real

                precondition:   the integer must lie in the range 1..10
                
                postcondition:  result is the specified member of the sequence

   C. You will notice that our specifications of the data types integer and
      demoarray have been ABSTRACT: we have said how they behave, but not how 
      this behavior is realized.

      1. Of course, someone does have to address this issue.  Thus, associated
         with every abstraction will be an IMPLEMENTATION.

      2. An implementation is specified by giving:

         a. A STORAGE STRUCTURE for mapping the values of the type to memory

         b. A set of "procedures" for realizing the operations.

      3. Sometimes, the implementation of a data type is built into the
         hardware of the computer we are using.  This is almost always the
         case with the data type integer.

         a. The typical storage structure for integers is a machine word which
            holds the binary encoding of the integer.

         b. The "procedures" for realizing the operations on integers are 
            generally basic machine instructions - e.g. most computers have
            hardware instructions to add, subtract, multiply, divide, and
            compare integers.

      4. In other cases, the implementation of a data type is provided by
         the programming language translator.  This is the case with array
         types.

         a. The typical storage structure for an array is a series of adjacent
            memory locations, each big enough to hold one element of the array.

         b. The subscript operation is realized by using the following formula.
            If an array is declared as

                array [lo .. hi] of sometype

            then the address of element[i] =

                address of element[lo] + (i-lo) * size of one item of sometype

            (Code to perform this computation is emitted by the Pascal compiler
             whenever a subscript is encountered in a program.)

      5. We will be working with abstract data types of our own invention where
         we must furnish both the interface and the implementation.

      6. One of the advantages of data abstraction is that the same abstraction
         can sometimes be realized in different ways.  Different implementations
         may be used in different contexts.

         a. For example, an alternative implementation for integers is to 
            store them as character strings, with subroutines being written to
            perform the necessary operations.  If a given Pascal implementation
            chose to do this, the programmer who uses integers might never know
            the difference.

            i. Each method of implementation has advantages and disadvantages.

           ii. For example, the typical binary implementation is compact and 
               fast, because it uses facilities built into the hardware.  But
               there is a limit to the range of integers that can be represented
               (e.g. -maxint .. maxint).

          iii. A chracter string implementation would allow virtually 
               unlimited range - e.g. a 100-digit integer should be no problem.
               (Since all computers are finite, there is an ultimate limit, of
               course).

         b. If the user of the abstraction and the implementer of the 
            abstraction agree on an interface (the specifications of the set
            of values and operations), then the implementer is free to change 
            the implementation without the user of the abstraction having to
            change what he has done.

         c. However, if the user of an abstraction relies on details of how
            the abstraction is implemented on a particular system, then
            any change in the implementation may damage the correctness of
            a program using it.  To prevent this, we use the principle of
            INFORMATION HIDING:

            i. The user of a data type is permitted to rely on the
               specification of the type in terms of its set of values and
               set of operations.

           ii. The user of a data type is NOT permitted to rely on the details
               of how the type is implemented - in fact, he need not know this
               (and perhaps should not, to prevent inadvertent reliance on
               them.)

   D. We are now ready to define the notion of an "ABSTRACT DATA TYPE".

      1. An abstract data type is defined by a SPECIFICATION that describes its
         values and the operations permitted on objects of that type.

      2. An abstract data type is implemented by appropriate programming
         language constructs such as type declarations, procedures etc -
         in such a way as to HIDE implementation details from the clients using
         the type.

      3. Some newer programming languages have had as one of their design
         goals the provision of support for specifying abstract data types in
         such a way as to hide the details of their implementation.

         a. Standard Pascal is weak on this count; but, as we shall see later,
            VAX Pascal incorporates some extensions that help quite a bit.

         b. Languages with stronger support for ADT's include Modula-2
            (Niklaus Wirth's successor to Pascal), C++ and Ada.

         c. Object-oriented languages such as C++ and Java also incorporate
            strong support for data abstraction.

III. Using ADT's to Design Software
---  ----- ----- -- ------ --------

   A. As we indicated earlier, data abstraction can be used along with
      procedural abstraction in designing software (both high-level and
      low-level.)  Basically, we preceed by a process of stepwise refinement.

      1. We begin by asking the question "What are the major abstract data
         types needed to model the reality we are working with?".

      2. We then develop a specification for each type, by spelling out
         the kind of values it can assume, and by defining each of the
         needed operations.

      3. Often, the new type will need to be defined in terms of a mixture
         of standard data types (e.g. integers) and additional new abstract
         types.  Each of these new abstract types then must itself be
         refined.

   B. Sometimes, wd do design of abstract data types in something of
      a bottom up, rather than top down fashion.

      1. We essentially create a set of utility operations which we believe 
         to be needed to realize the higher-level operations uncovered in the 
         top down design.

      2. This leads to a "sandwich" approach to design - we start from the
         functional specifications and work down, and we start from the
         reality to be modelled and work up, with the part in the middle being
         done last.

   C. An example

      1. Consider the process of developing software for managing a video
         rental store.  What are the major entities that need to be
         represented?

         ASK

         a. Customers

         b. Videos

         c. Late charges

         etc.

      2. Thinking about late charges for a moment, what information must be
         recorded about a given late charge?

         ASK

         a. Customer owing the charge

         b. Video that the charge is assessed for

         c. Date video was due to be returned

         d. Date video was actually returned

         e. Amount of the charge.

         etc.

         The set of values of the abstract data type "Late charge" would be 
         tuples containing these various pieces of information.

         Values = { (customer, video, date due, date returned, amount) }

      3. What operations need to be performed on the data objects that represent
         late charges?

         ASK

         a. Creating a new object (e.g. when a video is returned late)

         b. Printing out information about it (if the customer wants to know
            why he has been charged).

         c. Summarizing the late charges owed by a given customer (total
            amount due).

         d. Recording that a charge has been paid.

         etc.

         Operations = { create, print, summarize by customer, pay }

   D. Another example

      1. Two of the pieces of information we need to record about a late
         charge are calendar dates: the date the video was due, and the date
         it was actually returned.

         a. Actually, calendar dates are used by many, many applications.

         b. However, most programming languages do not provide a built-in
            data type for representing a date.

         c. Thus, developing a date as an ADT can be quite useful for many
            situations.

      2. What is the set of values for a data object representing a calendar
         date?

         ASK

         a. The problem is a bit more complex than it appears at first, because
            practical considerations typically dictate that we actually limit
            ourselves to some subrange of all possible dates.  (Finite storage
            always dictates this - compare the limitation of integers to
            -maxint .. + maxint.)

         b. As we all know, one choice that has often been made in practice
            has been to limit the range of dates to one century - implicit in
            the common decision to use a two-digit integer to represent the
            year.  As 2000 draws near, this choice looks very bad!

         c. But note that even a choice involving using 4 digits for a year
            still has its limiations (though we probably won't have to worry
            about the "Y10K" problem becoming an issue for most software we
            develop!)

         d. There is also the issue of the EARLIEST date we allow ourselves
            to represent.  For example, if we are developing a date data type
            for use by historians, we will have to allow for representing
            dates both BC and AD.

         e. The actual range of values needed will obviously be dictated by
            the problem we are trying to solve.  For our example, it would
            probably be satisfactory to be able to represent dates running from
            the present time until the end of the next century, though being
            able to represent a larger range may end up happening as a natural
            result of the representation we choose.

            e.g. 

            Values = { x | x is a calendar date & today <= x < 1-jan-2100 }

      3. What are the operations we need to be able to perform on a date?

         ASK

         a. We need some mechanism to convert back and forth between the
            way we represent dates internally and a suitable printed form.

            In the example code I will be distributing, I will call these
            stringToDate and dateToString.

         b. We need some mechanism for getting at today's date.

            In the example code I will be distributing, I will call this
            operation todaysDate.

         c. We need the ability to compare two dates - e.g. to determine
            whether a video is being returned early, on time, or late.

            In the example code I will be distributing, this will be handled
            by using one of the other operations we will be defining.

         d. We need to be able to perform certain kind of arithmetic on
            dates.

            i. Add an integer to a date (needed to calculate a date due).

           ii. Subtract two dates, producing an integer (needed to calculate
               number of days a video is late.)

            In the example code I will be distributing, I will call these
            addToDate and subtractDates.  

          iii. As an aside, note that the arithmetic operations on dates
               involve operands/results of type integer as well as dates,
               and that the set of meaningful operations is quite limited -
               e.g. the following operations would be absurd:

               - Adding two dates together

               - Multiplying two dates, or a date by an integer

               - Dividing two dates, or a date by an integer

         e. To fully specify an operation on an ADT, we give its name, the
            type and meaning of its parameters and result, and its preconditions
            and postconditions.  We will due this for the two arithmetic
            operations as an example:

            i. addToDate (startDate, interval) returns Date

               Preconditions:  startDate is a valid Date

                               interval is an integer

                               Result of operation must fall within range of
                               representable dates.

               Postconditions: Result is a Date the specified number of days
                               before or after startDate (after if interval is
                               positive; before if interval is negative)

           ii. subtractDates (startDate, finishDate) returns integer

               Preconditions:  startDate and finishDate are valid Dates

               Postconditions: Result is number of days between these dates -
                               positive if finishDate is after startDate,
                               negative if before; zero if the same.
        
      4. The final step in implementing an ADT is developing an actual
         implementation - which entails:

         a. Developing a storage structure - a way of representing the values
            of the ADT using either built in types or other, simpler ADT's.

         b. Writing functions and/or procedures to implement the various
            operations, using the chosen storage structure.

         c. It is frequently the case, at this point, that we need to examine
            a variety of possible storage structures that might be used -
            choosing the one that lets us implement the desired operations
            most easily and efficiently.

            Example: What storage structures might be used to represent the
                     ADT Date?

            ASK

            i. Character string

               - stringToDate and dateToString operations are trivial

               - arithmetic and comparison operations are extremely complex

           ii. Record with fields for month, day, and year

               - stringToDate and dateToString are more complex, but not
                 horrible

               - arithmetic and comparison are complex, but doable

               Subordinate question: how to represent the individual fields?

          iii. Integer representing the number of days since some base date

               - This one is the most widely used - e.g. most operating systems
                 use something like this for their internal representation.

                 VMS:  number of 100-nanosecond "ticks" since the Smithsonian 
                       base date - November 17, 1858.

                 Unix: number of seconds since January 1, 1970

                 (Will eventually lead to a problem when # of time units exceeds
                 range of integer representation used - e.g. VMS systems  will 
                 face an "8664 problem" and Unix systems will face a "2038 
                 problem".)

               - stringToDate and dateToString are complex, but most systems
                 provide library routines for handling this for the standard 
                 internal format.  (Conversion from "ticks" or seconds since
                 base date to days since base date is a trivial matter of
                 dividing by "ticks" or seconds per day.)

               - Arithmetic and comparison operations are trivial.

               - Would be a poor choice for a system that has to represent
                 dates in the past as well as the future.

        c. Because different storage structures have different strengths and
           weaknesses, we build a "wall of abstraction" between the
           SPECIFICATION of the ADT and the IMPLEMENTATION of the ADT.  The
           specification should be independent of the internal representation
           used, and users of the ADT should depend only on the specification,
           not the implementation.  (Principle of INFORMATION HIDING).  This
           would allow the implementation to be changed at some future time
           without effecting the code that uses the ADT.

           (In the case of dates, some change of representation is needed on
            many systems as 2000 or some other critical year approaches.  In
            fact, the current Y2K problem would be MUCH easier to solve if
            good data abstraction techniques had been consistenly in use in
            software design).

IV. Realizing ADT's in VAX Pascal
--  --------- ----- -- --- ------

   A. In keeping with the principle of information hiding, it is desireable
      that each abstract data type used in a program be implemented as a
      module of its own, containing the definition of the storage structure
      for the type plus the operations on it.  This module should, in turn,
      be composed of two parts:

      1. A specification part, containing the information needed by other
         parts of the program that use the ADT.

      2. An implementation part, whose details should not be visible to the
         rest of the program.

   B. An approximation to this can be achieved in VAX Pascal by using
      modules.  (These are a major extension to standard Pascal.)  (Note: this
      is not the only reason why this facility of VAX Pascal exists.)

      DISTRIBUTE, GO OVER HANDOUT 4.  DEMO COMPILATION OF AND RUNNING PROGRAM

      1. A VAX Pascal program may consist of a main program and any number of
         modules.  Each module resides in a separate file, and is compiled
         separately - as is the main program.  The main program and the other
         modules are then joined together by the LINK command.

      2. A main program begins with a program heading, and a module begins
         with a module heading - e.g. the module for the ADT Date begins:

         module Dates;

         Note: if the module uses any files, they should appear in the module
               heading, just like files used in the main program appear in the
               program heading.

      3. The bulk of the module consists of declarations of constants, types,
         variables and/or procedures and functions.

      4. The module terminates with an end statement and a period.  There is
         no begin associated with this end, since nothing in the module is
         executed until it is called by the main program.

   C. To make the declarations in the module accessible to other modules or
      the main program, the module heading can be preceeded with an environment
      directive.

      1. This causes the compiler to create an environment file containing an
         internalized representation of the declarations in the module.

      2. Users of the module can begin with an inherit directive to access
         this environment file.

   D. In keeping with the principle of data abstraction, a module can be
      physically divided into two parts - the specification part, and the
      implementation part.

      1. Declarations which users of the ADT need should be placed in the
         specification part, and others in the implementation part.

      2. In the case of procedures and functions, the forward declaration
         can be used to separate the procedure's heading (specification)
         from its implementation.

   E. The handout example illustrates one limitation of the module facility of 
      VAX Pascal for data abstraction.

      1. We had to declare the type Date in the visible part; thus, modules
         that use Date "know" that dates are actually represented by
         integers.  (In fact, we would use some convention that a Date
         is represented as the number of days that have elapsed since some
         arbitrarily chosen "Day 0".  The actual choice of base date, of
         course, remains hidden - e.g. the VMS operating system uses
         November 17, 1858 for this purpose.)

      2. The problem with this is that someone might be tempted to bypass
         the use of addToDate and subtractDates by just doing ordinary
         integer addition and subtraction on dates.  This will work fine -
         unless the implementation of dates were later changed to, say,
         use a record with three fields for month, day, and year; or to
         use an integer representing seconds since the base date instead of
         days.

      3. Newer data abstraction and OO languages make it possible to declare
         "private" data types.  With a private type, the user of the ADT
         knows that a type of a given name exists, but does not know how
         it is declared.
Copyright ©1999 - Russell C. Bjork