CPS222 Lecture: Hashing last revised 1/25/2013 I. Introduction - ------------ A. Recall that a map is a data structure that can be used to store and retrieve information associated with a certain key. The operations on such a structure can be pictured as follows: 1. Insertion: __________________ Key, value | Map | ----------> | (key,value pairs)| |__________________| 2. Lookup: ___________ Key | Map | Value ----------> | | -------> |___________| 3. Deletion: ____________________________ | Map | ----------> | (key and its value removed)| |____________________________| B. We have, at various times in the past, considered the following search structures: Structure Insert Lookup Delete "Pile" O(1) O(n) O(n) [ to find "victim" ] (Sequential search) Ordered array O(n) O(log n) O(n) (Binary search) Linked list - O(1) O(n) O(n) [ to find "victim" ] not ordered (Sequential search) Binary search tree O(log n)-O(n) O(log n)-O(n) O(log n)-O(n) (All operations can be guaranteed to be O(log n) if we use a balancing strategy as we will discuss shortly) C. Today we introduce the last search structure (for main memory) that we will consider: the hash table. A hash table exhibits the following performance: Hash table O(1)-O(n) O(1)-O(n) O(1)-O(n) Note that there is something of a gamble involved here. A hash table can perform unbeatably well; but it can also do as badly as the worst of the structures we have considered. II. Hashing -- ------- A. In a map, the key may be an integer, a character string, or (more rarely) something else such as a real. For the purpose of hashing, though, we will want to always work with integer keys. This is possible, because any non-integer key can be converted into an equivalent integer. Example: a character string composed only of the 26 letters A-Z plus space can be regarded as an integer radix 27. Example: Treated as an integer radix 27, BJORK = 2*27^4 + 10*27^3 + 15*27^2 + 18*27 + 11 = 1,271,144 Example: The real number 3.14 has the (32 bit) binary representation using IEEE floating point 0100 0000 0100 1000 1111 0101 1100 0011 which is equivalent to the integer 1078523331 B. If we had unlimited storage, we could create a very fast search structure by allocating an array whose subscripts would range over all possible values of the key. If a given key exists, it would be found in the slot indexed by its value. 1. Example: in such a table an entry for BJORK would be in slot 1271144 2. Obviously, however, such a table could not be reasonably created for keys of any significant size - e.g. 10 letter keys would require 27^10 slots (= about 2 x 10^14) - the vast majority of which would be empty! 3. Even if such a table could be created, initializing it would be computationally costly, since we would have to put an indicator in each slot showing that there is no value there as yet to prevent wrong results when searching for a nonexistent key. C. Hashing builds on this basic idea, as follows: A hash table is an array of BUCKETS, each of which consists of one or more SLOTS. The buckets are typically numbered 0..b-1 or 1..b. (There are b buckets in all.) The number of slots, then, is s*b, where s is the number of slots per bucket. 1. A single slot is able to hold one key-value pair. 2. When hashing is used for tables maintained in main memory (our focus here), it is common for each bucket to consist of a single slot - in which case the hash table becomes simply an array of slots. We will develop most of our initial examples along these lines - extension to the case of buckets having multiple slots is easy. 3. The total number of slots is greater than the total number of keys expected to be entered in the table, but typically much less (often by many orders of magnitude) than the number of POSSIBLE keys. 4. Initially, all the slots are set to some special value that indicates that the slot is empty - e.g. a key of 0 or the like. This is now feasible because the number of slots is comparable in magnitude to the number of keys we intend to store, not the number of possible keys. D. The basic idea behind hashing is this: we devise a key to address transformation algorithm that converts a LOGICAL key (such as a name) to a PHYSICAL "home" bucket for the data associated with that key. ________________ | Key to address | Logical ---->| transformation | ----> Physical bucket key | algorithm | number ---------------- 1. This transformation must allow for the fact that the number of POSSIBLE logical keys is MUCH greater than the number of possible resulting bucket numbers. Example: Let's say we decide to use a hash table with 2000 buckets for Gordon students (to allow room for growth etc.) Further, lets suppose we use student ID's as the logical key. The transformation Gordon ID to bucket number maps 10,000,000 possible logical keys into only 2000 possible physical keys. 2. As a result, any given algorithm has the possibility of mapping two different logical keys to the same physical bucket. Such keys are said to be SYNONYMS. a. As long the number of synonyms for any given bucket is <= the number of slots per bucket, we have no problem. We simply store successive synonyms in successive slots in the same bucket. When we go to look up a key, we calculate its home bucket using the hashing function and then search all slots in the bucket to see if it is in one of them. b. Of course, if we have only one slot per bucket, then there is no room to put two or more synonyms in the same bucket. Even if we have multiple slots in the same bucket, a problem can arise if there are more synonym actual keys for a bucket than there are slots. c. The resultant condition is said to be a COLLISION. Since only one key-value pair can be stored in any given slot in the table, we will have to devise some strategy for handling these collisions. 3. Thus, any hash table scheme is characterized by the following parameters: a. The number of buckets (b). b. The number of slots per bucket (s). c. The number of keys actually present in the table (n) - where n <= b*s. d. The hashing function that maps a key to a bucket number in the range 0..b-1 or 1..b. e. The strategy for handling collisions. Later in the lecture, we will consider the last two items in detail, exploring various alternatives. For now, we will consider one commonly used hashing function and one commonly used collision- handling strategy. E. As we shall see, there are many hashing strategies that could be used. The simplest is one called the DIVISION REMAINDER METHOD. home-bucket = key mod b (to produce a result in the range 0..b-1) or 1 + key mod b (to produce a result in the range 1..b) F. Likewise, there are many possible strategies for handling collisions. The simplest is one called LINEAR PROBING, or LINEAR OPEN ADDRESSING: 1. To insert a record: a. Compute the address of the home bucket, using the hash function. b. If that bucket has room, put the record there. c. Otherwise, begin looking at adjacent buckets (in increasing bucket number order) until a bucket with room is found. Put the record in the first vacant bucket. i. If you reach the last bucket in the table, then continue searching with the first bucket (i.e. treat the bucket numbers as if they wrap around modulo b). ii. If you come full circle back to the home bucket, then give up; the table is full. (In this case, one can replace the table with a larger one dynamically; but a new hash function would also be needed to take advantage of the added bucket. This would mean repositioning every record already in the table as well.) Example: Table with 5 buckets, 1 slot per bucket (initially empty); Entries consist of a numeric ID (key) plus a name. hash function = key mod 5. Insert 17 AARDVARK: goes into bucket 2 Insert 23 BUFFALO: goes into bucket 3 Insert 12 CAT: should go into bucket 2, but ends up in 4 Insert 44 DOG: should go into bucket 4, but ends up in 0 2. To locate a record: a. Compute the address of the home bucket, using the hash function. b. If that bucket contains the record, we have succeeded. (One must actually check the data stored to be sure the key matches.) If the home bucket is vacant, then the record is not in the table. c. If the home bucket is full - but does not contain the desired key - begin searching successive buckets (as on insert), until either - The desired record is found. - A vacant bucket is found (in which case, we conclude the record is not in the table, since otherwise insert would have found this bucket and put the record there.) - You come full circle to the home bucket (in which case conclude the record is not there because you have tried every one!) Example: Trace lookup of each of 17, 23, 12, 44 in turn Trace lookup of records with key = 31, 30 3. To delete a record: a. First locate the record as above. b. Now, can we just simply use the DELETE operation to vacate the slot? No. Why? Because then a later lookup on another record may fail. Example: suppose we deleted 17 AARDVARK by vacating slot 3. What would happen when we try to lookup 12? c. Therefore, we instead must replace the record with a dummy record that fills the slot, but will never match any key we are looking for. (E.G. if our key is numeric, we might store the letter D in the key field of the record.) - On insert, we treat such a slot as if it were, in fact, vacant, and put a new record there if we need to. - On lookup, we treat such a slot as occupied, since it once was. III. Additional Key-To-Address Transformation Techniques (Hashing Functions) --- ---------- -------------- -------------- ---------- -------- ---------- A. Any hashing function must meet two basic criteria: 1. For all possible logical keys, it must produce a value in the range 0..b-1 or 1..b. 2. It must disperse the logical keys uniformly - i.e. the probability that a randomly chosen key hashes to any particular bucket should be 1/b - or very close to this. 3. Actually, the second criterion can be more complicated if the keys to be used exhibit some pattern or bias. For example, the following scheme would uniformly distribute random integer keys over 10 buckets: bucket = key mod 10 However, if the last digits of the key were not uniformly distributed (e.g. if there were a bias toward even keys), then the resulting distribution would be non-uniform. 4. A further consideration is that the hashing function should not be computationally-expensive, since we are trying to compete with an O(log n) search strategy and can lose our advantage if too much computation is required. B. We have already discussed the division-remainder method 1. home-bucket = key mod b (to produce a result in the range 0..b-1) or key mod b + 1 (to produce a result in the range 1..b) 2. Advantages: a. Computationally simple if the the key is an integer to begin with, or if converting it to an integer is not too expensive. b. Provides good dispersion if b is a prime or at least has no prime factors <= 20. c. Flexible choice of b values - many sizes to choose from. C. The mid-square method 1. home-bucket := middle m bits of sqr(key). 2. This requires that b be a power of 2. 3. Example: Let b be 64, and let keys be integers ranging from 1 to 1000. Then the square of the key is 20 bits long, and we choose bits 7..12. 50 would hash as follows: bits taken ________ sqr(50) = 2500 = 0000 0000 1001 1100 0100 2 home-bucket = 010011 = 19 2 4. Advantages: a. Computationally simple if the the key is an integer to begin with, or if converting it to an integer is not too expensive. b. Provides good dispersion, since the hash function depends on all bits of the original key. c. Tables whose size is a power of 2 are often natural anyway. D. Folding 1. Folding is one way of avoiding the need to convert a non-integer key to an integer - which often poses a problem if the key is long enough that the resultant value would not fit in the word length of the underlying machine. (E.g. even a 7-letter alphabetic key, treated as a number radix 27, could have a value bigger than the largest 32 bit integer.) 2. The key is divided into a number of pieces, each of which is treated as an integer. All of the pieces are added together, either straight or with alternate pieces reversed. Example: 123456789012 might be treated as four pieces 123 456 789 012 which could be added together one of two ways: a. Shift folding: 123 456 789 012 --- 1380 b. Boundary folding: 123 654 789 210 --- 1776 E. Digit analysis 1. The previous methods required no advance knowledge of the actual set of keys to be used. If the keys are available in advance, though, a hashing function might be developed based on an analysis of them. 2. One approach is to calculate the frequency distribution for different values of each digit (letter) of the keys. Example: frequency analysis on last names of students in class F. One last topic we mention briefly is the notion of ORDER-PRESERVING hash functions. 1. In general, it is not the case that if key1 < key2 then hash(key1) < hash(key2). However, there are some hash functions that have this property. They are known as order-preserving hash functions. 2. An order-preserving hash function would be used in a case where one wishes to have the ability to process table entries in ascending order of key value, starting at some given point. Such processing is needed when looking for a RANGE of key values - e.g. JOHNS <= last_name < JOHNT IV. Additional Collision Resolution Strategies -- ---------- ---------- ---------- ---------- A. Though a good hashing scheme that disperses keys uniformly can reduce the number of synonyms and hence the probability of a collision, we cannot aovid having to deal with such collisions as do occur somehow. B. We have already considered linear probing, or linear open addressing: 1. Basically, when we have a collision in a given bucket on insert, we try the next successive bucket, and we keep trying successive buckets (wrapping around from the last to the first if need be) until we find a bucket with a vacant slot - which must always occur, unless the table is 100% full. (A special case we should detect to prevent infinite looping!) 2. Looking up a key mirrors the process of inserting a key. We keep trying successive buckets until we either find our key, or we find a bucket with a vacant slot - which is where it would have gone if it were in the table. 3. To delete a key, we need to leave behind some sort of marker that says the slot was occupied at one point in time. On insert, we consider such a slot vacant and use it; on lookup, we consider it occupied and keep looking. 4. Comments on efficiency of linear open addressing a. At first glance, it appears that hashing with linear open addressing could be terribly inefficient: it could degenerate to searching the entire table. b. On the other hand, if the record we want is, in fact, in its home bucket or very near to it, then this method works quite well. c. The success of this method depends on two things: i. Allocating enough space in the table so that there are sufficient vacant slots to break up long searches. (A good rule of thumb is to never allow more than 80% of the slots to actually be used- eg if we wish to store records on 1600 students, then use a table with at least 2000 slots, plus an appropriate hash function.) ii. Choose a hash function that disperses the keys uniformly over the slots. d. One remaining problem that is hard to avoid, however, is the problem of CLUSTERING. i. Consider the following portion of a hashtable: |_____________| | Bucket x | |_____________| | Bucket x+1 | |_____________| | Bucket x+2 | |_____________| | Bucket x+3 | .... Suppose bucket x overflows. Then a key belonging to bucket x is inserted into bucket x+1. ii. Of course, the effect of this overflow is to increase the probability that bucket x+1 will also overflow, since it is now receiving keys that map to two different buckets. iii. When bucket x+1 overflows, it begins adding keys to bucket x+2. This also becomes the place where further overflows from bucket x must go, of course. So now bucket x+2 becomes the target for keys hashing to three different buckets. This further enhances the chances of bucket x+2 overflowing, which would make bucket x+3 the target for keys hashing to four different buckets ... iv. As you can see, linear probing suffers from the problem that clusters of overflowing buckets can develop such that several buckets "compete" for the same overflow space. (In the above case, buckets x, x+1, x+2, x+3 and x+4 would all compete for space in bucket x+4.) Further, once this clustering starts to occur, it feeds on and compounds itself. v. There are several alternatives available to reduce this clustering problem. C. Quadratic probing 1. Quadratic probing addresses the clustering problem of linear open addressing by using a quadratic function to choose overflow buckets. a. If a key belongs in bucket x, the following series of buckets is examined until one is found with room to hold it: x (x+1) mod b (x+4) mod b (x+9) mod b .. 2 i.e. the buckets probed are of the form (home + i ) mod b b. Notice how this breaks up clusters. In the above example, bucket x would first overflow to bucket x+1, increasing the probability of overflow there. But now once bucket x+1 overflows, further overflows from home bucket x would go into bucket x+4, while overflows from bucket home x+1 would go into bucket x+2. Thus, buckets x and x+1 would not compete with each other for overflow space, and the reinforcing effect of local overflows would not occur. 2. Of course, in looking for a key we must probe buckets in the same order as we do for insertion. We continue probing until we find the key, or we are forced to abandon the search if some probe leads us to a bucket with an empty slot (not the result of a deletion). 3. Of course, we want to be sure that an insertion into a nearly full table will succeed if at all possible, which means that - if necessary - we will eventually probe each bucket in the table exactly once. It is possible to get this behavior if we use a variant of quadratic probing in which we go both forward and backward from a slot (i.e. we go +1, -1, +4, -4 ...). We will get the desired behavior if we use a table size that is a prime of the form 4j + 3 - i.e. is a prime one less than a multiple of four. Example: Consider a table of size 7 (buckets numbered 0..6), and a key that hashes to bucket 3. If the table is nearly full, the following buckets will be probed in the order shown in an attempt to find room for the key: (home) 3 (home +/- 1) 4 2 (home +/- 4) 0 6 (home +/- 9) 5 1 since we have now tried all buckets, any future probe must fail D. Rehashing 1. Another approach to solving the clustering problem is REHASHING. Instead of using a single hash function, we use a series of hash functions f1, f2, f3 ... 2. When an attempt is made to insert a key in the table, it is first hashed using f1. If the resultant bucket is full, then the key is hashed again using f2. If that bucket is full, then f3 is used, etc, until some hash function hashes the key to a bucket that is empty. 3. The same series of hash functions are used, in turn, when searching for a key until it is found or some probe takes us to a bucket with an empty slot, in which case we abandon the search. 4. One obvious challenge is developing a suitable series of hash functions. Ideally, the hash functions should have the property that if f1 hashes two keys to the same bucket, then f2 hashes them to different buckets, etc. In contrast to other overflow handling methods, this has the effect of causing two keys that collide initially to not also collide on overflow. On the other hand, it is hard to find functions having this property that also guarantee that every bucket will eventually be tried in the case of insertion into an almost full table. E. Chaining 1. One problem that all of the variants of hashing we have considered thus far suffer from is that overflows are handled by using the same space in the table that is used from the home buckets. Thus, it can eventually happen that an insertion will fail because the entire table is full. (E.g. the 1001st insertion into a table of 500 buckets of 2 slots each must fail.) a. The normal solution to this problem with the schemes we have considered previously is to rebuild the table with a larger size - either by increasing the number of buckets or by increasing the number of slots per bucket. b. This, of course, requires a recopying of the entire table - and may involve a change of hash function and a rehashing of all existing entries if the number of buckets changes. Naturally, this would make the insertion that triggers the restructuring appear to be very slow. 2. An alternate approach is to make use of linked lists. This takes two forms. a. The hashtable may be structured as an array of buckets, as before. But now we add to each bucket a link, initially NULL. b. Insertions are initially made into the home bucket, as before. However, should the home bucket overflow the following approach is used: i. A new bucket is allocated from outside the table structure. ii. The new key is put into it. iii. The link of the home bucket is made to point to the overflow bucket, and the link of the overflow bucket is made NULL. iv. Should the overflow bucket itself become full, additional overflow bucket(s) are added to the chain as needed. Example: Table with 5 buckets, 1 slot per bucket (initially empty); Entries consist of a numeric ID (key) plus a name. hash function = key mod 5. Insert 17 AARDVARK: goes into bucket 2 Insert 23 BUFFALO: goes into bucket 3 Insert 12 CAT: since bucket 2 is full, goes into an overflow bucket pointed to by 2 Insert 44 DOG: goes into bucket 4 (collision with overflow from 2 does not occur) c. Alternately, the hashtable itself may be simply an array of pointers to lists of buckets. In this way, no bucket need be allocated for a given hash value until a key with that hash value actually occurs. Example: rework the above F. Another approach to handling collisions is EXTENDIBLE HASHING. 1. Several schemes have been proposed to allow the size of a hash table to grow dynamically in a smooth, efficient way. We consider only one here. For others, see Smith and Barnes: Files and Databases pp 124-135. 2. All such schemes use a hash function that generates a large range of values. For example, on a 32-bit computer, a typical hash function used with such a scheme would produce a full 32-bit value. 3. Initially, only a limited number of bits from the hash function are actually used; the rest are ignored. When a bucket overflows, however, it is split in two and an additional bit of the hash function is then used to redistribute the keys between the halves. 4. The scheme we consider here makes use of a table called the directory, whose size is a power of two. Each entry in the table points to a bucket of keys, but not necessarily a unique bucket. (That is, several table entries may point to the same bucket.) When we do lookups or insertions, we use as many bits from the hash function as are needed to compute an index for this table, and then follow the pointer to the correct bucket. Example: the following is a hashtable with bucket size 2, with keys and hash values as shown. (Hash values are sums of ASCII values of characters of keys with bits in reverse order - not great, but OK). At present, three bits of the hash function are used to distribute the keys. ------ 000 ------------------------> HIPPO 0000000110 ------ CAT 0001101100 001 ---------------\ ------ \-------> AARDVARK 0011001001 010 -------------\ ------ ----------> DOG 0101101100 011 -------------/ JACKAL 0110010110 ------ 100 ------------------------> ELEPHANT 1000101001 ------ 101 ------------------------> GOPHER 1010001110 ------ FOX 1011011100 110 --------------\ ------ ---------> BUFFALO 1111111110 111 --------------/ We now consider the following insertions: MONKEY 1100101110 - would go in bucket with BUFFALO OSPREY 0100011110 - would cause bucket containing DOG, JACKAL to split. As a result, 010 entry in the table would point to a bucket containing DOG and OSPREY, while 011 entry would now point to a bucket with JACKAL IGUANA 1010110110 - would force us to go to using 4 bits to differentiate keys, since there are already two entries with 101 as their first 3 bits. The new table (including MONKEY and OSPREY from before) would look like this: 0000------------\ ------ -----------> HIPPO 0000000110 0001------------/ CAT 0001101100 ------ 0010------------\ ------ ------------> AARDVARK 0011001001 0011------------/ ------ 0100------------\ ------ ------------> DOG 0101101100 0101------------/ OSPREY 0100011110 ------ 0110------------\ ------ ------------> JACKAL 0110010110 0111------------/ ------ 1000------------\ ------ ------------> ELEPHANT 1000101001 1001------------/ ------ GOPHER 1010001110 1010-------------------------> IGUANA 1010110110 ------ 1011-------------------------> FOX 1011011100 ------ 1100----------\ ------ \ 1001------------\ ------ ------------> MONKEY 1100101110 1110------------/ BUFFALO 1111111110 ------ / 1111----------/