Project 5: Incremental Rehash

Due: Tuesday, May 16, 8:59:59pm


Addenda

Additions and changes are highlighted in orange throughout.

Objectives

The objective of this programming assignment is to have you practice implementing a hash table.

Background

Your micro-manager of a boss from Project 4 found you out. Somehow he figured out that you did not implement real min-max heaps for him. After some verbal reprimand, somehow you got to keep your job. For your next assignment, your boss says, "You will implement the data structure EXACTLY as I specify!"

Your next assignment is to implement a hash table for strings. Your boss says you must use open addressing, linear probing and the division method. He even gives you the function to compute the hashcode of a string:

unsigned int hashCode(const char *str) { unsigned int val = 0 ; const unsigned int thirtyThree = 33 ; // magic number from textbook int i = 0 ; while (str[i] != '\0') { val = val * thirtyThree + str[i] ; i++ ; } return val ; }

Really?? Linear probing? Doesn't he know that linear probing is vulnerable to primary clustering? You decide that you will follow your boss's instructions, but you will nevertheless add your own optimizations.

When a hash table is "full" or has too many collisions, a last resort is to rehash. When you rehash, you construct another hash table and insert every item in the old hash table into the new hash table. The new hash table might be bigger and could use a different hash function. The hope is that you were just "unlucky" with collisions and that the new hash table will not have as many.

Rehashing is very expensive in terms of running time since you have to re-insert every item from the old hash table into the new hash table. Alternatively, you can rehash incrementally and not move the items all at once. Incremental rehashing will not lower the overall running time of rehashing, but it will prevent your program from stalling (think spinning beach balls) while a hash table is being copied. During an incremental rehash, there are two hash tables in the data structure: the old hash table and the new hash table. When you insert a new string, you insert it in the new hash table. When you search for a string, the string may be in either the old hash table or the new hash table. You have to look in both. Similarly, a remove operation will have to look in both tables and remove the item from the appropriate one. During the incremental rehash phase, every operation (i.e., insert, find and remove) will also move a few items from the old table to the new table. After a number of operations, most of the items will have migrated to the new table. The stragglers in the old table will be moved to the new table and the old table can be discarded.


Your Assignment

Your assignment is to implement a hash table data structure that uses incremental rehashing as described above.

Rehashing is triggered in two ways. First, if the load factor of the table exceeds 0.5 (i.e., more than half of the hash slots are filled), your data structure should begin the incremental rehashing. Second, if linear probing during an insert, find or remove operation encounters more than 10 hash slots, then you should also initiate the incremental rehashing. The new hash table should have a size that brings the load factor to 0.25 if all of the elements of the old table were moved to the new table at once. That is, the size of the new table should be approximately four times the number of items in the old table when the incremental hashing is triggered. If incremental rehashing was triggered because by a long linear probe, the new table might actually be smaller than the old table. In any case, the new table size must be prime, and the new table size must be different from the old table size. We will also keep our table sizes between 101 and 199,999 for this project.

Instead of moving individual items from the old table to the new table, you will move clusters. These are contiguous hash slots that are occupied by items. Clusters are bad for linear probing because they lengthen the running time for a search. (We do not know that an item is not in the hash table until we have reached the end of the cluster.) Also, large clusters tend to get larger because the probability that a new item hashes to a slot in a cluster increases with the size of a cluster. Thus, when incremental rehashing has been triggered, every time a slot t in the old table is accessed (during insert, find or remove), we will rehash the cluster in the old hash table surrounding t.

For example, the following is a part of the dump from a hash table with size 137. The numbers in the parentheses after each word is the hash index of the word. The words in slots 67 through 87 form a cluster. If we did a search for "aberdeen", which hashes to slot 68, we won't find the word in the linear probing until we reach slot 80. This will trigger the incremental rehashing since the linear probing had to examine more than 10 slots to find "aberdeen". Thus, every item in the cluster from slot 67 through 87 will be moved to the new table. Note that there are items in the cluster before slot 68 where "aberdeen" hashed to. Also, in the linear probing scheme, when an item is deleted, the slot it occupied is marked as deleted rather than emptied (otherwise linear probing won't work for later searches). So, the cluster also includes deleted slots, but obviously there is no item to move to the new table for that slot. Subsequently, if we do search on "abbreviation" then the cluster in slots 89 through 91 will be moved to the new table.

H[ 60] = H[ 61] = H[ 62] = abeam (62) H[ 63] = H[ 64] = H[ 65] = H[ 66] = H[ 67] = abbreviated (67) H[ 68] = abductors (67) H[ 69] = abattoir (69) H[ 70] = aardwolf (70) H[ 71] = abated (70) H[ 72] = abatement (71) H[ 73] = abbeys (71) H[ 74] = abalone (74) H[ 75] = abandons (75) H[ 76] = abdicates (69) H[ 77] = abased (77) H[ 78] = DELETED H[ 79] = abel (76) H[ 80] = aberdeen (68) H[ 81] = abhorred (67) H[ 82] = abbreviates (82) H[ 83] = abducts (82) H[ 84] = abducted (84) H[ 85] = abates (85) H[ 86] = abet (84) H[ 87] = abhorrence (78) H[ 88] = H[ 89] = abdomens (89) H[ 90] = abbey (90) H[ 91] = abbreviation (90) H[ 92] = H[ 93] = H[ 94] = H[ 95] = H[ 96] = abdominal (96) H[ 97] = abhor (97) H[ 98] = abasement (98) H[ 99] = abhorrent (96) H[100] =

Note that in the example above, if we have not entered incremental rehashing and we did a search for "abel", this would not trigger an incremental rehash because "abel" hashes to index 76 and we found it in slot 79. The linear probing looked at only 4 slots. That is, incremental rehashing is not triggered by the existence of large clusters — it is triggered when we encounter a long linear probe sequence.

When there is an ongoing rehash, we must also check if the number of items in the old table has dropped below 3% of the total number of items in the data structure. If that is the case, then we should wrap up the rehashing and copy all of the remaining items in the old hash table to the new hash table. The old hash table can then be discarded and we will exit from the "rehash mode". This check should be done at the beginning of every insert, find and remove operation.

Finally, what should we do if during an incremental rehash, the load factor of the new table exceeds 0.5? or if there is a long linear probe in the new table? Then, we throw up our hands and give up on incremental rehashing. We make a third table that has a table size about 4 times the total number of items in the old and new tables, and move all of the items into the third table at once.



Requirements

You must design and implement a C++ class HashTable that uses the incremental rehashing scheme as described above. The following are some firm requirements (i.e., necessary for grading):

We will be hashing strings. The strings are given in a global array words[]. None of the member functions in your implementation of the HashTable class should use this global array. The driver programs should be the only code that uses the words[] array. The global variable numWords has the size of the words[] array.

#ifndef _WORDS_H_ #define _WORDS_H_ const int numWords = 58109 ; const char *words[numWords] = { "aardvark", "aardwolf", "aaron", "aback", "abacus", "abaft", "abalone", "abandon", "abandoned", "abandonment", "abandons", . . . "zoomed", "zooming", "zooms", "zooplankton", "zoos", "zulu", "zulus" } ; #endif

Download: words.h.

Note that the items in the words[] array are const char * strings and not C++ strings. When you store a string in your hash table you should make a copy of the given string.

Similarly, there is an array of prime numbers between 100 and 200,000 in the primes.h file. You should use binary search in the primes[] array to find a prime number for your table size. If, during rehashing, you have to create a table with more than 199,999 items, then just give up and throw an out_of_range exception.

const int numPrimes = 17959 ; const int primes[numPrimes] = { 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, . . . 199873, 199877, 199889, 199909, 199921, 199931, 199933, 199961, 199967, 199999 } ;

Download: primes.h.



Additional Specifications

Here are some additional specifications for the HashTable class member functions that you have to implement. You will definitely want to add other data members and member functions to the HashTable class, but the prototypes for the member functions listed here should not be changed. You may create additional classes if you prefer, but all class declarations should be placed in HashTable.h and all implementations should be placed in HashTable.cpp.


Implementation Notes


Test Programs

The following programs should be compiled using g++ testX.cpp HashTable.cpp where testX.cpp is one of test1.cpp, test2.cpp, test3.cpp, test4.cpp or test5.cpp. Run these programs under valgrind to check for memory leaks and memory read/write errors.

What to Submit

You must submit the following files:

The main function in Driver.cpp should exercise your HashTable functions and show what your program has implemented successfully. (I.e., if your code in Driver.cpp does not produce any output or it seg faults, then we will assume that you have not implemented very much.)