Project 5: Incremental Rehash

Due: Tuesday, December 11, 8:59:59pm



Objectives

The objective of this programming assignment is to have you practice implementing a hash table.

Background

Your final programming assignment is to implement a hash table for strings. Your implemenation will use open addressing, linear probing and the division method. We even give you the function to compute the hashcode of a string:

unsigned int hashCode(const char *str) { unsigned int val = 0 ; const unsigned int thirtyThree = 33 ; // magic number from textbook int i = 0 ; while (str[i] != '\0') { val = val * thirtyThree + str[i] ; i++ ; } return val ; }

The problem is that linear probing is vulnerable to primary clustering. So, as with previous projects, you will add some optimizations to the basic design.

When a hash table is "full" or has too many collisions, a last resort is to rehash. When you rehash, you construct another hash table and insert every item in the old hash table into the new hash table. The new hash table might be bigger and could use a different hash function. The hope is that you were just "unlucky" with collisions and that the new hash table will not have as many.

Rehashing is very expensive in terms of running time since you have to re-insert every item from the old hash table into the new hash table. Alternatively, you can rehash incrementally and not move the items all at once. Incremental rehashing will not lower the overall running time of rehashing, but it will prevent your program from stalling for a long time while a hash table is being copied. During an incremental rehash, there are two hash tables in the data structure: the old hash table and the new hash table. When you insert a new string, you insert it in the new hash table. When you search for a string, the string may be in either the old hash table or the new hash table. You have to look in both. Similarly, a remove operation will have to look in both tables and remove the item from the appropriate one. During the incremental rehash phase, every operation (i.e., insert, find and remove) will also move a few items from the old table to the new table. After a number of operations, most of the items will have migrated to the new table. The stragglers in the old table will be moved to the new table and the old table can be discarded.


Your Assignment

Your assignment is to implement a hash table data structure that uses incremental rehashing as described above. We are assuming the concept of a standard hash table that uses open addressing and linear probing is already familiar to you from class. We are also assuming you understand lazy deletion and why it is necessary when using open addressing. Here, we will cover just the incremental rehashing process.

Rehashing is triggered in two ways. First, if the load factor of the table exceeds 0.5 (i.e., more than half of the hash slots are filled), your data structure should begin the incremental rehashing. (There is no ambiguity about the >= 50% vs > 50% question, since the restriction that the table size must be prime and at least 101 implies the size must be odd.) You should check the load factor at the beginning of every insert, find and remove operation. For insert, this makes sense because if you are going to rehash due to a load factor exceeding 0.5, then you want to start inserting in the new hash table right away. For find and remove, checking the load factor first, and triggering the rehash if needed also clarifies what you are supposed to do with the cluster you find in the operation. If you are already in rehash mode, yes, you should move that cluster to the new table. Also, since the load factor is being checked before we do the insert and remove, you don't need to take into account the size change from the insert and remove when checking the load factor.

The second way that incremental rehashing is triggered is if linear probing during an insert, find or remove operation requires examining 10 or more hash slots. We usually prefer not to be too specific about implementation details that do not affect the performance of a data structure, but in order to prevent a bunch of questions and also to make grading easier, let's be really specific about measuring the length of a linear probe. Suppose a string x hashes to index 67:

In either trigger case, the new hash table should have a size that brings the load factor to 0.25 if all of the elements of the old table were moved to the new table at once. That is, the size of the new table should be four times the number of items in the old table when the incremental hashing is triggered. Next, the new size should then be rounded up to the next prime, of course. Third, the new table size must be different from the old table size; if the number you computed is the same as the current size, increment the size to the next prime. Last, be sure to maintain a minimum table size of 101 and a maximum size of 199,999 for this project. If, during rehashing, you have to create a table with more than 199,999 items, then just give up and throw an out_of_range exception. Note that if incremental rehashing was triggered because of a long linear probe, the new table might actually be smaller than the old table.

Instead of moving individual items from the old table to the new table, you will move clusters. These are contiguous hash slots that are occupied by items. Clusters are bad for linear probing because they lengthen the running time for a search. (We do not know that an item is not in the hash table until we have reached the end of the cluster.) Also, large clusters tend to get larger because the probability that a new item hashes to a slot in a cluster increases with the size of a cluster. Thus, when incremental rehashing has been triggered, every time a slot t in the old table is accessed (during insert, find or remove), we will rehash the cluster in the old hash table surrounding t.

For example, the following is a part of the dump from a hash table with size 137. The numbers in the parentheses after each word is the hash index of the word. The words in slots 67 through 87 form a cluster. If we did a search for "aberdeen", which hashes to slot 68, we won't find the word in the linear probing until we reach slot 80. This will trigger the incremental rehashing since the linear probing had to examine 10 or more slots (13, to be exact) to find "aberdeen". Thus, every item in the cluster from slot 67 through 87 will be moved to the new table. Note that there are items in the cluster before slot 68 where "aberdeen" hashed to. Also, in the linear probing scheme, when an item is deleted, the slot it occupied is marked as deleted rather than emptied (otherwise linear probing won't work for later searches). So, the cluster also includes deleted slots, but obviously there is no item to move to the new table for that slot. Subsequently, if we do search on "abbreviation" then the cluster in slots 89 through 91 will be moved to the new table.

H[ 60] = H[ 61] = H[ 62] = abeam (62) H[ 63] = H[ 64] = H[ 65] = H[ 66] = H[ 67] = abbreviated (67) H[ 68] = abductors (67) H[ 69] = abattoir (69) H[ 70] = aardwolf (70) H[ 71] = abated (70) H[ 72] = abatement (71) H[ 73] = abbeys (71) H[ 74] = abalone (74) H[ 75] = abandons (75) H[ 76] = abdicates (69) H[ 77] = abased (77) H[ 78] = DELETED H[ 79] = abel (76) H[ 80] = aberdeen (68) H[ 81] = abhorred (67) H[ 82] = abbreviates (82) H[ 83] = abducts (82) H[ 84] = abducted (84) H[ 85] = abates (85) H[ 86] = abet (84) H[ 87] = abhorrence (78) H[ 88] = H[ 89] = abdomens (89) H[ 90] = abbey (90) H[ 91] = abbreviation (90) H[ 92] = H[ 93] = H[ 94] = H[ 95] = H[ 96] = abdominal (96) H[ 97] = abhor (97) H[ 98] = abasement (98) H[ 99] = abhorrent (96) H[100] =

Note that in the example above, if we have not entered incremental rehashing and we did a search for "abel", this would not trigger an incremental rehash because "abel" hashes to index 76 and we found it in slot 79. The linear probing looked at only 4 slots. That is, incremental rehashing is not triggered by the existence of large clusters — it is triggered when we encounter a long linear probe sequence.

After incremental rehashing has started, when do clusters move from the old table to the new table? The answer is: Whenever you "touch" a string in the old table, the cluster it belongs to should move to the new table. The idea here is that we want to migrate strings from the old table to the new table as much as possible without running down the old table looking for them. (We do that for the final 3%, but that's unavoidable.) So, if you happen to see a string in the old table, you should go ahead and move the cluster that it is part of to the new table. We move clusters and not individual strings, because moving an individual string from the old table would break the linear probe searching (and we do still need to search in the old table).

In the example with table size 137 given below, after rehashing has started, suppose someone calls remove("trolley"). Since "trolley" is not in the data structure at all, you have to look for "trolley" in both the new table and the old table. With a table of size 137, "trolley" hashes to slot 98. You have to probe slots 98, 99 and 100 of the old table to determine that "trolley" is, in fact, not in the table. Even though the search is unsuccessful, you still did find some strings in the old table, so you should move the cluster consisting of "abdominal", "abhor", "abasement" and "abhorrent" in slots 96 through 99 from the old table to the new table.

When there is an ongoing rehash, we must also check if the number of items in the old table has dropped below 3% of the total number of items in the data structure. If that is the case, then we should wrap up the rehashing and copy all of the remaining items in the old hash table to the new hash table. The old hash table can then be discarded and we will exit from the "rehash mode". This check should be done at the beginning of every insert, find and remove operation.

Finally, what should we do if during an incremental rehash, the load factor of the new table exceeds 0.5? or if there is a long linear probe in the new table? Then, we throw up our hands and give up on incremental rehashing. We make a third table that has a table size about 4 times the total number of items in the old and new tables, and move all of the items into the third table at once.



Requirements

You must design and implement a C++ class HashTable that uses the incremental rehashing scheme as described above. The following are some firm requirements (i.e., necessary for grading):

We will be hashing strings. The strings are given in a global array words[]. None of the member functions in your implementation of the HashTable class should use this global array. The driver programs should be the only code that uses the words[] array. The global variable numWords has the size of the words[] array.

#ifndef _WORDS_H_ #define _WORDS_H_ const int numWords = 58109 ; const char *words[numWords] = { "aardvark", "aardwolf", "aaron", "aback", "abacus", "abaft", "abalone", "abandon", "abandoned", "abandonment", "abandons", . . . "zoomed", "zooming", "zooms", "zooplankton", "zoos", "zulu", "zulus" } ; #endif

Download: words.h.

Note that the items in the words[] array are const char * strings and not C++ strings. When you store a string in your hash table you should make a copy of the given string.

We are providing a function called int roundUpPrime(int n) that helps you compute good sizes for your hash tables. It takes as a parameter the desired size (either from the user, or the new size computed for rehashing) and rounds it up to the next prime. If the parameter is already prime, it is not rounded up. If the parameter is too large (> 199,999), the function returns 0.

Download: primes.cpp.



Additional Specifications

Here are some additional specifications for the HashTable class member functions that you have to implement. You will definitely want to add other data members and member functions to the HashTable class, but the prototypes for the member functions listed here should not be changed. You may create additional classes if you prefer, but all class declarations should be placed in HashTable.h and all implementations should be placed in HashTable.cpp.


Implementation Notes


Test Programs

The following programs should be compiled using g++ testX.cpp HashTable.cpp where testX.cpp is one of test1.cpp, test2.cpp, test3.cpp, test4.cpp or test5.cpp. Run these programs under valgrind to check for memory leaks and memory read/write errors.

What to Submit

You must submit the following files:

The main function in Driver.cpp should exercise your HashTable functions and show what your program has implemented successfully. (I.e., if your code in Driver.cpp does not produce any output or it seg faults, then we will assume that you have not implemented very much.)