Project 5: Incremental Rehash

Due: Tuesday, December 11, 8:59:59pm

Objectives

The objective of this programming assignment is to have you practice implementing a hash table.

Background

Your final programming assignment is to implement a hash table for strings. Your implemenation will use open addressing, linear probing and the division method. We even give you the function to compute the hashcode of a string:

unsigned int hashCode(const char *str) { unsigned int val = 0 ; const unsigned int thirtyThree = 33 ; // magic number from textbook int i = 0 ; while (str[i] != '\0') { val = val * thirtyThree + str[i] ; i++ ; } return val ; }

The problem is that linear probing is vulnerable to primary clustering. So, as with previous projects, you will add some optimizations to the basic design.

When a hash table is "full" or has too many collisions, a last resort is to rehash. When you rehash, you construct another hash table and insert every item in the old hash table into the new hash table. The new hash table might be bigger and could use a different hash function. The hope is that you were just "unlucky" with collisions and that the new hash table will not have as many.

Rehashing is very expensive in terms of running time since you have to re-insert every item from the old hash table into the new hash table. Alternatively, you can rehash incrementally and not move the items all at once. Incremental rehashing will not lower the overall running time of rehashing, but it will prevent your program from stalling for a long time while a hash table is being copied. During an incremental rehash, there are two hash tables in the data structure: the old hash table and the new hash table. When you insert a new string, you insert it in the new hash table. When you search for a string, the string may be in either the old hash table or the new hash table. You have to look in both. Similarly, a remove operation will have to look in both tables and remove the item from the appropriate one. During the incremental rehash phase, every operation (i.e., insert, find and remove) will also move a few items from the old table to the new table. After a number of operations, most of the items will have migrated to the new table. The stragglers in the old table will be moved to the new table and the old table can be discarded.

Your Assignment

Your assignment is to implement a hash table data structure that uses incremental rehashing as described above. We are assuming the concept of a standard hash table that uses open addressing and linear probing is already familiar to you from class. We are also assuming you understand lazy deletion and why it is necessary when using open addressing. Here, we will cover just the incremental rehashing process.

Rehashing is triggered in two ways. First, if the load factor of the table exceeds 0.5 (i.e., more than half of the hash slots are filled), your data structure should begin the incremental rehashing. (There is no ambiguity about the >= 50% vs > 50% question, since the restriction that the table size must be prime and at least 101 implies the size must be odd.) You should check the load factor at the beginning of every insert, find and remove operation. For insert, this makes sense because if you are going to rehash due to a load factor exceeding 0.5, then you want to start inserting in the new hash table right away. For find and remove, checking the load factor first, and triggering the rehash if needed also clarifies what you are supposed to do with the cluster you find in the operation. If you are already in rehash mode, yes, you should move that cluster to the new table. Also, since the load factor is being checked before we do the insert and remove, you don't need to take into account the size change from the insert and remove when checking the load factor.

The second way that incremental rehashing is triggered is if linear probing during an insert, find or remove operation requires examining 10 or more hash slots. We usually prefer not to be too specific about implementation details that do not affect the performance of a data structure, but in order to prevent a bunch of questions and also to make grading easier, let's be really specific about measuring the length of a linear probe. Suppose a string x hashes to index 67:

If x is inserted in slot 75, the probe length is 9 (since you had to examine slots 67 thru 75) and this does not trigger a rehash. If x is inserted in slot 76, then the probe length is 10 and a rehash would be triggered. The current cluster that contains slot 76 should be moved to the new hash table immediately.
If x is found in slot 75 by a find operation, the probe length is 9 and we do not rehash. If x is found in slot 76, then we do start a rehash. The current cluster that contains slot 76 should be moved to the new hash table immediately.
The remove operation is the same as the find operation, since we have to look for an item before we can remove it. The slot marked DELETED is still part of the cluster, there is no item to insert in the new table, but the deleted slot joins the slots before it and the slots after it into one cluster. This cluster should be moved immediately after the deletion, if a rehash is triggered during the operation.

In either trigger case, the new hash table should have a size that brings the load factor to 0.25 if all of the elements of the old table were moved to the new table at once. That is, the size of the new table should be four times the number of items in the old table when the incremental hashing is triggered. Next, the new size should then be rounded up to the next prime, of course. Third, the new table size must be different from the old table size; if the number you computed is the same as the current size, increment the size to the next prime. Last, be sure to maintain a minimum table size of 101 and a maximum size of 199,999 for this project. If, during rehashing, you have to create a table with more than 199,999 items, then just give up and throw an out_of_range exception. Note that if incremental rehashing was triggered because of a long linear probe, the new table might actually be smaller than the old table.

Instead of moving individual items from the old table to the new table, you will move clusters. These are contiguous hash slots that are occupied by items. Clusters are bad for linear probing because they lengthen the running time for a search. (We do not know that an item is not in the hash table until we have reached the end of the cluster.) Also, large clusters tend to get larger because the probability that a new item hashes to a slot in a cluster increases with the size of a cluster. Thus, when incremental rehashing has been triggered, every time a slot t in the old table is accessed (during insert, find or remove), we will rehash the cluster in the old hash table surrounding t.

For example, the following is a part of the dump from a hash table with size 137. The numbers in the parentheses after each word is the hash index of the word. The words in slots 67 through 87 form a cluster. If we did a search for "aberdeen", which hashes to slot 68, we won't find the word in the linear probing until we reach slot 80. This will trigger the incremental rehashing since the linear probing had to examine 10 or more slots (13, to be exact) to find "aberdeen". Thus, every item in the cluster from slot 67 through 87 will be moved to the new table. Note that there are items in the cluster before slot 68 where "aberdeen" hashed to. Also, in the linear probing scheme, when an item is deleted, the slot it occupied is marked as deleted rather than emptied (otherwise linear probing won't work for later searches). So, the cluster also includes deleted slots, but obviously there is no item to move to the new table for that slot. Subsequently, if we do search on "abbreviation" then the cluster in slots 89 through 91 will be moved to the new table.

H[ 60] = H[ 61] = H[ 62] = abeam (62) H[ 63] = H[ 64] = H[ 65] = H[ 66] = H[ 67] = abbreviated (67) H[ 68] = abductors (67) H[ 69] = abattoir (69) H[ 70] = aardwolf (70) H[ 71] = abated (70) H[ 72] = abatement (71) H[ 73] = abbeys (71) H[ 74] = abalone (74) H[ 75] = abandons (75) H[ 76] = abdicates (69) H[ 77] = abased (77) H[ 78] = DELETED H[ 79] = abel (76) H[ 80] = aberdeen (68) H[ 81] = abhorred (67) H[ 82] = abbreviates (82) H[ 83] = abducts (82) H[ 84] = abducted (84) H[ 85] = abates (85) H[ 86] = abet (84) H[ 87] = abhorrence (78) H[ 88] = H[ 89] = abdomens (89) H[ 90] = abbey (90) H[ 91] = abbreviation (90) H[ 92] = H[ 93] = H[ 94] = H[ 95] = H[ 96] = abdominal (96) H[ 97] = abhor (97) H[ 98] = abasement (98) H[ 99] = abhorrent (96) H[100] =

Note that in the example above, if we have not entered incremental rehashing and we did a search for "abel", this would not trigger an incremental rehash because "abel" hashes to index 76 and we found it in slot 79. The linear probing looked at only 4 slots. That is, incremental rehashing is not triggered by the existence of large clusters — it is triggered when we encounter a long linear probe sequence.

After incremental rehashing has started, when do clusters move from the old table to the new table? The answer is: Whenever you "touch" a string in the old table, the cluster it belongs to should move to the new table. The idea here is that we want to migrate strings from the old table to the new table as much as possible without running down the old table looking for them. (We do that for the final 3%, but that's unavoidable.) So, if you happen to see a string in the old table, you should go ahead and move the cluster that it is part of to the new table. We move clusters and not individual strings, because moving an individual string from the old table would break the linear probe searching (and we do still need to search in the old table).

In the example with table size 137 given below, after rehashing has started, suppose someone calls remove("trolley"). Since "trolley" is not in the data structure at all, you have to look for "trolley" in both the new table and the old table. With a table of size 137, "trolley" hashes to slot 98. You have to probe slots 98, 99 and 100 of the old table to determine that "trolley" is, in fact, not in the table. Even though the search is unsuccessful, you still did find some strings in the old table, so you should move the cluster consisting of "abdominal", "abhor", "abasement" and "abhorrent" in slots 96 through 99 from the old table to the new table.

When there is an ongoing rehash, we must also check if the number of items in the old table has dropped below 3% of the total number of items in the data structure. If that is the case, then we should wrap up the rehashing and copy all of the remaining items in the old hash table to the new hash table. The old hash table can then be discarded and we will exit from the "rehash mode". This check should be done at the beginning of every insert, find and remove operation.

Finally, what should we do if during an incremental rehash, the load factor of the new table exceeds 0.5? or if there is a long linear probe in the new table? Then, we throw up our hands and give up on incremental rehashing. We make a third table that has a table size about 4 times the total number of items in the old and new tables, and move all of the items into the third table at once.

Requirements

You must design and implement a C++ class HashTable that uses the incremental rehashing scheme as described above. The following are some firm requirements (i.e., necessary for grading):

The name of your class must be HashTable.
You must use HashTable.h and HashTable.cpp for the filenames (case sensitive) of your header and implementation files.
You must use the hashCode() function given above.
You must use open addressing and linear probing.
You must use the division method where hash index is the value returned by hashCode() modulo the table size.
Your table sizes must be prime numbers.
If either the user-requested table size or the computed rehashing table size is greater than 199,999, you should
You may not use STL classes, not even string.
You must dynamically allocate arrays of type char * to hold your C strings. I.e., the declaration of your table should be something like:
Your program must not leak memory.

We will be hashing strings. The strings are given in a global array words[]. None of the member functions in your implementation of the HashTable class should use this global array. The driver programs should be the only code that uses the words[] array. The global variable numWords has the size of the words[] array.

#ifndef _WORDS_H_ #define _WORDS_H_ const int numWords = 58109 ; const char *words[numWords] = { "aardvark", "aardwolf", "aaron", "aback", "abacus", "abaft", "abalone", "abandon", "abandoned", "abandonment", "abandons", . . . "zoomed", "zooming", "zooms", "zooplankton", "zoos", "zulu", "zulus" } ; #endif

Download: words.h.

Note that the items in the words[] array are const char * strings and not C++ strings. When you store a string in your hash table you should make a copy of the given string.

We are providing a function called int roundUpPrime(int n) that helps you compute good sizes for your hash tables. It takes as a parameter the desired size (either from the user, or the new size computed for rehashing) and rounds it up to the next prime. If the parameter is already prime, it is not rounded up. If the parameter is too large (> 199,999), the function returns 0.

Download: primes.cpp.

Additional Specifications

Here are some additional specifications for the HashTable class member functions that you have to implement. You will definitely want to add other data members and member functions to the HashTable class, but the prototypes for the member functions listed here should not be changed. You may create additional classes if you prefer, but all class declarations should be placed in HashTable.h and all implementations should be placed in HashTable.cpp.

This is the default constructor for the HashTable class. The parameter n is the requested size of the hash table. Note that it has a default value of 101: that is the minimum allowed size. If the user requests a size smaller than 101, you will override the request and set the size to 101 (i.e., the minimum table size is 101). The maximum allowed size is 199,999, so if the requested size exceeds this, then just give up and throw an out_of_range exception. Lastly, the table size must be a prime number, so you must round up the the requested size to the nearest prime number greater than or equal to n. (We provide a function called int roundUpPrime(int n) that computes this value for you.) The HashTable object created by the constructor should be ready for insertion, search and deletion without any additional initialization.

This is the destructor. Make sure you deallocate all memory for this object. The strings in the hash table must be deallocated using free() since they are C strings (i.e., don't use remove for C strings).
This is the copy constructor for the HashTable class. If the other hash table is not in incremental rehash, then copying the hash table is very straightforward. Just make sure you allocate memory for the new hash table and use strdup() to copy the strings.

On the other hand, if the other hash table has an ongoing rehash, then it doesn't make sense to make a duplicate of a hash table that is in the middle of being copied. Instead, force the other table to finish its rehashing and copy over the resulting single table.

Note that the usual const modifier has been removed from the parameter of this copy constructor, since the HashTable object in the parameter might be asked to finish incremental rehashing. (There are other ways to resolve this, but they are too cumbersome.)

Do not call the assignment operator from the copy constructor. If you do not want to have duplicate code, then create a third function that handles the common parts of the both functions.
This is the overloaded assignment operator for the HashTable class. As with the copy constructor, the process is fairly standard if the rhs does not have an ongoing incremental rehash. If there is an ongoing rehash, then force the rhs to finish its rehashing and copy over the resulting single hash table.

Note that the usual const modifier has been removed from the parameter of this assignment operator, since the HashTable object in the parameter might be asked to finish incremental rehashing.
This function inserts a copy of the C string str into the hash table. It has no return value. (Note: use strdup() to copy C strings.)

The insert() function should insert in the new table if there is an ongoing incremental rehash.

Calling insert() with a string that is already in the hash table should have no effect. (I.e., do not insert a second copy of the same value.) Make sure you don't have a copy of a string that you didn't insert floating around. That's a memory leak.

The insert() function should trigger incremental rehashing when appropriate as described above. The insert() operation should also wrap up the incremental rehashing if the number of items in the old table drops below 3%.
The find() function looks for str in the hash table. The function returns true if found, false otherwise. The find() function look in both the old and the new hash tables if there is an ongoing incremental rehashing.

The find() function should trigger incremental rehashing when appropriate as described above. The find() operation should also wrap up the incremental rehashing if the number of items in the old table drops below 3%.
The remove() function removes str from the hash table and returns the pointer. If str is not in the hash table, remove() returns NULL.

It is the responsibility of the code that calls remove() to deallocate the string that is returned. (Again, use free(), not delete to deallocate.)

Recall that for open addressing, when an item is removed, the slot it occupied should be specially marked as "deleted", and not set back to NULL, which would break linear probing. But what special value can you use? You need a pointer value that cannot point to any real C-string. Use the following definition inside your HashTable class:

Then in the .cpp file, initialize with:

The constant DELETED can then be stored in your hash table slot. DELETED is a safe sentinel value because no string in the program can have that address--it is occupied by the class variable DELETED itself (in the global data section).

The remove() function should trigger incremental rehashing when appropriate as described above. The remove() operation should also wrap up the incremental rehashing if the number of items in the old table drops below 3%.
These functions are used for grading purposes, so we can examine your hash table(s). The isRehashing() function returns true if there is an ongoing incremental rehash. The tableSize() function returns the size of the hash table. When there is an ongoing rehash, tableSize(0) should return the size of the old table and tableSize(1) should return the size of the new table. Similarly, size() returns the number of items currently in the table(s).

The at() function returns a pointer to the string stored at the index slot of the hash table specified by table. If the index is invalid (i.e., less than 0 or greater than or equal to table size), then at() should throw an out_of_range exception (already defined in stdexcept). The pointer returned by at() has type const char * to prevent the string stored in the hash table from being changed. The calling function can make a copy if desired.
Dump should print some vital statistics and the contents of the hash table(s) to stdout. You should include the table size and number of items in the hash table(s). When you print out the string in each hash slot, include the item's hash index in parentheses (see example above.)

Implementation Notes

Remember to mod out by the table size when you are working with hash table indices.
Again, in hash tables, the indices wrap around to 0 at the bottom of the hash table. You must take this into account when you use linear probing for insert, find and remove.
One more time, the indices wrap. This means clusters can straddle the end of the hash table. When you move a cluster that is at the beginning or at the end of the table, you have to check if the cluster wraps around. Also, note that for loops do not work very well in this situation, because doesn't do the right thing when the cluster wraps around.
If you are thinking of working with negative indices for your hash table, remember that the modulo operator in C/C++ does not work mathematically (i.e., correctly): -2 % 10 evaluates to -2, not 8. So, the negative indices will not wrap around to the end of the hash table as you might expect.
You are working with C null-terminated strings, not C++ strings. If you haven't used C strings in a while, please review. For example, a C string is just an array of char, so the type is char *. A dynamically allocated array of C strings has type char * * since it is a pointer to an array of pointers to char. Here are Mr. Lupoli's notes on C strings.
You must use the strcmp() function to make string comparisons. You cannot use the == operator. That will do pointer comparison, not string comparison. To use strcmp(), make sure you or
Your hash table should make a copy of the string inserted, and not just store the pointer. Copies should be made using the strdup() function.
The strdup() function allocates memory using malloc() instead of new. To deallocate this memory, you must use free() instead of delete. While it is possible to allocate an array of char using new like this: it would be very confusing for one program to sometimes use malloc() and sometimes use new to allocate strings. So, just stick to using malloc() and free() to allocate and deallocate C strings. Having said that, you don't really need to use malloc() at all, if you use strdup() to copy strings, since strdup() allocates memory. Finally, for all other memory allocations not involving C strings, just use new.
Some of the strings that you are working with have type const char *. This means the string being pointed to cannot be changed. (The string is immutable in Python-speak.) Make sure that you know what this means. If you assign a const char * pointer to a char * pointer, the compiler will give you an error.
You should avoid copying strings whenever possible. For example, when you move a string from the old hash table to the new hash table during a rehash, you do not need to make a copy of the string. This means you probably want to have two versions of the insert() function. The standard one makes copies (because that's what your client is expecting). The internal one that you use for moving should not make copies. (Yes, one can call the other.)
You should periodically run your program under valgrind during development. This is so you can catch memory leaks as soon as possible. Also, if valgrind complains about memory read or memory write errors, this means you have a bug in your program. You should not ignore these errors. You should fix the bug as soon as possible because memory errors tend to manifest themselves in other places and are difficult to debug. You want to know if your program has any memory errors as soon as possible, because it is likely that the bug is in the last few modifications you have made in your program.
Some common errors to watch out for:
- forgetting to update the number of items after inserting or removing a string
- taking modulo of number of items instead of table capacity
- using the capacity of the wrong table
- trying to strdup(DELETED)
- trying to use DELETED as a string
- forgetting to intialize the entries of a new table to NULL

Test Programs

The following programs should be compiled using g++ testX.cpp HashTable.cpp where testX.cpp is one of test1.cpp, test2.cpp, test3.cpp, test4.cpp or test5.cpp. Run these programs under valgrind to check for memory leaks and memory read/write errors.

Basic test of insert(), find() and remove() without triggering incremental rehashing. Code (test1.cpp) and sample output (test1.txt).
Rehashing triggered. Clusters moved from old hash table to new hash table. Rehashing ends and only one table remains. Code (test2.cpp) and sample output (test2.txt).
Rehashing triggered. Further insertions cause long probe sequence in new table. Rehashing stopped and all items are consolidated in one hash table. Code (test3.cpp) and sample output (test3.txt).
Robust test with thousands of calls to insert(), find() and remove(). Resulting hash table is checked against equivalent STL set. Code (test4.cpp) and sample output (test4.txt).
Test of copy constructor and assignment operator. Code (test5.cpp) and sample output (test5.txt).

What to Submit

You must submit the following files:

HashTable.h
HashTable.cpp
Driver.cpp

The main function in Driver.cpp should exercise your HashTable functions and show what your program has implemented successfully. (I.e., if your code in Driver.cpp does not produce any output or it seg faults, then we will assume that you have not implemented very much.)

CMSC 341 Data Structures — All Sections — Fall 2018