Project 5: Incremental Rehash

Due: Tuesday, May 16, 8:59:59pm

Addenda

[Thu 5/11/17 14:35] Changed the comments in test3.cpp that indicated the slots where "heliosphere", "obstructs" and "peripatetic" would end up in the new table. (Should be 72, 73 and 77 instead of 71, 72 and 73.)
[Mon 5/8/17 09:25]
Question: After incremental rehashing has started, when do clusters move from the old table to the new table?
Answer: Whenever you "touch" a string in the old table, the cluster it belongs to should move to the new table. The idea here is that we want to migrate strings from the old table to the new table as much as possible without running down the old table looking for them. (We do that for the final 3%, but that's unavoidable.) So, if you happen to see a string in the old table, you should go ahead and move the cluster that it is part of to the new table. We move clusters and not individual strings, because moving an individual string from the old table would break the linear probe searching (and we do still need to search in the old table).
In the example with table size 137 given below, after rehashing has started, suppose someone calls remove("trolley"). Since "trolley" is not in the data structure at all, you have to look for "trolley" in both the new table and the old table. With a table of size 137, "trolley" hashes to slot 98. You have to probe slots 98, 99 and 100 of the old table to determine that "trolley" is, in fact, not in the table. Even though the search is unsuccessful, you still did find some strings in the old table, so you should move the cluster consisting of "abdominal", "abhor", "abasement" and "abhorrent" in slots 96 through 99 from the old table to the new table.
[Mon 5/8/17 09:25] Common bugs seen so far: forgetting to update the number of items after inserting or removing a string, taking modulo of number of items instead of table capacity, using the capacity of the wrong table, trying to strdup(DELETED), trying to use DELETED as a string, forgetting to intialize the entries of a new table to NULL. In other words, run of the mill bugs. Let's be careful out there.
[Sun 5/7/17 20:10] Test programs posted.
[Thu 5/4/17 14:00] Correction on the syntax for the DELETED. The class definition in the .h file should look like: This has to be initialized in the .cpp file: Note the change in the positioning of the keyword const.
[Wed 5/3/17 09:15] const has been removed from the parameters of the copy constructor and assignment operator, since the HashTable object in the parameter will be asked to finish incremental rehashing. The alternative is to make most of the data members mutable, but that is too cumbersome.
[Mon 5/1/17 16:30] We usually prefer not to be too specific about implementation details that do not affect the performance of a data structure, but in order to prevent a bunch of questions and also to make grading easier, let's be really specific about the length of a linear probe. Suppose a string x hashes to index 67:
- If x is inserted in slot 75, the probe length is 9 and this does not trigger a rehash. If x is inserted in slot 76, then the probe length is 10 and a rehash would be triggered. The current cluster that contains slot 76 should be moved to the new hash table immediately.
- If x is found in slot 75 by a find operation, the probe length is 9 and we do not rehash. If x is found in slot 76, then we do start a rehash. The current cluster that contains slot 76 should be moved to the new hash table immediately.
- The remove operation is the same as the find operation, since we have to look for an item before we can remove it. The slot marked DELETED is still part of the cluster, there is no item to insert in the new table, but the deleted slot joins the slots before it and the slots after it into one cluster. This cluster should be moved immediately after the deletion, if a rehash is triggered during the operation.
Let's be similarly specific about checking the load factor. First of all, the table size is prime and >100, so must be odd. We at least do not have to worry about the >= 50% vs > 50% question. We should check the load factor at the beginning of every insert, find and remove operation. For insert, this makes sense because if we are going to rehash due to a load factor exceeding 0.5, then we want to start inserting in the new hash table right away. For find and remove, checking the load factor first, and triggering the rehash if needed also clarifies what you are supposed to do with the cluster you find in the operation. If you are already in rehash mode, yes, you should move that cluster to the new table. Also, since the load factor is being checked before we do the insert and remove, we don't take into account of the size change from the insert and remove when we check the load factor.
As stated previously, these stipulations do not really affect the performance of the data structure. It just makes grading easier. It is also just easier to say, "Check the load factor before you do anything with insert."
[Mon 5/1/17 16:10] Clarified that checking the smallest table size we want to consider is 101.
[Mon 5/1/17 14:50] Clarified that checking whether the old heap has less than 3% of the items should be done at the beginning of each insert, find and remove operation. (It doesn't really matter, but this avoids a bunch of questions.)
[Mon 5/1/17 13:06] Added warning about negative numbers and the modulo operator in Implementation Notes.

Additions and changes are highlighted in orange throughout.

Objectives

The objective of this programming assignment is to have you practice implementing a hash table.

Background

Your micro-manager of a boss from Project 4 found you out. Somehow he figured out that you did not implement real min-max heaps for him. After some verbal reprimand, somehow you got to keep your job. For your next assignment, your boss says, "You will implement the data structure EXACTLY as I specify!"

Your next assignment is to implement a hash table for strings. Your boss says you must use open addressing, linear probing and the division method. He even gives you the function to compute the hashcode of a string:

unsigned int hashCode(const char *str) { unsigned int val = 0 ; const unsigned int thirtyThree = 33 ; // magic number from textbook int i = 0 ; while (str[i] != '\0') { val = val * thirtyThree + str[i] ; i++ ; } return val ; }

Really?? Linear probing? Doesn't he know that linear probing is vulnerable to primary clustering? You decide that you will follow your boss's instructions, but you will nevertheless add your own optimizations.

When a hash table is "full" or has too many collisions, a last resort is to rehash. When you rehash, you construct another hash table and insert every item in the old hash table into the new hash table. The new hash table might be bigger and could use a different hash function. The hope is that you were just "unlucky" with collisions and that the new hash table will not have as many.

Rehashing is very expensive in terms of running time since you have to re-insert every item from the old hash table into the new hash table. Alternatively, you can rehash incrementally and not move the items all at once. Incremental rehashing will not lower the overall running time of rehashing, but it will prevent your program from stalling (think spinning beach balls) while a hash table is being copied. During an incremental rehash, there are two hash tables in the data structure: the old hash table and the new hash table. When you insert a new string, you insert it in the new hash table. When you search for a string, the string may be in either the old hash table or the new hash table. You have to look in both. Similarly, a remove operation will have to look in both tables and remove the item from the appropriate one. During the incremental rehash phase, every operation (i.e., insert, find and remove) will also move a few items from the old table to the new table. After a number of operations, most of the items will have migrated to the new table. The stragglers in the old table will be moved to the new table and the old table can be discarded.

Your Assignment

Your assignment is to implement a hash table data structure that uses incremental rehashing as described above.

Rehashing is triggered in two ways. First, if the load factor of the table exceeds 0.5 (i.e., more than half of the hash slots are filled), your data structure should begin the incremental rehashing. Second, if linear probing during an insert, find or remove operation encounters more than 10 hash slots, then you should also initiate the incremental rehashing. The new hash table should have a size that brings the load factor to 0.25 if all of the elements of the old table were moved to the new table at once. That is, the size of the new table should be approximately four times the number of items in the old table when the incremental hashing is triggered. If incremental rehashing was triggered because by a long linear probe, the new table might actually be smaller than the old table. In any case, the new table size must be prime, and the new table size must be different from the old table size. We will also keep our table sizes between 101 and 199,999 for this project.

Instead of moving individual items from the old table to the new table, you will move clusters. These are contiguous hash slots that are occupied by items. Clusters are bad for linear probing because they lengthen the running time for a search. (We do not know that an item is not in the hash table until we have reached the end of the cluster.) Also, large clusters tend to get larger because the probability that a new item hashes to a slot in a cluster increases with the size of a cluster. Thus, when incremental rehashing has been triggered, every time a slot t in the old table is accessed (during insert, find or remove), we will rehash the cluster in the old hash table surrounding t.

For example, the following is a part of the dump from a hash table with size 137. The numbers in the parentheses after each word is the hash index of the word. The words in slots 67 through 87 form a cluster. If we did a search for "aberdeen", which hashes to slot 68, we won't find the word in the linear probing until we reach slot 80. This will trigger the incremental rehashing since the linear probing had to examine more than 10 slots to find "aberdeen". Thus, every item in the cluster from slot 67 through 87 will be moved to the new table. Note that there are items in the cluster before slot 68 where "aberdeen" hashed to. Also, in the linear probing scheme, when an item is deleted, the slot it occupied is marked as deleted rather than emptied (otherwise linear probing won't work for later searches). So, the cluster also includes deleted slots, but obviously there is no item to move to the new table for that slot. Subsequently, if we do search on "abbreviation" then the cluster in slots 89 through 91 will be moved to the new table.

H[ 60] = H[ 61] = H[ 62] = abeam (62) H[ 63] = H[ 64] = H[ 65] = H[ 66] = H[ 67] = abbreviated (67) H[ 68] = abductors (67) H[ 69] = abattoir (69) H[ 70] = aardwolf (70) H[ 71] = abated (70) H[ 72] = abatement (71) H[ 73] = abbeys (71) H[ 74] = abalone (74) H[ 75] = abandons (75) H[ 76] = abdicates (69) H[ 77] = abased (77) H[ 78] = DELETED H[ 79] = abel (76) H[ 80] = aberdeen (68) H[ 81] = abhorred (67) H[ 82] = abbreviates (82) H[ 83] = abducts (82) H[ 84] = abducted (84) H[ 85] = abates (85) H[ 86] = abet (84) H[ 87] = abhorrence (78) H[ 88] = H[ 89] = abdomens (89) H[ 90] = abbey (90) H[ 91] = abbreviation (90) H[ 92] = H[ 93] = H[ 94] = H[ 95] = H[ 96] = abdominal (96) H[ 97] = abhor (97) H[ 98] = abasement (98) H[ 99] = abhorrent (96) H[100] =

Note that in the example above, if we have not entered incremental rehashing and we did a search for "abel", this would not trigger an incremental rehash because "abel" hashes to index 76 and we found it in slot 79. The linear probing looked at only 4 slots. That is, incremental rehashing is not triggered by the existence of large clusters — it is triggered when we encounter a long linear probe sequence.

When there is an ongoing rehash, we must also check if the number of items in the old table has dropped below 3% of the total number of items in the data structure. If that is the case, then we should wrap up the rehashing and copy all of the remaining items in the old hash table to the new hash table. The old hash table can then be discarded and we will exit from the "rehash mode". This check should be done at the beginning of every insert, find and remove operation.

Finally, what should we do if during an incremental rehash, the load factor of the new table exceeds 0.5? or if there is a long linear probe in the new table? Then, we throw up our hands and give up on incremental rehashing. We make a third table that has a table size about 4 times the total number of items in the old and new tables, and move all of the items into the third table at once.

Requirements

You must design and implement a C++ class HashTable that uses the incremental rehashing scheme as described above. The following are some firm requirements (i.e., necessary for grading):

The name of your class must be HashTable.
You must use HashTable.h and HashTable.cpp for the filenames (case sensitive) of your header and implementation files.
You must use the hashCode() function given above.
You must use open addressing and linear probing.
You must use the division method where hash index is the value returned by hashCode() modulo the table size.
Your table sizes must be prime numbers.
You may not use STL classes, not even string.
You must dynamically allocate arrays of type char * to hold your C strings. I.e., the declaration of your table should be something like:
Your program must not leak memory.

We will be hashing strings. The strings are given in a global array words[]. None of the member functions in your implementation of the HashTable class should use this global array. The driver programs should be the only code that uses the words[] array. The global variable numWords has the size of the words[] array.

#ifndef _WORDS_H_ #define _WORDS_H_ const int numWords = 58109 ; const char *words[numWords] = { "aardvark", "aardwolf", "aaron", "aback", "abacus", "abaft", "abalone", "abandon", "abandoned", "abandonment", "abandons", . . . "zoomed", "zooming", "zooms", "zooplankton", "zoos", "zulu", "zulus" } ; #endif

Download: words.h.

Note that the items in the words[] array are const char * strings and not C++ strings. When you store a string in your hash table you should make a copy of the given string.

Similarly, there is an array of prime numbers between 100 and 200,000 in the primes.h file. You should use binary search in the primes[] array to find a prime number for your table size. If, during rehashing, you have to create a table with more than 199,999 items, then just give up and throw an out_of_range exception.

const int numPrimes = 17959 ; const int primes[numPrimes] = { 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, . . . 199873, 199877, 199889, 199909, 199921, 199931, 199933, 199961, 199967, 199999 } ;

Download: primes.h.

Additional Specifications

Here are some additional specifications for the HashTable class member functions that you have to implement. You will definitely want to add other data members and member functions to the HashTable class, but the prototypes for the member functions listed here should not be changed. You may create additional classes if you prefer, but all class declarations should be placed in HashTable.h and all implementations should be placed in HashTable.cpp.

This is the default constructor for the HashTable class. The parameter n is the requested size of the hash table. If no size is given or if n is less than 101, then use the prime number 101 for the table size. (I.e., the minimum table size is 101.) If the requested size is a prime number greater than 100, then you should use that size for the table size. If the requested size is not a prime number, you should use the next highest prime number from primes.h. If the requested size exceeds 199,999, then just give up and throw an out_of_range exception. The HashTable object created by the constructor should be ready for insertion, search and deletion without any additional initialization.

This is the destructor. Make sure you deallocate all memory for this object. The strings in the hash table must be deallocated using free() since they are C strings (i.e., don't use remove for C strings).
This is the copy constructor for the HashTable class. If the other hash table is not in incremental rehash, then copying the hash table is very straightforward. Just make sure you allocate memory for the new hash table and use strdup() to copy the strings.

On the other hand, if the other hash table has an ongoing rehash, then it doesn't make sense to make a duplicate of a hash table that is in the middle of being copied. Instead, force the other table to finish its rehashing and copy over the resulting single table.

Do not call the assignment operator from the copy constructor. If you do not want to have duplicate code, then create a third function that handles the common parts of the both functions.
This is the overloaded assignment operator for the HashTable class. As with the copy constructor, the process is fairly standard if the rhs does not have an ongoing incremental rehash. If there is an ongoing rehash, then force the rhs to finish its rehashing and copy over the resulting single hash table.
This function inserts a copy of the C string str into the hash table. It has no return value. (Note: use strdup() to copy C strings.)

The insert() function should insert in the new table if there is an ongoing incremental rehash.

Calling insert() with a string that is already in the hash table should have no effect. (I.e., do not insert a second copy of the same value.) Make sure you don't have a copy of a string that you didn't insert floating around. That's a memory leak.

The insert() function should trigger incremental rehashing when appropriate as described above. The insert() operation should also wrap up the incremental rehashing if the number of items in the old table drops below 3%.
The find() function looks for str in the hash table. The function returns true if found, false otherwise. The find() function look in both the old and the new hash tables if there is an ongoing incremental rehashing.

The find() function should trigger incremental rehashing when appropriate as described above. The find() operation should also wrap up the incremental rehashing if the number of items in the old table drops below 3%.
The remove() function removes str from the hash table and returns the pointer. If str is not in the hash table, remove() returns NULL.

It is the responsibility of the code that calls remove() to deallocate the string that is returned. (Again, use free(), not delete to deallocate.)

When an item is removed, the slot it occupied should be marked as deleted and not set to NULL, which would break linear probing. Use the declaration in your HashTable class:

Then in the .cpp file, initialize with:

The constant DELETED can then be stored in your hash table slot. This assumes that memory address 1 is never returned by the memory manager. (That's a pretty safe assumption.)

The remove() function should trigger incremental rehashing when appropriate as described above. The remove() operation should also wrap up the incremental rehashing if the number of items in the old table drops below 3%.
These functions are used for grading purposes, so we can examine your hash table(s). The isRehashing() function returns true if there is an ongoing incremental rehash. The tableSize() function returns the size of the hash table. When there is an ongoing rehash, tableSize(0) should return the size of the old table and tableSize(1) should return the size of the new table. Similarly, size() returns the number of items currently in the table(s).

The at() function returns a pointer to the string stored at the index slot of the hash table specified by table. If the index is invalid (i.e., less than 0 or greater than or equal to table size), then at() should throw an out_of_range exception (already defined in stdexcept). The pointer returned by at() has type const char * to prevent the string stored in the hash table from being changed. The calling function can make a copy if desired.
Dump should print some vital statistics and the contents of the hash table(s) to stdout. You should include the table size and number of items in the hash table(s). When you print out the string in each hash slot, include the item's hash index in parentheses (see example above.)

Implementation Notes

Remember to mod out by the table size when you are working with hash table indices.
Again, in hash tables, the indices wrap around to 0 at the bottom of the hash table. You must take this into account when you use linear probing for insert, find and remove.
One more time, the indices wrap. This means clusters can straddle the end of the hash table. When you move a cluster that is at the beginning or at the end of the table, you have to check if the cluster wraps around. Also, note that for loops do not work very well in this situation, because doesn't do the right thing when the cluster wraps around.
If you are thinking of working with negative indices for your hash table, remember that the modulo operator in C/C++ does not work mathematically (i.e., correctly): -2 % 10 evaluates to -2, not 8. So, the negative indices will not wrap around to the end of the hash table as you might expect.
You are working with C null-terminated strings, not C++ strings. If you haven't used C strings in a while, please review. For example, a C string is just an array of char, so the type is char *. A dynamically allocated array of C strings has type char * * since it is a pointer to an array of pointers to char. Here are Mr. Lupoli's notes on C strings.
You must use the strcmp() function to make string comparisons. You cannot use the == operator. That will do pointer comparison, not string comparison. To use strcmp(), make sure you or
Your hash table should make a copy of the string inserted, and not just store the pointer. Copies should be made using the strdup() function.
The strdup() function allocates memory using malloc() instead of new. To deallocate this memory, you must use free() instead of delete. While it is possible to allocate an array of char using new like this: it would be very confusing for one program to sometimes use malloc() and sometimes use new to allocate strings. So, just stick to using malloc() and free() to allocate and deallocate C strings. Having said that, you don't really need to use malloc() at all, if you use strdup() to copy strings, since strdup() allocates memory. Finally, for all other memory allocations not involving C strings, just use new.
Some of the strings that you are working with have type const char *. This means the string being pointed to cannot be changed. (The string is immutable in Python-speak.) Make sure that you know what this means. If you assign a const char * pointer to a char * pointer, the compiler will give you an error.
You should avoid copying strings whenever possible. For example, when you move a string from the old hash table to the new hash table during a rehash, you do not need to make a copy of the string. This means you probably want to have two versions of the insert() function. The standard one makes copies (because that's what your client is expecting). The internal one that you use for moving should not make copies. (Yes, one can call the other.)
You should periodically run your program under valgrind during development. This is so you can catch memory leaks as soon as possible. Also, if valgrind complains about memory read or memory write errors, this means you have a bug in your program. You should not ignore these errors. You should fix the bug as soon as possible because memory errors tend to manifest themselves in other places and are difficult to debug. You want to know if your program has any memory errors as soon as possible, because it is likely that the bug is in the last few modifications you have made in your program.

Test Programs

The following programs should be compiled using g++ testX.cpp HashTable.cpp where testX.cpp is one of test1.cpp, test2.cpp, test3.cpp, test4.cpp or test5.cpp. Run these programs under valgrind to check for memory leaks and memory read/write errors.

Basic test of insert(), find() and remove() without triggering incremental rehashing. Code (test1.cpp) and sample output (test1.txt).
Rehashing triggered. Clusters moved from old hash table to new hash table. Rehashing ends and only one table remains. Code (test2.cpp) and sample output (test2.txt).
Rehashing triggered. Further insertions cause long probe sequence in new table. Rehashing stopped and all items are consolidated in one hash table. Code (test3.cpp) and sample output (test3.txt).
Robust test with thousands of calls to insert(), find() and remove(). Resulting hash table is checked against equivalent STL set. Code (test4.cpp) and sample output (test4.txt).
Test of copy constructor and assignment operator. Code (test5.cpp) and sample output (test5.txt).

What to Submit

You must submit the following files:

HashTable.h
HashTable.cpp
Driver.cpp

The main function in Driver.cpp should exercise your HashTable functions and show what your program has implemented successfully. (I.e., if your code in Driver.cpp does not produce any output or it seg faults, then we will assume that you have not implemented very much.)

CMSC 341 Data Structures — All Sections — Spring 2017