Project 5, Perfect Hashing

Due: Tuesday, December 12, 8:59pm


Addenda


Objectives

The objective of this programming assignment is for you to gain experience working with hash tables and C++ functor classes.

Background

Perfect Hashing

Hash tables are nice and fast, but dealing with collisions is a pain. Wouldn't it be nice to have a hash table without collisions? This is not possible if we use a fixed hash function because a devil's advocate can always select a set of keys that hashes to the same slot of the hash table once the the fixed hash function is known.

However, there is a situation where we can have nice things, in this case, a collision-free hash table. In some applications, we know the set of keys that we will hash ahead of time. For example, we might want to put the filenames of a CD we want to burn in a hash table for quick look up. After the CD has been made, the filenames will not change. (But, who burns CDs anymore?) Or perhaps, the keys are the commands available to the users of an application. Until the next version of the app, the set of commands is fixed.

If we know that there are n keys to hash, we can randomly choose a hash function and hash the n keys into a table with n2 slots. It can be shown that the probability of having zero collisions is greater than 50% in this situation. What if there is a collision? Then we just randomly pick another hash function and see if that hash function has any collisions. We can keep picking hash functions until there are no collisions. The expected number of hash functions we have to try is less than 2.

How do we randomly pick a hash function? One of the hashing methods presented in our textbook is the so called MAD method (MAD = Multiply Add Divide). The hash function is

    h(k) = ( a · k + b ) % m

where k is the key and m is the table size. If we choose the values of a and b randomly, then each time we pick a different value of a and b we get a different hash function. Note that the table size m must be a prime number. Also, 1 ≤ am − 1 and 0 ≤ bm − 1. Thus, we need to be a bit careful when we choose these parameters.

There is another issue. A hash table with n2 slots is really big! With only n keys, the vast majority of the slots are empty. This is a total waste of memory. To fix this problem, we first hash the keys into a table with just n slots. We call this table the primary hash table. We expect the primary hash table to have collisions, but for a given slot in the primary hash table, only a few keys will collide in that slot — we expect something in the range of 2 to 5 keys. Now, for each slot that has a collision, we create a secondary hash table that has t2 slots where t is the number of keys that collided in that slot. With high probability, after picking a few hash functions randomly, we have a secondary hash table that is collision free. This is called perfect hashing (because there are no collisions, not because it is actually perfect).

So, how does a perfect hash table work after the keys are stored? Suppose that you want to look up whether a key k is in the perfect hash table. First you apply the primary hash table's hash function h1 to the key k. Let's say h1( k ) evaluates to j. Then we look in the jth slot of the primary hash table. If that slot is empty, then k is not in the hash table. If the jth slot is not empty, there may or may not have been any collisions. If there weren't any collisions in this slot, then we simply check if the key stored in that slot is equal to k. Otherwise there is a secondary hash table associated with the jth slot. We retrieve the hash function h2 of this secondary hash table and compute ℓ = h2(k). Then, we look in the ℓth slot of the secondary hash table for k. (Note that every secondary hash table has its own h2.) The "average" running time for this search is O(1) since it is basically two hash table searches, each of which takes O(1) time.

Why does this scheme save space? Because quadratic functions are not linear. Consider this example. Suppose that we want to store 100 keys. If we created a single hash table of 1002 slots, that would use 10,000 entries. Instead, if we are very unlucky in a perfect hash table, the 100 keys might hash into 20 slots with 5 keys colliding at each slot. The secondary hash tables would have 25 slots, but 20 × 25 is only 500, which is much smaller than 10,000. Furthermore, most of the time, we would not be so unlucky. In fact, for a perfect hash table with n keys, the expected total number of slots used in the primary hash table and all of the secondary hash tables is 2n. (The analysis is not that hard but I will not reproduce it here.)

Functors

So, that's perfect hashing. What the heck is a functor? In mathematics, a functor is a mapping between categories — basically a higher order function. In computer science, we co-opted the word to mean objects that behave like functions. In particular, in C++, functors are objects that have the () operator defined. (Yes, you can overload the () operator and yes, you should think of () as an operator the way you think of [] and + as operators.)

Here's a simple example of a functor class Increment.cpp:

#include <iostream> using namespace std ; class Increment { public: Increment(int n=1) { m_addThis = n ; } int operator() (int x) { return x + m_addThis ; } int m_addThis ; } ; int main() { Increment add1 ; cout << add1(5) << endl ; // prints 6 cout << add1(8) << endl ; // prints 9 Increment add7(7) ; cout << add7(5) << endl ; // prints 12 cout << add7(8) << endl ; // prints 15 cout << add1(5) << endl ; // still prints 6 }

In main(), add1 and add7 are Increment class objects. Since the Increment class has the () operator defined, placing () after add1 and add7 invokes the member function operator(). The call add1(5) passes the value 5 to the parameter x in the parameter list for operator(). That means add1(5) will return 6. Why? Since add1 was declared without any parameters in

Increment add1 ; the constructor's parameter n takes on the default value of 1. This is assigned to the data member m_addThis. When add1(5) is executed, the operator() member function returns 5 + m_addThis which evaluates to 6.

On the other hand, when add7 was declared by

Increment add7(7) ; the data member m_addThis was set to 7. So, the calls add7(5) and add7(8) return 12 and 15 respectively. Note that calling add1(5) still returns 6, because m_addThis is a data member. Thus, add1 and add7 each have their own copies of m_addThis. If you simply remember that add1 and add7 are objects, this makes perfect sense.

So, functor objects can be used like functions, but it is easier to think of them as objects. They can be stored in variables, like any other object. They can be passed as parameters. If the functor class has dynamically allocated data, they may even need copy constructors, destructors and overloaded assignment operators.

But why have functors at all? Well, they are a bit more robust than the function pointers we used in Project 4. They are certainly easier to declare. Plus, functors is the C++ answer to the mantra from Programming Languages that functions should be treated as first-class objects. In C++, functors are first-class objects because they are objects. It remains debatable whether functors are functions. (Plus, the Programming Languages people really mean that all functions should be treated as first-class objects. Here, we have some function-like things are that first-class objects. So, doesn't really count.)

But wait, why do we want functors for this project? Well, we want our hash functions to be object-oriented. That way, when we declare a hash function, it is already randomly selected. Also, if we don't like a particular hash function (because it causes a collision), then we can just ask the hash function to go "reboot" itself and become a different hash function.


Assignment

Your assignment is to implement perfect hashing using a functor class for hash functions. To keep things straightforward at the end of the semester, you will follow the design given in the header file PerfectHT.h shown below. You may make changes to this header file, but you must keep the function signatures for the member functions that are called by the test programs. Some functions have been implemented for you already. Those you should not change.

File: PerfectHT.h

// File: PerfectHT.h // // UMBC CMSC 341 Fall 2017 Project 5 // // Class declarations for HashFunction, // SecondaryHT and PerfectHT. // // Version: 2017-11-30 // #ifndef _PERFECTHASH_H_ #define _PERFECTHASH_H_ #include <stdexcept> #include <vector> #include <string> using namespace std ; // // "Functor" class for hash functions. // class HashFunction { public: // Constructor. // Pass in requested hash table size via parameter n. // The constructor picks the smallest prime # greater than // or equal to n for the table size. // // Initializes other hash function parameters randomly. // HashFunction (int n=100) ; // Function that maps string to unsigned int. // Return value can be much larger than table size. // Uses m_multiplier data member. // unsigned int hashCode(const char *str) const ; // Getter for table size // int tableSize() const ; // Overloaded () operator that makes this a "functor" class. // Returns the slot in the hash table for str. // Uses the MAD method: h(x) = (ax + b) % m where // the parameters are: // a = m_factor // b = m_shift // m = m_tsize // unsigned int operator()(const char *str) const ; // Pick new parameters for MAD method and the hashCode function. // Note: can change table size. // void reboot() ; // Find smallest prime number greater than or equal to m. // Relies on global array primes[]. // static int findPrime(int m) ; // Set random seed for the random number generator. // Call once at the start of the main program. // Uses srand() and rand() from cstdlib which is // shared with other code. For compatibility with // C++98, it does not have a private random number // generator (e.g., mt19337). // static void setSeed(unsigned int seed) ; static bool m_debug ; // print debugging statements? private: unsigned int m_tsize ; // tablesize, must be prime # unsigned int m_factor ; // must have 0 < m_factor < m_tsize unsigned int m_shift ; // must have 0 <= m_shift < m_tsize unsigned int m_multiplier ; // used in hashCode function int m_reboots ; // number of times reboot() was called } ; // Derived exception class that is thrown when // SecondaryHT constructor fails to create a // collision-free secondary hash table after // many attempts. class very_unlucky : public std::runtime_error { public: very_unlucky(const string& what) : runtime_error(what) { /* do nothing else */ } } ; // Secondary hash table for PerfectHT class. // class SecondaryHT { public: // Create a secondary hash table using the char * strings // stored in "words". Makes copy of each item in words. // SecondaryHT(vector<const char *> words) ; // Copy constructor, destructor and assignment operator. // Just the usual. // SecondaryHT(const SecondaryHT& other) ; ~SecondaryHT() ; const SecondaryHT& operator=(const SecondaryHT& rhs) ; // returns whether given word is in this hash table. // bool isMember(const char* word) const ; // getter // int tableSize() const ; // Pretty print for debugging // void dump() const ; static bool m_debug ; // print debugging info? private: // number of attempts, before we give up. // static const int maxAttempts = 20 ; // hash function from the functor class. // HashFunction hash ; // The actual hash table. Each entry points // to an array of null-terminated char string. // char **T2 ; // keep track number of attempts to reboot the // hash function to achieve collision-free // hashing. // int m_attempts ; // number of items stored in this secondary hash table. // int m_size ; } ; // Perfect Hash Table class. // class PerfectHT { public: // Create a Perfect Hashing table using the first n strings // from the words array. // PerfectHT(const char * words[], int n) ; // Copy constructor, destructor and assignment operator. // Just the usual. // PerfectHT(const PerfectHT& other) ; ~PerfectHT() ; const PerfectHT& operator=(const PerfectHT& rhs) ; // Returns whether word is stored in this hash table. // bool isMember(const char * word) const ; // Returns number of items stored in this table. // int tableSize() const ; // Print stats for debugging. // void dump() const ; static bool m_debug ; // print debugging info? private: HashFunction hash ; // hash function "functor" object // PHT = Perfect Hash Table, pronounced "fuut" with a long "oo". // // The following two arrays are arrays of pointers. // // If a primary hash slot contains only one char * string, // then the string is stored in PHT1. // // If a primary hash slot contains two or more char * strings, // then they are stored in a secondary hash table. // // Note that PHT2[i] is a POINTER to a SecondaryHT and not // a SecondaryHT object. // // Space must be allocated for these two arrays. // Copies of the stored strings must be made using strdup(). // Space for the array and for the strings must be deallocated // by the destructor. // char **PHT1 ; // array of plain old char pointers SecondaryHT **PHT2 ; // array of secondary hash table pointers } ; #endif

There are three classes defined in PerfectHT.h: HashFunction, SecondaryHT and PerfectHT. The HashFunction class is the functor class. The role of the HashFunction class is to create and maintain hash functions. It does not create any space for actual hash tables. The SecondaryHT class creates collision-free hash tables. It takes a vector of C strings (null-delimited char * pointers) and creates a hash table big enough to hold all of the strings without collision. The PerfectHT class is the class that would be used by the client programmer. It is given an array (not vector) of C strings and creates a perfect hash table from that array.


Step 1: HashFunction

Many of the member functions in the HashFunction class have already been implemented for you, including the hashCode function. (See PerfectHT.cpp.) Recall that our textbook divides hashing into two steps: computing the hash code and compression. The purpose of hashCode() is to map strings into numbers:

// Function that maps string to unsigned int. // Return value can be much larger than table size. // Uses m_multiplier data member. // Return value must be unsigned for overflow to work correctly. // unsigned int HashFunction::hashCode(const char *str) const { unsigned int val = 0 ; int i = 0 ; while (str[i] != '\0') { val = val * m_multiplier + str[i] ; i++ ; } return val ; }

The data member m_multiplier is the "secret sauce". For English words, studies have shown that 33 is a good value to pick for m_multiplier. The HashFunction constructor does initialize m_multiplier with 33, but if there are many collisions, we may need to change that value.

The compression phase of the hash function is computed by the overloaded () operator:

// Overloaded () operator that makes this a "functor" class. // Returns the slot in the hash table for str. // Uses the MAD method: h(x) = (ax + b) % m where // the parameters are: // a = m_factor // b = m_shift // m = m_tsize // unsigned int HashFunction::operator() (const char *str) const { return ( m_factor * hashCode(str) + m_shift ) % m_tsize ; }

The data members m_factor, m_shift and m_tsize are initialized in the constructor. Note that m_factor and m_shift are chosen "randomly" subject to the restrictions of the MAD method. The table size m_tsize must be a prime number, so the parameter n given to the HashFunction constructor is just a "suggestion".

// Constructor. // The constructor picks the smallest prime # larger than n // for the table size. Default value of n defined in header. // // Initializes other hash function parameters randomly. // HashFunction::HashFunction(int n /* =100*/ ) { // note: maxPrime defined in prime.h // if (n > maxPrime) throw out_of_range("HashTable size too big.\n") ; m_tsize = findPrime(n) ; m_factor = ( rand() % (m_tsize - 1) ) + 1 ; m_shift = rand() % m_tsize ; m_multiplier = 33 ; // magic number from textbook m_reboots = 0 ; }

Your job in this step is to implement two member functions for the HashFunction class: findPrime() and reboot(). These functions are already declared in the header file. Please follow those signatures.

Calling findPrime(n) should return the smallest prime number greater than or equal to the parameter n. An array of prime numbers is included in the file primes.h. (See links for all files below.) Make use of the global constants numPrimes and maxPrime. If n is greater than maxPrime then throw an out_of_range exception. For efficiency, you must use binary search to find the correct prime.

The second member function you have to implement is reboot(). This function randomly selects new values for a and b in the MAD method for hash functions. Recall that we have the restrictions 1 ≤ am − 1 and 0 ≤ bm − 1, where m is the table size. The purpose of reboot() is to give us a new hash function for the secondary hash table if the current hash function results in collisions. So, make sure that m_factor and m_shift are actually different values from before to guarantee that we have a different hash function.

However, having new values for a and b is not enough to avoid collisions. If we have two strings str1 and str2 where hashCode(str1) and hashCode(str2) differ by a multiple of m_tsize, then the two strings will still collide when we pick new values for m_factor and m_shift. To get around this problem, every third call to reboot() should also change the value of m_multiplier (just increase it by 2) and every fifth call to reboot() should change the table size. This will require you to keep track of the number of times that reboot() has been called. (This is another reason to have a HashFunction functor class.)

In reboot(), print out the new values for m_tsize, m_multiplier, m_factor and m_shift. But do this only in "debug mode", selected by m_debug, so we do not get thousands of lines of output when we run larger test cases.

After implementing these findPrime() and reboot(), your code should compile with the first two test programs:

If you run your programs on machines other than GL, the output may be different because the implementation of the rand() functions are not standardized. Notice in the output of p5test2.cpp that about a third of the slots are empty and roughly a quarter of the slots have collisions. The average number of collisions in a non-empty slot is only around 1.6. That means we would expect to have small secondary hash tables.


Step 2: SecondaryHT

Now that your HashFunction functor class is working, you can use it in the SecondaryHT class. The HashFunction class doesn't actually allocate any space to hold data. It is only responsible for providing the hash function. The SecondaryHT class, on the other hand, is responsible for creating a collision-free hash table. So, it must allocate space to store strings and it must make sure there are no collisions.

As before, several member functions of the SecondaryHT class have been implemented for you already. For example, the isMember() function returns whether a string is stored in the table:

// returns whether given word is in this hash table. // bool SecondaryHT::isMember (const char *word) const { int slot = hash(word) ; assert ( 0 <= slot && slot < hash.tableSize() ) ; if (T2[slot] == NULL) return false ; if ( strcmp(word, T2[slot]) != 0 ) return false ; return true ; }

The constructor for the SecondaryHT class must set up T2, an array of char * strings. A SecondaryHT object must have its own copies of these strings. The easiest way to copy a C string is to use the strdup() function from the cstring library. The strdup() function will allocate space and copy the string. The catch is that strdup() comes from the C library and uses malloc() instead of new. When you deallocate a string that was created by strdup(), you have to use free() to deallocate the memory instead of delete. (Just type "free(str)" instead of "delete str".)

You assignment for the SecondaryHT class is to implement the constructor, copy constructor, destructor and assignment operator. The last three functions are straightforward. The constructor has the signature:

// Create a secondary hash table using the char * strings // stored in "words". Makes copy of each item in words. // SecondaryHT::SecondaryHT(vector<const char *> words) { // // Keep trying until a table with no collisions is found // // wrap debugging statements in "if (m_debug) { ... }" // // remember that calling HashFunction::reboot() can change // the table size! // // use strdup to copy char * strings // }

Here words is a vector of const char * strings. The constructor must check whether there are any collisions using the current hash function on the strings in words. Remember that the secondary hash table in the perfect hashing scheme tries to hash the strings into a table with n2 slots. We expect that half of the time there will not be any collisions, but if there is a collision then the hash function has to be rebooted. If there are more than maxAttempts reboots, then throw a very_unlucky exception. For efficiency, you should not allocate space for T2 and copy strings over until after you have determined that the current hash function has no collisions. Finally, remember that the table size might change after a hash function reboot.

After implementing the SecondaryHT constructor, check it against these test programs:

After your SecondaryHT constructor passes these tests, write the copy constructor, destructor and assignment operator for SecondaryHT. Then, use valgrind and the following test program to check for memory leaks.


Step 3: PerfectHT

Finally, complete the implementation of the PerfectHT class. Recall from the discussion above, that for efficiency, the primary hash table separates the case when just one string is hashed to a slot and when there is a collision. To handle this, there are two arrays for the primary hash table. PHT1 is just an array of char * strings. If there is only one string in slot i, then PHT1[i] holds a copy of that string. If there are no strings in that slot or when there are two or more strings in that slot, PHT[i] should be NULL. When there is a collision then PHT2[i] is a pointer to a SecondaryHT that holds the strings. To see this better, look over the implementation of isMember():

// Returns whether word is stored in this hash table. // bool PerfectHT::isMember(const char * str) const { int slot = hash(str) ; if (PHT1[slot] == NULL && PHT2[slot] == NULL) return false ; if (PHT1[slot] != NULL) return strcmp(str,PHT1[slot]) == 0 ; return PHT2[slot]->isMember(str) ; }

For the PerfectHT class, you have to implement the constructor, copy constructor, destructor and assignment operator. The constructor is declared as:

// Create a Perfect Hashing table using the first n strings // from the words array. // PerfectHT::PerfectHT (const char *words[], int n) { // Implement constructor for PerfectHT here. // // You will need an array of vectors of char * strings. // Something like; // // vector<const char *> *hold = new vector<const char *>[m] ; // // Each hold[i] is a vector that holds the strings // that hashed to slot i. Then hold[i] can be passed // to the SecondaryHT constructor. // }

The parameter passed to the PerfectHT constructor is an array of char * strings and an int value n that specifies the length of the array. The string array is defined in the main program and holds literal strings. Hence, the const designation. This requirement to work with const values propagates throughout the class declarations for the PerfectHT, SecondaryHT and HashFunction classes. That is why you see a lot more const designations for the parameters and member functions.

After you have implemented these functions for PerfectHT, test them against the following programs. Run them in valgrind to check for memory leaks.

Then you are done!

Implementation Notes

Most of the usual warnings against pitfalls are sprinkled in the discussion and directions given above. Here are some general remarks:


All the files in one place

On GL, you can also copy the files from the directory: /afs/umbc.edu/users/c/h/chang/pub/www/cs341.f17/projects/proj5files/ (Also available zipped up: cmsc341proj5.zip.)


What to Submit

Before submitting, remove all extraneous output from your program or wrap them with if (m_debug) { ...}.

You must submit the following files to the proj5 directory.

Please do not submit other files. Make sure your project works with only these two files submitted.

If you followed the instructions in the Project Submission page to set up your directories, you can submit your code using this Unix command command.

cp PerfectHT.h PerfectHT.cpp ~/cs341proj/proj5/

Your code should compile with the test programs using these Unix commands on GL:

g++ -I ../../00Proj5/ -I . PerfectHT.cpp ../../00Proj5/p5test1.cpp -o t1.out g++ -I ../../00Proj5/ -I . PerfectHT.cpp ../../00Proj5/p5test2.cpp -o t2.out g++ -I ../../00Proj5/ -I . PerfectHT.cpp ../../00Proj5/p5test3.cpp -o t3.out g++ -I ../../00Proj5/ -I . PerfectHT.cpp ../../00Proj5/p5test4.cpp -o t4.out g++ -I ../../00Proj5/ -I . PerfectHT.cpp ../../00Proj5/p5test5.cpp -o t5.out g++ -I ../../00Proj5/ -I . PerfectHT.cpp ../../00Proj5/p5test6.cpp -o t6.out g++ -I ../../00Proj5/ -I . PerfectHT.cpp ../../00Proj5/p5test7.cpp -o t7.out g++ -I ../../00Proj5/ -I . PerfectHT.cpp ../../00Proj5/p5test8.cpp -o t8.out g++ -I ../../00Proj5/ -I . PerfectHT.cpp ../../00Proj5/p5test9.cpp -o t9.out

Good luck with your final exams and catch up on some sleep over Winter Break!