Project 5: Hash Table Strategies & Performance

Due: Tuesday, May 15, 8:59:59pm


Addenda

Additions and changes are highlighted in orange throughout.

Objectives

The objective of this programming assignment is for you to gain experience working with hash tables and measuring execution times

Background

Okay, no more silly, condescending made-up stories about dumb managers that you have to fool. Prof. Park is your manager now, and he's no dummy. He decides to immediately set about evaluating your ability to do clean implementations and rigorous empirical evaluations of your latest course topic: hash tables. No more joking around--coding is a serious matter. Well, okay, one joke, but that's it--make it last:

A man walks into a bar. He asks the bartender for three beers, all at the same time. He proceeds to take turns drinking some from each glass, until they're all empty. He then gets up and leaves. The bartender thinks this is weird, but he's seen weirder. Some UMBC students came in the week before...

The next week, the man comes in again and does the exact same thing: always three beers, always drinking all of them simultaneously, then quietly leaving. This goes on week after week for months. Finally, one night the man comes in and orders only two beers. The bartender can't contain his curiosity and asks: why the multiple beers? The man explains that he has two brothers who have moved far away, but they agreed they would all stay spiritually connected by "sharing a drink" this way every Saturday night at 9, wherever they may each be. The bartender immediately gets the significance of the two beers and says: "I'm so sorry for your loss." The man asks "why??". The bartender answers: "Oh, I'm assuming one of your brothers must have died." "Ah, no", the man replies; "my brothers are still alive--I've just given up drinking."

Okay, back to business.

Your assignment is to implement and compare the performance of two different types of hash tables. You must implement one version that uses hash buckets with chaining, and a second version that uses open addressing with linear probing. Recall that these two kinds of hash tables have fundamentally different approaches to dealing with collisions: Chaining uses a more elaborate data structure for the hash buckets that allows us to store multiple items in each bucket. This leads to additional implementation complexity, depending on what data structure you use to implement each bucket. What it loses in terms of complexity it makes up for by the fact that a handling a collision amounts to searching through a list of exactly those items that hashed to the identical bucket, but no others. An additional benefit is that you can actually have a load factor greater than one: in other words, you can store >N items in an N-bucket hash table.

Open addressing, on the other hand, stores only a single item in each bucket, and deals with collisions by trying alternative buckets in an ordered fashion until it finds one that is empty. Data structure-wise, it is much simpler to implement. However, it has some performance issues when dealing with collisions: first, it might "collide" with earlier values that initially hashed to a different initial index but wound up landing in the same bucket that the current value wants to be in. Then, when we probe other alternatives, it again might collide with any number of other values that ended where they are through various means. Then, for the more simplistic probing strategies, there is the matter of clustering, which leads to an increased number of probe attempts. Then, there is the matter of having to implement lazy deletions. To make life simpler for you, for this project, the open addressing implementation will use a very simple linear probing strategy for handling collisions.

Since theoretical analyses of the performance of these hash tables is complex, you are going to compare how the two above designs perform by doing an empirical evaluation, adding code to your implmentation to measure runtimes for various operations. We will show you how to make time measurements; it is your responsibility to decide how to use these tools to actually profile the performance of your implementation.

One more design detail: Since the two types of hash tables are closely related, and have the exact same external interface, you will be implementing them as derived classes of a common base class. In other words, you will be using polymorphism to allow us to use the same driver code to test out both implementations, by using a common base class pointer but having them point to hash table objects of different subclasses. Additionally, we want your hash tables to be able to store arbitrary data types, which means you must define them as templated classes. Then, since it is not possible to have an effective hash function that works universally across different types, one of the templateconstructor parameters will be a hash function appropriate for the type, computing a hash code that is an unsigned int. You will then convert this into a hash table index using the division method.


Your Assignment

Your assignment is to implement a hash table class hierarchy that consists of a base class that defines the basic interface for a hash table, as well as two derived classes: one that implements a hash table with chaining, and another version that uses open addressing. The base class definition declares the function prototypes for the various member functions a hash table needs, like insert(), find() and remove(). Of course, these will not be defined in the base class, but instead in the specialized subclasses you will be creating. This allows polymorphism to work. (Review the concept from your Comp Sci II notes if you forgot.) Note that other than the interface definitions, there is little else we can put in the base class, since almost every other part, even the underlying hash table array, depends on what type of collision strategy we're using.

We are providing the base class definition to you. You should not change any part of those files, nor should you submit it. Here is the definition of the class from HashTable.h:

#ifndef HASHTABLE_H #define HASHTABLE_H #include <vector> template <typename T> class HashTable { public: virtual ~HashTable() {}; // Functions in a standard hash table interface, // independent of implementation: virtual bool insert(const T &data) = 0; virtual bool find(const T &data) = 0; virtual T remove(const T &data, bool &found) = 0; // Functions for debugging and grading: virtual void dump() = 0; virtual int at(int index, std::vector<T> &contents) = 0; // Place to store pointer to hash code generating function unsigned int (*hashFunc)(const T &data); // There are no other data members that the different derived classes // would have in common }; #endif

Note that the base class--and also the derived classes--must be templated. This is because we want the flexibility to be able to store any type of data in our hash tables, not just some fixed type like ints or strings. Unfortunately, this makes it difficult to have a generic hash function defined inside the class, since we don't even know the type of the key. We resolve this by having the user provide an appropriate hash function at the time the hash table is constructed, which we then store a pointer to in our class (in the "(*hashFunc)(T &data)" member as seen above).

Another complication of being type-agnostic is that the user of our hash tables must make sure the type being stored has an overloaded operator== (and operator!=) so that we can tell when we've found the desired item in the hash table. Note that this is especially true if the user wants to store pointers-to-objects in our hash table. Then, the user-provided overloaded comparison function would dereference the pointer first and compare the actual data being pointed to. The standard pointer equality operator would merely test whether the pointed-to addresses are the same. You can see an example of this in the first test program we provide, test1.cpp. Note that we simply want to know if/when we've found the specific item in the table, nothing fancier, so we do not need any of the other comparison operators (like '<', and '>') to be overloaded.

The ChainHashTable class

The first concrete subclass of the HashTable class that you will be defining is the ChainHashTable class. This is a version that uses chaining to handle collisions. That means that the hash table is a bucket array where every bucket can hold a collection of values. As discussed in class, any of a number of different data structures could be used to store the items for each bucket. For this project, we are requiring that you use the STL singly-linked list: the forward_list class. The forward_list class only allows simple insertion at the head of the list, but this is fine for our needs; when you add new elements to a bucket, make sure you add it to the head of the list. To search the forward_list to look for an item, and to subsequently remove the item if desired, you must use the forward_list's iterator. std::list class. When you add new elements to a bucket, you must add it to the head of the list, using the push_front() member function. To search the list to look for an item, and to subsequently remove the item if desired, you must use the std::list's iterator. You should be very familiar with iterators by this point.

There isn't much else to the ChainHashTable class. At construction time, the user provides a pointer to a hash function appropriate to generate a hash code for the data type. Your hash table operations use this to generate a hash code, and then uses the division method (simple modulo division by the size of the hash table) for compression to generate an index into the hash table. It then adds the item to the bucket (i.e., forward_ list) at the index. Note that the hash table should be initialized to an array of empty forward_ list's at construction time, so that you can start adding items immediately after construction.

The ProbeHashTable class

The second subclass of the HashTable class that you will be defining is the ProbeHashTable class. This is a version that uses open addressing to handle collisions. That means that the table is an array where each member only holds at most a single value. You might be tempted to therefore declare the hash table as a simple array of elements of type T (the templated type parameter), but this has a problem. First, you need a way to mark empty buckets. Also, a hash table that uses open addressing cannot just free the bucket up by marking it as empty when an item is deleted: it must mark that slot as a special lazy-deleted bucket; otherwise linear probing won't work for later searches. (Review your notes on how lazy deletion works.) So, it must have two special unique values that it can use to mark empty buckets and lazy-deleted buckets, respectively, and these must be distinct from any actual data value the user might insert. However, since you don't know the type of T or what the user is using it for, you cannot know what two values are free for you to use. So, the cleaner, but slightly space-intensive, solution you will use is to define a helper class called HashTableEntry that has two data members:m_flag and m_data. The m_flag member is of type int, and has one of 3 possible values at any given time: 0 if that bucket is empty, -1 if that bucket is marked as lazy-deleted, and 1 if that bucket currently holds a value. (Obviously, you should use named constants instead of these magic numbers.) The m_data member is of type T, and holds the value the user inserted, if m_flag == 1; else, it is garbage. The bucket array is then declared as an array of elements of type HashTableEntry. You should define this HashTableEntry helper class as a nested class of ProbeHashTable, and you should make all of its data members public. It does not need any member functions: you are using it just as a struct.

Like the ChainHashTable, the ProbeHashTable should use the hash function and the division method to generate an starting index into the hash table. In the case of a collision, it should use simple linear probing to look for the first empty or lazy-deleted bucket and insert the data item there. Note that the hash table should be initialized to an array of HashTableEntry objects, each with its m_flag member initialized to 0, for "empty", so that you can start adding items immediately after construction.

Measuring Execution Times

An important part of this project is computing average execution times for the various operations on the hash table. To do that, we are providing a function called getCurrentCPU(), in the files getCurrentCPU.{h,cpp} which returns the cummulative amount of CPU time (measured in some defined units) that your program has used so far. By calling this function immediately before, and then just after a series of hash table operations, you can get a measure of how much CPU time those operations took, and you can use that to compute the average time per insertion, deletion, etc. We give you an example of its usage in test2.cpp; you can just adapt that to do your own performance measurements.



Requirements

The following are some firm requirements, i.e., necessary for grading. (When we refer to "both hash table classes" here, we are stating the requirement applies to both the ChainHashTable and ProbeHashTable classes.)



Additional Specifications

Here are some additional specifications for the ChainHashTable and ProbeHashTable class member functions that you have to implement. You will definitely want to add other data members and member functions to your derived hash table classes, but the prototypes for the member functions listed here should not be changed. You may create additional classes if you prefer, but all class declarations should be placed in one of your submitted .h files (not HashTable.h!) and all implementations should be placed in the corresponding .cpp file.


Implementation Notes


Test Programs

Note: this whole section is new.

The provided sources (HashTable.h, getCurrentCPU.{h,cpp}, words.h, test programs, sample output) are all available in the GL directory /afs/umbc.edu/users/p/a/park/pub/www/cs341.s18/projects/proj5files. You can use the "cp" command to copy these to your own directory.

The following programs should be compiled using

g++ -g testX.cpp getCurrentCPU.cpp -o testX.out where testX.cpp is one of test1.cpp, test2.cpp, test3.cpp, etc.. Run these programs under valgrind to check for memory leaks and memory read/write errors.