Project 5: Anagram Hashing

Due: Tuesday, May 14, before 9:00 pm


Addenda


Objectives

The objective of this programming assignment is to practice implementing hash functions and hash tables.


Background

Hash collisions are bad, right? Well, maybe not always. For this project you are going to use a hash table to quickly identify anagrams; to do this, you need a hash function that collides when the two inputs are anagrams (good collisions) but doesn't collide too often when the inputs are not anagrams (bad collisions). For example, if the hash function is H, then it must be that H(“parsley”) and H(“players”) are equal, but it would be good if H(“earned”) and H(“canters”) are not.

Anagramming is slightly more subtle than it might first appear. In particular, capitalization and spaces do not count when determining if two words or phrases are anagrams. For example, the following should all have the same value:

H("Sesame Street") H("Tesserae Stem") H("Teases Me Set")

Think of it this way: if you convert each of the phrases to upper case and remove the spaces, they would comprise exactly the same letters; thus they are anagrams.

Where would one even begin to find such a hash function? Start with the textbook. An additive hash code has the property that simple anagrams (“pots”, “stop”, “tops”) hash to the same value.

The hash table is an array of sets of strings (set<string> *m_ht) with the table m_ht allocated dynamically by the constructor. A location in the hash table, say m_ht[h] contains strings which are all anagrams of each other and have hash value h.

What if a bad collision occurs: there is some string S with hash value h but S is not an anagram of the entries in mt_h? Then we choose a different location at which to save S using linear probing.

As a small example, suppose we insert the strings “canters”, “scanter”, and “replays”, all of which hash to three, and that our hash table has size six. First we insert “canters” by computing it's hash (3) and, checking that the set at index three is empty, insert it into the set:

m_ht[0] (empty) m_ht[1] (empty) m_ht[2] (empty) m_ht[3] { "canters" } m_ht[4] (empty) m_ht[5] (empty)

Now, to insert “scanter”, we compute it's hash (3), but we see that the set at index three is non-empty. We have to check that “scanter” is actually an anagram of an element of the set using a secondary test. A secondary test is an exact but more computationally expensive test that we use when hashing alone can't give us a definite answer: we can't be sure that the two strings are anagrams even though they have the same hash value since it could just be a “bad” collision. It is up to you how to implement your secondary test. With this example, we can see that “canters” and “scanter” are anagrams, so we just insert “scanter” into the list at index three:

m_ht[0] (empty) m_ht[1] (empty) m_ht[2] (empty) m_ht[3] { "canters", "scanter" } m_ht[4] (empty) m_ht[5] (empty)

Finally we need to insert “replays”. It also has hash value three, but our secondary test will show that it is not an anagram of “canters”, and so does not belong in the list at index three. We use linear probing to determine where to save the string which, in this case, is at index four:

m_ht[0] (empty) m_ht[1] (empty) m_ht[2] (empty) m_ht[3] { "canters", "scanter" } m_ht[4] { "replays" } m_ht[5] (empty)

In general, linear probing makes removal from hash tables more difficult, but for this project, we never remove anything, so we don't have to worry about that.

The end result after inserting a set of strings is a hash table in which:

  1. If an entry m_ht[h] is non-empty, then it contains a set of anagrams.
  2. If two strings S1 and S2 are anagrams, then they are in the same set m_ht[h] for some index h.

That is, our hash table partitions the input strings into sets of anagrams.


Your Assignment

Your assignment is to complete the anagram hashing class Anahash. To get you started, you are provided with the following files:

First, you must design and implement a hash function that always gives equal hash values for strings that are anagrams of each other but minimizes the number of “bad” collisions. It may be easier to design and test the hash function outside of the class. Once you have a hash function, you can implement the hash tables operations (described below).


Specifications

Requirement: Private helper functions must be declared in Anahash.h. No other modifications to Anahash.h are permitted!

Requirement: You must use a hash table data structure to store and organize the anagrams. The entries of the hash table must be sets of strings. The size of the hash table can be no more than HT_SIZE_LIMIT, a constant defined in Anahash.h.

Requirement: Two strings S1 and S2 are anagrams if they consist of exactly the same letters after conversion to upper case and removal of spaces. However, strings must be stored in the hash table with their orginal capitalization and spaces.

Requirement: Linear probing must be used to resolve “bad” collisions.

Requirement: Each non-empty entry m_ht[h] must contain strings which are anagrams of each other. If two strings S1 and S2 are anagrams, then they must be stored in the same entry m_ht[h] for some h.

Requirement: You may use the STL set container and the sort method of the algorithm library. No other STL containers or algorithms may be used. However, you may use standard types and function libraries such as string and cstring.


These are the member functions of the Anahash class.


Test Programs

The following program may be used as a basic test of your implementation. Remember, it is your responsibility to thoroughly test your implementation.

These and all the provided project files are available on GL:

/afs/umbc.edu/users/c/m/cmarron/pub/www/cs341.s19/projects/proj5files/


Implementation Notes


What to Submit

You must submit the following files to the proj5 directory.

The main function in Driver.cpp should exercise your Anahash functions and show what your program functions correctly.

If you followed the instructions in the Project Submission page to set up your directories, you can submit your code using this Unix command command:

cp Anahash.h Anahash.cpp Driver.cpp ~/cs341proj/proj5/