Project 4, Median Heaps
Due: Saturday, November 17, 8:59pm
Objectives
The objective of this programming assignment is for you to gain experience working with binary heaps, templates and function pointers.Background
The premise of this project is that you are asked to design a data structure for keeping track of three statistics: minimum, maximum and median. The data structure has to support efficient insertion, and it needs to also support search and remove, though these do not have to be efficient at all.
These requirements can be handled easily by using two binary heaps: a max heap and a min heap. Recall that a max heap has the largest value at the root and a min heap has the smallest value at the root. Our intention is to store half of the numbers in the max heap and the other half in the min heap. Furthermore, we want to have all of the numbers in the max heap be smaller than the numbers in the min heap. In such an arrangement, the median will be either the root of the max heap or the root of the min heap. (Take a moment now to digest that the smaller numbers go in the max heap.)
When we insert more items in this data structure, we can compare the item with the current median. If it is smaller than the current median then we would store it in the max heap. Otherwise, it should go in the min heap. Since insertion in a binary heap is O(log n) time, this satisfies our requirement for efficient insertion.
Now, this can mess up our balance — the sizes of the max heap and min heap should never differ by more than one. To "rebalance" our heaps when one heap has two more items than the other heap, we simply remove the root of the larger heap and insert it in the other heap. Since removing the root and inserting take only O(log n) time, we continue to maintain a O(log n) running time.
In this scheme, we can actually report the median item in O(1) time since the median item is one of the roots. We also want to be able to report the minimum and maximum item in O(1) time. This can be accomplished by simply remembering which items are the smallest and largest seen so far. When you insert an item, check to see if it is the new minimum or the new maximum. These scheme works until you have to remove something from the data structure.
Binary heaps do not handle search very well. If you have to look for something in a binary heap, you cannot do better than O(n) time because a binary heap is not a binary search tree. The remove operation we want to support in this data structure looks for a particular item and deletes it from the heap. The search takes O(n) time but the actual deletion from the heap takes O(log n) time. These are pretty straightforward operations. The complication is that whenever you remove an item from our data structure, you have to check if you happen to have removed the minimum or the maximum item. If that is the case, you have to go and find the new minimum or new maximum item, so that the data structure can continue to report the minimum and maximum in O(1) time. This search for new minimum/maximum will take O(n) which should be considered very slow. Anyone using this data structure should be told that while the remove operation is supported, it should be used very sparingly.
Those are the only operations you need to support: insert, remove, get median, get minimum and get maximum (plus the usual C++ constructor, destructor, ...) The "catch" is that your implementation must be templated and you are required to use only one binary heap implementation that works for both min heaps and max heaps. To do this, you have to work with function pointers.
In C/C++, you can have pointers to functions. You can then use a function pointer to invoke the function. The tricky part about function pointers is actually specifying the types. Function pointers can only point to functions of a specified signature. So, when you declare a function pointer, you have to say "this pointer points to a function that takes one int parameter and returns an int". This is accomplished by the following syntax:
Here's a small program that demonstrates the use of function pointers
For this project, you want to have function pointers that compare two items. The idea is that code for a min heap and a max heap are identical except you need to know what "less than" means. So, if you have a function pointer that points to a function that does the comparison, you can write the same code that works for both a min heap and a max heap. The declaration of a function that takes two int values and returns a bool would look like:
Assignment
Note: Running time is one of the most important considerations in the implementation of a data structure. Programs that produce the desired output but exceed the required running times are considered wrong implementations and will receive substantial deductions during grading.
Your assignment is to implement the MedianHeap data as described above. No header files are provided so you are free to design your own C++ classes, subject to the requirements below.
- The name of the class must be MedianHeap.
- You must also have a separate class called Heap.
- The MedianHeap class must use two Heap data members: a min heap and a max heap.
- The Heap implementation must be array based and start at index 1. (Just leave index 0 unused.)
- Both MedianHeap and Heap must be templated and work with base C++ types (e.g., int and float) and any well-implemented classes.
- The Heap class must store function pointers to comparison functions and use the function pointers to invoke functions that compare the data items. In particular, it should not assume that the template parameters are instantiated with types that have <, <=, ==, >= and > defined.
- No STL container classes may be used in this programming project. In particular, you may not use vector instead of an array in your Heap class.
Specific Requirements
Your MedianHeap class must have these member functions. There are additional requirements for your Heap class (other than the ones already specified above), since external code should not be invoking the Heap class member functions directly.-
Your class definitions must be in a single file called
MedianHeap.h (case sensitive). Client programs
that use the MedianHeap class should be able to
do so simply using:
#include "MedianHeap.h" -
The MedianHeap class must have a constructor with the
signature:
template First two parameters are function pointers that compare items of type T. The first function returns true if the first parameter is less than the second parameter. The second function returns true if the first parameter is greater than the second. The constructor must create a MedianHeap object capable of holding cap items. The object created by the constructor must be useable immediately.MedianHeap ::MedianHeap( bool (*lt) (const T&, const T&), bool (*gt) (const T&, const T&), int cap=100 ) ; -
The MedianHeap class must have a copy constructor with the signature:
template The constructor must create a copy of the MedianHeap object given in the parameter. The copied object must have its own allocated memory.MedianHeap ::MedianHeap(const MedianHeap & otherH) ; It is possible that your class design allows your MedianHeap to use the compiler-supplied copy constructor. In that case, you do not have to write a explicit copy constructor, but it is still your responsibility to make sure that the compiler-supplied copy constructor works correctly.
-
The MedianHeap class must have a destructor with the signature:
template The destructor must deallocate any dynamically allocated memory.MedianHeap ::~MedianHeap() ; It is possible that your class design allows your MedianHeap to use the compiler-supplied destructor. In that case, you do not have to write a explicit destructor, but it is still your responsibility to make sure that the compiler-supplied destructor works correctly.
-
The MedianHeap class must have an overloaded assignment operator with the signature:
template The assignment operator must deallocate memory in the host object and copy the rhs MedianHeap into the host.const MedianHeap & MedianHeap ::operator=(const MedianHeap & rhs) ; It is possible that your class design allows your MedianHeap to use the compiler-supplied assignment operator. In that case, you do not have to write a explicit assignment operator, but it is still your responsibility to make sure that the compiler-supplied assignment operator works correctly.
-
The MedianHeap class must have a member function called size() that returns the total number of items in the MedianHeap. The size() function must have the signature:
template int MedianHeap ::size() ; The size() function must run in constant time to receive full credit.
-
The MedianHeap class must have a member function called capacity() that returns the maximum number of items that can be stored in the MedianHeap. The capacity() function must have the signature:
template int MedianHeap ::capacity() ; The capacity() function must run constant time to receive full credit.
-
The MedianHeap class must have a member function called insert() that adds the item given in the parameter to the MedianHeap. The insert() function must have the signature:
template If the MedianHeap cannot hold any more items (i.e., it has reached capacity), then insert() should throw an out_of_range exception.void MedianHeap ::insert(const T& item) ; The insert() function must run in O(log n) time to receive full credit.
-
The MedianHeap class must have member functions called getMedian(), getMin() and getMax() that returns a copy of an object with the median, minimum and maximum keys, respectively. These functions should have the signatures:
template These functions should throw an out_of_range exception if the MedianHeap is empty. For full credit, these functions must run in constant time.T MedianHeap ::getMedian() ; template T MedianHeap ::getMin() ; template T MedianHeap ::getMax() ; Note that in computer science, the median value is always one of the stored values. In case the number of items in the MedianHeap is even, either one of the two objects closest to the middle is acceptable as the median value. Do not average these two values. (If the type T does not have the + operator defined, you cannot add them anyway.)
-
The MedianHeap class must have a member function called deleteItem with the following signature:
template The deleteItem() function should look for an item stored in the MedianHeap that is equal to the givenItem parameter according to the equalTo() function. The equalTo() function does not have to determine equality based on the key value used by the MedianHeap. For example, in the City class given in the test programs below, the MedianHeap might be keyed on the population of the cities, but we may invoke the deleteItem() function to remove a city with a particular name. In that case, the equalTo() function might compare the name and state of the city. You may not assume that givenItem has a valid entry in every field. If we are removing a city based on its name, the population field might have a garbage value. So, you will need to check both the min heap and the max heap to see if there is an item that matches.bool MedianHeap ::deleteItem(T& givenItem, bool (*equalTo) (const T&, const T&) ) ; If deleteItem() successfully finds and removes an item, it should return true, and copy the values of the found item in the givenItem parameter. (That is why the parameter is passed by reference.) In the City class example, you should be able to delete "Anchorage, AK" from the MedianHeap and obtain its statistics (latitude, longitude, population and elevation) from the reference parameter.
If deleteItem() does not find an item to remove, it should return false. The MedianHeap data structure should not be modified in this case. The parameter givenItem should also not be modified.
If the MedianHeap is empty, the deleteItem() function should throw an out_of_range exception.
The deleteItem() function must run in O(n) time to receive full credit.
-
The MedianHeap class must have a member function called dump that prints out the contents of the MedianHeap including the positions of each key in the max heap and the min heap. See formatting examples in the sample output below. The dump() function must have the signature:
template Calling dump() should not modify the data structure in any way.void MedianHeap ::dump() ; -
The MedianHeap class must have four member functions used for grading with the following signatures:
template These functions are used for grading purposes. So, you should make sure that they are implemented in any submission.int MedianHeap ::maxHeapSize() ; template int MedianHeap ::minHeapSize() ; template T MedianHeap ::locateInMaxHeap(int pos) ; template T MedianHeap ::locateInMinHeap(int pos) ; The functions maxHeapSize() and minHeapSize() should return the number of items in the max heap and min heap. It should be the case that the sum of these two sizes equals the value returned by MedianHeap::size(). (I.e., don't have any items, such as the median item, set aside and not stored in one of the two heaps.)
The functions locateInMaxHeap() and locateInMinHeap() returns a copy of the item in position pos in the max heap and min heap, respectively. Remember that you are required to store the items starting at index 1. (The slot in index 0, is unused.)
If the pos parameter has an invalid index, these functions should throw an out_of_range exception.
Your code must run without segmentation faults and without memory leaks. For grading purposes, memory leaks are considered as bad as segmentation faults. This is because many segmentation faults are cause by poorly written destructors. A program with an empty destructor might avoid some segmentation faults but will leak memory horribly. Thus, not implementing a destructor or not deleting unused memory must incur a penalty that is equivalent to a segmentation fault.
Test Programs
Here are some test programs for checking the correctness of your code. As usual, you should also do your own testing. Your programs should compile and run with the provided tests without seg faulting or leaking memory. However, good runs with these programs do not guarantee that your code is bug free.
Some of the code use the City class, defined and implemented: City.h and City.cpp.
A simple test using MedianHeap<int>:
p4test1.cpp and p4test1.txt.A bigger test using MedianHeap<int>. This version also runs the sanityCheck() function:
p4test2.cpp and p4test2.txt.A simple test using MedianHeap<City> (compile with City.cpp):
p4test3.cpp and p4test3.txt.This test shows that your MedianHeap class can handle function pointers. One of the MedianHeaps is keyed on the population of the cities, while the second one is keyed on the elevation.
Similar to the previous case, except this program uses City pointers (i.e., MedianHeap<City *>:
p4test4.cpp and p4test4.txt.A bigger test using the City class. This program expects to receive data in standard input. Test data are in:
CityData10.txt, CityDataMD.txt and CityDataUS.txt.
p4test5.cpp and p4test5.txt.You can run the program in Unix this way:
linux1% g++ City.cpp p4test5.cpp -o t5.out linux1% ./t5.out < CityData10.txt linux1% ./t5.out < CityDataMD.txt linux1% ./t5.out < CityDataUS.txt Testing the copy constructor, destructor and assignment operator. You should run this version in valgrind to check for memory leaks.
p4test6.cpp and p4test6.txt.Timing runs. This program takes a command line argument that specifies the number of repeititions of the timing runs.
p4test7.cpp and p4test7.txt.
Implementation Notes
Here we list some recommendations and point out some traps and pitfalls.
- Template implementations can either be in a separate file called MedianHeap.cpp or they can be placed in MedianHeap.h after the templated class definitions. (There are pros and cons for each style.) Choose whichever style you prefer, but do recall that in the first method you have to #include the .cpp file at the bottom of the .h file and that the .cpp file has to be guarded.
- Please review how to implement templates. Do this before you start coding, and certainly do this before you post questions on Piazza.
- Do not define Heap as a nested class within MedianHeap. Yes, it actually makes sense, but the C++ syntax for templated inner classes are horrendous. Just don't.
- You should think about which data members and member functions belong to MedianHeap and which ones belong to Heap. Good design will make your coding much simpler. Simpler code is easier to debug.
- Remember that we are not using the array slot in index 0. Make sure that you allocate the correct amount of space for the min heap and the max heap arrays. Also, note that one of the arrays will be bigger than the other one if there's an odd number of items in the MedianHeap.
-
Potential Complex Edge Case to Watch For:
If the dataset consists of a set of items with the same key value
(which is completely allowed in our project), a situation can
arise that might cause you to report the min item and max item
incorrectly:
Suppose that we are using the City class and we have data from some county in Kansas where all of the cities have the same elevation. Depending on how you keep track of the min elevation and how you break ties, it is possible that the min item was the root of the max heap which then gets moved from the max heap to the min heap (to rebalance heap sizes) and then gets removed from the min heap by the deleteItem() function.
In that situation, you might end up reporting a city as having the min elevation that is not a city that is still in the MedianHeap. This is a very unusual situation because the entire max heap will have to have the same elevation in order for the max item to have the min elevation.
You can fix this situation by making "deleteRoot" of the max heap check to see if happens to equal the min item and make the last item of the heap the new min item, because the only way that can happen is if the entire heap has the same elevation.
Here's an example program that illustrates this situation. It is tuned to expose the bug in a particular implementation. If you want to see if you have the same bug, you have to change the program around.
- Main program that adds 10 cities, all with eleveation 50: obscure.cpp.
- In the buggy version, the output shows that "Town 1" gets moved from the min heap to the max heap and then back to the min heap. Town 1 eventually gets deleted, but is still reported as the min and max items: buggy.txt.
- The fixed version detects when the root of a max heap happens to be the min value (and when the root of a min heap happens to be the max value): debugged.txt.
What to Submit
Before submitting, remove all extraneous output statements from your program. (Just comment them out.) Your typescript file should have only a few hundred lines of output; if they have significantly more than that, you're doing something wrong, and We will likely not look at the typescript file (and you will therefore not get credit for it!).
You must submit the following files to the proj4 directory.
- MedianHeap.h
- MedianHeap.cpp
If you followed the instructions in the Project Submission page to set up your directories, you can submit your code using this Unix command command.
Your code should compile with the test programs using these Unix commands on GL:
Finally, remember to exit from script.