Project 4, Median Heaps

Due: Tuesday, November 28, 8:59pm


Addenda

Corrections/clarifications in orange below.

Objectives

The objective of this programming assignment is for you to gain experience working with binary heaps, templates and function pointers.

Background

The premise of this project is that you are asked to design a data structure for keeping track of three statistics: minimum, maximum and median. The data structure has to support efficient insertion, but does not have to support efficient search or remove.

These requirements can be handled easily using two binary heaps: a max heap and a min heap. Recall that a max heap has the largest value at the root and a min heap has the smallest value at the root. Our intention is to store half of the numbers in the max heap and the other half in the min heap. Furthermore, we want to have all of the numbers in the max heap to be smaller than the numbers in the min heap. In such an arrangement, the median will be either the root of the max heap or the root of the min heap. (Take a moment now to digest that the smaller numbers go in the max heap.)

When we insert more items in this data structure, we can compare the item with the current median. If it is smaller than the current median then we would store it in the max heap. Otherwise, it should go in the min heap. Since insertion in a binary heap is O(log n) time, this satisfies our requirement for efficient insertion.

Now, this can mess up our balance — the sizes of the max heap and min heap should never differ by more than one. To "rebalance" our heaps when one heap has two more items than the other heap, we simply remove the root of the larger heap and insert it in the other heap. Since removing the root and inserting take only O(log n) time, we continue to maintain a O(log n) running time.

In this scheme, we can actually report the median item in O(1) time since the median item is one of the roots. We also want to be able to report the minimum and maximum item in O(1) time. This can be accomplished by simply remembering which items are the smallest and largest seen so far. When you insert an item, check to see if it is the new minimum or the new maximum. These scheme works until you have to remove something from the data structure.

Binary heaps do not handle search very well. If you have to look for something in a binary heap, you cannot do better than O(n) time because a binary heap is not a binary search tree. The remove operation we want to support in this data structure looks for a particular item and deletes it from the heap. The search takes O(n) time but the actual deletion from the heap takes O(log n) time. These are pretty straightforward operations. The complication is that whenever you remove an item from our data structure, you have to check if you happen to have removed the minimum or the maximum item. If that is the case, you have to go and find the new minimum or new maximum item, so that the data structure can continue to report the minimum and maximum in O(1) time. This search for new minimum/maximum will take O(n) which should be considered very slow. Anyone using this data structure should be told that while the remove operation is supported, it should be used very sparingly.

Those are the only operations you need to support: insert, remove, get median, get minimum and get maximum (plus the usual C++ constructor, destructor, ...) The "catch" is that your implementation must be templated and you are required to use only one binary heap implementation that works for both min heaps and max heaps. To do this, you have to work with function pointers.

In C/C++, you can have pointers to functions. You can then use a function pointer to invoke the function. The tricky part about function pointers is actually specifying the types. Function pointers can only point to functions of a specified signature. So, when you declare a function pointer, you have to say "this pointer points to a function that takes one int parameter and returns an int". This is accomplished by the following syntax:

int (*fptr)(int ) ; As usual, C/C++ declarations should be read "inside out". Here you should think: if I dereferenced fptr, I get "int (int )" and that's a function that takes one int and returns an int. Similarly, if you want to have a function pointer for functions that take two int parameters and returns an int parameter, you should have something like: int (*anotherFptr) (int, int) ;

Here's a small program that demonstrates the use of function pointers

// File: fptr.cpp // // CMSC 341 Fall 2017 // // Simple use of function pointers // #include <iostream> using namespace std ; // An example function // int add3(int x) { return x + 3 ; } // Another function that has same // signature as add3() // int add5(int x) { return x + 5 ; } // function that takes two parameters // int addpair3(int x, int y) { return x + y + 3 ; } // Another function that has same // signature as addpair3() // int addpair5(int x, int y) { return x + y + 5 ; } int main() { // Declare function pointer named "fptr" that can // point to functions that take an int parameter // and returns an int. // int (*fptr)(int ) ; int n ; // use fptr to invoke add3 // fptr = add3 ; // assigns function to function pointer n = fptr(8) ; cout << "n = " << n << endl ; // should be 11 // use fptr to invoke add5 // fptr = add5 ; // assigns function to function pointer n = fptr(8) ; cout << "n = " << n << endl ; // should be 13 // Declare a function pointer named "anotherFptr" // that points to a function that takes two // int parameters and returns an int. // int (*anotherFptr) (int, int) ; anotherFptr = addpair3 ; // function pointer assignment n = anotherFptr(17, 13) ; cout << "n = " << n << endl ; // should be 33 anotherFptr = addpair5 ; // function pointer assignment n = anotherFptr(17, 13) ; cout << "n = " << n << endl ; // should be 35 return 0 ; }

For this project, you want to have function pointers that compare two items. The idea is that code for a min heap and a max heap are identical except you need to know what "less than" means. So, if you have a function pointer that points to a function that does the comparison, you can write the same code that works for both a min heap and a max heap. The declaration of a function that takes two int values and returns a bool would look like:

bool (*compare)(int, int) ; That is simple enough, but for this project you have to work with templates, so you might try something like: template <typename T> bool (*compare)(T, T) ; However, you don't know anything about T, so the parameters should be passed by reference to reduce copying. Furthermore, a comparison function should not need to modify its parameters (it's "just looking"), thus a more reasonable declaration is: template <typename T> bool (*compare)(const T&, const T&) ; Thus, your templated heap implementation should have a data member that is a function pointer with the type above.

Assignment

Note: Running time is one of the most important considerations in the implementation of a data structure. Programs that produce the desired output but exceed the required running times are considered wrong implementations and will receive substantial deductions during grading.

Your assignment is to implement the MedianHeap data as described above. No header files are provided so you are free to design your own C++ classes, subject to the requirements below.


Specific Requirements

Your MedianHeap class must have these member functions. There are additional requirements for your Heap class (other than the ones already specified above), since external code should not be invoking the Heap class member functions directly.

  1. Your class definitions must be in a single file called MedianHeap.h (case sensitive). Client programs that use the MedianHeap class should be able to do so simply using: #include "MedianHeap.h"

  2. The MedianHeap class must have a constructor with the signature:

    template <typename T> MedianHeap<T>::MedianHeap( bool (*lt) (const T&, const T&), bool (*gt) (const T&, const T&), int cap=100 ) ;

    First two parameters are function pointers that compare items of type T. The first function returns true if the first parameter is less than the second parameter. The second function returns true if the first parameter is greater than the second. The constructor must create a MedianHeap object capable of holding cap items. The object created by the constructor must be useable immediately.
  3. The MedianHeap class must have a copy constructor with the signature:

    template <typename T> MedianHeap<T>::MedianHeap(const MedianHeap<T>& otherH) ;

    The constructor must create a copy of the MedianHeap object given in the parameter. The copied object must have its own allocated memory.

    It is possible that your class design allows your MedianHeap to use the compiler-supplied copy constructor. In that case, you do not have to write a explicit copy constructor, but it is still your responsibility to make sure that the compiler-supplied copy constructor works correctly.

  4. The MedianHeap class must have a destructor with the signature:

    template <typename T> MedianHeap<T>::~MedianHeap() ;

    The destructor must deallocate any dynamically allocated memory.

    It is possible that your class design allows your MedianHeap to use the compiler-supplied destructor. In that case, you do not have to write a explicit destructor, but it is still your responsibility to make sure that the compiler-supplied destructor works correctly.

  5. The MedianHeap class must have an overloaded assignment operator with the signature:

    template <typename T> const MedianHeap<T>& MedianHeap<T>::operator=(const MedianHeap<T>& rhs) ;

    The assignment operator must deallocate memory in the host object and copy the rhs MedianHeap into the host.

    It is possible that your class design allows your MedianHeap to use the compiler-supplied assignment operator. In that case, you do not have to write a explicit assignment operator, but it is still your responsibility to make sure that the compiler-supplied assignment operator works correctly.

  6. The MedianHeap class must have a member function called size() that returns the total number of items in the MedianHeap. The size() function must have the signature:

    template <typename T> int MedianHeap<T>::size() ;

    The size() function must run in constant time to receive full credit.

  7. The MedianHeap class must have a member function called capacity() that returns the maximum number of items that can be stored in the MedianHeap. The capacity() function must have the signature:

    template <typename T> int MedianHeap<T>::capacity() ;

    The capacity() function must run constant time to receive full credit.

  8. The MedianHeap class must have a member function called insert() that adds the item given in the parameter to the MedianHeap. The insert() function must have the signature:

    template <typename T> void MedianHeap<T>::insert(const T& item) ;

    If the MedianHeap cannot hold any more items (i.e., it has reached capacity), then insert() should throw an out_of_range exception.

    The insert() function must run in O(log n) time to receive full credit.

  9. The MedianHeap class must have member functions called getMedian(), getMin() and getMax() that returns a copy of an object with the median, minimum and maximum keys, respectively. These functions should have the signatures:

    template <typename T> T MedianHeap<T>::getMedian() ; template <typename T> T MedianHeap<T>::getMin() ; template <typename T> T MedianHeap<T>::getMax() ;

    These functions should throw an out_of_range exception if the MedianHeap is empty. For full credit, these functions must run in constant time.

    Note that in computer science, the median value is always one of the stored values. In case the number of items in the MedianHeap is even, either one of the two objects closest to the middle is acceptable as the median value. Do not average these two values. (If the type T does not have the + operator defined, you cannot add them anyway.)

  10. The MedianHeap class must have a member function called deleteItem with the following signature:

    template <typename T> bool MedianHeap<T>::deleteItem(T& givenItem, bool (*equalTo) (const T&, const T&) ) ;

    The deleteItem() function should look for an item stored in the MedianHeap that is equal to the givenItem parameter according to the equalTo() function. The equalTo() function does not have to determine equality based on the key value used by the MedianHeap. For example, in the City class given in the test programs below, the MedianHeap might be keyed on the population of the cities, but we may invoke the deleteItem() function to remove a city with a particular name. In that case, the equalTo() function might compare the name and state of the city. You may not assume that givenItem has a valid entry in every field. If we are removing a city based on its name, the population field might have a garbage value. So, you will need to check both the min heap and the max heap to see if there is an item that matches.

    If deleteItem() successfully finds and removes an item, it should return true, and copy the values of the found item in the givenItem parameter. (That is why the parameter is passed by reference.) In the City class example, you should be able to delete "Anchorage, AK" from the MedianHeap and obtain its statistics (latitude, longitude, population and elevation) from the reference parameter.

    If deleteItem() does not find an item to remove, it should return false. The MedianHeap data structure should not be modified in this case. The parameter givenItem should also not be modified.

    If the MedianHeap is empty, the deleteItem() function should throw an out_of_range exception.

    The deleteItem() function must run in O(n) time to receive full credit.

  11. The MedianHeap class must have a member function called dump that prints out the contents of the MedianHeap including the positions of each key in the max heap and the min heap. See formatting examples in the sample output below. The dump() function must have the signature:

    template <typename T> void MedianHeap<T>::dump() ;

    Calling dump() should not modify the data structure in any way.
  12. The MedianHeap class must have four member functions used for grading with the following signatures:

    template <typename T> int MedianHeap<T>::maxHeapSize() ; template <typename T> int MedianHeap<T>::minHeapSize() ; template <typename T> T MedianHeap<T>::locateInMaxHeap(int pos) ; template <typename T> T MedianHeap<T>::locateInMinHeap(int pos) ;

    These functions are used for grading purposes. So, you should make sure that they are implemented in any submission.

    The functions maxHeapSize() and minHeapSize() should return the number of items in the max heap and min heap. It should be the case that the sum of these two sizes equals the value returned by MedianHeap::size(). (I.e., don't have any items, such as the median item, set aside and not stored in one of the two heaps.)

    The functions locateInMaxHeap() and locateInMinHeap() returns a copy of the item in position pos in the max heap and min heap, respectively. Remember that you are required to store the items starting at index 1. (The slot in index 0, is unused.)

    If the pos parameter has an invalid index, these functions should throw an out_of_range exception.

Your code must run without segmentation faults and without memory leaks. For grading purposes, memory leaks are considered as bad as segmentation faults. This is because many segmentation faults are cause by poorly written destructors. A program with an empty destructor might avoid some segmentation faults but will leak memory horribly. Thus, not implementing a destructor or not deleting unused memory must incur a penalty that is equivalent to a segmentation fault.


Test Programs

Here are some test programs for checking the correctness of your code. As usual, you should also do your own testing. Your programs should compile and run with the provided tests without seg faulting or leaking memory. However, good runs with these programs do not guarantee that your code is bug free.

Some of the code use the City class, defined and implemented: City.h and City.cpp.


Implementation Notes

Here we list some recommendations and point out some traps and pitfalls.


What to Submit

Before submitting, remove all extraneous output from your program. (Just comment them out.) Your typescript file should have only a few hundred lines of output, not 195 megabytes of text. (Yes, that's the record.) We will not look at the typescript file beyond the first 1000 lines.

You must submit the following files to the proj4 directory.

If you put the implementation of the MedianHeap class in the .h file, then you only have to submit MedianHeap.h.

If you followed the instructions in the Project Submission page to set up your directories, you can submit your code using this Unix command command.

cp MedianHeap.h MedianHeap.cpp ~/cs341proj/proj4/

Your code should compile with the test programs using these Unix commands on GL:

linux1% g++ -I . ../../00Proj4/p4test1.cpp -o t1.out linux1% g++ -I . ../../00Proj4/p4test2.cpp -o t2.out linux1% g++ -I ../../00Proj4/ -I . ../../00Proj4/City.cpp ../../00Proj4/p4test3.cpp -o t3.out linux1% g++ -I ../../00Proj4/ -I . ../../00Proj4/City.cpp ../../00Proj4/p4test4.cpp -o t4.out linux1% g++ -I ../../00Proj4/ -I . ../../00Proj4/City.cpp ../../00Proj4/p4test5.cpp -o t5.out linux1% g++ -I ../../00Proj4/ -I . ../../00Proj4/City.cpp ../../00Proj4/p4test6.cpp -o t6.out linux1% g++ -I . ../../00Proj4/p4test7.cpp -o t7.out

Use the Unix script command to record yourself compiling these programs. Then, use p4test7.cpp to perform some timing runs: linux1% time ./t7.out 15000 1.038u 0.006s 0:01.04 99.0% 0+0k 0+0io 0pf+0w linux1% time ./t7.out 30000 2.237u 0.014s 0:02.25 99.5% 0+0k 0+0io 0pf+0w linux1% time ./t7.out 60000 4.365u 0.023s 0:04.39 99.7% 0+0k 0+0io 0pf+0w

Finally, remember to exit from script.