Almost Perfect Hashing
:A Case Study Using Integral Values of Strings as Keys
Chris Esposito
Student, UMBC
12/7/98
http://www.gl.umbc.edu/~cespos2/
Is there a way to perform perfect? The answer to this question is not a clear one. There are different techniques that can be used to approximate perfect hashing. One can never have a perfect hash function, because there is always the possibility of more data possibilities that key values, so collisions are always possibility. Collisions however can be made less probable by picking keys that are well spread among the data members of the data set. From my experimentation I found that the larger the hash table, (the more distinct keys that are possible) the lower the probability of collisions is. The question is: What time space tradeoffs are worthwhile? If the header table is ten times the size of the data set, it is not always favorable due to the wasted space. However, if the header table is the size of the data set or close to it, it may also be unfavorable due to the number of collisions which occur, resulting in a slower running time. From my data sample I found that the table size with the best time space tradeoff occurs at four times the data size.
Hashing: Hashing is a method of ordering data members by a hash value in a hash table using a hash function that determines the hash value by performing some operations on the data with expected O(n) asymptotic performance.
Hash Key: A value that can be used to identify a data member or group of data members that are in a hash table.
Hash Value: The value that is computed by sending the key of data that is to be inserted or searched for to the hash function.
Hash Table: The place where data is stored and indexed by the hash values.
Hash Function: A function given a data member k it outputs a hash value h(k) for the given hash table.
Division Method: The division method is a method of hashing where the hash function h(k) takes the key value k and returns k modulo m, where m is a predetermined integral constant.
h( k ) = k mod m
Multiplication Method: The multiplication method is a method of hashing where the hash function h(k) takes the key value k and returns m time the floor of k multiplied by A modulo 1, where m is a predetermined integral constants, and a is a real number where 0<A<1.
h( k ) = ë m ( k A mod 1 ) û
Reciprocal Method: The reciprocal method is a method of hashing where the hash function h(k) takes the key value k and returns the floor of A divided by the quantity B multiplied by k plus C quantity modulo m, where m, A, B ,C are some predetermined integral constants.
h( k ) = ë A / ( B * k + C ) û mod m
Open Addressing: Open addressing is an implementation of a hash table where all data elements are stored in the hash table, and a slot in the table can have data in it or is set to nil. When searching the table for data, the table is systematically examined until the data is found or it is clear that the data being searched for is not there.
Linear Probing: Linear Probing is a method of hashing where the hash function h(k,i) takes the key value k and index i and returns the value of computing some new hash function h’ of k plus i modulo a integral constant m.
h( k , i ) = ( h’( k ) + i ) mod m
Quadratic Probing: Quadratic Probing is a method of hashing where the hash function h(k,i) takes the key value k and index i and returns the value of computing some new hash function h’ of k multiplied by i and some integral constant c1 plus i squared multiplied by another integral constant c2 modulo a third integral constant m.
h( k , i ) = ( h’( k ) c1 * i + c2 * i2 ) mod m
Double Hashing: Double Hashing is a method of hashing where the hash function h(k,i) takes the key value k and index i and returns the value of computing some new hash function h1 of k plus i multiplied by another hash function h2 modulo an integral constant m.
h( k , i ) = ( h1 ( k ) + i * h2( k ) ) mod m
Big-Oh notation (f(n) = O(g(n))): Big-Oh notation is a notation denoting that one function g(n) is asymptotic upper bound of function f(n) provided that n is a sufficiently large integer.
Motivation: My motivation for doing this project was to determine which of my experimental techniques, division, multiplication, reciprocal or the provided method, for hashing resulted in performance closest to the theoretical running time of O(n) for hashing n elements into a data table.

Methods: I determined the number of collisions occurring when inserting a given set of words into a hash table using the following four hashing techniques. The key to send to the hash function was determined by summing the values of the characters in the word using the following formula:
ci is the ascii value of the ith character in the word.
Division Method: For my division method testing I used the header table size as my m value for the modulo calculation.
h( k ) = k mod HeaderTableSize
Multiplication Method: For my multiplication method testing I used 0.9 as my A value, 17 as my m value and took the modulo of the result by the header table size.
h( k ) = ë 17 ( k 0.9 mod 1 ) û mod HeaderTableSize
Reciprocal Method: For my reciprocal method testing I used 15 as my A value, 29 as my B value, 13 as my C value and the header table size as my m value.
h( k ) = ë 15 / ( 29 * k + 13 ) û mod HeaderTableSize
Experimental Method: For the experimental method testing I used the header table size as the m value for the modulo calculation. In this technique the calculations were complicated and were similar to that of linear probing when collisions occurred and the new hash function hi was of the listed form. I had problems with the given implementation, because I could not find m values that would give unique values since some of my data members produced identical k values.
h( k ) = k mod HeaderTableSize
hi( k , r ) =( ( k mod 2 * i + 100 * r + 1 ) mod r ) mod HeaderTableSize

|
Header Table Size |
Division Collisions |
Multplication Collisions |
Reciprocal Collisions |
Experimental Collisions |
|
504 |
181 |
195 |
348 |
181 |
|
550 |
174 |
179 |
396 |
174 |
|
625 |
146 |
148 |
382 |
146 |
|
750 |
134 |
129 |
455 |
134 |
|
875 |
117 |
118 |
340 |
117 |
|
1000 |
107 |
106 |
318 |
107 |
|
1250 |
79 |
78 |
286 |
79 |
|
1500 |
75 |
70 |
405 |
75 |
|
1750 |
69 |
60 |
237 |
69 |
|
2000 |
53 |
52 |
213 |
53 |
|
2250 |
55 |
48 |
360 |
55 |
|
2500 |
39 |
50 |
179 |
39 |
|
2750 |
51 |
44 |
174 |
51 |
|
3000 |
41 |
34 |
318 |
41 |
|
3250 |
33 |
45 |
155 |
33 |
|
3500 |
34 |
40 |
144 |
34 |
|
3750 |
30 |
27 |
286 |
30 |
|
4000 |
26 |
31 |
126 |
26 |
|
4250 |
29 |
39 |
113 |
29 |
|
4500 |
33 |
24 |
258 |
33 |
|
4750 |
25 |
34 |
116 |
25 |
|
5000 |
22 |
24 |
107 |
22 |
|
5500 |
33 |
25 |
101 |
33 |
|
6000 |
23 |
18 |
213 |
23 |
|
7000 |
18 |
26 |
82 |
18 |
|
8000 |
16 |
18 |
77 |
16 |
|
9000 |
21 |
13 |
159 |
21 |
|
10000 |
23 |
19 |
53 |
23 |
|
20000 |
9 |
13 |
26 |
9 |
|
30000 |
8 |
9 |
53 |
8 |
|
40000 |
8 |
12 |
16 |
8 |
|
50000 |
9 |
10 |
12 |
9 |
There is no perfect hash function! There is no absolute time space tradeoff to pick when creating a hash function. This is what I have determined by analyzing the experimental data obtained in my research. First I had to arrive at a function that would change strings into almost unique keys.
I experimented with some methods of making keys from strings:
There are always considerations to be made when making a determination of what is a suitable tradeoff between time and space when hashing. I came up with the following considerations to use analysis of the tested hash functions:
For the division method it was very east to implement because it just involved one constant which was easy to vary by just choosing the size of the hash table. The reciprocal and multiplication methods were also easy to implement, because they only involved choosing constants that would minimize duplication of hash values. The experimental implementation, however was difficult to implement, because m was hard to find, and some keys were exactly the same so there would be no m value, which would produces unique results.
All of the implementations used take the size of the header table and the number of data pieces multiplied by the size of a data member to store the data, this memory makes up the data.
Collision occurrence with respect to data table size was almost identical between the division, multiplication, and experimental methods of hashing. However, with the reciprocal method data was in two trends a high and a low range. The general trend was of approaching eight as the table size goes to infinity. This is to be expected, since there must be atleast eight collisions due to the data set used.
Collisions take the same time for all trial functions, since they all deal with the collision by searching the data table for a set of consecutive empty spaces equal to the number of duplicate elements with the given key. After that the values with the same key are moved to the free slots.
Any of the algorithms that were experimented with would be suitable for hashing values. The reciprocal method may have had poor running time due to the choice of values of A, B, and C. To improve insert time of colliding items a pointer could be kept to the first free slot in the data table that would be set if a slot is cleared or go to the next slot after the last inserted so the entire table would not have to be searched to find a free slot to insert data. This would improve running time, because on average a data item to be inserted would have a key not already inserted in the table.
The optimal tradeoff between time and space as I can see from the plot of space versus collisions occurs at table size equal to four times the data size. This seems to be the place where the graph levels off, indicating a good trade off point.
All source code compiled and tested on the GL machine of UMBC using the gcc compiler.
Dr. Kalpakis and Dr. Sherman of UMBC provided the experimental algorithm skeleton.
Cormen, Leiserson,, and Rivest; Introduction to Algorithms, MIT Press, 1990, p.23, pp. 219-239.
A: Source code for division method search and insert functions
B: Source code for multiplication method search and insert functions
C: Source code for reciprocal method search and insert functions
D: Source code for experimental method search and insert functions