Almost Perfect Hashing:

A Case Study Using Integral Values of Strings as Keys

Chris Esposito

Student, UMBC

12/7/98

cespos2@gl.umbc.edu

http://www.gl.umbc.edu/~cespos2/

Abstract

Key Words

Report Body

Motivation

Methods

Results

Discussion

Conclusion

Acknowledgements

References

Appendices

 

Abstract

Is there a way to perform perfect? The answer to this question is not a clear one. There are different techniques that can be used to approximate perfect hashing. One can never have a perfect hash function, because there is always the possibility of more data possibilities that key values, so collisions are always possibility. Collisions however can be made less probable by picking keys that are well spread among the data members of the data set. From my experimentation I found that the larger the hash table, (the more distinct keys that are possible) the lower the probability of collisions is. The question is: What time space tradeoffs are worthwhile? If the header table is ten times the size of the data set, it is not always favorable due to the wasted space. However, if the header table is the size of the data set or close to it, it may also be unfavorable due to the number of collisions which occur, resulting in a slower running time. From my data sample I found that the table size with the best time space tradeoff occurs at four times the data size.

Key Words

Hashing: Hashing is a method of ordering data members by a hash value in a hash table using a hash function that determines the hash value by performing some operations on the data with expected O(n) asymptotic performance.

Hash Key: A value that can be used to identify a data member or group of data members that are in a hash table.

Hash Value: The value that is computed by sending the key of data that is to be inserted or searched for to the hash function.

Hash Table: The place where data is stored and indexed by the hash values.

Hash Function: A function given a data member k it outputs a hash value h(k) for the given hash table.

Division Method: The division method is a method of hashing where the hash function h(k) takes the key value k and returns k modulo m, where m is a predetermined integral constant.

h( k ) = k mod m

Multiplication Method: The multiplication method is a method of hashing where the hash function h(k) takes the key value k and returns m time the floor of k multiplied by A modulo 1, where m is a predetermined integral constants, and a is a real number where 0<A<1.

h( k ) = ë m ( k A mod 1 ) û

Reciprocal Method: The reciprocal method is a method of hashing where the hash function h(k) takes the key value k and returns the floor of A divided by the quantity B multiplied by k plus C quantity modulo m, where m, A, B ,C are some predetermined integral constants.

h( k ) = ë A / ( B * k + C ) û mod m

Open Addressing: Open addressing is an implementation of a hash table where all data elements are stored in the hash table, and a slot in the table can have data in it or is set to nil. When searching the table for data, the table is systematically examined until the data is found or it is clear that the data being searched for is not there.

Linear Probing: Linear Probing is a method of hashing where the hash function h(k,i) takes the key value k and index i and returns the value of computing some new hash function h’ of k plus i modulo a integral constant m.

h( k , i ) = ( h’( k ) + i ) mod m

Quadratic Probing: Quadratic Probing is a method of hashing where the hash function h(k,i) takes the key value k and index i and returns the value of computing some new hash function h’ of k multiplied by i and some integral constant c1 plus i squared multiplied by another integral constant c2 modulo a third integral constant m.

h( k , i ) = ( h’( k ) c1 * i + c2 * i2 ) mod m

Double Hashing: Double Hashing is a method of hashing where the hash function h(k,i) takes the key value k and index i and returns the value of computing some new hash function h1 of k plus i multiplied by another hash function h2 modulo an integral constant m.

h( k , i ) = ( h1 ( k ) + i * h2( k ) ) mod m

Big-Oh notation (f(n) = O(g(n))): Big-Oh notation is a notation denoting that one function g(n) is asymptotic upper bound of function f(n) provided that n is a sufficiently large integer.

Report Body

Motivation: My motivation for doing this project was to determine which of my experimental techniques, division, multiplication, reciprocal or the provided method, for hashing resulted in performance closest to the theoretical running time of O(n) for hashing n elements into a data table.

Methods: I determined the number of collisions occurring when inserting a given set of words into a hash table using the following four hashing techniques. The key to send to the hash function was determined by summing the values of the characters in the word using the following formula:

ci is the ascii value of the ith character in the word.

Division Method: For my division method testing I used the header table size as my m value for the modulo calculation.

h( k ) = k mod HeaderTableSize

Multiplication Method: For my multiplication method testing I used 0.9 as my A value, 17 as my m value and took the modulo of the result by the header table size.

h( k ) = ë 17 ( k 0.9 mod 1 ) û mod HeaderTableSize

Reciprocal Method: For my reciprocal method testing I used 15 as my A value, 29 as my B value, 13 as my C value and the header table size as my m value.

h( k ) = ë 15 / ( 29 * k + 13 ) û mod HeaderTableSize

Experimental Method: For the experimental method testing I used the header table size as the m value for the modulo calculation. In this technique the calculations were complicated and were similar to that of linear probing when collisions occurred and the new hash function hi was of the listed form. I had problems with the given implementation, because I could not find m values that would give unique values since some of my data members produced identical k values.

h( k ) = k mod HeaderTableSize

hi( k , r ) =( ( k mod 2 * i + 100 * r + 1 ) mod r ) mod HeaderTableSize

 

Results

Header Table Size

Division Collisions

Multplication Collisions

Reciprocal Collisions

Experimental Collisions

504

181

195

348

181

550

174

179

396

174

625

146

148

382

146

750

134

129

455

134

875

117

118

340

117

1000

107

106

318

107

1250

79

78

286

79

1500

75

70

405

75

1750

69

60

237

69

2000

53

52

213

53

2250

55

48

360

55

2500

39

50

179

39

2750

51

44

174

51

3000

41

34

318

41

3250

33

45

155

33

3500

34

40

144

34

3750

30

27

286

30

4000

26

31

126

26

4250

29

39

113

29

4500

33

24

258

33

4750

25

34

116

25

5000

22

24

107

22

5500

33

25

101

33

6000

23

18

213

23

7000

18

26

82

18

8000

16

18

77

16

9000

21

13

159

21

10000

23

19

53

23

20000

9

13

26

9

30000

8

9

53

8

40000

8

12

16

8

50000

9

10

12

9

 

Discussion

There is no perfect hash function! There is no absolute time space tradeoff to pick when creating a hash function. This is what I have determined by analyzing the experimental data obtained in my research. First I had to arrive at a function that would change strings into almost unique keys.

I experimented with some methods of making keys from strings:

  1. I tried type casting the strings directly which had almost no variance in values.
  2. I tried summing the ascii values of each character in the string which gave many unique values but not to the extent I desired.
  3. I then tried summing the ascii multiplied by the position in the word with a few other alterations and this gave only 8 duplicated key values. Thus there will be atleast 8 collisions when hashing.

There are always considerations to be made when making a determination of what is a suitable tradeoff between time and space when hashing. I came up with the following considerations to use analysis of the tested hash functions:

  1. How easy is it to write the code to implement the method?
  2. For the division method it was very east to implement because it just involved one constant which was easy to vary by just choosing the size of the hash table. The reciprocal and multiplication methods were also easy to implement, because they only involved choosing constants that would minimize duplication of hash values. The experimental implementation, however was difficult to implement, because m was hard to find, and some keys were exactly the same so there would be no m value, which would produces unique results.

  3. How much memory will the implementation use?
  4. All of the implementations used take the size of the header table and the number of data pieces multiplied by the size of a data member to store the data, this memory makes up the data.

  5. How often do collisions occur?
  6. Collision occurrence with respect to data table size was almost identical between the division, multiplication, and experimental methods of hashing. However, with the reciprocal method data was in two trends a high and a low range. The general trend was of approaching eight as the table size goes to infinity. This is to be expected, since there must be atleast eight collisions due to the data set used.

  7. How much time does it take to handle collisions?

Collisions take the same time for all trial functions, since they all deal with the collision by searching the data table for a set of consecutive empty spaces equal to the number of duplicate elements with the given key. After that the values with the same key are moved to the free slots.

Conclusion

Any of the algorithms that were experimented with would be suitable for hashing values. The reciprocal method may have had poor running time due to the choice of values of A, B, and C. To improve insert time of colliding items a pointer could be kept to the first free slot in the data table that would be set if a slot is cleared or go to the next slot after the last inserted so the entire table would not have to be searched to find a free slot to insert data. This would improve running time, because on average a data item to be inserted would have a key not already inserted in the table.

The optimal tradeoff between time and space as I can see from the plot of space versus collisions occurs at table size equal to four times the data size. This seems to be the place where the graph levels off, indicating a good trade off point.

Acknowledgements

All source code compiled and tested on the GL machine of UMBC using the gcc compiler.

Dr. Kalpakis and Dr. Sherman of UMBC provided the experimental algorithm skeleton.

References

Cormen, Leiserson,, and Rivest; Introduction to Algorithms, MIT Press, 1990, p.23, pp. 219-239.

Appendices

A: Source code for division method search and insert functions

B: Source code for multiplication method search and insert functions

C: Source code for reciprocal method search and insert functions

D: Source code for experimental method search and insert functions

E: Data used for testing the insert and search functions