[ACE] HashTables

HashTables


Hash tables are a concrete data structure which is suitable for implementing maps.
Basic idea: convert each key into an index in a big array
Lookup of keys and insertion and deletion in a hash table usually runs in O(1) time

Hash Function and Hash tables

A hash function h maps keys of a given type to integers in a fixed interval [0, N-1]
The integer h(k) is called the hash value of key k
Example of hash function: h(k) = k mod N
  • A hash table for a given key type consists of
  • Hash function h
  • Array (called table) of size N
    When implementing a map with a hash table, the goal is to store them (k,v) at index i = h(k)

Collision handling

Collisions occur when different elements are mapped to the same cell.
A lot of the theory and practice of hashing consists of devising better ways to avoid or handle collisions.
enter image description here

Hash functions

A hash function is usually specified as the composition of two functions:
  • Hash code: h1: keys->integers
  • java: key.hashCode()
  • Compression function: h2: integers -> [0, N-1]
  • The hash code is applied first, and the compression function is applied next on the result.
  • h(x) = h2(h1(x))
  • The goal of the hash function is to “disperse” the keys in an apparently random way.

Why disperse?

to reduce numbers of collisions

Why random?

if there is an obvious pattern then the incoming data might have a matching pattern that leads to many collisions

Compression function


  • Division:
    h2(y) = y mod N
    The size N of the hash table is usually chosen to be a prime
  • Multiply, add and divide (MAD):
    h2(y) = (ay + b)mod N
a and b are nonnegative integers such that a mod N ≠ 0, otherwise every integer would map to the same value b

Separate chaining(one way of collision handling)

let each cell in the table point to (e.g.) a linked list of entries that map there.

Map Methods with Separate Chaining used for Collisions

enter image description here
enter image description here
enter image description here
  • Separate chaining is simple and fast, but requires additional memory outside the table.
  • When memory is critical then try harder to stick with using the existing memory

Open addressing (Another way)

Open addressing: The colliding item is placed in a different cell of the table.

Linear probing

Linear probing handles collisions by placing the colliding item in the next (circularly) available table cell(variant: cell + c where c is a constant, usually 1)
Disadvantages: colliding items lump together, causing future collisions to cause a longer sequence of probes.
enter image description here

Search with linear probing

consider a hash table A and how we use get(k)
  • We start at cell h(k)
  • We probe consecutive location until one of the following occurs
  • an item with key k is found or,
  • an empty cell is found, or
  • N cells have been unsuccessfully probed

How do you remove an element?

Find x using `get’ and set the entry back to blank, i.e. null or empty (which sometimes write as ‘#’)
Fix the sequence on its right-hand-side.
- WHY!?: If any entry on the right used linear probing then it might no longer be discoverable by ‘get’ because it will stop at the blank!!!
- Fix: Move such entries, e.g. by removing them and then re-inserting them all
Another way –lazy deletion
Don’t mark the entry as a blank, but as a “deleted” and fix the entries later.
E.g. in the find then skip over a ‘deleted’ entry rather than stopping
The code of linear probing hash table is here
enter image description here

Double hashing(another way of open addressing)

Double hashing uses a secondary hash function d(k) and handles collisions by replacing an item in the first available cell of series
(h(k) + j d(k)) mod N for j = 0, 1, …, N-1
The secondary hash function d(k) cannot have zero values
Linear probing is just d(k) = 1
The table size N must be a prime to allow probing of all the cells.
  • Common choice of compression function for the secondary hash function:
  • d(k) = q - (k mod q)
  • where q< N and q is a prime
  • The possible values for d(k) are 1,2,…q

why N need to be prime?

Consider d(k) = 4
With N = 12 then the positions are 4,8,0,4…
With N = 11 then the positions are 4,8,1,5,9,2,…
With a prime N, then eventually all table positions will be reached.
enter image description here

performance of hashing and load factor

  • In the worst case, searches, insertions and removals on a hash table take O(n) time
  • The worst case occurs when all the keys inserted into the map collide
  • The load factor a = n/N affects the performance of a hash table
  • In Java, maximal load factor is 0.75 (75%) – after that, rehashed
  • The expected running time of all the map ADT operations in a hash table is O(1)
  • Hash tables with chaining can work efficiently even with load factor more than 1. At the same time, tables based on open addressing scheme require load factor not to exceed 0.7 to be efficient. Hence, 30% of slots remain empty, which leads to obvious memory waste.

Rehashing

When the table gets too full then “re-hash”: create a new larger table and new hash function
  • need to (eventually) transfer all the entries from the old table to the new one
  • If do so immediately, then
  • one can amortise the cost over many entries (as for Vector) and so get an average cost of O(1) again
  • but the worst case might be O(n) when the table is rehashed and this might be bad for a real time system
  • Option:
    do not transfer all entries “in one go” but do “a few at a time”

Comparison between separate chaining and open addressing

enter image description here

评论

此博客中的热门博文

[MLE] W2 Multivariate linear regression

[MLE] W1 Introduction

[AIM] MetaHeuristics