[ACE] HashTables

二月 13, 2017

HashTables

Hash tables are a concrete data structure which is suitable for implementing maps.
Basic idea: convert each key into an index in a big array
Lookup of keys and insertion and deletion in a hash table usually runs in O(1) time

Hash Function and Hash tables

A hash function h maps keys of a given type to integers in a fixed interval [0, N-1]
The integer h(k) is called the hash value of key k
Example of hash function: h(k) = k mod N

A hash table for a given key type consists of
Hash function h
Array (called table) of size N
When implementing a map with a hash table, the goal is to store them (k,v) at index i = h(k)

Collision handling

Collisions occur when different elements are mapped to the same cell.
A lot of the theory and practice of hashing consists of devising better ways to avoid or handle collisions.
enter image description here

Hash functions

A hash function is usually specified as the composition of two functions:

Hash code: h1: keys->integers
java: key.hashCode()
Compression function: h2: integers -> [0, N-1]
The hash code is applied first, and the compression function is applied next on the result.
h(x) = h2(h1(x))
The goal of the hash function is to “disperse” the keys in an apparently random way.

Why disperse?

to reduce numbers of collisions

Why random?

if there is an obvious pattern then the incoming data might have a matching pattern that leads to many collisions

Compression function

Division:
h2(y) = y mod N
The size N of the hash table is usually chosen to be a prime
Multiply, add and divide (MAD):
h2(y) = (ay + b)mod N

a and b are nonnegative integers such that a mod N ≠ 0, otherwise every integer would map to the same value b

Separate chaining(one way of collision handling)

let each cell in the table point to (e.g.) a linked list of entries that map there.

Map Methods with Separate Chaining used for Collisions

Separate chaining is simple and fast, but requires additional memory outside the table.
When memory is critical then try harder to stick with using the existing memory

Open addressing (Another way)

Open addressing: The colliding item is placed in a different cell of the table.

Linear probing

Linear probing handles collisions by placing the colliding item in the next (circularly) available table cell(variant: cell + c where c is a constant, usually 1)
Disadvantages: colliding items lump together, causing future collisions to cause a longer sequence of probes.
enter image description here

Search with linear probing

consider a hash table A and how we use get(k)

We start at cell h(k)
We probe consecutive location until one of the following occurs
an item with key k is found or,
an empty cell is found, or
N cells have been unsuccessfully probed

How do you remove an element?

Find x using `get’ and set the entry back to blank, i.e. null or empty (which sometimes write as ‘#’)
Fix the sequence on its right-hand-side.
- WHY!?: If any entry on the right used linear probing then it might no longer be discoverable by ‘get’ because it will stop at the blank!!!
- Fix: Move such entries, e.g. by removing them and then re-inserting them all

Another way –lazy deletion

Don’t mark the entry as a blank, but as a “deleted” and fix the entries later.
E.g. in the find then skip over a ‘deleted’ entry rather than stopping
The code of linear probing hash table is here
enter image description here

Double hashing(another way of open addressing)

Double hashing uses a secondary hash function d(k) and handles collisions by replacing an item in the first available cell of series
(h(k) + j d(k)) mod N for j = 0, 1, …, N-1
The secondary hash function d(k) cannot have zero values
Linear probing is just d(k) = 1
The table size N must be a prime to allow probing of all the cells.

Common choice of compression function for the secondary hash function:
d(k) = q - (k mod q)
where q< N and q is a prime
The possible values for d(k) are 1,2,…q

why N need to be prime?

Consider d(k) = 4
With N = 12 then the positions are 4,8,0,4…
With N = 11 then the positions are 4,8,1,5,9,2,…
With a prime N, then eventually all table positions will be reached.
enter image description here

performance of hashing and load factor

In the worst case, searches, insertions and removals on a hash table take O(n) time
The worst case occurs when all the keys inserted into the map collide
The load factor a = n/N affects the performance of a hash table
In Java, maximal load factor is 0.75 (75%) – after that, rehashed
The expected running time of all the map ADT operations in a hash table is O(1)
Hash tables with chaining can work efficiently even with load factor more than 1. At the same time, tables based on open addressing scheme require load factor not to exceed 0.7 to be efficient. Hence, 30% of slots remain empty, which leads to obvious memory waste.

Rehashing

When the table gets too full then “re-hash”: create a new larger table and new hash function

need to (eventually) transfer all the entries from the old table to the new one
If do so immediately, then
one can amortise the cost over many entries (as for Vector) and so get an average cost of O(1) again
but the worst case might be O(n) when the table is rehashed and this might be bad for a real time system
Option:
do not transfer all entries “in one go” but do “a few at a time”

搜索此博客

MikeChen's Blog