Hash tables are one of the great inventions in computer science.
Combination of lists, arrays, and a clever idea.
Usual use is a symbol table: associate value (data) with string (key).
Countless other applications exist; we'll talk about some.
The insight is to create from arbitrary data a small integer key (or just hash) that can be used to identify the data efficiently.
Usually (but not always) a subset of the data is used in making the key: the string in a string/value pair, for example.
Store the complete data under the key, usually in a hash table that uses the key as an index.
This makes the key a form of lookup address (index in a table).
Some hashing schemes eliminate the table altogether and make the key the machine address itself.
Why is it called a hash? Because the key is usually generated by mashing all the bits of the data together to make as random a mess as possible. OED: A mixture of mangled and incongruous fragments; a medley; a spoiled mixture; a mess, jumble. Often in phr. to make a hash of, to mangle and spoil in attempting to deal with.
This is the job of the hash function: produce a hash from the data to be used as the key.
The hash should be randomly distributed through a modest integer range. That range is the size of the hash table that stores the data.
For example, add up the bytes in a character string and take the result modulo 128, to give an index into a 128-entry array of string/value pairs.
What to do if two items hash to same key?
1. Perfect hashing
Arrange that there are no collisions. Can be done if the data is known in advance: choose the right hash function, or compute it. Rare, but nice when it can work.2. Open hashing
Put the colliding entry in the next slot of the table, or chain them together with pointers. When the table fills, rehash. Adds complexity; avoids memory allocation. Old-fashioned.3. Hash buckets
Each hash table entry is the head of a list of items that hash to that value. This is almost always the way to go these days. Can still rehash if the lists get long, but we can often arrange that that won't happen.
1. What is the hash function?
Guidelines:
1. It should be fast
2. It should depend on every bit of input data (as seen by the hash function).
3. Should produce a random distribution of keys
Note: adding up the bytes is bad: doesn't distinguish transpositions. doesn't distinguish x1 and y0
Let's assume data is a string. Here's a good hash function, from a C compiler (and K&R):
enum { NHASH = 1024 }; int hash(char *str) { unsigned long h; unsigned char *p; p = (unsigned char*)str; for(h = 0; *p != '\0'; p++) h = 37 * h + *p; return h % NHASH; }It's a shift register, with a 37 multiplier, and a modulus at the end.
Sign bits: pay attention
(Why not "h&(NHASH-1)" ?)
Due to Peter Weinberger:
#define PRIME 211 #define EOS '\0' int hashpjw(char *s) { char *p; unsigned int h, g; h = 0; for(p=s; *p!=EOS; p++){ h = (h<<4) + *p; if(g = h&0xF0000000){ h ^= g>>24; h ^= g; } } return h % PRIME; }Why so complicated?
2. How big should the table be?
A power of two is fine if the bits are good. Can use shift and mask instead of modulus, too.
Prime numbers are often used, but this adds nothing if the bits are good, and it slows down the code.
But how BIG?
Depends on expected data set: Hashing is O(1) only if the lists are short. These were from a 1990 (1024) C compiler and a 1980 (211); are these sizes OK?
The hash function of the form
h = X*h + string[i++]is generally good (the first one I showed, e.g.) (We call this style of function P(X))
But what should X be?
Common examples are 3, 5, 31, 33, 37, 65599, ...
Idea is to mix up the bits well. Some theorists prefer prime numbers (31 or 37, not 33) -- mixes up the bits better -- but in practice this doesn't seem to be an issue.
Choose the multiplier in concert with the table size. (E.g. don't use a power of two for both!)
If the multiplier has a small number of one bits
33 = 32+1 = 0x21or many ones and few zeros
31 = 32-1 = 0x1Ethe multiply can be implemented efficiently by shifts and adds or subtracts.
Why does this matter? One lousy multiply?
Answer: TEST!
Test due to Josh Bloch at Sun. Datasets: Merriam-Webster's 2nd Int'l Unabridged Dictionary (311,141 strings, avg length 10 chars). All of the strings in /bin/*, /usr/bin/*, /usr/lib/*, /usr/ucb/* and /usr/openwin/bin/* (66,304 strings, avg length 21 characters). A list of URLs (28,372 strings, avg length 49 characters).
Table shows average chain size:
Web Strings URLs Java 1.2509 1.2738 13.2560 P(37) [Java] 1.2508 1.2481 1.2454 P(65599) [Aho] 1.2490 1.2510 1.2450 P(31) [K+R] 1.2500 1.2488 1.2425 P(33) [Torek] 1.2500 1.2500 1.2453 hashpjw 6.5169 7.2142 30.6864 [1] Vo's Fn 1.2487 1.2471 1.2462 WAIS Fn 1.2497 1.2519 1.2452[1] broken implementation!
What stands out?
PJW is actually fine: they got the code wrong in Dr. Dobb's! Used == instead of = in
if(g = h&0xF0000000)After fix:
hashpjw(24) 1.3222 1.2791 1.9732 hashpjw(28) 1.2530 1.2506 1.2439P(37) (the one I showed before) is fine.
P(5) has worked well for me, e.g. in Mark V. Shaney.
Get the code right.
Test!
enum{
NHASH = 1024
};
int hash(char *name)
{
unsigned int h;
int i, c, d;
h = 0;
n = len name;
while(*name){
d = *name++;
c = d;
c ^= c<<6;
h += (c<<11) ^ (c>>1);
h ^= (d<<14) + (d<<7) + (d<<4) + d;
}
return h & (NHASH-1);
}
The person who invented this wanted to build a better hash function. It may even be pretty good, but it's not enough better (P(37) is really quite good) to be worth either the computational overhead or the work of testing it.
Beware of different data sets. E.g. Java once had a string hash function which skipped chars, only hashing some of the string. But what about URLs? They often begin with http:// and end with .html so this function produced horrible performance for that data set.
A hash function that's good for one input set (say, short variable names) might be poor for another (URLs). It's wise to evaluate a potential hash function on a variety of typical inputs. Does it hash short strings well? Long strings? Equal length strings with minor variations?
As in much of engineering, use what works, don't invent your own. Experiments show that for a wide variety of strings it's hard to construct a hash function that does appreciably better than the one we present above, but it's very easy to make one that does worse.
And the worse ones are usually more expensive.
Sometimes, though, we need to hash something other than a string. Then what?
Examples:
1. 3-dimensional coordinates of a particle
For sparse problems, reduce dimensionality2. IP addresses for networking
Generate address for routing tables3. Client URLs in a web search engine
Distribute the processing, but maintain cache state for efficiency.4. Chess board positions
Use hash to short-cut complete position evaluation.If you can hash it, you can cache it!
Test!
Generate the address of the head of a list of items that share the hash value.
Use standard list techniques to maintain the individual hash chains.
Pick the hash bucket and walk along the list looking for a perfect match.
If the item is found, it is returned. If not and the create flag is set, lookup adds the item to the table:
typedef struct Symbol Symbol; struct Symbol { char *name; int value; Symbol *next; }; Symbol* lookup(char *name, int create) { int i, h; Symbol *s; h = hash(name); for(s=symbol[h]; s; s=s->next) if(strcmp(name, s->name) == 0) return s; if(create){ s = emalloc(sizeof(Symbol)); s->name = name; s->value = UNDEFINED; s->next = symbol[h]; symbol[h] = s; } return s; }
The combination of lookup and insertion is common. Without it, there is duplication of effort; one must write
if(lookup(newitem) == 0) additem(newitem);and the hash function is called twice unless precautions are taken.
How to evaluate a hash table?
Check the code with all collisions: set hash to 1. (BUT SET IT BACK AFTERWARDS!)
Once that's working, turn on the full hash code. Run a large data set, check for even distribution of hash chain lengths. If you find an anomalous case, track it down.
Hash tables have their limitations.
If the hash function is poor or the table size is too small, the lists can grow long.
Since they're unsorted, this leads to n 2 behaviour.
There's no easy (let alone efficient) way to extract the elements in order.
Still, when used properly, hash tables' constant-time lookup, insertion, and deletion is unmatched by other techniques. And they are dead easy to implement.