Tabulation hashing
Tabulation hashing is a method for constructing universal families of hash functions by combining with operations. It was first studied in the form of Zobrist hashing for computer games; later work by Carter and extended this method to arbitrary fixed-length keys. Generalizations of tabulation hashing have also been developed that can handle variable-length keys such as text strings.
Despite its simplicity, tabulation hashing has strong theoretical properties that distinguish it from some other hash functions. In particular, it is 3-independent: every 3-tuple of keys is equally likely to be mapped to any 3-tuple of hash values. However, it is not 4-independent. More sophisticated but slower variants of tabulation hashing extend the method to higher degrees of independence.
Because of its high degree of independence, tabulation hashing is usable with hashing methods that require a high-quality hash function, including , , and the MinHash technique for estimating the size of set intersections.
Contents
Method
Let p denote the number of bits in a key to be hashed, and q denote the number of bits desired in an output hash function. Choose another number r, less than or equal to p; this choice is arbitrary, and controls the tradeoff between time and memory usage of the hashing method: smaller values of r use less memory but cause the hash function to be slower. Compute t by rounding p/r up to the next larger integer; this gives the number of r-bit blocks needed to represent a key. For instance, if r = 8, then an r-bit number is a , and t is the number of bytes per key. The key idea of tabulation hashing is to view a key as a of t r-bit numbers, use a filled with random values to compute a hash value for each of the r-bit numbers representing a given key, and combine these values with the bitwise binary operation. The choice of r should be made in such a way that this table is not too large; e.g., so that it fits into the computer’s .
The initialization phase of the algorithm creates a two-dimensional array T of dimensions 2r by t, and fills the array with random q-bit numbers. Once the array T is initialized, it can be used to compute the hash value h(x) of any given key x. To do so, partition x into r-bit values, where x0 consists of the low order r bits of x, x1 consists of the next r bits, etc. For example, with the choice r = 8, xi is just the ith byte of x. Then, use these values as indices into T and combine them with the exclusive or operation:
However, this reasoning breaks down for four keys because there are sets of keys w, x, y, and z where none of the four has a byte value that it does not share with at least one of the other keys. For instance, if the keys have two bytes each, and w, x, y, and z are the four keys that have either zero or one as their byte values, then each byte value in each position is shared by exactly two of the four keys. For these four keys, the hash values computed by tabulation hashing will always satisfy the equation , whereas for a 4-independent hashing scheme the same equation would only be satisfied with probability 1/m. Therefore, tabulation hashing is not 4-independent. Nevertheless, despite only being 3-independent, tabulation hashing provides the same constant-time guarantee for linear probing.
, another technique for implementing , guarantees constant time per lookup (regardless of the hash function). Insertions into a cuckoo hash table may fail, causing the entire table to be rebuilt, but such failures are sufficiently unlikely that the expected time per insertion (using either a truly random hash function or a hash function with logarithmic independence) is constant. With tabulation hashing, on the other hand, the best bound known on the failure probability is higher, high enough that insertions cannot be guaranteed to take constant expected time. Nevertheless, tabulation hashing is adequate to ensure the linear-expected-time construction of a cuckoo hash table for a static set of keys that does not change as the table is used.
Extensions
Although tabulation hashing as described above (“simple tabulation hashing”) is only 3-independent, variations of this method can be used to obtain hash functions with much higher degrees of independence.
uses the same idea of using exclusive or operations to combine random values from a table, with a more complicated algorithm based on for transforming the key bits into table indices, to define hashing schemes that are k-independent for any constant or even logarithmic value of k. However, the number of table lookups needed to compute each hash value using Siegel's variation of tabulation hashing, while constant, is still too large to be practical, and the use of expanders in Siegel's technique also makes it not fully constructive. provides a scheme based on tabulation hashing that reaches high degrees of independence more quickly, in a more constructive way.
He observes that using one round of simple tabulation hashing to expand the input keys to six times their original length, and then a second round of simple tabulation hashing on the expanded keys, results in a hashing scheme whose independence number is exponential in the parameter r, the number of bits per block in the partition of the keys into blocks.
Simple tabulation is limited to keys of a fixed length, because a different table of random values needs to be initialized for each position of a block in the keys.
studies variations of tabulation hashing suitable for variable-length keys such as character strings. The general type of hashing scheme studied by Lemire uses a single table T indexed by the value of a block, regardless of its position within the key.
However, the values from this table may be combined by a more complicated function than bitwise exclusive or. Lemire shows that no scheme of this type can be 3-independent. Nevertheless, he shows that it is still possible to achieve 2-independence. In particular, a tabulation scheme that interprets the values T[xi] (where xi is, as before, the ith block of the input) as the coefficients of a over a finite field and then takes the remainder of the resulting polynomial modulo another polynomial, gives a 2-independent hash function.