DIMACS TR: 97-25

Linear Hashing

Authors: Noga Alon, Martin Dietzfelbinger, Peter Bro Miltersen, Erez Petrank and Gábor Tardos


Consider the set ${\cal H}$ of all linear (or affine) transf ormations between two vector spaces over a finite field $F$. We study how good $\cal H$ is as a class of hash functions, namely we consider hashing a set $S$ of size $n$ into a range having the same cardinali ty $n$ by a randomly chosen function from ${\cal H}$ and look at the expected size of the largest hash bucket. $\cal H$ is a universal class of hash functions for any fini te field, but with respect to our measure different fields behave differen tly.

If the finite field $F$ has $n$ elements then there is a bad set $S\subset F^2$ of size $n$ with expected maximal bucket size $\Omega(n^{1/3})$. If $n$ is a perfect square then there is even a bad set with largest bucket size {\em always} at least $\sqrt n$. (This is worst possible, since with respect to a universal class of hash functions every set of size $n$ has expected largest bucket size below $\sqrt n+1/2$.)

If, however, we consider the field of two elements then we get much better bounds. The best previously known upper bound on the expected size of the largest bucket for this class was $O( 2^{\sqrt{\log n}})$. We reduce this upper bound to $O(\log n\log\log n)$. Note that this is not far from the guarantee for a random function. There, the average largest bucket would be $\Theta(\log n/\log \log n)$.

In the course of our proof we develop a tool which may be of independent interest. Suppose we have a subset $S$ of a vector space $D$ over ${\bf Z}_2$, and consider a random linear mapping of $D$ to a smaller vector space $R$. If the cardinality of $S$ is larger than $c_\e|R|\log|R|$ then with probability $1-\e$, the image of $S$ will cover all elements in the range.

Paper Available at: ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/1997/97-25.ps.gz
DIMACS Home Page