DIMACS TR: 2005-11
Summarizing and Mining Inverse Distributions on Data Streams
via Dynamic Inverse Sampling
Authors: Graham Cormode, S. Muthukrishnan and Irina Rozenbaum
ABSTRACT
Emerging data stream management systems approach the challenge
of massive data distributions which arrive at high speeds while
there is only small storage by summarizing and mining the
distributions using samples or sketches. However, data distributions
can be "viewed" in different ways. A data stream of integer values
can be viewed either as the forward distribution f(x), ie., the
number of occurrences of x in the stream, or as its inverse,
$f^{-1}(i)$, which is the number of items that appear i times. While
both such "views" are equivalent in stored data systems, over data
streams that entail approximations, they may be significantly
different. In other words, samples and sketches developed for the
forward distribution may be ineffective for summarizing or mining
the inverse distribution. Yet, many applications such as IP traffic
monitoring naturally rely on mining inverse distributions.
We formalize the problems of managing and mining inverse
distributions and show provable differences between summarizing the
forward distribution vs the inverse distribution. We present
methods for summarizing and mining inverse distributions of data
streams: they rely on a novel technique to maintain a dynamic
sample over the stream with provable guarantees which can be
used for variety of summarization tasks (building quantiles or
equidepth histograms) and mining (anomaly detection: finding heavy
hitters, and measuring the number of rare items), all with provable
guarantees on quality of approximations and time/space used by our
streaming methods. We also complement our analytical and algorithmic
results by presenting an experimental study of the
methods over network data streams.
Paper Available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2005/2005-11.ps.gz
DIMACS Home Page