## DIMACS TR: 2002-21

## Estimating Rarity and Similarity over Data Stream Windows

### Authors: Mayur Datar and S. Muthukrishnan

**
ABSTRACT
**

In the windowed data stream model, we observe items
coming in over time. At any time $t$, we consider the window
of the last $N$ observations $a_{t-(N-1)},a_{t-(N-2)},\ldots,a_t$,
each $a_i \in \{1,\ldots,u\}$; we are allowed to ask queries
about the data in the window,
say, we wish to compute the minimum
or the median of the items in the window.
A crucial restriction is that we are
only allowed $o(N)$ (often polylogarithmic in $N$)
storage space, that is, space smaller than
the window size, so the items within the window can not be
archived. Window data stream model arose out of the need to
formally reason about the underlying data
analyses problems in applications like inter-networking
and transactions processing.
In this paper, we study two basic problems in the
windowed data stream model. The first is the estimation
of the rarity of items in the window. While
previous work has studied simple distributional
parameters such as the number of distinct items in the window,
no prior work has addressed the general problem of
estimating the rarity.
Our second problem is one of estimating similarity between
two data stream windows using the Jacard's coefficient.
Prior work has focused on $L_p$ norms and set similarity
measures such as the Jacard's coefficient have been studied before
in the windowed data stream model.
The problems of estimating rarity and similarity have
many applications in mining massive data.

We present novel, simple algorithms for estimating
rarity and similarity on windowed data streams, accurate up to
factor $1 \pm \epsilon$ using space only
logarithmic in the window size. In both cases,
our solutions are based on
modifications of the powerful min-wise hashing technique.
We expect our solutions to find applications in practice.

Paper Available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2002/2002-21.ps.gz

DIMACS Home Page