In this paper, we study two basic problems in the windowed data stream model. The first is the estimation of the rarity of items in the window. While previous work has studied simple distributional parameters such as the number of distinct items in the window, no prior work has addressed the general problem of estimating the rarity. Our second problem is one of estimating similarity between two data stream windows using the Jacard's coefficient. Prior work has focused on $L_p$ norms and set similarity measures such as the Jacard's coefficient have been studied before in the windowed data stream model. The problems of estimating rarity and similarity have many applications in mining massive data.
We present novel, simple algorithms for estimating
rarity and similarity on windowed data streams, accurate up to
factor $1 \pm \epsilon$ using space only
logarithmic in the window size. In both cases,
our solutions are based on
modifications of the powerful min-wise hashing technique.
We expect our solutions to find applications in practice.
Paper Available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2002/2002-21.ps.gz