**A Windowed Data Stream Model for Detecting Patterns in
Streaming Data.**

In many areas of application, the sheer volume of
data requires us to make a quick decision about it as it ``streams''
by. This is particularly true in applications of fraud detection in
telecommunications, in surveillance in homeland security, and in
financial transactions, as well as in astrophysics, environmental
modeling, etc. We have been working on methods to detect patterns in
streaming data. S. Muthukrishnan (Rutgers) and Mayur Datar (Stanford)
developed a unique ``windowed data stream model'' that is based on
observing a ``window'' of the last *N* observations. With storage
space less than the window size *N*, they studied the problem of
estimating the ``rarity'' of items in the window, a crucial problem in
telephone credit card fraud detection and anomaly detection in
counter-terrorism surveillance. They also studied the problem of
estimating the ``similarity'' between two data stream windows, an
important problem in sharing data between datasets obtained by
different agencies/companies, a vital problem in the homeland security
area. Dr. Muthukrishnan visited the NJ Office of Insurance Fraud in
the Division of Criminal Justice to discuss analogous problems they
are having. Muthukrishnan and Datar found novel, simple algorithms for
estimating rarity and similarity in this windowed situation.

**Constructing the First Known Data Stream Algorithm for Estimating
Dominance of Multiple Signals.**

Graham Cormode (University of
Warwick) and S. Muthukrishnan (Rutgers) considered streams of multiple
signals *(i,a _{i},j)* where the

**Space Lower Bounds for Distance Approximation in
the
Data Stream Model.** Michael Saks (Rutgers) and Xiaodong Sun (Rutgers)
investigated the problem of approximating the distance between two
*d*-dimensional vectors * x* and

M. Saks and X. Sun, Space Lower Bounds for Distance Approximation in
the
Data Stream Model, *Proceedings 34th Annual ACM Symposium on Theory
of Computing (STOC)*, 2002, 360-369.

A revised and expanded version has been submitted for journal publication.

** Brian Babcock, Mayur Datar, and Laidan O'Callaghan** are
developing techniques for k-median clustering over "sliding
windows." Thus, a stream of data points is arriving, and the most
recent N points are considered "relevant", but memory is not
large enough to store N points. Their algorithm maintains a running
constant-approximate k-median clustering of the relevant points in
polylog space.

** Adam Meyerson, Laidan O'Callaghan, and Serge Plotkin** have a
Monte Carlo algorithm that produces a clustering that with
high-probability is approximately optimal under the k-median
measure of clustering. Lower bounds are produced on the running
time required and on the sizes of the samples that must be
clustered.

**This material is based upon work supported by the National Science Foundation under Grant No. 0100921**

Index of Special Focus on Data Analysis and Mining

DIMACS Homepage

Contacting the Center

Document last modified on April 29, 2003.