DIMACS Workshop on Streaming Data Analysis and Mining

November 5, 2001 DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Organizers:
Adam Buchsbaum, AT&T Labs - Research, alb@research.att.com
Rajeev Motwani, Stanford University, rajeev@cs.stanford.edu
Jennifer Rexford, AT&T Labs, jrex@research.att.com
Presented under the auspices of the Special Focus on Data Analysis and Mining.

This material is based upon work supported by the National Science Foundation under Grant No. 0100921

Abstracts:


1.

Brian Babcock (Stanford)
Title:	Characterizing Memory Requirements for Queries over
Continuous Data Streams
Slides(ppt.gzip)

Abstract:  Many queries over continuous data streams can require
unbounded storage to answer (that is, storage that is proportional
to the size of the data streams).  When building a system for
processing queries over data streams, knowing the maximum memory
resources among concurrent queries.  We consider conjunctive
queries with equality and inequality predicates over multiple
data streams.  For this class of queries, we specify an algorithm
for determining whether or not a query can be answered using a
bounded amount of memory regardless of the size and contents of
the input data streams.  Our algorithm is constructive:  if a
query is answerable using bounded memory, our algorithm outputs
a description of how to build constant-sized synopses of the
data streams and then use those synopses to answer the query.

2.

Kevin Chen, University of California, Berkeley
Title:  Finding Frequent Items in Data Streams
Slides(.ppt.gzip)

Abstract:  We present algorithms for estimating the most
frequent items in a data stream, using limited storage space.
Our algorithms achieve better pace bounds than the best known
algorithms for this problem for several common distributions.
We intruduce a notion of a count sketch which lets us estimate
the frequency of the most common items in a stream.  The
sketches for two streams can be added or subtracted.  Thus, our
algorithm is easily adapted to estimate the objects with the
largest change in frequency for two data streams.  Previous
approaches were not able to solve this latter problem.  This is
joint work with Moses Charikar and Martin Farach-Colton.

3.

Mayur Datar (Stanford)
Title:  Maintaining Stream Statistics over Sliding Windows
Slides(.ppt.gzip)

Abstract:  We consider the problem of maintaining aggregates
and statistics over data streams, with respect to the last $N$
data elements seen so far. We refer to this model as the {\em
sliding window} model. We consider the following basic problem:
Given a stream of bits, maintain a count of the number of $1$'s
in the last $N$ elements seen from the stream. We show that
using $O(\frac{1}{\epsilon} \log^2 N)$ bits of memory, we can
estimate the number of $1$'s to within a factor of $1 + \epsilon$. We also give a matching lower bound of
$\Omega(\frac{1}{\epsilon}\log^2 N)$ memory bits for any
deterministic or randomized algorithms. We extend our scheme to
maintain the sum of the last $N$ positive integers. We provide
matching upper and lower bounds for this more general problem
as well. We apply our techniques to obtain efficient algorithms
for the $L_p$ norms (for $p \in [1,2]$) of vectors under the
sliding window model. Using the algorithm for the basic
counting problem, one can adapt many other techniques to work
for the sliding window model, with a multiplicative overhead of
$O(\frac{1}{\epsilon}\log N)$ in memory and a $1 +\epsilon$
factor loss in accuracy. These include maintaining approximate
histograms, hash tables, and statistics or aggregates such as
sum and averages.

4.

Phil Gibbons (Bell Labs)
Title:	Distinct Sampling of Streams: Theory and Practice

Abstract:  One of the earliest interesting streaming algorithms
was due to Flajolet and Martin, who described in the early 80's
an O(log n) space synopsis for approximately counting the number
of distinct values in a data stream.   In many applications,
however, the distinct values query of interest is not over the
entire stream, but over a subset of the stream specified by an
ad-hoc predicate.  What is needed, then, is a single synopsis
that can handle predicates specified only after the stream has
flown by.

We show how a new synopsis, called a "distinct sample", can handle
this more general problem.   We present both analytical guarantees
and experimental results demonstrating the effectiveness of our
synopsis.  Moreover, we show how distinct samples have broad
applicability in practice for session-based event recording
environments.

5.

Greg Humphreys (Stanford)
Title: A Streaming Framework for Scalable Visualization on Clusters
Slides(.ppt.gzip)

Abstract:  I'll be talking about my research on cluster
graphics.  Our group has developed systems for supporting
unmodified applications in novel display environments, as well
as enabling cluster-parallel applications to use the aggregated
rendering power of commodity graphics accelerators housed in the
cluster nodes.	To make this possible, we abstract the entire
graphics API as a stream, and manipulate these streams to form
traditional parallel graphics pipelines using commodity parts
as building blocks.

I'll describe the basic parallel rendering problem, and show
how stream manipulation has allowed us to build a very flexible
and general system based entirely on commodity technology.
Our software is currently in use in over 100 installations
worldwide, and its underlying stream processing technology has
been used to accomplish tasks other than scalability, such as
non-photorealistic rendering.

6.

Raul Jimenez (Rutgers)
Title:	Massive Lossless Data Compression and Multiple Parameter
Estimation from Galaxy Spectra
Slides(.ps.gzip)

Abstract:  We present a method for radical linear compression
of datasets where the data are dependent on some number $M$ of
parameters. We show that, if the noise in the data is independent
of the parameters, we can form $M$ linear combinations of the
data which contain as much information about all the parameters
as the entire dataset, in the sense that the Fisher information
matrices are identical; i.e. the method is lossless. We explore
how these compressed numbers fare when the noise is dependent on
the parameters, and show that the method, although not precisely
lossless, increases errors by a very modest factor. The method
is general, but we illustrate it with a problem for which it is
well-suited: galaxy spectra, whose data typically consist of
$\sim 10^3$ fluxes, and whose properties are set by a handful
of parameters such as age, brightness and a parametrised star
formation history. The spectra are reduced to a small number of
data, which are connected to the physical processes entering the
problem. This data compression offers the possibility of a large
increase in the speed of determining physical parameters. This
is an important consideration as datasets of galaxy spectra reach
$10^6$ in size, and the complexity of model spectra increases. In
offer a classification scheme for galaxy spectra which is based
rather directly on physical processes.

Reference:

Alan F. Heavens, Raul Jimenez, OferLahav, Massive lossless data
compression and multiple parameter estimation from galaxy spectra.
Mon. Not. R. Astron. Soc. 317, 965-972 (2000)

7.

Sampath Kannan (U. Penn)
Title:  Open Problems in Data Stream Algorithmics
Slides(.ppt.gzip)

8.

Stephen North (AT&T Labs)
Title:  A Large-Scale Network Visualization System

Abstract:  Visualization is part of a feedback loop in
analyzing large data sets.  Not just the end stage of analysis,
visualization itself can provide new metaphors for expressing
data, which can help spot new trends unseen by other methods
of analysis.  We have built an interactive viewer for large data
sets of events and transactions on networks, with the initial goal
of operating on a day's worth of telephone call records (about
400 million).  We then adapted the viewer to a large frame/ATM
packet data network.  I will talk about the engineering of this
system, its applications, and some ideas about future goals for
such systems.

9.

Title:  Clustering Data Streams
Slides(.ppt.gzip)

Abstract:  Clustering a set of data points means grouping them
according to some notion of similarity.  For example, we might
want to cluster web pages by content, putting similar web pages
into the same group and ensuring that web pages that are very
different are in different groups.  We might want to classify
phone call records so that we can identify which customers would
benefit from certain promotions, or so that we can more easily
identify fraud.

For many recent clustering applications, the {\em data stream}
model is more appropriate than the conventional data set model.
By nature, a stored data set is an appropriate model when
significant portions of the data are queried again and again,
and updates are small and/or relatively infrequent.  In contrast,
a data stream is an appropriate model when a large volume of
data is arriving continuously and it is either unnecessary
or impractical to store the data in some form of memory. Data
streams are also appropriate as a model of access to large data
sets stored in secondary memory where performance requirements
necessitate access via linear scans.

In the data stream model~\cite{henzinger98digital}, the data
points can only be accessed in the order in which they arrive.
Random access to the data is not allowed; memory is assumed to
be small relative to the number of points, and so only a limited
amount of information can be stored. In general, algorithms
operating on streams will be restricted to fairly simple
calculations because of the time and space constraints. The
challenge facing algorithm designers is to perform meaningful
computation with these restrictions.

I will discuss an approximation algorithm for clustering
according to the $k$--Median objective function; this
algorithm, due to Charikar and Guha, is based on their local
search algorithm for the related Facility Location problem.
Next I will discuss improvements made by Meyerson that maintain
most of the theoretical soundness of the original algorithm
but allow faster clustering, given some additional (reasonable)
assumptions.  Finally, I will discuss a general method of applying
any clustering algorithm to the clustering of data streams, even
if the algorithm of choice is not itself usable on data streams.
This method itself has approximation guarantees for $k$--Median.

10.

Jennifer Rexford (AT&T Labs)
Title: Computing Traffic Demands From Flow-Level Measurements
Slides(.ppt.gzip)

Abstract:  Controlling the distribution of traffic in a large
Internet Service Provider (ISP) backbone requires an
understanding of the network topology, traffic demands, and
routing policies.  Shifts in user behavior, the emergence of
new applications, and the failure of network elements can
result in significant (and sudden) fluctuations in the
distribution of traffic across the backbone.  Network operators
can alleviate congestion by adapting the routing configuration
to the prevailing traffic.  This talk describes how to acquire
an accurate, network-wide view of the traffic demands on an
operational ISP backbone.  We show that traffic demands can be
computed based on flow-level measurements collected at each
entry point to the provider network and information about the
destinations reachable from each exit point.  We describe our
experiences applying this methodology to routing and
measurement data collected in the AT&T IP Backbone.

References:

1)   Anja Feldmann, Albert Greenberg, Carsten Lund, Nick Reingold,
Jennifer Rexford, and Fred True, Deriving traffic demands for
operational IP networks: Methodology and experience, IEEE/ACM
Transactions on Networking, June 2001, pp. 265-279.

2)   Anja Feldmann, Albert Greenberg, Carsten Lund, Nick Reingold, and
Jennifer Rexford, NetScope: Traffic engineering for IP networks,
IEEE Network Magazine, March/April 2000, pp. 11-19.

11.

Anne Rogers (AT&T Labs)
Title:	Analyzing Transaction Streams with Hancock

Abstract:  Massive transaction streams present a number of
opportunities for data mining techniques. Transactions might
represent calls on a telephone network, commercial credit card
purchases, stock market trades, or HTTP requests to a web server.
While historically such data have been collected for billing
or security purposes, they are now being used to discover how
clients use the underlying services.

For several years, we have computed evolving profiles (called
signatures) of the clients in transaction streams using
handwritten C code.  The signature for each client captures
the salient features of the client's transactions through time.
These programs were carefully hand-optimized to ensure that the
data could be processed in a timely fashion.  They achieved
the necessary performance but at the expense of readability,
which led to programs that were difficult to verify and maintain.

Hancock is a domain-specific language created to analyze
transactions streams efficiently without sacrificing readability.
In this talk, I will describe the obstacles to computing with
large streams and explain how Hancock addresses these problems.

Hancock is joint work with Corinna Cortes, Kathleen Fisher,
Karin Hogstedt, Daryl Pregibon, and Fred Smith.

12.

Martin Strauss (AT&T Labs)
Title:  Fast, Small-Space Algorithms for Approximate Histogram
Maintenance

Consider an array, A, of N values, defined implicitly by a
We find a B-bucket piecewise-constant representation H for A whose
sum-square-error is at most (1+epsilon) times the error of the
optimal representation.  We consider three computational costs:
the time to process an update, the time to reconstruct H from the
synopsis data structure, and total space used; each of these is
polynomial in B, log(N), and 1/epsilon.  We also give extensions
to piecewise-linear representations, Haar wavelet representations,
and piecewise-constant representations under absolute (L1) error.
Recently this problem has received attention, especially in
the database community, where it corresponds to the problem of
maintaining approximate histograms.

Our work advances previous streaming work on arrays because we
estimate the entire array, not just a statistical summary such
as a norm or quantile.	We hope our results will be more useful
in practice than simple summaries.  Furthermore, unlike a norm,
that has a concise, data-independent, declarative definition,
the best representation for A is defined as the optimum over the
exponentially-large set of all representations.  Thus estimating
it quickly requires using a number of optimization techniques.

Joint work with Anna C. Gilbert, Sudipto Guha, Piotr Indyk,
Yannis Kotidis, and S. Muthukrishnan.

13.

Nitin Thaper (MIT)
Title: Space-Efficient Algorithms for Maintaining
Multi-Dimensional Histograms
Slides(.ps.gzip)

Abstract:  Maintaining approximate histograms of data attracted
recently a significant amount of attention in the database
community.  However, most of the research has been focused on
the case of one-dimensional data. In this talk we present novel
algorithms for maintaining histograms for 2 or more dimensional
data, as well as their experimental evaluation. Unlike earlier
work in this area, our algorithms also have *provable* guarantees
on their performance.

This is a joint work with S. Guha, P. Indyk and N. Koudas

Working Group Presentations:
1.

Corinna Cortes (AT&T Labs - Research)
Title: Communities of Interest

Reference:

Corinna Cortes, Daryl Pregicon, and Chris Volinsky,
Communities of Interest, In 4th Int'l. Symp. on
Intelligent Data Analysis (IDA 2001), Lisbon, Portugal,  2001.


Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center