DIMACS Workshop on Streaming Data Analysis and Mining

November 5, 2001
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Organizers:: Adam Buchsbaum, AT&T Labs - Research, alb@research.att.com; Rajeev Motwani, Stanford University, rajeev@cs.stanford.edu; Jennifer Rexford, AT&T Labs, jrex@research.att.com
Presented under the auspices of the Special Focus on Data Analysis and Mining.
Working Group on Streaming Data Analysis and Mining Home Page.
This material is based upon work supported by the National Science Foundation under Grant No. 0100921
Abstracts:


1.

Brian Babcock (Stanford)
	Title:	Characterizing Memory Requirements for Queries over 
	Continuous Data Streams 
	Slides(ppt.gzip) 
               

	Abstract:  Many queries over continuous data streams can require
	unbounded storage to answer (that is, storage that is proportional
	to the size of the data streams).  When building a system for
	processing queries over data streams, knowing the maximum memory
	requirements for a query in advance can be helpful in allocating
	resources among concurrent queries.  We consider conjunctive
	queries with equality and inequality predicates over multiple
	data streams.  For this class of queries, we specify an algorithm
	for determining whether or not a query can be answered using a
	bounded amount of memory regardless of the size and contents of
	the input data streams.  Our algorithm is constructive:  if a
	query is answerable using bounded memory, our algorithm outputs
	a description of how to build constant-sized synopses of the
	data streams and then use those synopses to answer the query.



2.

        Kevin Chen, University of California, Berkeley
	Title:  Finding Frequent Items in Data Streams
	Slides(.ppt.gzip) 

	Abstract:  We present algorithms for estimating the most
	frequent items in a data stream, using limited storage space.
	Our algorithms achieve better pace bounds than the best known
	algorithms for this problem for several common distributions.
	We intruduce a notion of a count sketch which lets us estimate
	the frequency of the most common items in a stream.  The
	sketches for two streams can be added or subtracted.  Thus, our
	algorithm is easily adapted to estimate the objects with the
	largest change in frequency for two data streams.  Previous
	approaches were not able to solve this latter problem.  This is
	joint work with Moses Charikar and Martin Farach-Colton.


3.

Mayur Datar (Stanford)
	Title:  Maintaining Stream Statistics over Sliding Windows
	Slides(.ppt.gzip) 

	Abstract:  We consider the problem of maintaining aggregates
	and statistics over data streams, with respect to the last $N$
	data elements seen so far. We refer to this model as the {\em
	sliding window} model. We consider the following basic problem:
	Given a stream of bits, maintain a count of the number of $1$'s
	in the last $N$ elements seen from the stream. We show that
	using $O(\frac{1}{\epsilon} \log^2 N)$ bits of memory, we can
	estimate the number of $1$'s to within a factor of $1 +
	\epsilon$. We also give a matching lower bound of
	$\Omega(\frac{1}{\epsilon}\log^2 N)$ memory bits for any
	deterministic or randomized algorithms. We extend our scheme to
	maintain the sum of the last $N$ positive integers. We provide
	matching upper and lower bounds for this more general problem
	as well. We apply our techniques to obtain efficient algorithms
	for the $L_p$ norms (for $p \in [1,2]$) of vectors under the
	sliding window model. Using the algorithm for the basic
	counting problem, one can adapt many other techniques to work
	for the sliding window model, with a multiplicative overhead of
	$O(\frac{1}{\epsilon}\log N)$ in memory and a $1 +\epsilon$
	factor loss in accuracy. These include maintaining approximate
	histograms, hash tables, and statistics or aggregates such as
	sum and averages.


4.

Phil Gibbons (Bell Labs)
	Title:	Distinct Sampling of Streams: Theory and Practice

	Abstract:  One of the earliest interesting streaming algorithms
	was due to Flajolet and Martin, who described in the early 80's
	an O(log n) space synopsis for approximately counting the number
	of distinct values in a data stream.   In many applications,
	however, the distinct values query of interest is not over the
	entire stream, but over a subset of the stream specified by an
	ad-hoc predicate.  What is needed, then, is a single synopsis
	that can handle predicates specified only after the stream has
	flown by.

	We show how a new synopsis, called a "distinct sample", can handle
	this more general problem.   We present both analytical guarantees
	and experimental results demonstrating the effectiveness of our
	synopsis.  Moreover, we show how distinct samples have broad
	applicability in practice for session-based event recording
	environments.


5.

Greg Humphreys (Stanford)
	Title: A Streaming Framework for Scalable Visualization on Clusters
	Slides(.ppt.gzip) 

	Abstract:  I'll be talking about my research on cluster
	graphics.  Our group has developed systems for supporting
	unmodified applications in novel display environments, as well
	as enabling cluster-parallel applications to use the aggregated
	rendering power of commodity graphics accelerators housed in the
	cluster nodes.	To make this possible, we abstract the entire
	graphics API as a stream, and manipulate these streams to form
	traditional parallel graphics pipelines using commodity parts
	as building blocks.

	I'll describe the basic parallel rendering problem, and show
	how stream manipulation has allowed us to build a very flexible
	and general system based entirely on commodity technology.
	Our software is currently in use in over 100 installations
	worldwide, and its underlying stream processing technology has
	been used to accomplish tasks other than scalability, such as
	non-photorealistic rendering.


6.

Raul Jimenez (Rutgers)
	Title:	Massive Lossless Data Compression and Multiple Parameter
	Estimation from Galaxy Spectra
	Slides(.ps.gzip) 

	Abstract:  We present a method for radical linear compression
	of datasets where the data are dependent on some number $M$ of
	parameters. We show that, if the noise in the data is independent
	of the parameters, we can form $M$ linear combinations of the
	data which contain as much information about all the parameters
	as the entire dataset, in the sense that the Fisher information
	matrices are identical; i.e. the method is lossless. We explore
	how these compressed numbers fare when the noise is dependent on
	the parameters, and show that the method, although not precisely
	lossless, increases errors by a very modest factor. The method
	is general, but we illustrate it with a problem for which it is
	well-suited: galaxy spectra, whose data typically consist of
	$\sim 10^3$ fluxes, and whose properties are set by a handful
	of parameters such as age, brightness and a parametrised star
	formation history. The spectra are reduced to a small number of
	data, which are connected to the physical processes entering the
	problem. This data compression offers the possibility of a large
	increase in the speed of determining physical parameters. This
	is an important consideration as datasets of galaxy spectra reach
	$10^6$ in size, and the complexity of model spectra increases. In
	addition to this practical advantage, the compressed data may
	offer a classification scheme for galaxy spectra which is based
	rather directly on physical processes.

	Reference:

        Alan F. Heavens, Raul Jimenez, OferLahav, Massive lossless data
	compression and multiple parameter estimation from galaxy spectra.
	Mon. Not. R. Astron. Soc. 317, 965-972 (2000)  


7.

Sampath Kannan (U. Penn)
	Title:  Open Problems in Data Stream Algorithmics
	Slides(.ppt.gzip) 


8.

Stephen North (AT&T Labs)
	Title:  A Large-Scale Network Visualization System

	Abstract:  Visualization is part of a feedback loop in
	analyzing large data sets.  Not just the end stage of analysis,
	visualization itself can provide new metaphors for expressing
	data, which can help spot new trends unseen by other methods
	of analysis.  We have built an interactive viewer for large data
	sets of events and transactions on networks, with the initial goal
	of operating on a day's worth of telephone call records (about
	400 million).  We then adapted the viewer to a large frame/ATM
	packet data network.  I will talk about the engineering of this
	system, its applications, and some ideas about future goals for
	such systems.


9.

Liadan O'Callaghan (Stanford)
	Title:  Clustering Data Streams
	Slides(.ppt.gzip) 

	Abstract:  Clustering a set of data points means grouping them
	according to some notion of similarity.  For example, we might
	want to cluster web pages by content, putting similar web pages
	into the same group and ensuring that web pages that are very
	different are in different groups.  We might want to classify
	phone call records so that we can identify which customers would
	benefit from certain promotions, or so that we can more easily
	identify fraud.

	For many recent clustering applications, the {\em data stream}
	model is more appropriate than the conventional data set model.
	By nature, a stored data set is an appropriate model when
	significant portions of the data are queried again and again,
	and updates are small and/or relatively infrequent.  In contrast,
	a data stream is an appropriate model when a large volume of
	data is arriving continuously and it is either unnecessary
	or impractical to store the data in some form of memory. Data
	streams are also appropriate as a model of access to large data
	sets stored in secondary memory where performance requirements
	necessitate access via linear scans.

	In the data stream model~\cite{henzinger98digital}, the data
	points can only be accessed in the order in which they arrive.
	Random access to the data is not allowed; memory is assumed to
	be small relative to the number of points, and so only a limited
	amount of information can be stored. In general, algorithms
	operating on streams will be restricted to fairly simple
	calculations because of the time and space constraints. The
	challenge facing algorithm designers is to perform meaningful
	computation with these restrictions.

	I will discuss an approximation algorithm for clustering
	according to the $k$--Median objective function; this
	algorithm, due to Charikar and Guha, is based on their local
	search algorithm for the related Facility Location problem.
	Next I will discuss improvements made by Meyerson that maintain
	most of the theoretical soundness of the original algorithm
	but allow faster clustering, given some additional (reasonable)
	assumptions.  Finally, I will discuss a general method of applying
	any clustering algorithm to the clustering of data streams, even
	if the algorithm of choice is not itself usable on data streams.
	This method itself has approximation guarantees for $k$--Median.


10.

Jennifer Rexford (AT&T Labs)
	Title: Computing Traffic Demands From Flow-Level Measurements
	Slides(.ppt.gzip) 

	Abstract:  Controlling the distribution of traffic in a large
	Internet Service Provider (ISP) backbone requires an
	understanding of the network topology, traffic demands, and
	routing policies.  Shifts in user behavior, the emergence of
	new applications, and the failure of network elements can
	result in significant (and sudden) fluctuations in the
	distribution of traffic across the backbone.  Network operators
	can alleviate congestion by adapting the routing configuration
	to the prevailing traffic.  This talk describes how to acquire
	an accurate, network-wide view of the traffic demands on an
	operational ISP backbone.  We show that traffic demands can be
	computed based on flow-level measurements collected at each
	entry point to the provider network and information about the
	destinations reachable from each exit point.  We describe our
	experiences applying this methodology to routing and
	measurement data collected in the AT&T IP Backbone.

	References:

	1)   Anja Feldmann, Albert Greenberg, Carsten Lund, Nick Reingold,
	Jennifer Rexford, and Fred True, Deriving traffic demands for
	operational IP networks: Methodology and experience, IEEE/ACM
	Transactions on Networking, June 2001, pp. 265-279. 

	2)   Anja Feldmann, Albert Greenberg, Carsten Lund, Nick Reingold, and
  	Jennifer Rexford, NetScope: Traffic engineering for IP networks,
	IEEE Network Magazine, March/April 2000, pp. 11-19.


11.

Anne Rogers (AT&T Labs)
	Title:	Analyzing Transaction Streams with Hancock

	Abstract:  Massive transaction streams present a number of
	opportunities for data mining techniques. Transactions might
	represent calls on a telephone network, commercial credit card
	purchases, stock market trades, or HTTP requests to a web server.
	While historically such data have been collected for billing
	or security purposes, they are now being used to discover how
	clients use the underlying services.

	For several years, we have computed evolving profiles (called
	signatures) of the clients in transaction streams using
	handwritten C code.  The signature for each client captures
	the salient features of the client's transactions through time.
	These programs were carefully hand-optimized to ensure that the
	data could be processed in a timely fashion.  They achieved
	the necessary performance but at the expense of readability,
	which led to programs that were difficult to verify and maintain.

	Hancock is a domain-specific language created to analyze
	transactions streams efficiently without sacrificing readability.
	In this talk, I will describe the obstacles to computing with
	large streams and explain how Hancock addresses these problems.

	Hancock is joint work with Corinna Cortes, Kathleen Fisher,
	Karin Hogstedt, Daryl Pregibon, and Fred Smith.


12.

Martin Strauss (AT&T Labs)
	Title:  Fast, Small-Space Algorithms for Approximate Histogram
	Maintenance
	
	Consider an array, A, of N values, defined implicitly by a
	stream of updates of the form "add/subtract 3 to/from A[5]".
	We find a B-bucket piecewise-constant representation H for A whose
	sum-square-error is at most (1+epsilon) times the error of the
	optimal representation.  We consider three computational costs:
	the time to process an update, the time to reconstruct H from the
	synopsis data structure, and total space used; each of these is
	polynomial in B, log(N), and 1/epsilon.  We also give extensions
	to piecewise-linear representations, Haar wavelet representations,
	and piecewise-constant representations under absolute (L1) error.
	Recently this problem has received attention, especially in
	the database community, where it corresponds to the problem of
	maintaining approximate histograms.

	Our work advances previous streaming work on arrays because we
	estimate the entire array, not just a statistical summary such
	as a norm or quantile.	We hope our results will be more useful
	in practice than simple summaries.  Furthermore, unlike a norm,
	that has a concise, data-independent, declarative definition,
	the best representation for A is defined as the optimum over the
	exponentially-large set of all representations.  Thus estimating
	it quickly requires using a number of optimization techniques.

	Joint work with Anna C. Gilbert, Sudipto Guha, Piotr Indyk,
	Yannis Kotidis, and S. Muthukrishnan.


13.

Nitin Thaper (MIT)
	Title: Space-Efficient Algorithms for Maintaining
	Multi-Dimensional Histograms
	Slides(.ps.gzip) 

	Abstract:  Maintaining approximate histograms of data attracted
	recently a significant amount of attention in the database
	community.  However, most of the research has been focused on
	the case of one-dimensional data. In this talk we present novel
	algorithms for maintaining histograms for 2 or more dimensional
	data, as well as their experimental evaluation. Unlike earlier
	work in this area, our algorithms also have *provable* guarantees
	on their performance.

	This is a joint work with S. Guha, P. Indyk and N. Koudas




Working Group Presentations:
1.

Corinna Cortes (AT&T Labs - Research)
        Title: Communities of Interest

	Reference:

	Corinna Cortes, Daryl Pregicon, and Chris Volinsky,
	Communities of Interest, In 4th Int'l. Symp. on 
	Intelligent Data Analysis (IDA 2001), Lisbon, Portugal,  2001.
Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center

Document last modified on April 16, 2002.
DIMACS Workshop on Streaming Data Analysis and Mining

November 5, 2001 DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Abstracts:

Working Group Presentations:

November 5, 2001
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ