CCR/DIMACS Workshop on Mining Massive Data Sets and Streams: Mathematical Methods and Algorithms for Homeland Defense

June 17 - 19, 2002
Center for Communications Research (CCR), Princeton, NJ

Organizers:: Bob Grossman,chair, University of Illinois and Two Cultures Group, grossman@uic.edu; Paul Kantor, Rutgers University, kantorp@cs.rutgers.edu; Muthu Muthukrishnan, AT&T Labs - Research and Rutgers University, muthu@cs.rutgers.edu
Workshop Coordinator:: Dolores Koch, DJK@idaccr.org

Co-sponsored by the Center for Communications Research (CCR) and DIMACS.

Abstracts:

A variety of massive data sets exhibit an underlying structure that can be modeled as dynamic weighted multi-digraphs. Their sizes range from tens of gigabytes to petabytes. The sheer volume of these graphs brings with it an interesting series of computational and visualization challenges.

We will discuss external memory algorithms for connectivity, minimum spanning trees and maximal matchings together with heuristics for quasiclique finding and diameter computations. From the visualization store we will describe an external memory hierarchy that allow us to use computer graphics techniques like dynamic visibility to provide navigation control. We will present experimental results with graphs having on the order of 200 million vertices and we will point out some mathematical problems that have surfaced along the way.

(some of this work has been done in cooperation with A. Buschbaum, J. Korn, M. Kreuseler, S. Krishnan, P. Pardalos, M. Resende, S. Sudarsky, A. Ucko, and J. Westbrook)

Moses S. Charikar, Princeton University

Title: Algorithmic Techniques for Clustering in the Streaming Data Model

Clustering can be viewed as an optimization problem where the the goal is to cluster the data so as to optimize the quality of the clustering measured by an objective function. Such problems have traditionally been studied in the offline model assuming arbitrary access to the data to be clustered. We will present some recent results on clustering in the streaming data model. The goal here is to produce an implicit description of the clusters in one pass over the data using a small amount of storage so that the corresponding clustering solution is approximately optimal.

To facilitate clustering in a streaming fashion, our model assumes that data items are represented in some form (e.g. as sets or vectors) so that the similarity (or distance) between pairs of items can be computed from their representations. A closely related issue is the design of compact representation schemes that allow possibly complicated data items (e.g. large documents or images) to be represented in a small amount of space so that the similarity between data items can be estimated from their compact representations. We will briefly discuss some recently developed schemes to achieve this.

Graham Cormode, University of Warwick

Title: Algorithmic Embedding for Comparing Large Text Streams

Texts are ubiquitous in daily life, varying in size from small (SMS and email) to potentially immense (automatically generated reports, biological sequences). When scanning for similarities between new data and previously stored examples, we need a model that takes account of how texts are changed: pieces are inserted or deleted, and sections are moved around. With a large number of large texts, trying to compare all pairs is not feasible, and for the largest texts we may not even be able to hold the whole of any text in memory at once. We describe an approach to approximately comparing large texts quickly and in sublinear space. It relies on finding combinatorial structures present in any string of characters, and generating a vector representation. This allows rapid comparison of sequences based on a succint representation, and the application of clustering, nearest neighbor searching, and other data mining techniques.

Allen Gorin, AT&T

Semantic Information Processing of Spoken Language in Dialog

The next generation of voice-based user interface technology enables easy-to-use automation of new and existing communication services, achieving a more natural human-machine interaction. By natural, we mean that the machine understands what people actually say, in contrast to what a system designer expects them to say. This approach is in contrast with menu-driven or strongly-prompted systems, where many users are unable or unwilling to navigate such highly structured interactions. AT&T's 'How May I Help You?' (tm) technology shifts the burden from human to machine, wherein the system adapts to peoples' language, as contrasted with forcing users to learn the machine's jargon. This technology has been nationally deployed by AT&T for customer care, handling millions of calls each month. In this talk I will describe methods for mining and modeling of speech, language and dialog information from this natural spoken dialog system.

Johannes Gehrke, Cornell University

Title: Distributed Mining and Monitoring

We are witnessing the emergence of a new class of applications that is best characterized as monitoring applications. Examples are homeland security services, environmental observation, surveillance and tracking, network management, and sensor data management. Monitoring applications are heavily network-centric, and need to process high-speed data streams in real time with a potentially huge number of complex continuous queries. The current generation of data management architectures and data mining systems are completely inadequate for monitoring applications.

In this talk I will address several technical challenges in building monitoring applications with an emphasis on (1) distributed processing of high-speed data streams, and (2) a framework for quantifying and defining the difference between two datasets.

Sudipto Guha, University of Pennsylvania

Title: Learning Mixture of Markov Chains

We consider the problem of inferring a ``mixture of Markov Chains'' based on observing a stream of interleaved outputs from these chains. In the first variant we consider, the mixing process chooses one of the Markov chains independently according to a fixed set of probabilities at each point, and only that chain makes a transition and outputs its new state. For this case, when the individual Markov Chains have disjoint state sets we show that a polynomially-long stream of observations is sufficient to infer arbitrarily good approximations to the correct chains. We also explore a generalizations.

Geoff Hulten, University of Washington

Title: A General Framework for Mining Very Large Databases and Data Streams

Much work in KDD has focused on scaling machine learning and statistical algorithms to large databases. The goal is generally to obtain algorithms whose running time is linear (or near-linear) in the size of the database, and that only access the data sequentially. So far this has been done mainly for one algorithm at a time, in a slow and laborious process. In this talk we propose a scaling-up method that is applicable to essentially any induction algorithm based on discrete search. The result of applying the method to an algorithm is that its running time becomes independent of the size of the database, while the decisions made are essentially identical to those that would be made given infinite data. The method works within pre-specified memory limits and, as long as the data is iid, only requires accessing it sequentially. It gives anytime results, and can be used to produce batch, stream, time-changing and active-learning versions of an algorithm. We have applied the method to learning decision trees and Bayesian networks, developing algorithms that are substantially faster than previous ones, while achieving essentially the same predictive performance. We observe these gains on a series of large databases generated synthetically, from benchmark networks, and on a Web logs containing 100 million requests.

Marco Mazzucco, University of Illinois at Chicago

Title: Merging of High Bandwidth Data Streams

Merging two data streams on their keys is one of the fundamental problems in data mining streams. We present practical results for merging streams over optical networks. We examine two algorithms for merging and measure their real world performance. We discuss the limitations of each algorithm and discuss which algorithm is best suited for which data stream enviroment.

Rafail Ostrovsky, Telcordia Technologies

Title: Content-Sensitive Fingerprinting and its Applications

A standard notion of a hash-function is the one that maps all documents to short "random-looking" outputs. That is, if two documents differ only in a few bits, a "classical" hash function is geared to output "unrelated" results. In many settings, there is a need for hash-functions which produce "similar" short outputs for "similar" documents. This is especially relevant when searching for a "related" documents in a large database, where one must find answers faster then searching through the entire database, with applications to information distillation and data mining algorithms. In this talk, I will show how to construct such a hash function (for a number of different metric spaces) and describe the underling ideas of the algorithm. I will then show applications of such a hash function to approximate searching, dimension-reduction, clustering, approximate matching, facility location, nearest neighbor search and other approximation problems. The talk will be self-contained.

H. Vincent Poor, Princeton University

Title: Change Detection: A Tutorial Overview

Many problems in data mining involve detecting the sudden onset of anomolous behavior in time series data. After a brief discussion of several motivating applications in this area, the fundamentals underlying algorithms for this purpose will be reviewed.

Daryl Pregibon, AT&T

Title: Graph Mining: Discovery in Large Networks

Large financial and telecommunication networks provide a rich source of problems for the data mining community. The problems are inherently quite distinct from traditional data mining in that the data records, representing transactions between pairs of entities, are not independent. Indeed, it is often the linkages between entities that are of primary interest. A second factor, network dynamics, induces further challenges as new nodes and edges are introduced through time while old edges and nodes disappear.

We discuss our approach to representing and mining large sparse graphs. Several applications in telecommunications fraud detection are used to illustrate the benefits our approach.

Rebecca Wright, DIMACS

Title: Protecting Privacy in Data-mining Applications

Data mining is concerned with revealing information contained in some data. However, in some data mining applications, it is also desirable to protect information other than the computed output at the same time. Doing so can provide protection of personal information, protection of sensitive information, and foster collaboration between different agencies. For example, two different governmental agencies may be more willing to work together to compute outputs of their joint data if they need not reveal that data to each other in order to do so. In this talk, I will discuss some recent work in two areas: secure multiparty computation of approximations and privacy-preserving statistics computations.

Secure multiparty approximations. Approximations are often useful to reduce computation and communication costs in a distributed setting where the inputs are held by different parties and are extremely large. Secure multiparty computation of approximations addresses applications in which the parties want to cooperate to compute a function of their inputs without revealing more information than they have to. Secure multiparty computation is a well-studied tool for computing exact functions without revealing unnecessary information, but these definitions can not be applied directly to approximation functions without a potential loss of privacy. We extend these definitions to apply to secure multiparty approximate computations, and we present an efficient, sublinear-communication, private approximate computation for the Hamming distance; we also give an efficient, polylogarithmic-communication solution for the L2 distance in a relaxed model.

Privacy-preserving statistics: Suppose a client wishes to compute some aggregate statistics on a privately-owned data base. The data owner wants to protect the privacy of the personal information in the data base, while the client does not want to reveal his selection criteria. Privacy-protecting statistical analysis allows the client and data owner to interact in such a way that the client learns the desired aggregate statistics, but does not learn anything further about the data; the data owner leans nothing about the client's query. Motivated by this application, we consider the more general problem of "selective private function evaluation," in which a client can privately compute an arbitrary function over a database. We present various approaches for constructing efficient selective private function evaluation protocols, both for the general problem and for privacy-protecting statistical analysis.

(This talk includes joint work with Ran Canetti, Joan Feigenbaum, Yuval Ishai, Ravi Kumar, Tal Malkin, Kobbi Nissim, Michael Reiter, Ronitt Rubinfeld, and Martin Strauss.)