Synopsis Data Structures for Dynamic Massive Data Sets

Yossi Matias


             Bell Laboratories and Tel Aviv University

                    matias@research.bell-labs.com

Data warehouses with hundreds of gigabytes or more of raw data are becoming commonplace. There is a growing need for algorithms and data structures that enable fast response times for various classes of queries on such data. Existing data structures are often ineffective on large data sets. Indeed, the traditional viewpoint in the algorithms literature -- that a linear space data structure enabling constant response times is a good one, if not "optimal" -- fails to hold for massive data sets.

In this talk, we describe a framework for algorithmic work relevant to massive data sets. We introduce, study, and advocate the use of "synopsis" data structures, which use very little space to capture the demographics of massive data sets. Synopsis data structures are designed to support fast, and typically approximated, answers to queries from a pre-specified classes of queries. We discuss several concrete examples of synopsis data structures, and describe fast algorithms for keeping them up-to-date in the presence of online updates to the data set. Example data structures include compressed histograms, backing samples, and concise samples; example applications include maintaining frequency moments, and maintaining the most frequently occurring values in a dynamic data (multi-)set. We present both analytical and experimental evidence of the effectiveness of our data structures and algorithms.

(Joint work with Phil Gibbons. Some of the work is also joint with Noga Alon, Christos Faloutsos, Vishy Poosala, Avi Silberschatz, and Mario Szegedy.)