DIMACS Workshop on Sublinear Algorithms

September 24 - 27, 2000
Nassau Inn, Princeton, New Jersey

Shafi Goldwasser, MIT, shafi@THEORY.LCS.MIT.EDU
Prabhakar Raghavan, Verity, pragh@verity.com
Ronitt Rubinfeld, NEC Research Institute, ronitt@research.nj.nec.com
Martin Strauss, AT&T Labs - Research, mstrauss@research.att.com
Presented under the auspices of the Special Year on Massive Data Sets, the Special Year on Computational Intractability and
Exploratory Workshops/Special Programs.



Speaker: Tugkan Batu
Title: Fast Approximate PCPs for Multidimensional Bin-Packing Problems

We consider approximate PCPs for multidimensional bin-packing
problems.  In particular, we show how a verifier can be quickly
convinced that a set of multidimensional blocks can be packed into a
small number of bins.  The running time of the verifier is bounded by
$O(T(n))$, where $T(n)$ is the time required to test for {\em
heaviness}.  We give heaviness testers that can test heaviness of an
element in the domain $[1,\ldots,n]^d$ in time $O((\log n)^d)$.  We
also also give approximate PCPs with efficient verifiers for
recursive bin packing and multidimensional routing.

2. Cache-Oblivious Algorithms and Data Structures Michael A. Bender SUNY Stony Brook We present methods for constructing cache-oblivious algorithms and data structures. Cache-oblivious algorithms run efficiently on a hierarchical memory, even though they completely avoid any memory-specific parameterization (such as cache block sizes or access times). We first review the regular and static problems in the literature. Then we show new methods for designing memory efficient cache-oblivious algorithms for dynamic and irregular problems, such as data structures. We show how to design a cache-oblivious B-tree, which matches the performance of the standard B-tree. Then we outline future directions in this area. Joint work with Erik Demaine and Martin Farach-Colton.
3. Speaker: Adam L. Buchsbaum, AT&T Labs--Research Title: External Memory Algorithms Most current external memory algorithmic work is based on the I/O model of Aggarwal and Vitter (1988) or the parallel disk model of Vitter and Shriver (1994), both of which trace their roots to work done by Demuth, Floyd, and Knuth in the 1960s and 1970s. In this talk, I will briefly survey these two models and some of the main results derived from them: algorithms for sorting, priority queue manipulation, and classical graph problems like connected components and minimum spanning trees. While much progress has been made on these problems in the last decade, experience and intuition suggest that some of the core problems involved are intrinsically hard in these models. New models for these problems thus seem necessary.
4. Speaker: Graham Cormode Title: Short string signatures Abstract: When communication between two parties is the resource, sublinearity is vital to improve on the trivial solution of one party sending their data wholesale to the other. In the particular case of approximating distances between strings, sublinear "signatures" can be efficiently computed and exchanged. These signatures have applications beyoned communication scenarios, in particular in approximate nearest neighbor algorithms.
5. Speaker: Artur Czumaj Title: Property Testing in Computational Geometry Abstract: We consider the notion of property testing as applied to computational geometry. We aim at developing efficient algorithms which determine whether a given (geometrical) object has a predetermined property or is 'far' away from any object having the property. We show that many basic geometric properties have very efficient testing algorithms, whose running time is significantly smaller than the object description size. 6. Speaker: William DuMouchel Title: Data Squashing: Constructing Summary Data Sets Abstract: One of the chief obstacles to effective data mining is the clumsiness of managing and analyzing data in very large files. The process of model search and model fitting often require many passes over a large dataset, or random access to the elements of a large dataset. Many statistical fitting algorithms assume that the entire dataset being analyzed fits into computer memory, restricting the number of feasible analyses. Here we define "large dataset" as one that cannot be analyzed using some particular desired combination of hardware and software because of computer memory constraints. There are two basic approaches to this problem: either switch to a different hardware/software/analysis strategy or else substitute a smaller dataset for the large one. Here we assume that the former strategy is unavailable or undesirable and consider ways of constructing a smaller substitute dataset. This latter approach was named data squashing by DuMouchel, Volinsky, Johnson, Cortes and Pregibon [KDD99]. A trivial form of data squashing is to take a simple random sample of the rows of the data set, which is often used as a comparison method. This presentation will summarize the results of three recent papers, which describe very different methodologies and theoretical rationales for data squashing. Comparing and contrasting them helps to define the limits of data squashing and to suggest future directions for data squashing research.
7. Speaker: Eldar Fischer, NEC Research Institute, Princeton. Title: On the strength of comparisons in property testing Abstract: An $\epsilon$-test for a property $P$ of functions from ${\cal D}=\{1,\ldots,d\}$ to the positive integers is a randomized algorithm, which makes queries on the value of an input function at specified locations, and distinguishes with high probability between the case of the function satisfying $P$, and the case that it has to be modified in more than $\epsilon d$ places to make it satisfy $P$. We prove that an $\epsilon$-test for any property defined in terms of the order relation, such as the property of being a monotone nondecreasing sequence, cannot perform less queries (in the worst case) than the best $\epsilon$-test which uses only comparisons between the queried values. In particular, we show that an adaptive algorithm for testing that a sequence is monotone nondecreasing performs no better than the best non-adaptive one, with respect to query complexity. From this follows a tight lower bound on tests for this property.
8. Speaker: Phillip B. Gibbons, Bell Labs Title: Synopsis Data Structures for Massive Data Sets Abstract: Massive data sets with terabytes of data are becoming commonplace. There is an increasing demand for algorithms and data structures that provide fast response times to queries on such data sets. In this talk, we argue that traditional algorithmic work is unsuitable for massive data sets, and that even work designed for massive data sets (i.e., the large body of literature on external memory algorithms) is inadequate. Specifically, both bodies of literature fail to consider "synopsis" data structures, which use very little space and provide much faster (typically approximated) answers to queries. Synopsis data structures obtain their speed advantage by being substantively smaller than their base data sets, permitting them to reside higher in the memory hierarchy and also making them suitable for queries on streaming and remote data. The design and analysis of effective synopsis data structures offer many algorithmic challenges. We will describe a number of concrete examples of synopsis data structures, as well as fast algorithms for keeping them up-to-date in the presence of online updates to the data sets. We also introduce a framework / cost model for studying work relevant to massive data sets. Finally, we show how this work is being incorporated into Aqua, the first system to provide fast, highly-accurate approximate answers for a broad class of queries arising in data warehousing scenarios.
9. SPEAKER: Oded Goldreich TITLE: An Introduction to Property Testing ABSTRACT: Property Testing is concerned with the question of deciding whether a given object has a predetermined property or is ``far'' from any object having the property. Specifically, objects are modeled by functions (or oracles), and distance between functions is measured as the fraction of the domain on which the functions differ. We consider (randomized) algorithms which may query the function at arguments of their choice, and seek algorithms which query the function at relatively few places. This yields a natural notion of approximation for decision problems, with focus on sublinear-time algorithms. Property Testing plays a central role in PCP, where one typically tests whether a given oracle is a codeword (w.r.t some fixed code) or is far from any codeword. In contrast, our focus will be on testing properties of objects represented in a natural way (rather than by an error-correcting code). Specifically, we will focus on testing properties of graphs, when represented either by adjacency matrix or by incidence lists. These two standard representations of graphs yield two different models for testing graph properties. The difference between the models will be illustrated by considering the problem of testing Bipartiteness in them.
10. Speaker: Yuval Ishai, DIMACS and AT&T Labs Research Title: Reducing the Servers' Computation in Private Information Retrieval: PIR with Preprocessing Abstract Private information retrieval (PIR) enables a user to retrieve a specific data item from a database, replicated among one or more servers, while hiding from each server the identity of the retrieved item. Since the introduction of the problem in '95 by Chor et al., several protocols with sub-linear communication were suggested. However, in all these protocols the servers' computation for each retrieval is at least linear in the size of entire database, even if the user requires just one bit. In this work we study the computational complexity of PIR. In the standard PIR model, where the servers hold only the database, linear computation cannot be avoided. To overcome this problem we propose the model of {\em PIR with preprocessing}: Before the execution of the protocol each server may compute and store polynomially-many information bits regarding the database; later on, this information should enable the servers to answer each query of the user with more efficient computation. We demonstrate that preprocessing can significantly save work in multi-server PIR protocols, and prove a lower bound on the work of the servers when they are only allowed to store a small number of extra bits. Finally, we present some alternative approaches to saving computation, by batching queries or by allowing off-line interaction between users and servers. While generally more restrictive than the preprocessing model, these alternative approaches apply in the single-server case as well. Joint work with Amos Beimel and Tal Malkin.
11. Speaker: Michael Krivelevich, Tel Aviv University, Israel Title: Using the Regularity Lemma for Property Testing ABSTRACT: The Regularity Lemma, proved by Endre Szemeredi in the mid-seventies, is commonly acknowledged by now to be one of the most powerful tools of Modern Combinatorics. Great many beautiful theorems have been proven by applying this result. Roughly speaking, the Regularity Lemma guarantees that any graph can be partitioned into a bounded number of equal parts so that most of the bipartite graphs between the partition pieces behave essentially like random graphs of a corresponding density. Recently, the relevance of the Regularity Lemma to Graph Property Testing has been revealed by Alon, Fischer, Krivelevich and Szegedy, who described a wide class of graph properties whose testability can be proven by applying the Regularity Lemma. In this (survey-type) talk I will introduce the Regularity Lemma and will indicate how it can be applied to prove testability of various graph properties. I will also discuss the above mentioned result of Alon et al. No previous experience with the Regularity Lemma will be assumed.
12. Speaker: Frederic Magniez Title: Multi-linearity Self-Testing with Relative Error Abstract: We investigate self-testing programs with relative error by allowing error terms proportional to the function to be computed. Until now, in numerical computation, error terms were assumed to be either constant or proportional to the $p$-th power of the magnitude of the input, for $p\in[0,1)$. We construct new self-testers with relative error for real-valued multi-linear functions defined over finite rational domains. The existence of such self-testers positively solves an open question in~\cite{kms99}. Moreover, our self-testers are very efficient: they use few queries and simple operations. \bibitem[KMS99]{kms99} M.~Kiwi, F.~Magniez, and M.~Santha. \newblock Approximate testing with relative error. \newblock In {\em Proc. 31st STOC}, pages 51--60, 1999.
13. Speaker: Nina Mishra Fox Title: Sublinear Approximate (PAC) Clustering Abstract: We present a sampling-based algorithm for approximate k-median clustering that improves upon previous approximation results in the case that the number of points n being clustered is the dominant problem parameter, assuming a distance metric on [0,M]^d. Unlike previous approximate clustering results, our work has time complexity independent of n. This independence is very attractive for the large clustering problems typically associated with large data sources like the web, phone records or transactional data. We also provide a generalization of the algorithm for use when no bound M is known for the points being clustered, and we describe how a previously developed dimension reduction technique can be applied to eliminate our dependence on the dimension when d > log n. We also apply our sampling approach to the related problem of property testing, i.e., given A and k, determining whether there exists a k-median clustering with average distance from points to centers A. A new definition of clustering similar in spirit to the PAC model of learning is developed that allows for clustering (possibly infinite) probability distributions and naturally allows for conceptual descriptions of clusters. We show that the k--median problem is PAC-clusterable, and conclude with an algorithm for PAC-clustering disjoint k-term DNF expressions. Joint work with Dan Oblinger and Leonard Pitt.
14. Title: Optimal aggregation algorithms for middleware Speaker: Moni Naor Let D be a database of N objects where each object has m fields. The objects are given in m sorted lists (where the ith list is sorted according to the ith field). Our goal is to find the top k objects according to a monotone aggregation function t, while minimizing access to the lists. The problem arises in several contexts. In particular Fagin (JCSS 1999) considered it for the purpose of aggregating information in a multimedia database system. We are interested in instant optimality, i.e. that our algorithm will be as good as any other (correct) algorithm on any instance. We give an elegant and simple algorithm (the Threshold Algorithm) that is optimal on every instance, i.e. for every database and for any monotone aggregation function (with some natural restrictions). Joint work with Ron Fagin, Michael Franklin and Amnon Lotem
15. Speaker: Sofya Raskhodnikova Title: Monotonicity Testing Abstract: We present algorithms for testing monotonicity of functions. Namely, given the ability to query an unknown function $\genf$, where $\Sigma$ and $\Xi$ are finite ordered sets, the test always accepts a monotone $f$, and rejects $f$ with high probability if it is $\e$-far from being monotone (i.e., every monotone function differs from $f$ on more than an $\e$ fraction of the domain). For any $\e>0$, the query complexity of the test is $O((n/\e) \cdot \log |\Sigma|\cdot \log |\Xi|)$. The talk is based on "Improved Testing Algorithms for Monotonicity" by Y. Dodis, O. Goldreich, E. Lehman, S. Raskhodnikova, D. Ron, and A. Samorodnitsky from RANDOM '99.
16. Speaker: Dana Ron Title: Testing of Clustering Abstract: A set $X$ of points in $\Re^d$ is $(k,b)${\em-clusterable\/} if $X$ can be partitioned into $k$ subsets ({\em clusters\/}) so that the diameter (alternatively, the radius) of each cluster is at most $b$. We present algorithms that by sampling from a set $X$, distinguish between the case that $X$ is $(k,b)$-clusterable and the case that $X$ is $\epsilon$-far from being $(k,b')$-clusterable for any given $0<\epsilon\leq 1$ and for $b' \geq b$. In $\epsilon$-far from being $(k,b')$-clusterable we mean that more than $\epsilon \cdot |X|$ points should be removed from $X$ so that it becomes $(k,b')$-clusterable. We give algorithms for a variety of cost measures that use a sample of size independent of $|X|$, and polynomial in $k$ and $1/\epsilon $. Our algorithms can also be used to find {\em approximately good\/} clusterings. Namely, these are clusterings of all but an $\epsilon$-fraction of the points in $X$ that have optimal (or close to optimal) cost. The benefit of our algorithms is that they construct an {\em implicit representation\/} of such clusterings in time independent of $|X|$. That is, without actually having to partition all points in $X$, the implicit representation can be used to answer queries concerning the cluster any given point belongs to.
17. Speaker: Christian Sohler Title: Soft Kinetic Data Structures Authors: Artur Czumaj & Christian Sohler We introduce the framework of Soft Kinetic Data Structures (SKDS). A soft kinetic data structure is an approximate data structure that can be used to answer queries on a set of moving objects with unpredictable motion. We analyse the quality of a soft kinetic data structure by giving a competetive analysis with respect to the dynamics of the system. We illustrate our approach by presenting soft kinetic data structures for maintaining classical data structures: sorted arrays, balanced search trees, heaps, and range trees. We also describe soft kinetic data structures for maintaining the Euclidean minimum spanning trees.
18. Speaker: Martin J. Strauss, AT&T Labs Title: Secure Multiparty Computation of Approximations Abstract: Approximation algorithms can sometimes be used to obtain efficient solutions where no efficient exact computation is known. Furthermore, for some applications, several parties will want to cooperate to compute a function of their inputs without revealing more information than necessary. For the first time, we present definitions and protocols of secure multiparty approximate computation that show how to realize most of the cost savings available by using an approximation instead of an exact computation of a function, f, without revealing more than necessary to compute f exactly. We make three contributions. First, we give formal definitions of secure multiparty approximate computations. Second, we introduce some general techniques for constructing secure multiparty approximations. Finally, we present an efficient, sublinear-communication, private approximate computation for the Hamming and L^1 distances. This is joint work with Joan Feigenbaum, Jessica Fong, and Rebecca Wright.
19. Speaker: Luca Trevisan, UC Berkeley Title: Sublinear Time Error-Correction and Error-Detection Abstract: Error-correcting codes based on low-degree multivariate polynomials admit efficient procedures to decode parts of the message given a corrupted codeword (by "polynomial reconstruction" algorithms) and to test whether a given string is close or not to a codeword (by "low degree test" algorithms). Constructing codes with polylogarithmic time, or even constant-time, decoding algorithms is useful in several settings, including worst-case to average-case hardness reductions and private information retrieval. For this problem, constructions not based on polynomials are known, as well as some negative results (giving a trade-off between the rate of the code and the efficiency of the decoding). We will present the known results and the main open questions in this area. Codes with efficient error-detecting procedures (along with additional properties) are at the hearth of PCP constructions. Only constructions based on polynomials are known in this setting, the analysis of testing algorithms tends to be quite hard, and no negative results are known. Positive results are much better than in the setting of error-correction, and some separation results between the two settings are known. We will present some codes not based on polynomials and some folklore testing algorithms for them, whose analysis is still an open question. In the long term, the study of such codes could yield a simpler proof and new insight on the PCP theorem.
20. Speaker: Mahesh Viswanathan Title: Testing of Data Streams Two programming paradigms for massive data-sets are sampling and streaming. Rather than take time even to read a massive data-set, a sampling algorithm extracts a small random sample and computes on it. By constrast, a streaming algorithm takes the time to read all the input, but little more time and little total space. Input to a streaming algorithm is a sequence of items; the streaming algorithm is given the items in order, lacks the space to record more than a few bits of the input, and is required to perform its per-item processing quickly in order to keep up with its unbuffered input. We consider the task of testing properties of large data-sets. We ask whether every property that can be efficiently tested by sampling testers can be efficiently tested by streaming testers. Though the notions of efficiency for sampling and streaming testers are not directly comparable, there are interesting comparisons to make. We present sampling and streaming testers for illustrative examples. (Joint work with Joan Feigenbaum, Sampath Kannan, and Martin Strauss)
21. Speaker: Patrick White Title: Testing that Distributions are Close (with Applications to Markov Chains Abstract: Given two distributions over an $n$ element set, we wish to check whether these distributions are statistically close by only sampling. We give a sublinear algorithm which uses $O(n^{2/3} \epsilon^{-4} \log n)$ independent samples from each distribution, runs in time linear in the sample size, makes no assumptions about the structure of the distributions, and distinguishes the cases when the distance between the distributions is small ( less than $\frac{\epsilon^2}{32 \root{3}{n}$) or large (more than $\epsilon$) in $L_1$-distance. We also give an $\Omega(n^{2/3}\epsilon^{-2/3})$ lover bound. Our algorithm has application to the problem of checking whether a given Markov process is rapidly mixing. We develop sublinear algorithms for this problem as well.
Other Workshops
DIMACS Homepage
Contacting the Center
Document last modified on November 2, 2000.