### DIMACS Workshop on Sublinear Algorithms

#### September 24 - 27, 2000 Nassau Inn, Princeton, New Jersey

Organizers:
Shafi Goldwasser, MIT, shafi@THEORY.LCS.MIT.EDU
Prabhakar Raghavan, Verity, pragh@verity.com
Ronitt Rubinfeld, NEC Research Institute, ronitt@research.nj.nec.com
Martin Strauss, AT&T Labs - Research, mstrauss@research.att.com
Presented under the auspices of the Special Year on Massive Data Sets, the Special Year on Computational Intractability and
Exploratory Workshops/Special Programs.

### ABSTRACTS


1.

Speaker: Tugkan Batu
Title: Fast Approximate PCPs for Multidimensional Bin-Packing Problems

Abstract:
We consider approximate PCPs for multidimensional bin-packing
problems.  In particular, we show how a verifier can be quickly
convinced that a set of multidimensional blocks can be packed into a
small number of bins.  The running time of the verifier is bounded by
$O(T(n))$, where $T(n)$ is the time required to test for {\em
heaviness}.  We give heaviness testers that can test heaviness of an
element in the domain $[1,\ldots,n]^d$ in time $O((\log n)^d)$.  We
also also give approximate PCPs with efficient verifiers for
recursive bin packing and multidimensional routing.

2.

Cache-Oblivious Algorithms and Data Structures

Michael A. Bender
SUNY Stony Brook

We present methods for constructing cache-oblivious algorithms and data
structures.  Cache-oblivious algorithms run efficiently on a
hierarchical memory, even though they completely avoid any
memory-specific parameterization (such as cache block sizes or access
times).  We first review the regular and static problems in the
literature.  Then we show new methods for designing memory efficient
cache-oblivious algorithms for dynamic and irregular problems, such as
data structures.  We show how to design a cache-oblivious B-tree, which
matches the performance of the standard B-tree.  Then we outline future
directions in this area.

Joint work with Erik Demaine and Martin Farach-Colton.

3.

Speaker: Adam L. Buchsbaum, AT&T Labs--Research
Title: External Memory Algorithms

Most current external memory algorithmic work is based on the I/O model
of Aggarwal and Vitter (1988) or the parallel disk model of Vitter and
Shriver (1994), both of which trace their roots to work done by Demuth,
Floyd, and Knuth in the 1960s and 1970s.  In this talk, I will briefly
survey these two models and some of the main results derived from them:
algorithms for sorting, priority queue manipulation, and classical
graph problems like connected components and minimum spanning trees.
While much progress has been made on these problems in the last decade,
experience and intuition suggest that some of the core problems involved
are intrinsically hard in these models.  New models for these problems
thus seem necessary.

4.

Speaker: Graham Cormode
Title: Short string signatures

Abstract:
When communication between two parties is the resource, sublinearity is
vital to improve on the trivial solution of one party sending their data
wholesale to the other.  In the particular case of approximating distances
between strings, sublinear "signatures" can be efficiently computed and
exchanged.  These signatures have applications beyoned communication
scenarios, in particular in approximate nearest neighbor algorithms.

5.

Speaker: Artur Czumaj
Title: Property Testing in Computational Geometry

Abstract: We consider the notion of property testing as applied to
computational geometry. We aim at developing efficient algorithms which
determine whether a given (geometrical) object has a predetermined
property or is 'far' away from any object having the property. We show
that many basic geometric properties have very efficient testing
algorithms, whose running time is significantly smaller than the object
description size.

6.

Speaker: William DuMouchel
Title: Data Squashing: Constructing Summary Data Sets

Abstract:

One of the chief obstacles to effective data mining is the clumsiness of
managing and analyzing data in very large files.  The process of model
search and model fitting often require many passes over a large dataset, or
random access to the elements of a large dataset.  Many statistical fitting
algorithms assume that the entire dataset being analyzed fits into computer
memory, restricting the number of feasible analyses.  Here we define "large
dataset" as one that cannot be analyzed using some particular desired
combination of hardware and software because of computer memory constraints.
There are two basic approaches to this problem: either switch to a different
hardware/software/analysis strategy or else substitute a smaller dataset for
the large one.  Here we assume that the former strategy is unavailable or
undesirable and consider ways of constructing a smaller substitute dataset.
This latter approach was named data squashing by DuMouchel, Volinsky,
Johnson, Cortes and Pregibon [KDD99].   A trivial form of data squashing is
to take a simple random sample of the rows of the data set, which is often
used as a comparison method.  This presentation will summarize the results
of three recent papers, which describe very different methodologies and
theoretical rationales for data squashing.  Comparing and contrasting them
helps to define the limits of data squashing and to suggest future
directions for data squashing research.

7.

Speaker: Eldar Fischer, NEC Research Institute, Princeton.
Title: On the strength of comparisons in property testing

Abstract:

An $\epsilon$-test for a property $P$ of functions from
${\cal D}=\{1,\ldots,d\}$ to the positive integers is a randomized
algorithm, which makes queries on the value of an input function at
specified locations, and distinguishes with high probability between the
case of the function satisfying $P$, and the case that it has to be
modified in more than $\epsilon d$ places to make it satisfy $P$.

We prove that an $\epsilon$-test for any property defined in terms of the
order relation, such as the property of being a monotone nondecreasing
sequence, cannot perform less queries (in the worst case) than the best
$\epsilon$-test which uses only comparisons between the queried values. In
particular, we show that an adaptive algorithm for testing that a sequence
is monotone nondecreasing performs no better than the best non-adaptive
one, with respect to query complexity. From this follows a tight lower
bound on tests for this property.

8.

Speaker: Phillip B. Gibbons, Bell Labs
Title: Synopsis Data Structures for Massive Data Sets

Abstract:

Massive data sets with terabytes of data are becoming
commonplace.  There is an increasing demand for algorithms and data
structures that provide fast response times to queries on such data
sets.  In this talk, we argue that traditional algorithmic work is
unsuitable for massive data sets, and that even work designed for
massive data sets (i.e., the large body of literature on external memory
algorithms) is inadequate.  Specifically, both bodies of literature fail
to consider "synopsis" data structures, which use very little space and
provide much faster (typically approximated) answers to queries.

Synopsis data structures obtain their speed advantage by being
substantively smaller than their base data sets, permitting them to
reside higher in the memory hierarchy and also making them suitable for
queries on streaming and remote data. The design and analysis of
effective synopsis data structures offer many algorithmic challenges.
We will describe a number of concrete examples of synopsis data
structures, as well as fast algorithms for keeping them up-to-date in
the presence of online updates to the data sets. We also introduce a
framework / cost model for studying work relevant to massive data sets.
Finally, we show how this work is being incorporated into Aqua, the
first system to provide fast, highly-accurate approximate answers for a
broad class of queries arising in data warehousing scenarios.

9.

SPEAKER: Oded Goldreich
TITLE: An Introduction to Property Testing

ABSTRACT:
Property Testing is concerned with the question of deciding
whether a given object has a predetermined property
or is far'' from any object having the property.
Specifically, objects are modeled by functions (or oracles),
and distance between functions is measured as the fraction
of the domain on which the functions differ.
We consider (randomized) algorithms which may query the function
at arguments of their choice, and seek algorithms which query the
function at relatively few places.
This yields a natural notion of approximation for decision problems,
with focus on sublinear-time algorithms.

Property Testing plays a central role in PCP,
where one typically tests whether a given oracle is a codeword
(w.r.t some fixed code) or is far from any codeword.
In contrast, our focus will be on testing properties
of objects represented in a natural way
(rather than by an error-correcting code).
Specifically, we will focus on testing properties of graphs,
when represented either by adjacency matrix or by incidence lists.
These two standard representations of graphs
yield two different models for testing graph properties.
The difference between the models will be illustrated by
considering the problem of testing Bipartiteness in them.

10.

Speaker: Yuval Ishai, DIMACS and AT&T Labs Research
Title: Reducing the Servers' Computation in Private Information Retrieval:
PIR with Preprocessing

Abstract

Private information retrieval (PIR) enables a user to retrieve a
specific data item from a database, replicated among one or more
servers, while hiding from each server the identity of the retrieved
item. Since the introduction of the problem in '95 by Chor et al.,
several protocols with sub-linear communication were suggested.
However, in all these protocols the servers' computation for each
retrieval is at least linear in the size of entire database, even if
the user requires just one bit.

In this work we study the computational complexity of PIR.  In the
standard PIR model, where the servers hold only the database, linear
computation cannot be avoided.  To overcome this problem we propose
the model of {\em PIR with preprocessing}: Before the execution of the
protocol each server may compute and store polynomially-many
information bits regarding the database; later on, this information
should enable the servers to answer each query of the user with more
efficient computation.  We demonstrate that preprocessing can
significantly save work in multi-server PIR protocols, and prove a
lower bound on the work of the servers when they are only allowed to
store a small number of extra bits.  Finally, we present some
alternative approaches to saving computation, by batching queries or
by allowing off-line interaction between users and servers. While
generally more restrictive than the preprocessing model, these
alternative approaches apply in the single-server case as well.

Joint work with Amos Beimel and Tal Malkin.

11.

Speaker:  Michael Krivelevich, Tel Aviv University, Israel
Title:    Using the Regularity Lemma for Property Testing

ABSTRACT:

The Regularity Lemma, proved by Endre Szemeredi in the mid-seventies, is
commonly acknowledged by now to be one of the most powerful tools of
Modern Combinatorics. Great many beautiful theorems have been proven by
applying this result. Roughly speaking, the Regularity Lemma guarantees
that any graph can be partitioned into a bounded number of equal parts
so that most of the bipartite graphs between the partition pieces behave
essentially like random graphs of a corresponding density.

Recently, the relevance of the Regularity Lemma to Graph Property
Testing has been revealed by Alon, Fischer, Krivelevich and Szegedy, who
described a wide class of graph properties whose testability can be
proven by applying the Regularity Lemma.

In this (survey-type) talk I will introduce the Regularity Lemma and
will indicate how it can be applied to  prove testability of various
graph properties. I will also discuss the above mentioned result of Alon
et al. No previous experience with the Regularity Lemma will be assumed.

12.

Speaker: Frederic Magniez
Title: Multi-linearity Self-Testing with Relative Error

Abstract:
We investigate self-testing programs with relative error by allowing
error terms proportional to the function to be computed.
Until now, in numerical computation, error terms were assumed to be
either constant or proportional to the $p$-th power of the magnitude
of the input, for $p\in[0,1)$. We construct new self-testers with
relative error for real-valued multi-linear functions defined over
finite rational domains. The existence of such self-testers positively
solves an open question in~\cite{kms99}. Moreover, our self-testers are
very efficient: they use few queries and simple operations.

\bibitem[KMS99]{kms99}
M.~Kiwi, F.~Magniez, and M.~Santha.
\newblock Approximate testing with relative error.
\newblock In {\em Proc. 31st STOC}, pages 51--60, 1999.

13.

Speaker: Nina Mishra Fox
Title: Sublinear Approximate (PAC) Clustering

Abstract: We present a sampling-based algorithm for approximate
k-median clustering that improves upon previous approximation
results in the case that the number of points n
being clustered is the dominant problem parameter, assuming a
distance metric on [0,M]^d. Unlike previous approximate clustering
results, our work has time complexity independent of n.
This independence is very attractive for the large clustering
problems typically associated with large data sources like
the web, phone records or transactional data.

We also provide a generalization of the algorithm for use
when no bound M is known for the points being clustered,
and we describe how a previously developed dimension reduction
technique can be applied to eliminate our dependence on the
dimension when d > log n. We also apply our sampling approach
to the related problem of property testing, i.e., given A and k,
determining whether there exists a k-median clustering with
average distance from points to centers A.

A new definition of clustering similar in spirit to the
PAC model of learning is developed that allows for clustering
(possibly infinite) probability distributions and naturally
allows for conceptual descriptions of clusters. We show that the
k--median problem is PAC-clusterable, and conclude with an
algorithm for PAC-clustering disjoint k-term DNF expressions.

Joint work with Dan Oblinger and Leonard Pitt.

14.

Title: Optimal aggregation  algorithms for middleware
Speaker: Moni Naor

Let D be a database of N objects where each object has m fields.
The objects are given in m sorted lists (where the ith list is
sorted according to the ith field). Our goal is to find the top k
objects according to a monotone aggregation function t, while
contexts. In particular Fagin (JCSS 1999) considered it for the
purpose of aggregating information in a multimedia database
system.

We are interested in instant optimality, i.e. that our algorithm
will be as good as any other (correct) algorithm on any instance.
We give an elegant and simple algorithm (the Threshold Algorithm)
that is optimal on every instance, i.e. for every database and for
any monotone aggregation function (with some natural
restrictions).

Joint work with Ron Fagin, Michael Franklin and Amnon Lotem

15.

Title: Monotonicity Testing
Abstract:

We present algorithms for testing monotonicity of functions.
Namely, given the ability to query an unknown function $\genf$, where
$\Sigma$ and $\Xi$ are finite ordered sets, the test always accepts a
monotone $f$, and rejects $f$ with high probability if it is $\e$-far
from being monotone (i.e., every monotone function differs from $f$ on
more than an $\e$ fraction of the domain). For any $\e>0$, the query
complexity of the test
is $O((n/\e) \cdot \log |\Sigma|\cdot \log |\Xi|)$.

The talk is based on "Improved Testing Algorithms for Monotonicity" by
Y. Dodis, O. Goldreich, E. Lehman, S. Raskhodnikova, D. Ron, and
A. Samorodnitsky from RANDOM '99.

16.

Speaker: Dana Ron
Title: Testing of Clustering

Abstract:

A set $X$ of points in $\Re^d$ is $(k,b)${\em-clusterable\/}
if $X$ can be partitioned into $k$ subsets ({\em clusters\/}) so
that the diameter (alternatively, the radius) of each cluster is at most $b$.
We present algorithms that by sampling from a set $X$, distinguish
between the case that $X$ is $(k,b)$-clusterable and the case that
$X$ is $\epsilon$-far from being $(k,b')$-clusterable for any given
$0<\epsilon\leq 1$ and for $b' \geq b$. In $\epsilon$-far from
being $(k,b')$-clusterable we mean that more than $\epsilon \cdot |X|$
points should be removed from $X$ so that it becomes $(k,b')$-clusterable.
We give algorithms for a variety of cost measures that use a sample
of size independent of $|X|$, and polynomial in $k$ and $1/\epsilon$.

Our algorithms can also be used to find {\em approximately good\/}
clusterings. Namely, these are clusterings of all but an
$\epsilon$-fraction of the points in $X$ that have optimal (or
close to optimal) cost.
The benefit of our algorithms is that they construct an {\em implicit
representation\/} of such clusterings in time independent of $|X|$.
That is, without actually having to partition all points in $X$, the
implicit representation can be used to answer queries concerning the
cluster any given point belongs to.

17.

Speaker: Christian Sohler
Title: Soft Kinetic Data Structures

Authors: Artur Czumaj & Christian Sohler

We introduce the framework of Soft Kinetic Data Structures (SKDS).
A soft kinetic data structure is an approximate data structure
that can be used to answer queries on a set of moving objects
with unpredictable motion. We analyse the quality of a soft
kinetic data structure by giving a competetive analysis with respect
to the dynamics of the system.

We illustrate our approach by presenting soft kinetic data structures
for maintaining classical data structures: sorted arrays, balanced
search trees, heaps, and range trees. We also describe soft kinetic
data structures for maintaining the Euclidean minimum spanning trees.

18.

Speaker:  Martin J. Strauss, AT&T Labs
Title:        Secure Multiparty Computation of Approximations

Abstract:

Approximation algorithms can sometimes be used to obtain efficient
solutions where no efficient exact computation is known.
Furthermore, for some applications, several parties will want to
cooperate to compute a function of their inputs without revealing
definitions and protocols of secure multiparty approximate
computation that show how to realize most of the cost savings
available by using an approximation instead of an exact
computation of a function, f, without revealing more than
necessary to compute f exactly.

We make three contributions.  First, we give formal definitions of
secure multiparty approximate computations.  Second, we introduce
some general techniques for constructing secure multiparty
approximations.  Finally, we present an efficient,
sublinear-communication, private approximate computation for the
Hamming and L^1 distances.

This is joint work with Joan Feigenbaum, Jessica Fong, and Rebecca
Wright.

19.

Speaker: Luca Trevisan, UC Berkeley
Title: Sublinear Time Error-Correction and Error-Detection

Abstract:
Error-correcting codes based on low-degree multivariate
polynomials admit efficient procedures to decode parts
of the message given a corrupted codeword (by "polynomial
reconstruction" algorithms) and to test whether a given
string is close or not to a codeword (by "low degree
test" algorithms).

Constructing codes with polylogarithmic
time, or even constant-time, decoding algorithms is
useful in several settings, including worst-case to
average-case hardness reductions and private information
retrieval. For this problem, constructions not
based on polynomials are known, as well as some
negative results (giving a trade-off between the
rate of the code and the efficiency of the decoding).
We will present the known results and  the main open
questions in this area.

Codes with efficient error-detecting procedures (along
with additional properties) are at the hearth of PCP
constructions. Only constructions based on polynomials
are known in this setting, the analysis of testing
algorithms tends to be quite hard, and no negative
results are known. Positive results are much better
than in the setting of error-correction, and some
separation results between the two settings are known.
We will present some codes not based on polynomials
and some folklore testing algorithms for them, whose
analysis is still an open question. In the long term,
the study of such codes could  yield a simpler proof
and new insight on the PCP theorem.

20.

Speaker: Mahesh Viswanathan
Title:   Testing of Data Streams

Two programming paradigms for massive data-sets are sampling and
streaming. Rather than take time even to read a massive data-set, a
sampling algorithm extracts a small random sample and computes on
it. By constrast, a streaming algorithm takes the time to read all the
input, but little more time and little total space. Input to a
streaming algorithm is a sequence of items; the streaming algorithm is
given the items in order, lacks the space to record more than a few
bits of the input, and is required to perform its per-item processing
quickly in order to keep up with its unbuffered input.

We consider the task of testing properties of large data-sets. We ask
whether every property that can be efficiently tested by sampling
testers can be efficiently tested by streaming testers. Though the
notions of efficiency for sampling and streaming testers are not
directly comparable, there are interesting comparisons to make. We
present sampling and streaming testers for illustrative examples.

(Joint work with Joan Feigenbaum, Sampath Kannan, and Martin Strauss)

21.

Speaker: Patrick White
Title:   Testing that Distributions are Close (with Applications to
Markov Chains

Abstract:

Given two distributions over an $n$ element set, we wish to check whether
these distributions are statistically close by only sampling. We give a
sublinear algorithm which uses $O(n^{2/3} \epsilon^{-4} \log n)$
independent samples from each distribution, runs in time linear in the
sample size, makes no assumptions about the structure of the distributions,
and distinguishes the cases when the distance between the distributions is
small ( less than $\frac{\epsilon^2}{32 \root{3}{n}$) or large (more than
$\epsilon$) in $L_1$-distance. We also give an
$\Omega(n^{2/3}\epsilon^{-2/3})$ lover bound.

Our algorithm has application to the problem of checking whether a given
Markov process is rapidly mixing. We develop sublinear algorithms for this
problem as well.


Other Workshops
DIMACS Homepage
Contacting the Center