Princeton University, Princeton, NJ

**Organizers:****Moses Charikar**, Princeton University, moses@CS.Princeton.EDU**Piotr Indyk**, MIT, indyk@theory.lcs.mit.edu**Nati Linial**, Hebrew University, nati@cs.huji.ac.il**Jiri Matousek**, Charles University, matousek@kam.mff.cuni.cz**Yuri Rabinovich**, University of Haifa, yuri@cs.haifa.ac.il**Gideon Schechtman**, Weizmann Institute, gideon@wisdom.weizmann.ac.il

Presented under the auspices of the DIMACS Special Focus on Data Analysis and Mining

Title: Approximation Algorithms for Embedding Metrics into Low-dimensional Spaces

In this talk, we consider the problem of computing an embedding of an arbitrary metric into low-dimensional spaces. We present algorithms for several problems of this flavor. We first show a polynomial-time approximation algorithm for computing an embedding of an arbitrary metric into R^2 under the l_1 norm. The algorithm finds an embedding whose maximum additive distortion is at most a constant times the smallest additive distortion possible. Then, we show a quasi-polynomial-time approximation algorithm for the same problem under the l_2 norm. We also present other results for related problems, such as embedding into a line minimizing the average distortion and several results where we are given some extra information about with the metric.

This is a survey of results from several papers.

Title: Ramsey Theorem for Metric Spaces

We provide the following metric Ramsey theorem:

For any e>0, every n point metric space contains a subset of size at least n^{1-e} which is embeddable in an ultrametric with O(log(1/e)/e) distortion.

This in particular means such a subset has the same distortion in Euclidean space. The bound on the distortion is nearly tight even for embedding in arbitrary Euclidean spaces.

Our theorem can be viewed as a non-linear analog of Dvoretzky's theorem, a cornerstone of modern Banach space theory and convex geometry.

Joint work with Nati Linial, Manor Mendel and Assaf Naor.

Title: On the Impossibility of Dimension Reduction in L1

The Johnson-Lindenstrauss Lemma shows that any $n$ points in Euclidean space (with distances measured by the $\ell_2$ norm) may be mapped down to $O((\log n)/\epsilon^2)$ dimensions such that no pairwise distance is distorted by more than a $(1+\epsilon)$ factor. Determining whether such dimension reduction is possible in $\ell_1$ has been an intriguing open question. Charikar and Sahai recently showed lower bounds for dimension reduction in $\ell_1$ that can be achieved by linear projections, and positive results for shortest path metrics of restricted graph families. However the question of general dimension reduction in $\ell_1$ was still open. For example, it was not known whether it is possible to reduce the number of dimensions to $O(\log n)$ with $1+\epsilon$ distortion.

We show strong lower bounds for general dimension reduction in $\ell_1$. We give an explicit family of $n$ points in $\ell_1$ such that any embedding with distortion $\delta$ requires $n^{\Omega(1/\delta^2)}$ dimensions. This proves that there is no analog of the Johnson-Lindenstrauss Lemma for $\ell_1$; in fact embedding with any constant distortion requires $n^\epsilon$ dimensions. Our proof establishes this lower bound for shortest path metrics of series-parallel graphs.

Title: Exploiting Embeddings of Biosequence Similarity Scores

A key enabling technology for modern molecular biology is the ability to search large databases of biosequences for substrings similar to a given query sequence. Abstractly, biosequence similarity search is a type of inexact string matching. However, biologists have very definite ideas (derived from evolutionary models) of how to measure the similarity of two DNA or protein sequences, and any practical search tool must respect these measures.

In this talk, I will describe recent work by my group and collaborators on approximating common biosequence similarity score functions by metric distances, then building isometries or low-distortion embeddings of these distances into the Hamming metric. By combining these embeddings with Hamming-space similarity search methods such as locality sensitive hashing, we can devise similarity search algorithms and indexing schemes for DNA and protein sequence data with provably high sensitivity and reasonable efficiency. Our results provide a way to insert more biologically relevant information into search heuristics commonly used by the bioinformatics community, improving the extent to which these heuristics recover sequence alignments of real interest to biologists.

Title: Similarity estimation, rounding algorithms and metric embeddings

We consider the problem of representing complicated objects by compact sketches so that distances between pairs of objects can be estimated from their sketches. We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as distance preserving hashing schemes, which yield embeddings into L1 and compact representations for several interesting collections of objects. In particular, we consider compact representation schemes for the Earth Mover Distance, a distance on distributions that has been proposed as a distance metric for images in graphics and vision. We show that the rounding algorithm of Kleinberg and Tardos for the problem of classification with pairwise relationships yields a compact representation scheme for estimating the Earth Mover Distance.

Title: Approximating steiner k-cuts

We consider the Steiner $k$-cut problem, which is a common generalization of the $k$-cut problem and the multiway cut problem: given an edge-weighted undirected graph $G=(V, E)$, a subset of vertices $X \subseteq V$ called {\em terminals}, and an integer $k \le |X|$, the objective is to find a minimum weight set of edges whose removal results in $k$ disconnected components, each of which contains at least one terminal. We present a $2$ approximation for this problem by rounding a generalization of a linear programming relaxation suggested by Naor and Rabani. The rounding uses the Goemans and Williamson primal-dual algorithm (and analysis) for the Steiner tree problem in an interesting way and differs from the rounding of Naor and Rabani which relied on an l_1 embedding of the metric induced by the LP solution.

Title: Generic Global Rigidity

Suppose a finite graph G is realized in Euclidean d-dimensional space. Regard the edges of G as straight line segments between the vertices of G. Suppose that H is another realization of G with corresponding edge lengths the same. When are the configurations of vertices of G and H congruent? In other words, when are ALL the corresponding pairwise distances between the vertices of G and H the same, when just the edges of G are guaranteed to have the same length as the edges of H. A sufficient condition will be shown, when the configuration of vertices is in generic position. Generic position means that the coordinates of all the vertices involved do not satisfy any non-zero polynomial equation over the integers. This sufficient condition, together with a recent combinatorial graph theoretical result of B. Jackson and T. Jordan, provide a proof of correctness of a polynomial time algorithm, known for some time, conjectured to determine whether a graph is generically rigid in the plane.

Title: Zeroing in on the L0 metric

The L0 metric is defined to be the number of positions where two vectors differ. This can be challenging to build small space approximations for when the vectors represent dynamic data sets that are being updated (streaming fashion). The general solution presented here gives a highly flexible approximation scheme, which can be applied to other problems such as approximating the Dominance Norm, a measure of worst case influence, of very large numbers of signals.

Title: Approximating minimum average distortion of non-contracting embeddings

We study the problem of embedding arbitrary finite metrics into a path metric in a non-contracting fashion to minimize average distortion. The worst case distortion for such an embedding could be as high as \Omega(n) (e.g. for the case of uniform metric.)

Hence, we concentrate on approximating the best possible embedding for given input metric. We will describe a constant factor approximation for the problem of embedding general metrics into the line metric. For the case of the metrics which can be represented as trees, we will give a better polytime approximation as well as a QPTAS.

Based on joint work with Anupam Gupta and R. Ravi.

Title: (Nearly) Optimal Embeddings into Additive and Ultrametric Trees

We consider the problem of embedding a graph into a tree, where the distance between tree nodes is either their path length -- an additive metric -- or the distance is the largest edge between them -- an ultrametric. Rather than producing universal bounds, for which large lower-bounds are known, we show how to embed the graph so that we achieve close to the best possible embedding.

Title: Experiments with Random Projections for Machine Learning

Dimensionality reduction via Random Projections has attracted considerable attention in recent years. We report a number of experiments to evaluate Random Projections in the context of supervised machine learning. In particular, we compare Random Projections and PCA on a number of different datasets and using different machine learning methods. While we find that the Random Projection approach predictively underperforms PCA, its computational advantages may make it attractive for certain applications.

Based on: D. Fradkin and D. Madigan. ``Experiments with Random Projections for Machine Learning'', KDD 2003.

Title: Bounded geometries, fractals, and low-distortion embeddings.

We take a look at metric spaces whose "volume growth" is uniformly bounded. We exhibit some tight bounds on Euclidean distortion and prove a surprising result on the embeddings of slowly growing trees (which confirms a special case of a conjecture of Assouad).

Joint work with Robi Krauthgamer and James R. Lee.

Title: Experiments with Embedding Earth-Mover Distance into Normed Spaces

In this talk we address the issue of similarity search in large data sets of images. For the purposes of defining similarity between the images, we use the Earth-Mover Distance (EMD). EMD has been experimentally verified to capture well the perceptual notion of a difference between images. However, the current algorithms for finding the nearest neighbor under EMD have running times linear in the number of images. This makes them inefficient when used for large data sets.

In this paper we propose a very fast algorithm for nearest neighbor search under EMD. Its main idea is to embed the EMD metric into Euclidean space, and then use fast nearest neighbor algorithms designed for the latter space (in particular, we use a variant of the Locality-Sensitive Hashing algorithm). This results in an algorithm that is order(s) of magnitude faster than the linear scan approach.

Title: Lipschitz quotient maps between Banach spaces.

A map f between metric spaces is said to be co-Lipschitz provided the image of every ball B(x,r) in the domain contains a ball B(f(x),r/C) of proportional radius in the range, where the constant C is independent of the ball. Note that when f is one to one, f is co-Lipschitz iff f is surjective with Lipschitz inverse.

A map is called a Lipschitz quotient mapping provided it is both Lipschitz and co-Lipschitz. This is the natural notion of quotient map in the category of metric spaces when the Lipschitz maps are the morphisms. If there exists a Lipschitz quotient mapping from X onto Y we say that Y is a Lipschitz quotient of X.

What structural properties of X are inherited by Y when Y is a Lipschitz quotient of X? When X and Y are general metric spaces (even when X is finite) this has not been much studied but looks like a very interesting direction.

When X and Y are normed spaces and Y is a Lipschitz quotient of X, the most basic question is: What additional condition (if any) will guarantee that Y is also a linear quotient of X? I'll survey what is known on this question. (Most of the results are due to the speaker, J. Lindenstrauss, D. Preiss, and G. Schechtman; partly in collaboration with S. Bates.) Some results involve the notion of metric tree and these are more naturally discussed in the class of metric spaces.

Title: Preduals of spaces of Lipschitz functions and applications to Lipschitz structure

We discuss recent results with Gilles Godefroy on the structure of the predual of the space of Lipschitz functions on a Baanch space and give applications to Lipschitz and uniform structure of metric spaces.

Title: The intrinsic dimensionality of graphs

We resolve the following conjecture raised by Levin together with Linial, London, and Rabinovich [Combinatorica, 1995]: The dimension of a graph $G$, defined as the smallest $d$ such that $G$ occurs as a (not necessarily induced) subgraph of the infinite $d$-dimensional "grid with diagonals", is characterized by the growth rate of $G$, which is the minimum $\rho$ such that every ball of radius $r>1$ in $G$ contains at most $r^\rho$ vertices. Our results extend to a variant of the conjecture for finite-dimensional Euclidean spaces due to Linial [Haifa Workshop, 2002].

(Joint work with James R. Lee.)

Title: Lower bounds for L_infty approximations in data streams

Alice and Bob are given a bit each, and wish to probabilistically compute the AND of their bits by exchanging messages; at the same time they would like the transcript of their messages to reveal as little information about their bits as possible. We formulate questions of this nature precisely, and present a technique to obtain sharp lower bounds on the amount of information that needs to be revealed. The motivation for studying these problems arises from a brand new method that we develop for obtaining communication complexity lower bounds.

The latter have applications for deriving lower bounds for data stream algorithms. Specifically, we show that computing L_p norms to within a factor of n^c requires n^{1 - 4c - 2/p} space in the data stream model; this generalizes a bound of Saks and Sun (2002). We also show that computing the frequency moments F_k for k > 2 requires polynomial space in the data stream model; this resolves the conjecture of Alon, Matias and Szegedy (1996).

(Joint work with Ziv Bar-Yossef, T.S. Jayram, and D. Sivakumar)

Title: A unified approach to Lipschitz extension.

Let X be a metric space, Y a subspace of X, and Z a normed space. A classical problem in analysis is whether a Lipschitz map f: Y -> Z can be extended to a Lipschitz map f': X -> Z and to determine how close the Lipschitz norm of f' can be to that of f.

We present some general techniques for constructing Lipschitz extensions, and in the process, we unify, improve, and extend a number of known results. Our main tool is the relationship between partitions of unity (a classical analytical device) and stochastic decomposition of metric spaces (a standard technique in computer science).

Joint work with Assaf Naor.

Title: Embedding Hamming-metric of Boxes into L_2

Consider the metric space $B$ of axis-parallel boxes in $R^D$ equipped with the hamming distance. Namely, the distance between two boxes is the (D-dim) volume of their symmetric difference. $B$ is obviously in $L_1$, but how does it relate to $L_2$? We show that $B$ is embeddable into $L_2$ with distortion $O(\sqrt{D})$, and that this is tight. A special case of our result is that the space of intervals on the line is embeddable with constant distortion in $L_2$, and a special case of that is that the metric on R $d(i,j) = min(M,|i-j|)$ (aka linear truncated metric) is also nearly $L_2$. A stronger version of this observation is in fact implicit in a work of Schoenberg(1938) and was also noted in a recent work of Mendel and Naor.

Title: On metric Ramsey-type Phenomena

The metric Ramsey question asks for the largest subset of a given finite metric space that has a low distortion embedding into some "nice" host space (e.g., Hilbert space). This question may be viewed as a non-linear analog of Dvoretzky's Theorem in Banach space theory or as part of Ramsey Theory in combinatorics, and it also has some applications in online computation.

In this talk I intend to survey recent results regarding these, and closely related, questions.

Joint work with Yair Bartal, Nati Linial, and Assaf Naor.

Title: The impossibility of dimension reduction in L_1

We present a very simple proof of the recent result of Brinkman and Charikar which settles the long-standing open question of whether there is an L_1 analogue of the Johnson-Lindenstrauss dimension reduction lemma. Unlike the linear programming argument appearing there, our proof relies on geometric intuition.

Joint work with James R. Lee.

Title: On Sobolev spaces on finite graphs

The talk will be devoted to some analogies between Sobolev spaces on ${\bf R}^n$ and Sobolev spaces on graphs. The main point is that some of the results on Sobolev spaces on ${\bf R}^n$ can be transferred to Sobolev spaces on graph with only minor changes in proofs.

Title: Embedding treewidth-2 graphs into l_1 revisited

Improving and simplifying the analysis of an embedding procedure from [GNRS], we show that the shortest-path metric of any treewidth-2 graph embeds into l_1 with distortion at most 2. The best previously known bound was ~13.92. The talk is based on a joint work with Y. Avraham.

Title: On the hardness of approximate proximity search for non-metric/non-geometric spaces

In this talk we will discuss hardness of several approximate proximity search problems for various combinatorial objects such as sets, vectors, strings and graphs under various distance measures of interest, based on a complexity hierarchy of approximate proximity search problems implied by a general class of reductions. We will start by showing that approximate proximity search problem under many non-metric distances of interest (e.g. set intersection, longest common subsequence) have efficient solutions if and only if the exact nearest neighbor search under Hamming distance is efficiently solvable. As a further evidence of hardness we will also show that an efficient O(1)-approximate nearest neighbor search (ANN) data structure is achievable for set intersection if and only if efficient k-ANN algorithms are achievable for all distance measures that are "reducible" to set intersection for all k=O(1). Similar results hold for approximate furthest neighbors under set intersection. As a consequence of these reductions we will also make observations about pairwise approximate distance computations; in particular we will describe a simple linear time 2-approximation algorithm for permutation edit distance computation.

Joint work with Andrey Utis.

Title: Approximating arbitrary metrics by tree metrics

Given a metric (V,d), we consider the problem of embedding V into a distribution over tree metrics. We improve the results of Bartal, and show thay any metric can be probabilistically embedded into a distribution over expansive ultrametrics with distortion O(\log n). This gives improved approximation algorithms for several optimization problems, and our techniques also apply to some other optimization problems.

Joint work with Jittat Fakcharoenphol and Satish Rao.

Title: Tight Lower Bounds for the Distinct Elements Problem

We prove strong lower bounds for the space complexity of (epsilon,delta)- approximating the number of distinct elements F_0 in a data stream. Let m be the size of the universe from which the stream elements are drawn. We show that any one-pass streaming algorithm for (epsilon, delta)-approximating F_0 must use Omega(1/epsilon^2) space when epsilon = Omega(m^{-1/(9+k)}), for any k > 0, improving upon the known lower bound of Omega(1/epsilon) for this range of epsilon. This lower bound is tight up to a factor of log log m. Our lower bound is derived from a reduction from the one-way communication complexity of approximating a boolean function in Euclidean space. The reduction makes use of a low-distortion embedding from an l_2 to an l_1 norm.

Previous: Participation

Next: Registration

Workshop Index

DIMACS Homepage

Contacting the Center

Document last modified on August 1, 2003.