DIMACS/DyDAn Center, CoRE Building, Rutgers University

**Organizers:****Aiyou Chen**, Bell Labs, aychen at research.bell-labs.com**Graham Cormode**, AT&T Labs, grahamr at research.att.com**Andrew McGregor**, UCSD, amcgregor.web at gmail.com**Olgica Milenkovic**, UIUC, milenkov at uiuc.edu.**S. Muthukrishnan**, Rutgers University, muthu at cs.rutgers.edu**Fred Roberts**, Rutgers University, froberts at dimacs.rutgers.edu**Emina Soljanin**, Bell Labs, emina at bell-labs.com

Title: Approximating Edit Distance in Near-Linear Time

We show how to compute the edit distance between two strings of length n up to a factor of 2^{\tilde O(\sqrt{\log n})} in n^{1+o(1)} time. This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art n^{1/3+o(1)} approximation of [Batu-Ergun- Sahinalp'06]. Previously, approximation of the same form was known only for embedding edit distance into L_1 [Ostrovsky-Rabani'05], and it is not known if that embedding can be computed in sub-quadratic time.

Joint work with Krzysztof Onak (MIT).

Title: A Fast and Compact Method for Unveiling Significant Patterns in High Speed Networks

Identification of significant patterns in network traffic, such as IPs or flows that contribute large volume (heavy hitters) or introduce large changes (heavy changers), has many applications in %such as accounting and network anomaly detection. As network speed and the number of flows grow rapidly, tracking per-IP or per-flow statistics becomes infeasible due to both the computational overhead and memory requirements. In this paper, we propose a novel sequential hashing scheme that requires only $O(H\log N)$ both in memory and computational overhead that are close to being optimal, where $N$ is the the number of all possible keys (e.g., flows, IPs) and $H$ is the maximum number of heavy keys. Moreover, the generalized sequential hashing scheme makes it possible to trade off among memory, update cost, and detection cost in a large range that can be utilized by different computer architectures for optimizing the overall performance. In addition, we also propose statistically efficient algorithms for estimating the values of heavy hitters and heavy changers. Using both theoretical analysis and experimental studies of Internet traces, we demonstrate that our approach can achieve the same accuracy as the existing methods do but using much less memory and computational overhead.

Joint work with Tian Bu, Aiyou Chen and Patrick P. C. Lee.

Title: Lower Bounds for Gap-Hamming-Distance and Consequences for Data Stream Algorithms

The Gap-Hamming-Distance problem arose in the context of proving space lower bounds for a number of key problems in the data stream model. In this problem, Alice and Bob have to decide whether the Hamming distance between their $n$-bit input strings is large (i.e., at least $n/2 + \sqrt n$) or small (i.e., at most $n/2 - \sqrt n $); they do not care if it is neither large nor small. This $ \Theta(\sqrt n)$ gap in the problem specification is crucial for capturing the approximation allowed to a data stream algorithm.

Thus far, for randomized communication, an $\Omega(n)$ lower bound on this problem was known only in the one-way setting. We prove an $ \Omega(n)$ lower bound for randomized protocols that use any constant number of rounds.

As a consequence we conclude, for instance, that $\epsilon$- approximately counting the number of distinct elements in a data stream requires $\Omega(1/\epsilon^2)$ space, even with multiple (a constant number of) passes over the input stream. This extends earlier one-pass lower bounds, answering a long-standing open question. We obtain similar results for approximating the frequency moments and for approximating the empirical entropy of a data stream.

In the process, we also obtain tight $n - \Theta(\sqrt{n}\log n)$ lower and upper bounds on the one-way deterministic communication complexity of the problem. Finally, we give a simple combinatorial proof of an $\Omega(n)$ lower bound on the one-way randomized communication complexity.

Joint work with Joshua Brody (Dartmouth).

Title: Sketches and Conditional Random Sampling (CRS) for Large Sparse Data Sets

We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are strongly associated or not. One can often obtain estimates of associations from a small sample. We develop a sketch-based algorithm that constructs a contingency table for a sample. One can estimate the contingency table for the entire population using straightforward statistical estimation methods.

Recently, we have extended the original simple sketch algorithm to be a general-purpose sampling algorithm called "Conditional Random Sampling (CRS)," which is applicable to sampling real-valued data and dynamic streaming data, as long as the data are sparse. Compared with many other sketching/sampling algorithms such as stable random projections, CRS exhibits a significant advantage in that it is ``one- sketch-for-all,'' meaning that i is capable of estimating any type of linear summary statistics including pairwise Lp distances (for any p), two-way and multi-way inner products, chi-squared distances, etc.

We recommend CRS as a promising tool for building highly scalable systems, in machine learning, data mining, recommender systems, and information retrieval.

Joint work with Ping Li and Trevor Hastie.

Title: Weighted Superimposed Codes and Constrained Integer Compressed Sensing

We introduce a new family of codes, termed weighted superimposed codes (WSCs), motivated by applications in wireless communications, biosensing and digital fingerprinting. WSCs can be viewed as a special instance of compressive sensing schemes, where guaranteed noise tolerance is achieved by restricting the alphabet of the input signals. Our main results are lower and upper bounds on the largest achievable code rates of several classes of WSCs. These bounds suggest that, with the codeword and weighting vector constraints at hand, one can improve the code rates achievable by standard compressive sensing techniques.

Joint work with Olgica Milenkovic (UIUC).

Title: Sparsity Recovery: Limits, Algorithms, and Wireless Applications

This talk considers the problem of recovering the support of a sparse vector from random linear measurements with noise. We provide new necessary and sufficient conditions on the number of measurements for reliable detection as a function of the sparsity level, signal-to- noise ratio and dynamic range of the non-zero components. The performance of the popular lasso and orthogonal matching pursuit (OMP) methods are compared against these bounds.

We also consider an application to the random on-off multiple access channel for wireless communications. We show that sparsity pattern detection can be used as a form of multiuser detection offering improved near-far resistance. Further gains at high SNR are also possible with a new sparsity detection algorithm which we call sequential OMP. The framework illustrates the role of power control and successive interference cancellation.

Title: QUAPO: Quantitative Analysis of Pooling in High-Throughput Drug Screening

High-throughput screening (HTS) is a standard step in drug discovery used to identify potential drugs from among thousands or millions of compounds. These compounds are screened for activity on a biological target by testing them in a high-throughput biochemical experiment or assay. Typically a small fraction of compounds tested display activity, so testing a large compound library can be wasteful. Pooling in HTS involves testing mixtures of compounds in each biochemical assay to minimize the number of tests required to screen a library for active compounds. Recent work in our group has focused on designing pooling schemes, inspired from group testing algorithms and tailored for drug screening, that guarantee the identification of active compounds while correcting for errors in testing. In this talk, we will present an overview of the pooling design problem in the context of HTS, highlighting the experimental constraints, current implementations and design-theoretic issues. Furthermore, by using simple biochemical models we extend the ability of pooling schemes to not just identify the active compounds but also obtain quantitative information about their activity. To achieve this quantitative ability, we developed a biophysically relevant pooling strategy called Quantitative Analysis of Pooling (QUAPO). QUAPO uses a compressive sensing approach to design efficient compound pooling strategies and decode the results to obtain robust and quantitative values for compound activity. We will also highlight potential problems in HTS that could benefit from advances in compressive sensing.

Joint work with Anna Gilbert, Paul Shearer and Peter Woolf (University of Michigan, Ann Arbor).

Title: Sketching and Streaming Entropy via Approximation Theory

I will present near-optimal sketching and streaming algorithms for estimating Shannon entropy in the most general streaming model, with arbitrary insertions and deletions. This improves upon prior results that obtain suboptimal space bounds in the general model, and near-optimal bounds in the insertion-only model without sketching. Our high-level approach is simple: we give algorithms to estimate Renyi and Tsallis entropy, and use them to extrapolate an estimate of Shannon entropy. The accuracy of our estimates is proven using approximation theory arguments and extremal properties of Chebyshev polynomials. Our work also yields the best- known and near-optimal additive approximations for entropy, and hence also for conditional entropy and mutual information.

Joint work with Nicholas J. A. Harvey (MSR/UWaterloo) and Krzysztof Onak (MIT).

Title: Exact Low-rank Matrix Completion via Convex Optimization

In this talk, I will discuss very general settings in which one can perfectly recover all of the missing entries of a low-rank matrix from a sufficiently large random subset of entries by solving a convex programming problem. Out of all of the matrices whose entries agree with the given subset, this program finds the matrix with the minimum nuclear norm (also known as the Ky-Fan norm or the Schatten 1- norm). I will show that if the number of sampled entries is greater than C n^{1.25} r log n for some positive numerical constant C, then with very high probability, most n x n matrices of rank r are perfectly recovered by the nuclear norm heuristic. The techniques for proving the main results build upon geometric ideas from the the recent literature on compressed sensing together with probabilistic tools such as the powerful techniques of Bourgain and of Rudelson for bounding norms of operators between Banach spaces. These results illustrate a general program for perfectly reconstructing geometric objects from very limited information.

Title: Coresets and Sketches for High Dimensional Subspace Approximation Problems

Given a point set $P$ a subspace approximation problem asks for a subspace that minimizes a certain objective function, for example, the sum of distances to the subspace. In this paper we develop techniques for subspace approximation algorithm that deal with massive data sets. In particular, we develop coresets and sketches for the linear subspace approximation problem, i.e. small space representations that approximate the input point set $P$ with respect to a subspace approximation problem. Using these techniques, we develop new approximation algorithms and data streaming algorithms for subspace approximation.

Joint work with Dan Feldman, Morteza Monemizadeh and David Woodruff.

Title: The Price of Privacy and the limits of LP decoding

This work is at the intersection of two lines of research. One line, initiated by Dinur and Nissim, investigates the price, in accuracy, of protecting privacy in a statistical database. The second, growing from an extensive literature on compressed sensing and explicitly connected to error-correcting codes by Candes and Tao, is in the use of linear programming for error correction.

Our principal result is the discovery of a sharp threshold \rho^* ~ 0.239, so that if \rho < \rho^* and A is a random m ? n encoding matrix of independently chosen standard Gaussians, where m = O(n), then with overwhelming probability over choice of A, for all x in R^n, LP decoding corrects (\rho m) arbitrary errors in the encoding Ax, while decoding can be made to fail if the error rate exceeds \rho^* Our bound resolves an open question of Cand`es, Rudelson, Tao, and Vershyin [3] and (oddly, but explicably) refutes empirical conclusions of Donoho [11] and Cand`es et al [3].

In the context of privacy-preserving data-mining our results say that any privacy mechanism, interactive or non-interactive, providing reasonably accurate answers to a 0.761 fraction of randomly generated weighted subset sum queries, and arbitrary answers on the remaining 0.239 fraction, is blatantly non-private.

Joint work with Cynthia Dwork and Frank McSherry (Microsoft Research).

Title: Statistical recovery in high dimensions: Practical and information-theoretic limitations

For many statistical inference problems (e.g., estimating structured vectors, matrices, or graphs from noisy data), there is a range of possible algorithms of varying computational complexity. An interesting question is the trade-off between computational complexity and statistical efficiency, where the latter corresponds to the minimum number of samples required for successful recovery. We describe some work on this trade-off in the context of high- dimensional sparse principal components analysis and for learning graphical models from noisy data. We discuss results both for various polynomial-time algorithms (e.g., thresholding estimators, convex relaxations) and optimal algorithms (information-theoretic analysis).

Based on joint works with: Arash Amini, John Lafferty, Pradeep Ravikumar, Prasad Santhanam.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center

Document last modified on March 4, 2009.