DIMACS/DyDAn Center, CoRE Building, Rutgers University

**Organizers:****Aiyou Chen**, Bell Labs, aychen at research.bell-labs.com**Graham Cormode**, AT&T Labs, grahamr at research.att.com**Andrew McGregor**, UCSD, amcgregor.web at gmail.com**Olgica Milenkovic**, UIUC, milenkov at uiuc.edu.**S. Muthukrishnan**, Rutgers University, muthu at cs.rutgers.edu**Fred Roberts**, Rutgers University, froberts at dimacs.rutgers.edu**Emina Soljanin**, Bell Labs, emina at bell-labs.com

Title: Versions of Random Forests: Properties and Performance

Random Forests is a combination of tree predictors such that each tree depends on the values of random vector. Since it was suggested by Leo Breiman, it has received much attention due to remarkable empirical success. But there has been relatively fewer theoretical results. In this talk, I will introduce many versions of Random Forests for classification and present theoretical properties of them. Some of their performance results will be presented.

Title: Deterministic Compressive Sensing Matrices

Compressed Sensing aims to capture attributes of a signal using very few measurements. The Restricted Isometry Property is the condition that the sensing matrix acts as a near isometry on all k-sparse signals. Candes and Tao showed that this condition is sufficient for sparse reconstruction and that random matrices, where the entries are generated by an iid Gaussian or Bernoulli process, satisfy the RIP with high probability. This approach treats all k-sparse signals equally likely, in contrast to mainstream signal processing where the filtering is deterministic, and the signal is described probabilistically. In the mainstream framework the sensing matrix is deterministic and it is required to act as a near-isometry on k-sparse vectors with high probability. We provide weak conditions that are sufficient to show that a deterministic sensing matrix satisfies this Statistical Restricted Isometry Property (STRIP). The proof is elementary and avoids intricate combinatorial arguments involving coherence of orthonormal bases. The new framework encompasses many families of deterministic sensing matrices, including those formed from discrete chirps and Delsarte-Goethals codes for which there exist very efficient reconstruction algorithms that are resilient to noise. In the case of Delsarte-Goethals matrices, it is possible to strengthen the STRIP property. The reconstruction algorithm involves pointwise multiplication of the k-sparse superposition with a shift of itself, followed by the Fourier / Walsh Hadamard transform, and a refinement of the STRIP property holds for all offsets and every Fourier / Walsh-Hadamard coefficient.

Title: Finding Frequent Items in Data Streams

The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. In this talk, I will survey the problem, and present the most important algorithm in a common framework. I'll also show some experimental comparisons, and discuss some more recent results which explain why some of these algorithms give even better results than their initial analysis led us to believe.

The talk covers joint work with Marios Hadjieleftheriou, Piotr Indyk, Radu Berinde and Martin Strauss.

Title: Sparse Recovery Using Sparse Random Matrices

Over the recent years, a new *linear* method for compressing high-dimensional data (e.g., images) has been discovered. For any high- dimensional vector x, its *sketch* is equal to Ax, where A is an m x n matrix (possibly chosen at random). Although typically the sketch length m is much smaller than the number of dimensions n, the sketch contains enough information to recover an *approximation* to x. At the same time, the linearity of the sketching method is very convenient for many applications, such as data stream computing and compressed sensing.

The major sketching approaches can be classified as either combinatorial (using sparse sketching matrices) or geometric (using dense sketching matrices). They achieve different trade-offs, notably between the compression rate and the running time. Thus, it is desirable to understand the connections between them, with the goal of obtaining the "best of both worlds" solution.

In this talk we show that, in a sense, the combinatorial and geometric approaches are based on different manifestations of the same phenomenon. This connection will enable us to obtain several novel algorithms and constructions, which inherit advantages of sparse matrices, such as lower sketching and recovery times.

Joint work with: Radu Berinde, Anna Gilbert, Howard Karloff, Milan Ruzic and Martin Strauss

Title: Data Stream Algorithms: Proving Memory Lower Bounds

The recent decade has witnessed tremendous innovations in the processing of continuous streams of data using only a limited amount of memory. Intensive research on the subject has resulted in a broad theoretical understanding of this model. This talk will address the limitations of the data stream model. In particular, how do we prove lower bounds on the memory needed by data stream algorithms?

Similar to NP-completeness, reductions play a major role involving a toolkit of certain classical problems. We will describe the methodology used to prove lower bounds for these classical problems.

The techniques are drawn from many diverse areas such as communication complexity, information theory and statistics.

Title: Compressed Counting

Compressed Counting (CC) is a new method for approximating the pth frequency moments of data streams, using maximally-skewed stable random projections. Under the relaxed strict-Turnstile data stream model, it is known that the first moment (ie, the sum, p=1) can be computed trivially using a simple counter. CC successfully captures the intuition that an intelligent counting system should have low complexity when p is around 1. Compared to previous classical work based on symmetric stable random projections, CC dramatically reduces the estimation errors, especially around p=1. CC has natural applications in estimating entropy of data streams, because one can approximate the Shannon entropy using functions of the pth frequency moments around p=1. In large-scale networks, the Shannon entropy has been demonstrated effective in detecting anomaly events such as network failures and distributed denial of service attack (DDoS).

Title: Identifiability in Network Tomography under Dependence

We study the problem of identifiability of the joint distribution of flows in a computer network from aggregate measurements collected on its edges. Conceptually, this is a canonical example of a statistical linear inverse problem. In a departure from previous approaches, we investigate situations where flow volumes exhibit dependence. We introduce a number of models that capture spatial, temporal and inter-modal (i.e. between packets and bytes) dependence between flow-volumes. These models are fairly general but specific instances that incorporate structural features of network traffic are also investigated. We provide sufficient, sometimes necessary, conditions for the identifiability of of the flow volumes distribution (upto mean) under these models. Further, we investigate conditions on network routing that are sufficient for identifiability of flow volumes distribution.

Title: Matrix Completion from Fewer Entries

Low-rank models are frequently used in machine learning and statistics. An important domain of application is provided by collaborative filtering, whereby a low-rank matrix describes the ratings that a large set of users attribute to a large set of products. The problem is in this case to predict future ratings from a sparse subset currently available. The dataset released for the {NetFlix challenge} provides an ideal testbed for theory and algorithms for learning low-rank matrices.

Given M, a random n x n matrix of rank r, we assume that a uniformly random subset E of its entries is observed. We describe an efficient procedure that reconstructs M from |E| = O(rn) observed entries with arbitrarily small root mean square error. Further, the matrix can be reconstructed exactly from O(rn log n) entries.

This improves (by an unbounded factor) over recent results by Candes and Recht. The complexity of our algorithm is O(|E|r\log n), which opens the way to its use for massive data sets. In the process of proving these statements, we obtain a generalization of a celebrated result by Friedman-Kahn-Szemeredi and Feige-Ofek on the spectrum of sparse random matrices.

Based on joint work with R. H. Keshavan and S. Oh.

Title: Approximate Sparse Recovery: Optimizing Time and Measurements

A Euclidean approximate sparse recovery system consists of parameters epsilon, k, d, an n-by-d measurement matrix, Phi, and a decoding algorithm, D. Given a vector, x, the system approximates x by y=D(Phi x), which must satisfy ||y - x||_2 <= (1+epsilon)||yopt_k - x|| _2, where yopt_k denotes the optimal k-term approximation to x. For each vector x, the system must succeed with probability at least 3/4. Many previous works have addressed this problem. Among the goals are minimizing the number n of measurements and the runtime of the decoding algorithm, D.

In this result, we give a system with n=O(k log(d/k)/epsilon) measurements---matching a lower bound, up to constant factors---and decode time k*poly( log(d)/epsilon), matching a lower bound up to factors of log(d)/epsilon.

Joint work with Anna C. Gilbert (Michigan), Yi Li (Michigan), Ely Porat (Bar-Ilan), and Martin J. Strauss (Michigan).

Title: Iterative signal recovery from incomplete and inaccurate measurements

Compressive Sampling (CoSa) offers a new paradigm for acquiring signals that are compressible with respect to an orthobasis. The major algorithmic challenge in CoSa is to approximate a compressible signal from noisy samples. Until recently, all provably correct reconstruction techniques have relied on large-scale optimization, which tends to be computationally burdensome.

This talk describes an iterative, greedy recovery algorithm, called CoSaMP that delivers the same guarantees as the best optimization-based approaches. Moreover, this algorithm offers rigorous bounds on computational cost and storage. It is likely to be extremely efficient for practical problems because it requires only matrix--vector multiplies with the sampling matrix. For many cases of interest, the running time is just O(N*log^2(N)), where N is the length of the signal.

Joint work with D. Needell.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center

Document last modified on March 25, 2009.