DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms

August 27 - 28, 2015
DIMACS Center, CoRE Building, Rutgers University

Organizers:: Alexandr Andoni, Columbia University; Muthu Muthukrishnan, Rutgers University, smewtoo at gmail.com; Grigory Yaroslavtsev, University of Pennsylvania, grigory.yaroslavtsev at gmail.com

Presented under the auspices of the DIMACS Special Focus on Information Sharing and Dynamic Data Analysis.

Abstracts:

Nina Balcan, Carnegie Mellon University

Title: Distributed Machine Learning

We consider the problem of learning from distributed data and analyze fundamental algorithmic and communication complexity questions involved. Broadly, we consider a framework where information is distributed between several locations, and our goal is to learn a low-error hypothesis with respect to the overall data by using as little communication, and as few rounds of communication, as possible. As an example, suppose k research groups around the world have collected large scientific datasets, such as genomic sequence data or sky survey data, and we wish to perform learning over the union of all these different datasets without too much communication.

In this talk, I will first discuss a general statistical or PAC style framework for analyzing communication complexity issues involved when doing distributed supervised machine learning, i.e., learning from annotated data distributed across multiple locations. I will discuss general lower bounds on the amount of communication needed to learn a given class and broadly-applicable techniques for achieving communication-efficient supervised learning.

I will also discuss algorithms with good communication complexity for unsupervised learning and dimensionality reduction problems, with interesting connections to efficient distributed coreset construction.

Mark Braverman, Princeton University

Title: Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality

We study the tradeoff between the statistical error and communication cost for distributed statistical estimation problems in high dimensions. In the distributed sparse Gaussian mean estimation problem, each of the m machines receives n data points from a d-dimensional Gaussian distribution with unknown mean theta which is promised to be k-sparse. The machines communicate by message passing and aim to estimate the mean theta. We provide a tight (up to logarithmic factors) tradeoff between the estimation error and the number of bits communicated between the machines.

This bound directly leads to a lower bound for the distributed sparse linear regression problem: to achieve the statistical minimax error, the total communication is at least Omega(min{n,d}m), where n is the number of observations that each machine receives and d is the ambient dimension. As our main technique, we prove a distributed data processing inequality, as a generalization of usual data processing inequalities, which might be of independent interest. Finally, we give a communication-optimal protocol for distributed Gaussian mean estimation, improving the number of rounds of the previous such protocol from O(log m) to 1.

Joint work with Ankit Garg, Tengyu Ma, Huy Nguyen, and David Woodruff

Amit Chakrabarti, Dartmouth College

Title: Optimization While Streaming

Dealing with the space limitations and sequential-access constraints of the data-streaming model has infused new life into a wide array of algorithmic problems. A recent and growing trend has been to revisit classic combinatorial optimization problems in the streaming setting. In this talk I shall address two recent works in this vein, the first on constrained submodular maximisation, and the second on set cover.

Each of these problems is defined over a "ground set" or "universe", which we shall take to be [n] := {1,2,...,n}. The input is then a stream of subsets of this universe: the sets themselves in the case of set cover, and (say) the edges of a graph on [n] in the case of maximum-submodular matching. Our focus shall be on semi-streaming algorithms: those running in space O(n poly(log n)).

For the matching problem (and, more generally, submodular maximization) we give new one-pass and multi-pass upper bounds, leaving the establishment of corresponding lower bounds as interesting open problems. For set cover, we give tight upper and lower bounds that fully resolve the pass/approximation tradeoffs achievable: in short, p semi-streaming passes lead to an approximation ratio of n^{1/(p+1)}. The set cover lower bound uses a novel algebraic construction of an abstract incidence geometry, which may be interesting in its own right as a combinatorial result.

Based on separate joint works with Sagar Kale and Tony Wirth.

Graham Cormode, University of Warwick

Title: Trusting the Cloud with Practical Interactive Proofs

When handling large quantities of data, it is often necessary to outsource the computational effort to a third party: this captures current efforts in cloud computing, but also scenarios within trusted computing systems and inter-organizational data sharing. When the third party is not fully trusted, it is important to give assurance that the computation has been performed correctly. This talk presents some recent results in designing new protocols for verifying computations which are streaming in nature: the verifier (data owner) needs only a single pass over the input, storing a sublinear amount of information, and follows a simple protocol with a prover (service provider) that takes a small number of rounds. A dishonest prover fools the verifier with only vanishingly small probability, while an honest prover's answer is always accepted. Starting from basic aggregations, interactive proof techniques allow a quite general class of computations to be verified, leading to practical implementations.

This is joint work with Amit Chakrabarti, Andrew McGregor, Justin Thaler and Suresh Venkatasubramanian.

Artur Czumaj, University of Warwick

Title: On Testing Properties in Directed Graphs

We study property testing algorithms in directed graphs with bounded maximum indegree/outdegree. For directed graphs with bounded degree, there are two different models in property testing introduced by Bender and Ron (2002). In the bidirectional model, one can access both ingoing and outgoing edges while in the one-directional model, which seems to be more natural in most of applications, one can only access outgoing edges. We provide a new relation between the two models: we prove that every property that can be tested with constant query complexity in the bidirectional model, can be also tested with sublinear query complexity in the one-directional model.

A corollary of this result is that in the one-directional model, every property in hyperfinite digraphs is testable with sublinear query complexity.

Joint work with Pan Peng and Christian Sohler.

Alina Ene, University of Warwick

Title: The Power of Randomization: Distributed Submodular Maximization on Massive Datasets

A wide variety of problems in machine learning, including exemplar clustering, document summarization, and sensor placement, can be cast as constrained submodular maximization problems. Unfortunately, the resulting submodular optimization problems are often too large to be solved on a single machine. In this talk, we describe a simple distributed algorithm that is embarrassingly parallel and it achieves provable, constant factor, worst-case approximation guarantees. We demonstrate its efficiency in large problems with different kinds of constraints with objective values always close to what is achievable in the centralized setting.

This is joint work with Rafael Barbosa, Huy Nguyen and Justin Ward.

Funda Ergun, Indiana University, Bloomington & Simon Fraser University

Title: Palindrome Recognition In The Streaming Model

A palindrome is defined as a string which reads forwards the same as backwards, such as the string “racecar”. In this talk we discuss the problem of locatinging the longest (or close to the longest) palindrome in a given data stream of length n, under strict space bounds.

We explore the problem in the streaming context, i.e., the algorithm reads the input stream from left to right, using o(n) space, and at the end outputs the location of the longest palindrome, or a palindrome whose length is a small factor away from the length of the longest palindrome.

We discuss several randomized algorithms; we show how to compute the location of palindromes whose length is an additive epsilon approximation to that of the longest palindrome in O(sqrt(n)) space as well as a multiplicative approximation in O(log(n)) space.

We also briefly discuss the exact solution to the longest palindrome problem as well as lower bounds showing that our solutions are tight.

This is joint work with P. Berenbrink, F. Mallmann-Trenn, E. Sadeqi-Azer.

Piotr Indyk, MIT

Title: Keynote: Streaming Algorithms for Set Cover

In the Set Cover problem, we are given a collection of sets S_1...S_m that cover {1...n} and want to find a minimum cardinality sub-collection that covers the same set. In the streaming version of this problem, the sets are stored consecutively in a read-only memory and an algorithm can access them by performing sequential scans of the input. Over the last few years several algorithms for this problem have been discovered. The algorithms achieve various tradeoffs between the number of passes over the input, the amount of storage used and the approximation factor.

InIn this talk I will give an overview of what we know about this problem, followed by several new upper and lower bounds. In particular, I will describe a recent algorithm that uses O(p) passes and O(m n^(1/p)) storage while guaranteeing O(p log n) approximation factor, as well as lower bounds that indicate that some of the tradeoffs achieved by this algorithm might be tight.

InJoint work with Sepideh Mahabadi and Ali Vakilian.

Sanjeev Khanna, University of Pennsylvania

Title: Approximate Matchings in Dynamic Graph Streams

We consider the problem of approximating a maximum matching in dynamic graph streams where the input graph is revealed as a stream of edge updates that may include both edge insertions and deletions. The goal is to design a streaming algorithm that computes an approximate matching in sublinear space. Essentially, the only known technique for designing algorithms in the dynamic streaming model is linear sketching where the algorithm maintains a linear projection of the input data. In this talk, we will present a resolution of the space needed to approximate maximum matching via linear sketching in the dynamic streaming model. Combined with a recent characterization of dynamic streaming ! algorithms, this in fact resolves the space complexity of approximating matchings by any streaming algorithm.

This is based on joint work with Sepehr Assadi, Yang Li, and Grigory Yaroslavtsev.

Silvio Lattanzi, Google Research, NYC

Title: Expanders via Local Edge Flips

Designing distributed and scalable algorithms to improve network connectivity is a central topic in peer-to-peer networks. In this paper we focus on the following well-known problem: given an n-node d-regular network for d = Omega(log n), we want to design a decentralized, local algorithm that transforms the graph into one that has good connectivity properties (low diameter, expansion, etc.) without affecting the sparsity of the graph. To this end, Mahlmann and Schindelhauer introduced the random “flip” transformation, where in each time step, a random pair of vertices that have an edge decide to ‘swap a neighbor’. They conjectured that performing O(nd) such flips at random would convert any connected d-regular graph into a d-regular expander graph, with high probability. However, the best known upper bound for the number of steps is roughly O(n^17d^23), obtained via a delicate Markov chain comparison argument.

Our main result is to prove that a natural instantiation of the random flip produces an expander in at most O(n^2d^2\sqrt{\log n}) steps, with high probability. Our argument uses a potential-function analysis based on the matrix exponential, together with the recent beautiful results on the higher-order Cheeger inequality of graphs. We also show that our technique can be used to analyze another well-studied random process known as the ‘random switch’, and show that it produces an expander in O(nd) steps with high probability.

This is joint work with Zeyuan Allen-Zhu, Aditya Bhaskara, Vahab Mirrokni and Lorenzo Orecchia.

John Langford, Microsoft Research, NYC

Title: Logarithic Time Prediction

Most multiclass learning algorithms require time O(K) to predict one of K different classes. Yet the lower bound is Omega(log K).

Why is there an exponential gap? Can it be closed? I will discuss a number of results related to the feasibility of logarithmic time prediction, algorithms for doing so, and the results they achieve.

Andrew McGregor, University of Massachusetts, Amherst

Title: The Latest on Linear Sketching for Large Graphs: Lots of Problems, Little Space

In this talk, we survey recent work on using random linear projections, a.k.a. sketches, to solve graph problems. Sketches are useful in a variety of computational models including the dynamic graph stream model were the input is defined by a stream of edge insertions and deletions that need to be processed in small space. A large number of problems have now been considered in this model including edge and vertex connectivity, sparsification, densest subgraph, correlation clustering, vertex cover and matching.

Michael Mahoney, University of California, Berkeley

Title: Linear and Sublinear Aspects of Combining SGD and RLA

In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. SGD methods are easy to implement and applicable to a wide range of convex optimization problems. In contrast, RLA algorithms provide much stronger performance guarantees but are applicable to a narrower class of problems. In addition, both SGD and RLA methods have been studied from an algorithmic perspective, where one must at least touch all of the data in order to obtain nontrivial results, as well as from a statistical perspective, where one can obtain stronger results without even looking at the entire data under quite strong statistical assumptions. We discuss some of these results, and we describe how to use ideas from stochastic optimization to develop a hybrid algorithm for overdetermined L1 and L2 regression that uses RLA techniques for preconditioning and constructing an importance sampling distribution, and then performs an SGD-like iterative process with weighted sampling on the preconditioned system.

Vahab Mirrokni, Google Research, NYC

Title: Large-scale Graph Mining at Google NYC: Problems and Frameworks

In this talk, I discuss large-scale graph mining project at Google NYC. The goal of the project is to develop a distributed graph algorithm library for analyzing graphs with hundreds of billions of edges. I present an overview of challenges from three perspectives: application-inspired problems, distributed computation challenges, and finally combined system+algorithms research. In the first topic, I discuss the model of public-private graphs. On the 2nd topic, I discuss randomized composable core-sets for distributed submodular maximization and clustering. Finally, I discuss hybrid algorithmic and system problems for computing connected components in Mapreduce, and also a new graph mining framework based on asynchronous message passing, referred to as ASYMP.

Kamesh Munagala, Duke University

Title: Efficient Primal-dual Graph Algorithms on MapReduce

We present improved algorithms for some key graph-theoretic problems in the popular MapReduce framework. We first consider the densest subgraph problem and present a primal-dual algorithm that provides a (1 + eps) approximation and takes O(1/eps^2 log n) MapReduce iterations, each iteration having a shuffle size of O(m) and a reduce-key-complexity of O(dmax). Here m is the number of edges, n is the number of vertices, and dmax is the maximum degree of a node. Our key idea is to carefully control the width of the underlying polytope so that the number of iterations becomes small, but an approximate primal solution can still be recovered from the approximate dual solution. Our technique also applies to maximum multicommodity flows with bounded path lengths, and our algorithms also naturally map to the PRAM model. For the problems we consider, the best previous distributed algorithms needed number of iterations depending on 1/eps^4, in order to achieve a (1+eps) approximation.

Jelani Nelson, Harvard University

Title: An Introduction to Chaining, and Applications to Sublinear Algorithms

This talk will give a short introduction to chaining methods in probability, present in the works of Kolmogorov and developed by many since (see the book by Talagrand). Applications related to sublinear algorithms will be highlighted, such as to compressed sensing and dimensionality reduction.

This talk is based on works by many authors, none of whom is me.

Krzysztof Onak, IBM Research, NY

Title: Communication Complexity of Learning Discrete Distributions

The bounds on the sample complexity of most fundamental learning and testing problems for discrete distributions are well understood. We consider the scenario when samples are collected by a number of disjoint players and ask the question how much communication is necessary for them to solve the learning or testing problem.

In the talk, I will focus on the problem of learning the distribution and show that players have to essentially transmit their samples in order to solve the problem.

Joint work with Ilias Diakonikolas, Elena Grigorescu, and Abhiram Natarajan.

Ronitt Rubinfeld, MIT

Title: Testing and Correcting Structured Distributions

Given samples of a distribution from an unknown source, one would like to understand what properties the underlying distribution exhibits. When the distribution is over a very large underlying domain, traditional statistics techniques do not usually give a satisfying answer. We survey recent results which give general purpose techniques for addressing such questions. These results apply to a variety of structured properties defined in terms of "shape constraints".

We next suggest a principled methodology for "correcting" samples of distributions which come from a noisy or imperfect source. We show connections between these correction procedures and distribution learning and property testing algorithms.

Some of the results described represent joint work with Clement Canonne, Ilias Diakonikolas and Themis Gouleakis.

Justin Thaler, Yahoo! Research, NYC

Title: Parallel Peeling Algorithms

Consider the following peeling process: starting with a random hypergraph, repeatedly remove vertices with degree less than k, together with their incident edges, until the hypergraph is empty or the process cannot continue. In a surprising variety of settings, this simple process produces algorithms with very fast running times, typically linear in the size of the hypergraph.

After surveying applications of such peeling algorithms to various big data problems, I will discuss methods for parallelizing the peeling process. Our analysis yields parallel peeling algorithms that are highly efficient, completing in just O(log log n) rounds.

Joint work with Jiayang Jiang and Michael Mitzenmacher

Sergei Vassilvitskii, Google Research, NYC

Title: Graph Connectivity in MapReduce: How Hard Could it Be?

We use the connected components problem to review algorithmic approaches for MapReduce algorithms and describe some recent progress on lower bounds in this setting.

David Woodruff, IBM Research, Almaden

Title: Tutorial: A Survey of Results in the Message Passing Communication Model

I'll discuss various tools for proving bounds in the message passing communication model, where there are k players each with an input who speak to each other in a point-to-point communication network, with the goal being to solve a function of their inputs while minimizing total communication. I'll discuss graph problems, linear-algebraic problems, and statistical problems.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on August 27, 2015.