DIMACS Center, CoRE Building, Rutgers University

**Organizing Committee:****Baruch Schieber**, IBM, sbar at us.ibm.com**Hadas Shachnai**, Technion, hadas at cs.technion.ac.il**Lisa Zhang**, Nokia Bell Labs, ylz at research.bell-labs.com

Title: On Coflow scheduling

Coflows have emerged as a basic communication problem in large systems (see Chowdhury, Zhong and Stoica [SIGCOMM 2014] and Chowdhury and Stoica [Hot Topics on Networks 2012]), especially in the context of Mapreduce. Abstractly, communication issues in data centers can be thought of as a single non blocking switch with m input and m output ports.

A coflow is specified by a communication matrix that specifies the amount of data that needs to be sent from input port i to output port j. At each step, after a matching is selected between the input and output ports, one unit of data can be transmitted simultaneously on each edge of the matching. A single coflow can be scheduled optimally However, we wish to schedule n coflows with the goal of minimizing the weighted completion time (a coflow only completes when all of its data is sent).

This problem is NP-hard and is a significant generalization of the open concurrent shop problem. We show that in fact it can be solved using techniques similar to those used to solve open concurrent shop and a factor 4 approximation can be obtained this way, even combinatorially.

Joint work with Manish Purohit and Sheng Yang.

Title: Communication-Aware (Data Center) Network Design : Some Results and Challenges

We currently witness the emergence of interesting new network topologies optimized towards the traffic matrices they serve, such as demand-aware datacenter interconnects (e.g., ProjecToR) and demand-aware peer-to-peer overlay networks (e.g., SplayNets). This work introduces a formal framework and approach to reason about and design such topologies.

In particular, we establish a connection between the communication request distribution and the expected path length in the network and show that this relationship depends on entropy measures of the communication matrix.

We derive a general lower bound and present asymptotically optimal network-aware design algorithms for important distribution families (such as sparse distributions and distributions of locally bounded doubling dimensions, among others.)

Title: Classic Network Measurements meets Software Defined Networking

In modern cloud infrastructures, each physical server often runs multiple virtual machines and employs a software Virtual Switch (VS) to handle their traffic. In addition to switching, the VS performs network measurements such as identifying the most frequent flows, which are essential for networking applications such as load balancing.

In this talk, I will present new sampling techniques that allow the switch to significantly reduce its CPU consumption thereby increasing its throughput.

Title: Recent Advances in Coflow-Based Networking

Over the past decade, the confluence of an unprecedented growth in data volumes and the rapid rise of cloud computing has fundamentally transformed systems software and corresponding infrastructure. To deal with massive datasets, more and more applications today are scaling out to large datacenters. Communication between the distributed computation tasks of these applications often result in massive data transfers over the network. Consequently, concentrated efforts in both industry and academia have gone into building high-capacity, low-latency datacenter networks at scale. At the same time, researchers and practitioners have proposed a wide variety of solutions to minimize flow completion times or to ensure per-flow fairness based on the point-to-point flow abstraction that forms the basis of the TCP/IP stack.

We observe that despite rapid innovations in both applications and infrastructure, application- and network-level goals are moving further apart. Data-parallel applications care about all their flows, but today's networks treat each point-to-point flow independently. This fundamental mismatch has resulted in complex point solutions for application developers, a myriad of configuration options for end users, and an overall loss of performance.

The recently proposed coflow abstraction bridges the gap between application-level performance and network-level optimizations. Each multipoint-to-multipoint coflow represents a collection of flows with a common application-level performance objective, enabling application-aware decision making in the network. We describe complete solutions including architectures, algorithms, and implementations that apply coflows to multiple scenarios using central coordination, and we demonstrate through large-scale cloud deployments and trace-driven simulations that simply knowing how flows relate to each other is enough for better network scheduling, meeting more deadlines, and providing higher performance isolation than what is otherwise possible using today's application-agnostic solutions.

Title: Explicit Expanding Expanders as Datacenter Topologies

In this talk I will explore the use of expander graphs as datacenter topologies, from a mostly theoretical perspective. I will mainly discuss the problem of making expanders "expandable". Since datacenters grow over time, we would like the ability to expand a datacenter by adding new servers or racks without requiring completely rewiring the network. One way to do this is of course to leave extra, unused capacity, but in order to maximize datacenter performance we would like to use all available capacity, and instead simply bound the amount of rewiring necessary. Recent proposals for such "expandable" datacenters include using random graphs as the underlying topology (the "Jellyfish" system of Singla et al.). We show that we can build explicit, deterministic expanders with strong throughput and rewiring guarantees (comparable to Jellyfish). Time permitting, I will also discuss recent results showing that two competing approaches for designing optimal topologies (expanders and degree/diameter graphs) are intertwined: any good degree/diameter graph is in fact a good expander.

Joint work with Asaf Valadarsky, Gal Shahaf, and Michael Schapira

Title: Towards Lossless Data Center Reconfiguration: Consistent Network Updates in SDNs

Data center networks (DCNs) are in a constant state of change: Traffic matrices shift, equipment gets repaired, VMs are migrated, and so on. As such, it is necessary to optimize the network behavior, e.g., change routing paths and modify flow splitting rules. Software defined networking (SDN) allows us to implement these optimizations, bringing the power of classical mathematical optimization algorithms into the domain of networking.

But, even though SDN comes with the promise of central control, the implementation still happens in the switches themselves, a decentralized and asynchronous environment. In the ideal case, the DCN should be oblivious to reconfiguration via so called network updates, not suffering data loss due to optimization procedures.

In this talk, we are going to discuss the inherent algorithmic issues that arise when trying to keep network updates consistent under two fundamental consistency properties, loop freedom and congestion avoidance. The talk will be based on various recent papers, please see http://www.foerster.me for more information.

Title: Tutorial: Performance Challenges in Distributed Machine Learning on Large-scale Data Centers

Machine learning is one of today's most rapidly growing fields: in particular, deep learning has achieved breakthrough results in key problems of computer vision, speech recognition, natural language processing, and robot control. To successfully train neural networks with many hidden layers, efficient algorithms, large datasets, and large scale data centers all play a fundamental role. When scaling up is not enough, data and computation must be distributed among multiple nodes, e.g., multiple CPUs, GPUs, or even FPGAs with dedicated hardware architectures. In this tutorial, we will review the solutions adopted in practice to distribute stochastic gradient descent computations at the scale of data centers. We will consider performance issues that can be encountered by these approaches when communication between nodes uses TCP/IP over Ethernet, a common scenario in modern data centers, and highlight possible models/methods to select optimal resource allocations.

Title: Policy Driven Software Defined Multicast Using Efficient Selection of Unicast Branching Points

Conventional multicast forwarding state is nonaggregatable and non-scalable and hence multicast is disabled in most public IP networks. We present a new, SDN based multicast technique designed to overcome these limitations. This technique, which we call Unicast Branching (UB), works by translating incoming multicast flows to unicast flows that are sent via a customized point-to-point distribution tree of branching switches to egress switches. By mapping multicast to unicast, UB eliminates multicast state from intermediate switches and enables interworking across layers and protocols and full overlay and underlay virtualization. Given a candidate set of branching switches for a given multicast flow, the standard Steiner tree methods for building optimal multicast trees are not directly usable for UB. We therefore devise an efficient two-way mapping between this problem and the Steiner tree problem in a related graph. This allows us to leverage the considerable body of work on the Steiner tree problem to provide full policy support in the selection of the number, type and location of branching switches for each multicast flow.

Joint work with: Gordon Wilfong and T. V. Lakshman Nokia Bell Labs

Title: Reusing Network Services Logic to Improve Network Performance

Network traffic inside a datacenter network flow through chains of services. These services include middlebox applications such as packet processing and monitoring, caching, and billing, as well as endpoint applications such as web servers. In many cases, traffic goes through highly similar processing steps over and over, in different services. A simple, yet highly resource-consuming example step is packet classification, which is done in almost any network service.

In this talk we present OpenBox - a software-defined framework for developing, deploying, and managing network functions in large-scale managed networks. OpenBox is designed to centrally aggregate the logic of multiple services such that packets would traverse a shorter processing path, as similar processing steps are merged. We implemented and deployed the OpenBox framework, and we show that it improves network performance and resource utilization.

As OpenBox is a relatively clean-slate solution, we also discuss a gradual approach, in which existing services could offload some of their logic to supporting infrastructure or to other services, to improve the overall network performance.

Title: Survey: Approximating Spanners

The basic minimum k-spanner problem is the following. Given a graph G(V,E) and a number k, find a subgraph G'=G'(V,E') of G so that dist_{G'} (u,v) \leq k* dist_{G}(u,v), for every pair of vertices u,v, such that the number of edges in G' is minimized. In this survey talk, we present some general existential upper bounds for k-spanners. But mostly, we survey approximating algorithm for spanners, including very recent results. In particular, the survey is partially based on a paper with Chlamtac, Dinitz, and Laekhanukit, presented in SODA 2017.

Title: Algorithmic Challenges in Building Efficient Cloud Infrastructure

I will talk about two algorithmic questions that arise in the context of building efficient cloud infrastructure. The first problem is about expressing an n × n doubly stochastic matrix as a linear combination using the smallest number of (sub)permutation matrices. This problem arises in the setting of routing in reconfigurable data centers.

The second problem I will talk about is scheduling and pricing virtual machines in the cloud: How should we price and schedule VMs online to maximize the total profit? The problem is motivated by cloud services such Microsoft Azure or Amazon Elastic Cloud.

Title: Sublinear Random Access Generators for Preferential Attachment Graphs

We consider the problem of generating random graphs in evolving random graph models. In the standard approach, the whole graph is chosen randomly according to the distribution of the model before answering queries to the adjacency lists of the graph. Instead, we propose to answer queries by generating the graphs on-the-fly while respecting the probability space of the random graph model. We focus on two random graph models: the Barabasi-Albert Preferential Attachment model (BA graphs) and the random recursive tree model. We present sublinear randomized generating algorithms for both models. Per query, the running time, the increase in space, and the number of random bits consumed are $\poly\log(n)$ with probability $1-1/\poly(n)$, where n denotes the number of vertices. This result shows that, although the BA random graph model is defined sequentially, random access is possible without chronological evolution. In addition to a conceptual contribution, on the-fly generation of random graphs can serve as a tool for simulating sublinear algorithms over large BA-graphs.

Joint work with Guy Even, Reut Levi and Adi Rosen.

Title: Load Balancing on the Cloud Infrastructure

I will discuss a number of online load balancing problems for cloud infrastructure: one in the context of consistent hashing with bounded loads for dynamic environments, one is in the context of online scheduling with overcommitment, and finally one in the context of multi-dimensional load balancing. Other than presenting theoretical results on these topics, we show how some of our new algorithmic techniques have been applied by Google and other companies, and confirm their significance in practice.

Title: Online Vector Scheduling

The online load balancing problem, introduced by Graham in the1960s, is one of the oldest and most extensively studied problems in the scheduling literature. Recent interest in data center scheduling has led to renewed efforts at solving this problem in the context of vector loads that characterize jobs with multi-dimensional resource requirements. In this talk, I will describe nearly tight algorithms for the vector scheduling problem on identical, related, and unrelated machines. This talk will be based on a joint paper with Sungjin Im, Nat Kell, and Janardhan Kulkarni in FOCS 2015 and a very recent set of results with Sungjin Im, Nat Kell, and Maryam Shadloo.

Title: Tutorial: Network Function Virtualization (NFV) Placement Theory and Practice

This tutorial focuses on the importance of modeling for the development optimization algorithms for the Network Function Virtualization (NFV) and Software Define Networks (SDN) paradigm. Since the systems we deal with are very complex and the actual behavior depends on many different parameters, we usually create a model that captures the relevant aspects of the system, define the important business or operation objectives and apply algorithmic techniques to find a scheme that optimizes the objectives over this model. We then deploy these algorithms on the real systems, hoping to get optimal performance. A key observation here is that the definition of the model has a critical impact on the actual results and small changes in the model can result in very different allocation techniques and system performance. In this talk I will illustrate this point by describing several optimization problems in the area of NFV/SDN with specific emphasis on the model, its ability to capture the important aspects and the quality of the final optimization solution.

This tutorial covers work done while I was at Bell Labs (these results will appear in IM 2017 and INFOCOM 2017).

Title: Scheduling with Energy and Network Constraints

Energy efficient scheduling is a critical task in modern data centers. It has been observed that proper scheduling mechanisms that assign jobs to parallel machines can play a major role in saving energy by selectively shutting down machines. In traditional scheduling literature (called the restricted assignment setting), each job has a fixed subset of machines where it can be assigned. However, such a framework severely restricts the possible assignments, and may become a performance bottleneck.

In this talk, we will discuss our preliminary work that allows for some migration of jobs through data center network to both improve performance, and energy consumption. The model generalizes other well-known combinatorial optimization problems such as capacitated vertex cover etc., and gives rise to many exciting new directions of research.

Title: Preemptive Resource-Constrained Scheduling with Deadlines

Resource constrained scheduling problems have received considerable attention since the work of Graham and Gary in 1960s. Such problems naturally arise also in data center networks, where commonly shared resources are storage space, processors, or network bandwidth. We consider a throughput and a machine minimization variant of the following problem in this domain: given a set of non-parallel jobs with resource requirements, release times and deadlines, and a set of identical machines, schedule all (in the machine minimization variant) or a maximum weight subset (for the throughput variant) of jobs on these machines feasibly. Preemption and migration are allowed.

While the machine minimization variant (in the "slotted time" model) generalizes the well-known vector packing problem, the throughput variant has applications in ad campaign scheduling. I will present some recent results in a joint work with Baruch Schieber and Hadas Shachnai.

Title: Tutorial: Algorithmic Issues in Network Resource Reservation Problems

Cloud-based applications, including batch processing, streaming, and scale-out databases, generate a significant amount of network traffic and a considerable fraction of their runtime is due to network activity. Accordingly, interference on the network can lead to an unpredictable performance (e.g., execution times). This tutorial gives an overview of the algorithmic problems underlying network resource reservation problems. In particular, we will discuss challenges and algorithmic solutions for the (virtual) network embedding problem, both at deployment time (offline) as well as at runtime (online). We conclude with an overview of models and open research questions.

Title: Towards Optimal-Performance Datacenters

Despite extensive efforts to meet the ever-growing demands, today's datacenters often exhibit far-from-optimal performance in terms of network utilization, resiliency to failures, cost efficiency, incremental expandability, and more. Consequently, many novel architectures for high performance datacenters have been proposed. We show that the benefits of state-of-the-art proposals are, in fact, derived from the fact that they are (implicitly) utilizing "expander graphs" (aka expanders) as their network topologies, thus unveiling a unifying theme of these proposals. We observe, however, that these proposals are not optimal with respect to performance, do not scale, or suffer from seemingly insurmountable obstacles to deployment. We leverage these insights to present Xpander, a novel datacenter architecture that achieves near-optimal performance and provides a tangible alternative to existing datacenter designs.

We contrast expander-based datacenters a la Xpander with another recently proposed approach for improving datacenter performance: the design of reconfigurable topologies that can be dynamically adapted to changes in traffic patterns. We report on initial results comparing Xpander to reconfigurable datacenter designs of the same cost. Our results suggest that expander-based datacenters achieve comparable performance benefits, thus constituting a simpler path for progress beyond today's datacenters than contending with the significant deployment barriers facing dynamic topologies.

This is a joint work with Michael Dinitz (Johns Hopkings University), Simon Kassing and Ankit Singla (ETH Zurich), Gal Shahaf and Michael Schapira (Hebrew University)

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center

Document last modified on June 1, 2017.