Title: A Survey on Power-Reduction Techniques for Data-Center Storage Systems
As data-intensive, network-based applications proliferate, the power consumed by the data-center storage subsystem surges. This talk surveys a decade of research on power-aware enterprise storage systems. All of the existing power-reduction techniques are classified according to the disk-power factor and storage-stack layer addressed. A majority of power-reduction techniques is based on dynamic power management. We also consider alternative methods that reduce disk access time, conserve space, or exploit energy-efficient storage hardware. For every energy-conservation technique, the fundamental trade-offs between power, capacity, performance, and dependability are uncovered. With this survey, we intend to stimulate integration of different power-reduction techniques in new energy-efficient file and storage systems.
Title: Availability and Locality in Distributed Storage
Currently, several storage clusters are deploying modern distributed storage codes to increase storage efficiency. In this talk we will first give an overview of how Hadoop HDFS can be modified to use codes with locality and the involved benefits. Subsequently we will discuss challenges that arise in Facebook warehouses when storing cold and hot data and how different code properties influence performance and energy savings.
Title: Providing Performance Guarantees for Cloud Applications
Applications with a dynamic workload demand need access to a flexible infrastructure to meet performance guarantees and minimize resource costs. While cloud computing provides the elasticity to scale the infrastructure on demand, cloud service providers lack control and visibility of user-space applications, making it difficult to accurately scale the underlying infrastructure. Thus, the burden of scaling falls on the user. That is, the user must determine when to trigger scaling, and how much to scale by.
In this talk, I will present our ongoing work at IBM aimed at providing a cloud service that automatically scales the infrastructure to meet the user-specified performance requirements, even when multiple user applications are running concurrently. We leverage application-level metrics, along with resource usage metrics, to more accurately scale the infrastructure when compared with existing cloud scaling technologies that only use resource usage metrics. We employ Kalman filtering to automatically learn the (possibly changing) system parameters for each application, thereby proactively scaling the infrastructure to meet the user-specified performance requirements.
Title: Dynamic Power Management in Data Centers
Energy costs for data centers continue to rise, already exceeding ten billion dollars yearly. Sadly much of this power is wasted. Server are only busy 10-30% of the time, but they are often left on, while idle, utilizing 60% of more of peak power while in the idle state. The obvious solution is dynamic power management: turning servers off, or re-purposing them, when idle. The drawback is a prohibitive "setup cost" to get servers back "on." The purpose of this talk is to understand the effect of the "setup cost" and whether dynamic power management makes sense.
We first turn to theory and study the effect of setup cost in an M/M/k queue. We present the first analysis of the M/M/k/setup queueing system. We do this by introducing a new technique for analyzing infinite, repeating, Markov chains, which we call Recursive Renewal Reward (RRR).
We then turn to implementation, where we implement and evaluate dynamic power management in a multi-tier data center with key-value store workload, reminiscent of Facebook or Amazon. We propose a new dynamic algorithm, AutoScale, which is ideally suited to the case of unpredictable, time-varying load, and we show that AutoScale dramatically reduces power in data centers.
Joint work with Anshul Gandhi, Alan Scheller-Wolf, and Mike Kozuch.
Title: Performance Metrics and Protocols for Data Centers in Multimedia
The design and use of data centers involves tradeoffs among cost of transmission, cost of storage, and different performance metrics. In this talk, we present a few simple case studies that illustrate these interactions, in the context of using coding. First, we consider the use of coding for trading off use of a costly resource, say a local cache or network with higher cost, with the probability of interruption of a progressive download video and its buffering delay. Next, we consider a peer-aided edge cache system, where coding is used to provide smooth use of edge cache, peers and data centers, in a way that envisages both storage and transmission costs. Finally, we discuss the use of coding in delivery of video, both when the video is kept uncoded but delivered in a coded fashion, using HTTP over TCP, and when the video is stored in a coded fashion.
Joint work with Flavio du Pin Calmon, Jason Cloud, Ulric Ferner, Kerim Fouli, Minji Kim, Doug Leith, Qian Long, Asu Ozdaglar, Ali Parandehgheibi, Marco Pedroso, Srinivas Shakkottai, Emina Soljanin, Leo Urbina, Luis Voloch, Weifei Zeng.
Title: ParGreening Datacenters Through Self-Generation of Renewable Energy
Datacenters consume an enormous amount of electricity, which translates into high operational cost and high carbon emissions, since most of this electricity is produced using fossil fuels. Interest has been growing in building "green" datacenters that are partially or completely powered by renewable ("green") sources of energy such as solar or wind. Green datacenters have the potential to reduce both the electricity costs and the carbon footprint. However, solar and wind energy production is variable, making it challenging to use in datacenters. In this talk, I will first explore self-generation with solar and/or wind as an approach to greening datacenters. I will then describe Parasol, a prototype green datacenter that we have built as a research platform. Parasol comprises a small container, a set of solar panels, a battery bank, and a grid-tie. Finally, I will describe our work on matching a datacenter's computational load to the green energy supply. I will present real experiments run on Parasol to show that intelligent workload and energy source management can significantly reduce grid electricity consumption (thereby lowering the carbon footprint) and cost.
Title: Speeding up Content Retrieval in Distributed Systems: Do Redundancy and Cooperation Help?
We address the problem of reducing latency in content retrieval for two large-scale system settings: (a) distributed storage and (b) distributed content delivery.
For coded distributed storage systems, we analyze retrieval latency performance through the lens of queueing theory. Building on this, we ask: when do redundant requests reduce system latency? Intuitively, when multiple copies of a piece of data are stored, service finishes as soon as the first copy is retrieved. However, this needs to be baked off against the extra system delay introduced due to redundant requests. Who wins? Our goal is to study this in an analytical framework under various system settings, as well as to come up with efficient dynamic redundancy-requesting scheduling policies.
We then address the value of peer-to-peer cooperation in improving the latency of distributed content delivery systems. We describe our recent work on a robust distributed content delivery system made possible by massively aggregating the micro-resources (of storage, bandwidth, and network connectivity) available at the exponentially growing number of "edge" devices like smartphones, tablets, and laptops. We present a distributed optimization algorithm as well as extensive system simulations validating the power of cooperation.
Title: Performance Diagnosis and Improvements in Data Center Networks
Performance problems in data center networks are hard to detect and resolve because there is a large number of hosts, switches, and applications with diverse workloads. In this talk, we first present SNAP, a scalable network-application profiler that guides developers in identifying and fixing performance problems. Next, we discuss the experiences of our one-week deployment of SNAP in a production data center (with over 8,000 servers and over 700 application components), which helped developers uncover 15 major performance problems in application software, the network stack on the server, and the underlying network. Finally, to solve one of the performance problems we discovered with SNAP, we present our preliminary solution that shares buffer capacity of nearby switches during heavy congestion.