DIMACS Workshop on Parallelism: A 2020 Vision

March 14 - 16, 2011
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Organizers:: Phil Gibbons, Intel; Howard Karloff, AT & T Research, howard at research.att.com; Sergei Vassilvitskii, Yahoo! Research

Presented under the auspices of the Abstracts: Foto Afrati, National Technical University of Athens

Title: Cluster Computing, Recursion and Datalog

The introduction of map-reduce for parallel computation on commodity clusters focused a great deal of commercial and intellectual interest on this model and similar approaches to managing large-scale data. Various extensions have been implemented and some of them also handle recursion. Extending the map-reduce framework to include recursive queries presents a lot of challenges, such as recovery from compute-node failure or managing file transmission. Datalog has been used by various researchers and implementors to express queries on large data. We will present issues that arise when we execute Datalog queries on a cluster of low-end compute-nodes. Datalog allows also for a convenient way to express queries and also allows us to put a lower bound on the performance of the algorithms. This lower bound is the number of derivations of a Datalog program. Thus we compare the execution time and the data-volume cost of our algorithms to the number of derivations.

We first present a computation model to evaluate algorithms and then using transitive closure as an example, we examine the cost of executing recursions in several different ways. We will discuss how to generalize to all Datalog programs. We also notice that in the cluster computation environment it is more economical to use a small number of rounds in the recursion. However, for simple Datalog programs this comes at a cost of increasing the number of derivations. We discuss how linear Datalog programs can benefit from the transitive closure algorithms we present.

Werner Backes, Stevens Institute

Title: Lattice Basis Reduction and Multi-Core

Lattice theory and in particular lattice basis reduction is of great importance in cryptography. Not only does it provide effective cryptanalysis tools but it is also believed to bring about new cryptographic primitives that exhibit strong security even in the presence of quantum computers.

In this talk we will present our work on developing parallel algorithms for lattice basis reduction. Recently, we have developed the first parallel LLL algorithm based on the Schnorr-Euchner algorithm which is also efficient in practice. Our experiments with sparse and dense lattice bases show that the algorithm works well with up to 8 threads (and more depending on the speed and latency of the main memory). In this talk we will focus on the newest improvements to our algorithm that have allowed us to significantly improve the speed-up factor in lower dimensions and for lattice bases that are causing problems for our initial parallel algorithm. Our work does not only concentrate on developing parallel implementations for known algorithms, but also focuses on finding alternate ways to solve the underlying problems of finding a 'good' basis or 'short' lattice vectors. In this context, we will present a new generic reduction framework BoostReduce (and BoostBKZ as an instance) for strong lattice basis reduction. At the core of this framework is an iterative method which uses a newly-developed algorithm for finding short lattice vectors and integrating them efficiently into an improved lattice basis. Using these alternative approaches we demonstrate how we were able to develop a prototype lattice basis reduction algorithm that is capable of efficiently using 32 cores (and more) for specific classes of lattice bases.

Guy Blelloch, Carnegie-Mellon University

Title: Programming Parallel Algorithms

The first parallel algorithms were developed well over 30 years ago but yet there is still little convergence on a common way to express or even analyze them. Should one use transactions, futures, nested parallelism, map-reduce, GPU programming, data parallelism, the PRAM, bulk synchronization, threads, message passing, parallel I/O models, partitioned global address space, coordination languages, concurrent data structures, ... ? Although it is unlikely we will ever settle on a single model, as we move towards 2020 it seems time for some consolidation.

In the talk I will argue for some programming model features I believe are important for wide acceptance of parallel algorithms and programming, including the support of deterministic parallelism, a strong tie between programming language and cost model, the support of dynamic and divide-and-conquer parallelism, and a simple way to account for locality. The talk is more to invoke discussion than to suggest any definitive solutions.

Graham Cormode, AT & T Labs--Research

Title: Distributed Summaries

In dealing with massive distributed data, exact computations may be possible but can still be very expensive to perform. Instead, it is often desirable to look to more lightweight computations that give (guaranteed) approximations. A central approach is the idea of the mergeable summary: a compact data structure that summarizes a data set, which can be merged with others to summarize the union of the data without significant growth in size. This framework enables parallel computation.

Samples and sketches are two examples of well-known, effective mergeable summaries. I'll present recent results on new, compact mergeable summaries for various tasks, such as approximating frequent items, order statistics, and range counts.

Ran Dror, DE SHAW

Title: Overcoming Communication Latency Barriers in Massively Parallel Molecular Dynamics Simulations on Anton

Strong scaling of scientific applications on parallel architectures is increasingly limited by communication latency. This talk will describe the techniques used to reduce latency and mitigate its effects on performance in Anton, a massively parallel special-purpose machine that accelerates molecular dynamics (MD) simulations by orders of magnitude compared with the previous state of the art. Achieving this speedup required both specialized hardware mechanisms and a restructuring of the application software to reduce network latency, sender and receiver overhead, and synchronization costs. Key elements of Antonąs approach, in addition to tightly integrated communication hardware, include formulating data transfer in terms of counted remote writes and leveraging fine-grained communication. Anton delivers end-to-end inter-node latency significantly lower than any other large-scale parallel machine, and the total critical-path communication time for an Anton MD simulation is less than 3% that of the next-fastest MD platform.

Jonathan Eckstein, MSIS Department and RUTCOR, Rutgers University (Previously of Thinking Machines Corporation)

Title: A Survey of Parallelism in Solving Numerical Optimization and Operations Research Problems

This talk gives a broad-brush description of the current state of the art and the remaining challenges in performing numerical optimization on parallel architectures. We consider two broad classes of optimization algorithms: the first consists of continuous methods for convex problems or finding local optima. Here, there are two subclasses, decomposition algorithms that tend to be relatively well adapted to parallelism but require special problem structure and tend to have slow "tail" convergence, and mainstream algorithms that present parallel implementation difficulties for linear algebra operations on matrices whose sparsity patterns may differ significantly from those of classical PDE problems.

The second class involves branch-and-bound methods for global and integer optimization. Here, the problems are less daunting and there have been numerous successful test implementations using a variety of implementation philosophies. Still, there remain some issues of scalability, nondeterminism, and global information sharing between distant tree nodes.

The talk closes with some speculations about the "right" way to implement numerical optimization algorithms in parallel, both in the terms of "turnkey" solvers versus toolkits, and the most appropriate languages and programming environments.

James Edwards, University of Maryland

Title: Can PRAM Graph Algorithms Provide Practical Speedups on Many-Core Machines?

It has proven to be quite difficult for programmers to obtain significant performance improvements using current parallel computing platforms. Graph algorithms in particular tend to be difficult to implement efficiently. Many doubt the practical relevance of PRAM algorithms, and past work provided very limited evidence to alleviate these doubts; previous studies reported speedups of up to 4x on biconnectivity using a 12-processor Sun machine and up to 2.5x on maximum flow using a hybrid CPU-GPU implementation when compared to best serial implementations.

New results show that parallel graph algorithms derived from PRAM theory can provide significantly stronger speedups than alternative algorithms. These results include potential speedups of 5.4x and 73x on breadth-first search (BFS) and 2.2x to 4x on connectivity when compared optimized GPU implementations and potential speedups of 9x to 33x on biconnectivity and up to 108x on maximum flow when compared to best serial implementations on modern CPU architectures. These results were made possible by the Explicit Multi-Threading (XMT) architecture, which was designed to support the efficient execution of code derived from PRAM algorithms and for which a 64-processor FPGA hardware prototype and a software toolchain (compiler and simulator) are available. These experimental algorithm results show not only that theory-based algorithms can provide good speedups in practice, but also that they are sometimes the only ones that can do so. Perhaps most surprising to theorists would be that the nominal number of processors is not as important for a fair comparison among same-generation many-core platforms as silicon area.

Ashish Goel, Stanford

Title: Fast Incremental PageRank and Collaborative Filtering

We will analyze the efficiency of Monte Carlo methods for incremental computation of PageRank, personalized PageRank, and similar random walk-based methods (with focus on SALSA), on large-scale dynamically evolving social networks. We assume that the graph of friendships is stored in distributed shared memory, as is the case for large social networks such as Twitter. For global PageRank, we assume that the social network has n nodes, and m adversarially chosen edges arrive in a random order. We show that with a reset probability of epsilon, the total work needed to maintain an accurate estimate (using the Monte Carlo method) of the PageRank of every node at all times is

O((n log m)/epsilon^2).

This is significantly better than all other known bounds for incremental PageRank.

We will then outline how this can be extended to compute personalized PageRanks efficiently, and for random walk-based collaborative filtering. We also present experimental results from the social networking site, Twitter, verifying our assumptions and analyses. The overall result is that this algorithm is fast enough for real-time queries over a dynamic social network.

Michael T. Goodrich, UC Irvine

Title: Sorting, Searching, and Simulation in the MapReduce Framework

We study the MapReduce framework from an algorithmic standpoint and demonstrate the usefulness of our approach by designing and analyzing efficient MapReduce algorithms for fundamental sorting, searching, and simulation problems. This study is motivated by a goal of ultimately putting the MapReduce framework on an equal theoretical footing with the well-known PRAM and BSP parallel models, which would benefit both the theory and practice of MapReduce algorithms. We describe efficient MapReduce algorithms for sorting, multi-searching, and simulations of parallel algorithms specified in the BSP and CRCW PRAM models. We also provide some applications of these results to problems in parallel computational geometry for the MapReduce framework, which result in efficient MapReduce algorithms for sorting, 2- and 3-dimensional convex hulls, and fixed-dimensional linear programming. For the case when mappers and reducers have a memory/message-I/O size of M=Theta(N^epsilon), for a small constant epsilon>0, all of our MapReduce algorithms for these applications run in a constant number of rounds.

John Gustafson, Intel

Title: Billion-Core Computing

By 2020, some computers will have over one billion processor cores. Simple extrapolation of current trends are not likely to get us there because of both the laws of physics and the limits of human economics. We will get there nonetheless, by changing both the way we design and the way we use extreme-scale computers. This talk will present some of challenges to billion-core computer design and the architectural choices those challenges mandate, a simple approach to predicting the parallel performance of such machines, potential "sea changes" in the programming environment, and finally, some ways that applications might make effective use of such a large number of processors.

Robert Jacob, Argonne National Lab

Title: Parallel Coupling in Climate Models

Climate models are made of components, such as atmosphere and ocean models, that are traditionally developed by separate groups. This creates special software challenges when coupling the components together to make a climate model. When the components are each parallelized with their own methods and datatypes, there are new challenges in achieving parallel coupling. The Model Coupling Toolkit was created to address many of the problems encountered when coupling numerical models of two physical sub-systems in to a model of a larger system. MCT creates standard data types to which each model can adapt to facilitate efficient parallel coupling. MCT has been used successfully in one of the main U.S. climate models.

John Kubiatowicz, UC Berkeley

Title: Locally Limited, but Globally Unbounded: Dealing with Resources in an Explicitly Parallel World

The year 2002 will be remembered as the beginning of a fundamental phase-change in the computing world, one in which computing systems became explicitly parallel at all levels of the "world computing graph". Despite this apparent abundance of resources, resource imbalances still exist in this graph and will continue to exist for the foreseeable future. Such imbalances impact the ability of systems to guarantee responsiveness, realtime guarantees, power efficiency, security, and correctness. I will argue that a new OS primitive, called a "Cell", should replace processes as the basic unit of isolation, protection, and scheduling. Cells contain guaranteed fractions of system resources, including gang-scheduled groups of processors, as well as guaranteed fractions of system resources such as caches, network and memory bandwidth, and system services. I will describe "two-level scheduling" which separates the assignment of resources to Cells from the user-level scheduling of these resources. I will give some specific examples to highlight the adaptive resource assignment problem as it applies to local parallel computations and to cloud-backed mobile devices. Cells and two-level scheduling are a fundamental component of Tessellation, a research operating system being designed at UC Berkeley.

John Langford, Yahoo! Research

Title: A Summary of Parallel Learning Efforts

Ron Brachman, Misha Bilenko, and I have been editing a book (due out in a few months) on parallel learning efforts by a large and diverse set of groups covering a diverse set of learning problems. I plan to summarize these efforts, and discuss where the challenges are in parallel learning.

Dinesh Manocha, University of North Carolina at Chapel Hill

Title: 13 Years of GPGPU and Many-Core Computing: What have we learned?

For years the performance and functionality of graphics processors (GPUs) has been increasing at a faster pace than Moore's Law. The latest GPUs consist of more than 3 billion transistors and can offer a few TFlops of peak performance. They consist of tens and hundreds of cores, offer high memory bandwidth and have different programming model and the underlying architecture. In this talk, we will give an overview of our work in GPUs as many-core accelerators for different applications, including scientific computing, database computations, sorting and geometric algorithms. These include:

1. Scientific computation including linear algebra solvers, FFT, differential equation solvers and applications to fluid dynamics, physical simulation, acoustic simulati on.

2. Geometric computations including Voronoi diagrams, motion planning, collision detection, ray tracing, hierarchical computations, multi-agent simulation, etc.

3. Database operations and stream data mining on complex datasets. In 2006, we beat the 20+ year old Sorting benchmark using GPUs.

We will highlight our results on these applications and describe some of the hurdles we had faced in using the GPUs for a variety of applications. We also make a case that GPU-like architectures can be an integral part of future heterogeneous processors. We will highlight results from our recent work on work distribution methods, FFT, and numeric sound simulation, interactive crowd and traffic simulation.

http://gamma.cs.unc.edu

S. Muthukrishnan, Rutgers University

Title: MapReduce with Parallelizable Reduce

We introduce a simple algorithmic model for massive, unordered, distributed (mud) computation, as implemented by these systems. We show that in principle, mud algorithms are equivalent in power to symmetric streaming algorithms. More precisely, we show that any symmetric (order-invariant) function that can be computed by a streaming algorithm can also be computed by a mud algorithm, with comparable space and communication complexity. We also introduce an extension of the mud model to multiple keys and multiple rounds.

Jeffrey D. Oldham, Google

Title: Experiences Scaling Use of Google's Sawzall

Sawzall is a procedural language developed at Google for parallel analysis of very large data sets. Given a log sharded into many separate files, its companion tool named saw runs Sawzall interpreters to perform an analysis.

Hundreds of Googlers have written thousands of saw+Sawzall programs, which form a significant minority of Google's daily data processing. Short programs grew to become longer programs, which were not easily shared nor tested. In other words, scaling naively written Sawzall led to unmaintainable programs.

The simple idea of writing programs functionally, not iteratively, yielded shareable, testable programs. The functions reflect fundamental map reduction concepts: mapping, reducing, and iterating. Each can be easily tested.

This case study demonstrates that developers of parallel processing systems should also simultaneously develop ways for users to decompose code into sharable pieces that reflect fundamental underlying concepts. As importantly, they must develop ways for users to easily write tests of their code.

Lenny Oliker, Lawrence Berkeley Laboratory

Title: Green Flash: Designing An Energy Efficient Climate Supercomputer

Traditional approaches to the scaling of commodity desktop processor technology that focus on raw per-core sequential performance are running into an insurmountable power wall. It is clear from both the cooling demands and the electricity costs, that the growth in scientific computing capabilities of the last few decades is not sustainable unless fundamentally new ideas are brought to bear. In this talk we propose a novel approach to supercomputing design that leverages the sophisticated tool chains of the consumer electronics marketplace. We analyze our framework in the context of high-resolution global climate change simulations -- an application with multi-trillion dollar ramifications to the world economies. A key aspect of our methodology is hardware-software co-tuning, which utilizes fast and accurate FPGA-based architectural emulation. This enables the design of future exaflop-class supercomputing systems to be defined by scientific requirements instead of constraining science to the machine configurations. Our talk will provide detailed design requirements for a kilometer-scale global cloud system resolving climate models and point the way toward Green Flash: an application-targeted exascale machine that could be efficiently implemented using mainstream embedded design processes.

Kunle Olukotun , Stanford University

Title: Taming Heterogeneous Parallelism with Domain Specific Languages

Exploiting the power of heterogeneous parallel hardware is complicated because it requires application-code development with multiple programming models. In this talk, I will present a much simpler single programming model approach that uses domain specific languages (DSLs). I will describe OptiML, a DSL for machine learning and Liszt a DSL for mesh-based PDEs. Both of these DSLs are embedded in the Scala programming language using a technique called modular staging. Modular staging generates efficient parallel code for heterogeneous hardware.

Alejandro Salinger, University of Waterloo

Title: Theoretical Modeling of Multicore Computation

The paradigm shift to multicore computing has lead to a revival in the research activity for theoretical models of parallel computation. In this talk we review some of the various attempts to define models for computation in multicore architectures, from the perspective of the natural trade-off that any model has between accuracy and simplicity. We describe the key characteristics of each model and show examples of its applicability, ranging from a simple model that focuses on computation only, to models that consider variable levels of memory hierarchies and multiple parameters. We then discuss the problem of deciding which model or models should be adopted for multicore computation, from the point of view of both teaching and usability. Can these models be combined or even coexist to achieve a framework that can be manageable and realistic at the same time?

Tamas Sarlos, Yahoo! Research

Title: On Scheduling in Map-Reduce and Flow-Shops

The map-reduce paradigm is now standard in industry and academia for processing large-scale data. In this work, we formalize job scheduling in map-reduce as a novel generalization of the two-stage classical flexible flow shop problem: instead of a single task at each stage, a job now consists of a set of tasks per stage. For this generalization, we consider the problem of minimizing the total flowtime and give an efficient 12-approximation in the offline setting and an online (1+epsilon)-speed O(1/epsilon^2)-competitive algorithm.

Motivated by map-reduce, we revisit the two-stage flow shop problem, where we give a (Q)PTAS based on dynamic programming for minimizing the total flowtime when all jobs arrive at the same time. This gives the first improvement in approximation of flowtime for the two-stage flow shop problem since the trivial 2-approximation algorithm of Gonzalez and Sahni in 1978, and the first known approximation for the flexible flow problem. We then consider the generalization of the two-stage flexible flow problem to the unrelated machines case, where we give an offline 6-approximation and an online (1+epsilon)-speed O(1/epsilon^4)-competitive algorithm.

Gokarna Sharma, Louisiana State University

Title: Scalable Transactional Memory Scheduling

Transactional memory (TM) is a prominent optimistic concurrent programming API suitable for distributed environments in multicore/multiprocessor architectures. The advantage of TM API is the convenience that it offers to the programmer in accessing shared memory resources lock-freely and wait-freely though transactions consisting of shared memory accesses. In the heart of any software TM (STM) API is the contention manager which handles the transaction conflicts (memory access collisions), and schedules the transactions appropriately. The efficiency of STM systems relies on the good performance of contention managers and it is of great importance to design contention managers which scale gracefully with the size and complexity of the distributed system. We discuss our work on the design, formal analysis, and experimental evaluation of contention managers which exhibit both non-trivial formal performance guarantees and promising empirical performance. We first present formal metrics and performance analysis on the problem of implementing STM API in tightly-coupled shared memory systems, and we also provide experimental evaluations. Second, we give performance analysis results on the problem of supporting STM API in large-scale distributed networked systems.

Siddharth Suri , Yahoo! Research

Title: Counting Triangles and the Curse of the Last Reducer

The clustering coefficient of a node in a social network is a fundamental measure that quantifies how tightly-knit the community is around the node. Its computation can be reduced to counting the number of triangles incident on the particular node in the network. In case the graph is too big to fit into memory, this is a non-trivial task, and previous researchers showed how to estimate the clustering coefficient in this scenario. A different avenue of research is to to perform the computation in parallel, spreading it across many machines. In recent years MapReduce has emerged as a de facto programming paradigm for parallel computation on massive data sets. The main focus of this work is to give MapReduce algorithms for counting triangles which we use to compute clustering coefficients.

Our contributions are twofold. First, we describe a sequential triangle counting algorithm and show how to adapt it to the MapReduce setting. This algorithm achieves a factor of 10-100 speed up over the naive approach. Second, we present a new algorithm designed specifically for the MapReduce framework. A key feature of this approach is that it allows for a smooth tradeoff between the memory available on each individual machine and the total memory available to the algorithm, while keeping the total work done constant. Moreover, this algorithm can use any triangle counting algorithm as a black box and distribute the computation across many machines. We validate our algorithms on real world datasets comprising of millions of nodes and over a billion edges. Our results show both algorithms effectively deal with skew in the degree distribution and lead to dramatic speedups over the naive implementation.

Marc Snir, University of Illinois

Title: The Post-Moore Era and Exascale Computing: On the Need For New Foundations

The pace of continued decrease in feature size, described by Moore's Law, is expected to slow down in the coming decade, while the power requirements of top supercomputers hit the limits of current infrastructure. One will need an increased emphasis on compute efficiency: performing more computations with a given budget of silicon and energy. Traditional complexity theory, with its emphasis on operation counts, does not reflect the real performance bottlenecks of systems and is a poor guide to compute efficiency; new foundations are needed.

The talk will provide some evidence for these views and will propose several research problems that may lead toward such foundations.

Kanat Tangwongsan, Carnegie-Mellon University

Title: Efficient Parallel Approximation Algorithms: What We Learn From Facility Location

How can one design a parallel approximation algorithm that obtains non-trivial speedups over its sequential counterpart, even on a modest number of processors?

In this talk, I will discuss our lessons from deriving an RNC O(m log_(1+eps) m)-work and factor-(1.861+eps) algorithm for metric facility location---matching essentially the work and approximation bounds of its sequential counterpart. The core technique here can be viewed as a natural generalization of maximal independent set, which is applicable more broadly to other problems, including (weighted) set-cover, max k-cover, and min-sum set-cover. Finally, I will mention other work in progress along this line and describe a few open problems.

(This presentation will touch on our recent works in the area of efficient parallel approximation algorithms. This collection of results is joint with Guy Blelloch, Anupam Gupta, Ioannis Koutis, Gary Miller, and Richard Peng.)

Philippas Tsigas, Chalmers University, Sweden

Title: Design Challenges for Scalable Concurrent Data Structures

Concurrent data structure designers are striving to maintain consistency of data structures while keeping the use of mutual exclusion and expensive synchronization to a minimum, in order to prevent the data structure from becoming a sequential bottleneck.

First we discuss the challenges that come from the elimination of locks and mutual exclusion from the design space of concurrent data structures and also part of our effort to design efficient lock-free data structures. After that we discuss the opportunities of using new memory access capabilities for synchronization and the challenge to model these capabilities. Hopefully some expensive synchronization needed in data structure design can be replaced by less expensive memory access primitives. For the strong synchronization part that cannot be replaced by the new advanced memory accesses primitives, the introduction of new synchronization hardware that scales better than the one deployed widely in conventional architectures is needed.

Leslie G. Valiant, Harvard University

Title: Do Parallel Algorithms and Programs Need to be Parameter Aware?

We suggest that efficient exploitation of parallel computers will be severely limited unless the algorithms that run on them are designed to be aware of the resource parameters of the particular machines on which they are running. Parameter-free parallel languages or models, and algorithms that are embarrassingly parallel or oblivious to some parameters for other reasons, may continue to be appealing in restricted domains. We suggest, however, that the parameter-free approaches will always remain inadequate for the full range of tasks that need to be computed.

With parameter-awareness the main promise is that of portable algorithms, those that contain efficient designs for all reasonable ranges of the basic resource parameters and input sizes. Such portable algorithms need to be designed just once, but, once designed, they can be compiled to run efficiently on any machine. In this way the considerable intellectual effort that goes into parallel program design becomes reusable.

To permit such portable algorithms some standard bridging model is needed - a common understanding between hardware and algorithm designers of what the costs of a computation are in terms of some agreed parameters, such as communication bandwidth, communication latency, numbers of processors, and memory sizes. We shall describe the Multi-BSP model as a candidate for this bridging role when multicore architectures are involved. We show that for several fundamental problems, namely matrix multiplication, fast Fourier Transform, and sorting, portable algorithms do exist that are optimal in a defined sense, for all combinations of input size and parameter values for this model.

Reference: L. G. Valiant, "A bridging model for multi-core computing," Journal of Computer and System Sciences, 77:1 (2011) 154-166.

Vijaya Ramachandran, University of Texas at Austin

Title: Resource Oblivious Parallel Computing

We introduce the notion of a resource oblivious parallel computation, which contains no machine parameters in the algorithm, and yet runs efficiently in a parallel setting. In this talk, we consider a multicore computing environment with a global shared memory and multiple cores, each having a private cache, and with data organized in blocks of words. A resource oblivious computation would have the desirable feature of being portable across multicores with a wide range of machine parameters.

Resource oblivious computing has two components: a resource oblivious algorithm for a computational problem, and a scheduler that maps tasks to cores. On the algorithms side, we characterize the class of `Hierarchical Balanced Parallel (HBP)' multithreaded computations, and on the scheduling side, we present the `Priority Work Stealing (PWS)' scheduler. We establish that several natural classes of HBP computations are scheduled optimally by PWS, when accounting for both parallelism and cache misses, including cache misses incurred due to false sharing of blocks. HBP algorithms also have good performance under the familiar randomized work stealing scheduler.

We illustrate our results with a new HBP sorting algorithm, SPMS, which has O(log n log log n) critical path length. We also have efficient HBP algorithms for matrix computations, FFT, lists and graphs. We have applied constructs similar to HBP algorithms to other parallel environments including network-oblivious algorithms for BSP-style network computations, and multicore-oblivious algorithms for multicores with a hierarchy of shared caches.

(Joint work with Richard Cole. The work on network-oblivious and multicore-oblivious algorithms is joint with Rezaul Chowdhury, Francesco Silvestri and Brandon Blakeley.)

Uzi Vishkin, University of Maryland Institute for Advanced Computer Studies

Title: From Asymptotic PRAM Speedups To Easy-To-Obtain Concrete XMT Ones

Considering a horizon where the transistor count of a chip would be in the billions, our SPAA 1998 explicit multi-threading (XMT) platform paper envisioned a framework extending from PRAM-style algorithms to architecture. Through steady and methodical progress, 2011 marks a point where prototyping meets many more of the 1998 original objectives: we can now show strong speedups on some of the most advanced PRAM algorithms, often far exceeding speedups obtained for commercial platforms. XMT speedups are also much easier to derive, and the thermal envelope of XMT appears to not fall behind commercial processors. The main industry players designated multi-core machines coupled with parallel programming as the only remaining avenue for maintaining the celebrated Moores Law on track for general-purpose computing. Time permitting, I will review a variety of recent data points suggesting significant gaps between stated industry objectives and what has been delivered to date.

Homepage of the XMT project: www.umiacs.umd.edu/users/vishkin/XMT/

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on March 11, 2011.

DIMACS Workshop on Parallelism: A 2020 Vision

March 14 - 16, 2011 DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

March 14 - 16, 2011
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ