|
Scaling the Performance for Multicore Processors
Pedro Trancoso, University of Cyprus
8:40 a.m.
Demands for increasing performance and lower power consumption, along with technology advances, have led to the release of multicore microprocessors. Such processors are used in general purpose systems, such as laptops and servers, and special purpose systems, such as graphics cards and gaming machines. The trends will lead to future processors having a larger number of cores as well as cores of different characteristics. While there are no major technological obstacles in achieving the goal of producing large-scale heterogeneous multicore microprocessors, there are many challenging issues that still need to be solved. Trancoso will present some of the work his team has been pursuing in order to exploit the performance for future multicores. Trancoso addresses this issue in three ways. First, by proposing to use certain cores of the multicore processor as HelperCores in order to perform tasks that indirectly improve the performance of the main application. Second, by exploiting the parallelism offered in specialized multicore processors, such as the GPUs, for general-purpose applications. As future multicore processors will include GPU cores, the results achieved will be applicable to a wide domain of systems. Third, Trancoso proposes the use of a new programming model, which is based on the dataflow model of execution but applied to the granularity of a thread of instructions. He proposes a portable platform that virtualizes the underlying hardware and executes on commodity systems. The benefits observed for the above techniques will be used to exploit the potential of future large-scale heterogeneous multicore microprocessors.
back to the agenda
Autotuning Memory-Intensive Kernels for Multicore
Kaushik Datta, Berkeley
8:40 a.m.
Datta will present an autotuning approach to optimize application performance on emerging multicore architectures. This work applies autotuning to Sparse Matrix Vector Multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann application (LBMHD). Datta explores one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM (STI) Cell. Rather than hand-tuning each kernel for each system, Datta's team develops a code generator for each kernel that allows them to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that the autotuned kernel applications often achieve a better than 4x improvement compared with the original code.
back to the agenda
Data-Parallel Algorithms: Design and Implementation
John Owens, University of California, Davis
10:30 a.m.
Crucial to the adoption of GPU computing is the development of high-performance algorithmic building blocks that are useful across a wide range of GPU applications. Owens will outline recent advances in data-parallel algorithms and their implementation on GPUs.
back to the agenda
Performance, Productivity, and Accuracy of Graphics Processing Units for a Quantum Monte Carlo Application
Jeremy Meredith, Oak Ridge National Laboratory
10:30 a.m.
The increased programmability and performance of graphics processing units (GPUs) can have profound positive impact on developer productivity. Meredith will discuss the acceleration of a Quantum Monte Carlo application using GPUs. Topics include the impact of GPU features on performance and accuracy, tradeoffs from using a library approach versus a hand optimized acceleration path, and the implications of combining these approaches.
back to the agenda
Accelerating Scientific Applications with GPUs
John Stone, University of Illinois at Urbana-Champaign
10:30 a.m.
For many years graphics processing units (GPUs) have been an untapped computational resource for scientific computations due to limitations in the hardware and programming interfaces they provided. State-of-the-art GPUs and software development tools have begun to address these problems, expanding their applicability to scientific computation and easing integration with existing applications. Stone will present an overview of performance results for several GPU-accelerated computational biology applications based on CUDA, and the key algorithm design and performance optimization techniques used in each case. The talk will also include a brief review of recent advancements in GPU hardware and software, and early experiences and results running on GPU-accelerated clusters.
back to the agenda
Parallel GPU Computing with CUDA
Michael Garland, NVIDIA
1:30 p.m.
Modern GPUs provide a level of massively parallel computation that was once the preserve of supercomputers like the MasPar and Connection Machine. NVIDIA's Tesla architecture for GPU Computing provides a fully programmable, massively multithreaded chip with up to 128 scalar processor cores and capable of delivering hundreds of billions of operations per second. Applications across many scientific and engineering disciplines are being accelerated by up to 2 orders of magnitude on this platform. Garland will provide an overview of the Tesla architecture and explore the transition it represents in massively parallel computing: from the domain of supercomputers to that of commodity "manycore" hardware available to all. He also will introduce CUDA, a scalable parallel programming model and software environment for parallel programming. By providing a small set of readily understood extensions to the C/C++ languages, CUDA allows programmers to focus on writing efficient parallel algorithms without the burden of learning a multitude of new programming constructs. Finally, Garland will sketch some techniques for implementing common data-parallel algorithms in CUDA.
back to the agenda
AMD Stream Computing
Michael Houston, AMD
1:30 p.m.
An introduction to AMD GPU architectures and the company's Stream software stack, concentrating on the compute capabilities of the hardware and how the different layers of the software stack work with the hardware to allow programmers to write efficient programs and build compilers to efficiently utilize the hardware.
back to the agenda
The Cell Broadband Engine: Architecture and Roadmap
Hema Reddy, IBM Corporation
1:30 p.m.
The Cell Broadband Engine (Cell BE), developed by STI (Sony, Toshiba, IBM) Alliance, defines an architecture well suited for a variety of next-generation compute-intensive applications, such as gaming and signal processing applications, and satisfying the most demanding graphics developers. The Cell BE achieves a significant performance per watt and performance per chip area advantage over conventional high-performance processors. This presentation covers a brief architectural overview and programming models that allow leveraging Cell BE's tremendous computational power and outperform conventional processors by significantly more than an order of magnitude.
back to the agenda
Accelerating numerically intensive life science codes for Molecular and Quantum Mechanics
Simon McIntosh-Smith, ClearSpeed
1:30 p.m.
Up to now accelerators have more typically been used to speed up integer-intensive lifescience codes, such as BLAST and Smith-Waterman for sequence comparison. With the advent of accelerators capable of significantly speeding up 64-bit floating-point operations, whole new classes of life science applications can now benefit from the greater performance and increased compute density offered by these solutions. ClearSpeed has been collaborating with Bristol University to accelerate two important classes of life science application: molecular dynamics-based drug docking codes, and quantum mechanics-based codes. McIntosh-Smith will describe the results achieved so far, and discuss the end applications enabled by this radical change in compute power delivered by accelerators.
back to the agenda
|