High Performance Parallelism Pearls Volume One
eBook - ePub

High Performance Parallelism Pearls Volume One

Multicore and Many-core Programming Approaches

  1. 600 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

High Performance Parallelism Pearls Volume One

Multicore and Many-core Programming Approaches

About this book

High Performance Parallelism Pearls shows how to leverage parallelism on processors and coprocessors with the same programming – illustrating the most effective ways to better tap the computational potential of systems with Intel Xeon Phi coprocessors and Intel Xeon processors or other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as chemistry, engineering, and environmental science. Each chapter in this edited work includes detailed explanations of the programming techniques used, while showing high performance results on both Intel Xeon Phi coprocessors and multicore processors. Learn from dozens of new examples and case studies illustrating "success stories" demonstrating not just the features of these powerful systems, but also how to leverage parallelism across these heterogeneous systems. - Promotes consistent standards-based programming, showing in detail how to code for high performance on multicore processors and Intel® Xeon Phi™ - Examples from multiple vertical domains illustrating parallel optimizations to modernize real-world codes - Source code available for download to facilitate further exploration

Trusted by 375,005 students

Access to over 1 million titles for a fair monthly price.

Study more efficiently using our study tools.

Chapter 1

Introduction

James Reinders Intel Corporation

Abstract

This chapter introduces this book to share the experience of software developers who have written highly scalable code to take advantage of both multicore (Xeon or other) and many-core (Intel Xeon Phi) machines. Such modernization of code can come from concurrent algorithms, vectorization and data locality, managing power usage, and other techniques. The advantages of neo-heterogeneous systems are apparent because the programming techniques used benefit both multicore and many-core devices. Sixty-nine experts contributed to this book so that we can all learn from their experiences.
Keywords
Heterogeneous
Many-core
Multicore
Neo-heterogeneous
Xeon Phi
AVX-512
New era in programming
We should “create a cookbook” was a common and frequent comment that Jim Jeffers and I heard after Intel® Xeon Phi™ Coprocessor High-Performance Programming was published. Guillaume Colin de Verdière was early in his encouragement to create such a book and was pleased when we moved forward with this project. Guillaume matched action with words by also coauthoring the first contributed chapter with Jason Sewall, From “correct” to “correct & efficient”: a Hydro2D case study with Godunov’s scheme. Their chapter reflects a basic premise of this book that the sharing of experience and success can be highly educational to others. It also contains a theme familiar to those who program the massive parallelism of the Intel Xeon Phi family: running code on Intel Xeon Phi coprocessors is easy. This lets you quickly focus on optimization and the achievement of high performance—but we do need to tune for parallelism in our applications! Notably, we see such optimization work improves the performance on processors and coprocessors. As the authors note, “a rising tide lifts all boats.”

Learning from successful experiences

Learning from others is what this book is all about. This book brings together the collective work of numerous experts in parallel programming to share their work. The examples were selected for their educational content, applicability, and success—and—you can download the codes and try them yourself! All the examples demonstrate successful approaches to parallel programming, but not all the examples scale well enough to make an Intel Xeon Phi coprocessor run faster than a processor. In the real world, this is what we face and reinforces something we are not bashful in pointing out: a common programming model matters a great deal. You see that notion emerge over and over in real-life examples including those in this book.
We are indebted to the many contributors to this book. In this book, you find a rich set of examples and advice. Given that this is the introduction, we offer a little perspective to bind it together somewhat. Most of all, we encourage you to dive into the rich examples, found starting in Chapter 2.

Code modernization

It is popular to talk about “code modernization” these days. Having experienced the “inspired by 61 cores” phenomenon, we are excited to see it has gone viral and is now being discussed by more and more people. You will find lots of “modernization” shown in this book.
Code modernization is reorganizing the code, and perhaps changing algorithms, to increase the amount of thread parallelism, vector/SIMD operations, and compute intensity to optimize performance on modern architectures. Thread parallelism, vector/SIMD operations, and an emphasis on temporal data reuse are all critical for high-performance programming. Many existing applications were written before these elements were required for performance, and therefore, such codes are not yet optimized for modern computers.

Modernize with concurrent algorithms

Examples of opportunities to rethink approaches to better suit the parallelism of modern computers are scattered throughout this book. Chapter 5 encourages using barriers with an eye toward more concurrency. Chapter 11 stresses the importance of not statically decomposing workloads because neither workloads nor the machines we run them on are truly uniform. Chapter 18 shows the power of not thinking that the parallel world is flat. Chapter 26 juggles data, computation, and storage to increase performance. Chapter 12 increases performance by ensuring parallelism in a heterogeneous node. Enhancing parallelism across a heterogeneous cluster is illustrated in Chapter 13 and Chapter 25.

Modernize with vectorization and data locality

Chapter 8 provides a solid examination of data layout issues in the quest to process data as vectors. Chapters 27 and 28 provide additional education and motivation for doing data layout and vectorization work.

Understanding power usage

Power usage is mentioned in enough chapters that we invited Intel’s power tuning expert, Claude Wright, to write Chapter 14. His chapter looks directly at methods to measure power including creating a simple software-based power analyzer with the Intel MPSS tools and also the difficulties of measuring idle power since you are not idle if you are busy measuring power!

ISPC and OpenCL anyone?

While OpenMP and TBB dominate as parallel programming solutions in the industry and this book, we have included some mind-stretching chapters that make the case for other solutions.
SPMD programming gives interesting solutions for vectorization including data layout help, at the cost of dropping sequential consistency. Is it that okay? Chapters 6 and 21 include usage of ispc and its SPMD approach for your consideration. SPMD thinking resonates well when you approach vectorization, even if you do not adopt ispc.
Chapter 22 is written to advocate for OpenCL usage in a heterogeneous world. The contributors describe results from the BUDE molecular docking code, which sustains over 30% of peak floating point performance on a wide variety of systems.

Intel Xeon Phi coprocessor specific

While most of the chapters move algorithms forward on processors and coprocessors, three chapters are dedicated to a deeper look at Intel Xeon Phi coprocessor specific topics. Chapter 15 presents current best practices for managing Intel Xeon Phi coprocessors in a cluster. Chapters 16 and 20 give valuable insights for users of Intel Xeon Phi coprocessors.

Many-core, neo-heterogeneous

The adoption rate of Intel Xeon Phi coprocessors has been steadily increasing since they were first introduced in November 2012. By mid-2013, the cumulative number of FLOPs contributed by Intel Xeon Phi coprocessors in TOP 500 machines exceeded the combined FLOPs contributed by all the graphics processing units (GPUs) installed as floating-point accelerators in the TOP 500 list. In fact, the only device type contributing more FLOPs to TOP 500 supercomputers was Intel XeonÂŽ processors.
As we mentioned in the Preface, the 61 cores of an Intel Xeon Phi coprocessor have inspired a new era of interest in parallel programming. As we saw in our introductory book, Intel Xeon Phi Coprocessor High-Performance Programming, the coprocessors use the same programming languages, parallel programming models, and the same tools as processors. In essence, this means that the challenge of programming the coprocessor is largely the same challenge as parallel programming for a general-purpose processor. This is because the design of both processors and the Intel Xeon Phi coprocessor avoided the restricted programming nature inherent in heterogeneous programming when using devices with restricted programming capabilities.
The experiences of programmers using the Intel Xeon Phi coprocessor time and time again have reinforced the value of a common programming model—a fact that is independently and repeatedly emphasized by the chapter authors in this book. The take-away message is clear that the effort spent to tune for scaling and vectorization for the Intel Xeon Phi coprocessor is time well spent for improving performance for processors such as Intel Xeon processors.

No “Xeon Phi” in the title, neo-heterogeneous programming

Because the key programming challenges are generically parallel, we knew we needed to emphasize the applicability to both multicore and many-core computing instead of focusing only on Intel Xeon Phi coprocessors, which is why “Xeon Phi” does not appear in the title of this book.
However, systems with coprocessors and processors combined do usher in two unique challenges that are addressed in this book: (1) Hiding the latency of moving data to and from an attached device, a challenge common to any “attached” device including GPUs and coprocessors. Future Intel Xeon Phi products will offer configurations that eliminate the data-movement challenge by being offered as processors instead of being packaged coprocessors. (2) Another unique and broader challenge lies in programming heterogeneous systems. Previously, heterogeneous programming referred to systems that combined incompatible computational devices. Incompatible in that they used programming methods different enough to require separate development tools and coding approaches. The Intel Xeon Phi products changed all that. Intel Xeon Phi coprocessors offer...

Table of contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Contributors
  6. Acknowledgments
  7. Foreword
  8. Preface
  9. Chapter 1: Introduction
  10. Chapter 2: From “Correct” to “Correct & Efficient”: A Hydro2D Case Study with Godunov’s Scheme
  11. Chapter 3: Better Concurrency and SIMD on HBM
  12. Chapter 4: Optimizing for Reacting Navier-Stokes Equations
  13. Chapter 5: Plesiochronous Phasing Barriers
  14. Chapter 6: Parallel Evaluation of Fault Tree Expressions
  15. Chapter 7: Deep-Learning Numerical Optimization
  16. Chapter 8: Optimizing Gather/Scatter Patterns
  17. Chapter 9: A Many-Core Implementation of the Direct N-Body Problem
  18. Chapter 10: N-Body Methods
  19. Chapter 11: Dynamic Load Balancing Using OpenMP 4.0
  20. Chapter 12: Concurrent Kernel Offloading
  21. Chapter 13: Heterogeneous Computing with MPI
  22. Chapter 14: Power Analysis on the Intel® Xeon Phi™ Coprocessor
  23. Chapter 15: Integrating Intel Xeon Phi Coprocessors into a Cluster Environment
  24. Chapter 16: Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors
  25. Chapter 17: NWChem: Quantum Chemistry Simulations at Scale
  26. Chapter 18: Efficient Nested Parallelism on Large-Scale Systems
  27. Chapter 19: Performance Optimization of Black-Scholes Pricing
  28. Chapter 20: Data Transfer Using the Intel COI Library
  29. Chapter 21: High-Performance Ray Tracing
  30. Chapter 22: Portable Performance with OpenCL
  31. Chapter 23: Characterization and Optimization Methodology Applied to Stencil Computations
  32. Chapter 24: Profiling-Guided Optimization
  33. Chapter 25: Heterogeneous MPI application optimization with ITAC
  34. Chapter 26: Scalable Out-of-Core Solvers on a Cluster
  35. Chapter 27: Sparse Matrix-Vector Multiplication: Parallelization and Vectorization
  36. Chapter 28: Morton Order Improves Performance
  37. Author Index
  38. Subject Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access High Performance Parallelism Pearls Volume One by James Reinders,James Jeffers in PDF and/or ePUB format, as well as other popular books in Computer Science & Software Development. We have over one million books available in our catalogue for you to explore.