Chapter 1
Introduction
In this book, we bring together the essentials to high performance programming for a IntelĀ® Xeon Phi⢠coprocessor. As weāll see, programming for the Intel Xeon Phi coprocessors is mostly about programming in the same way as you would for an IntelĀ® XeonĀ® processor-based system, but with extra attention on exploiting lots of parallelism. This extra attention pays off for processor-based systems as well. Youāll see this āDouble Advantage of Transforming-and-Tuningā to be a key aspect of why programming for the Intel Xeon Phi coprocessor is particularly rewarding and helps protect investments in programming.
The Intel Xeon Phi coprocessor is both generally programmable and tailored to tackle highly parallel problems. As such, it is ready to consume very demanding parallel applications. We explain how to make sure your application is constructed to take advantage of such a large parallel capability. As a natural side effect, these techniques generally improve performance on less parallel machines and prepare applications better for computers of the future as well. The overall concept can be thought of as āPortable High Performance Programming.ā
Sports cars are not designed for a superior experience driving around on slow-moving congested highways. As weāll see, the similarities between an Intel Xeon Phi coprocessor and a sports car will give us opportunities to mention sports cars a few more times in the next few chapters.
Sports Car in Two Situations: Left in Traffic, Right on Race Course.
Trend: more parallelism
To squeeze more performance out of new designs, computer designers rely on the strategy of adding more transistors to do multiple things at once. This represents a shift away from relying on higher speeds, which demanded higher power consumption, to a more power-efficient parallel approach. Hardware performance derived from parallel hardware is more disruptive for software design than speeding up the hardware because it benefits parallel applications to the exclusion of nonparallel programs.
It is interesting to look at a few graphs that quantify the factors behind this trend. Figure 1.1 shows the end of the āera of higher processor speeds,ā which gives way to the āera of higher processor parallelismā shown by the trends graphed in Figures 1.2 and 1.3. This switch is possible because, while both eras required a steady rise in the number of transistors available for a computer design, trends in transistor density continue to follow Mooreās law as shown in Figure 1.4. A continued rise in transistor density will continue to drive more parallelism in computer design and result in more performance for programs that can consume it.
Figure 1.1 Processor/Coprocessor Speed Era [Log Scale].
Figure 1.2 Processor/Coprocessor Core/Thread Parallelism [Log Scale].
Figure 1.3 Processor/Coprocessor Vector Parallelism [Log Scale].
Figure 1.4 Mooreās Law Continues, Processor/Coprocessor Transistor Count [Log Scale].
Why Intel® Xeon Phi⢠coprocessors are needed
Intel Xeon Phi coprocessors are designed to extend the reach of applications that have demonstrated the ability to fully utilize the scaling capabilities of Intel Xeon processor-based systems and fully exploit available processor vector capabilities or memory bandwidth. For such applications, the Intel Xeon Phi coprocessors offer additional power-efficient scaling, vector support, and local memory bandwidth, while maintaining the programmability and support associated with Intel Xeon processors.
Most applications in the world have not been structured to exploit parallelism. This leaves a wealth of capabilities untapped on nearly every computer system. Such applications can be extended in performance by a highly parallel device only when the application expresses a need for parallelism through parallel programming.
Advice for successful parallel programming can be summarized as āProgram with lots of threads that use vectors with your preferred programming languages and parallelism models.ā Since most applications have not yet been structured to take advantage of the full magnitude of parallelism available in any processor, understanding how to restructure to expose more parallelism is critically important to enable the best performance for Intel Xeon processors or Intel Xeon Phi coprocessors. This restructuring itself will generally yield benefits on most general-purpose computing systems, a bonus due to the emphasis on common programming languages, models, and tools across the processors and coprocessors. We refer to this bonus as the dual-transforming-tuning advantage.
It has been said that a single picture can speak a thousand words; for understanding Intel Xeon Phi coprocessors (or any highly parallel device) it is Figure 1.5 that speaks a thousand words. We should not dwell on the exact numbers as they are based on some models that may be as typical as applications can be. The picture speaks to this principle: Intel Xeon Phi coprocessors offer the ability to make a system that can potentially offer exceptional performance while still being buildable and power efficient. Intel Xeon processors deliver performance much more readily for a broad range of applications but do reach a practical limit on peak performance as indicated by the end of the line in Figure 1.5. The key is āready to use parallelism.ā Note from the picture that more parallelism is needed to make the Intel Xeon Phi coprocessor reach the same performance level, and that requires programming adapted to deliver that higher level of parallelism required. In exchange for the programming investment, we may reach otherwise unobtainable performance. The transforming-and-tuning double advantage of these Intel products is that the use of the same parallel programming models, programming languages, and familiar tools to greatly enhance preservation of programming investments. Weāll revisit this picture later.
Figure 1.5 This Picture Speaks a Thousand Words.
Platforms with coprocessors
A typical platform is diagrammed in Figure 1.6. Multiple such platforms may be interconnected to form a cluster or supercomputer. A platform cannot consist of only coprocessors. Processors are cache coherent and share access to main memory with other processors. Coprocessors are cache-coherent SMP-on-a-chip1 devices that connect to other devices via the PCIe bus, and are not hardware cache coherent with other processors or coprocessors in the node or the system.
Figure 1.6 Processors and Coprocessors in a Platform Together.
The Intel Xeon Phi coprocessor runs Linux. It really is an x86 SMP-on-a-chip running Linux. Every card has its own IP address. We logged onto one of our pre-production systems in a terminal window. We first got a shell on the host (an Intel Xeon processor), and then we did āssh mic0ā, which logged me into the first coprocessor card in the system. Once we had this window, we listed /proc/cpuinfo. The result is 6100 lines long, so weāre showing the first 5 and last 26 lines in Figure 1.7.
Figure 1.7 Preproduction IntelĀ® Xeon Phi⢠Coprocessor ācat /proc/cpuinfoā.
In some ways, for me, this really makes the Intel Xeon Phi coprocessor feel very familiar. From this window, we can āsshā to the world. We can run āemacsā (you can run āviā if that is your thing). We can run āawkā scripts or āperl.ā We can start up an MPI program to run across the cores of this card, or to connect with any other computer in the world.
If you are wondering how many cores are in an Intel Xeon Phi coprocessor, the answer is āit depends.ā It turns out there are, and will be, a variety of configurations available from Intel, all with more than 50 cores. Preserving programming investments is greatly enhanced by the transforming-and-tuning double advantage. For years, we have been able to buy processors in a variety of clock speeds. More recently, an additional variation in offerings is based on the number of cores. The results in Figure 1.7 are from a 61-core pre-production Intel Xeon Phi coprocessor that is a precursor to the production parts known as an Intel Xeon Phi coprocessor SE10x. It reports a processor number 243 because the threads are enumerated 0..243 meaning there are 244 threads (61 cores times 4 threads per core).
The first Intel® Xeon Phi⢠coprocessor
The first Intel® Xeon Phi⢠coprocessor was known by the code name Knights Corner early in development. While programming does not require deep knowledge of the implementation of the device, it is definitely useful to know some attributes of the coprocessor. From a programming standpoint, treating it as an x86-based SMP-on-a-chip with over fifty cores, with multiple hardware threads per core, and 512-bit SIMD instructions, is the key. It is not critical to completely absorb everything else in this part of the chapter, including the microarchitectural diagrams in Figures 1.8 and 1.9 that we chose to include for those who enjoy such things as we do.
Figure 1.8 Architecture of a Single Intel® Xeon Phi⢠Coprocessor Core.
Figure 1.9 Microarchitecture of the Entire Coprocessor.
The cores are in-order dual issue x86 processor cores, which trace some history to the original PentiumĀ® design, but with the addition of 64-bit support, four hardware threads per core, power management, ring interconnect support, 512-bit SIMD capabilities, and other enhancements, these are hardly the Pentium cores of 20 years ago. The x86-specific logic (excluding L2 caches) makes up less than 2 percent of the die area for an Intel Xeon Phi coprocessor.
Here are key facts about the first Intel Xeon Phi coprocessor product:
⢠A coprocessor (requires at least one processor in the system), in production in 2012.
⢠Runs Linux (source code available http://intel.com/software/mic).
⢠Manufactured using Intelās 22 nm process technology with 3-D Trigate transistors.
⢠Supported by standard tools including Intel® Parallel Studio XE 2013. A list of additional tools available can be found online (http://intel.com/software/mic).
⢠Many cores:
ā More than 50 cores (it will vary within a generation of products, and between generations; it is good advice to avoid hard-coding applications to a particular number).
ā In-order cores support 64-bit x86 instru...