Computer Science

Single Instruction Multiple Data (SIMD)

SIMD is a type of parallel computing architecture that allows multiple processing elements to simultaneously execute the same instruction on different data. It is commonly used in applications that require high-performance computing, such as video and audio processing, scientific simulations, and machine learning. SIMD can significantly improve processing speed and efficiency by reducing the number of instructions needed to perform a task.

Written by Perlego with AI-assistance

12 Key excerpts on "Single Instruction Multiple Data (SIMD)"

  • Book cover image for: Computer Architecture
    eBook - ePub

    Computer Architecture

    A Quantitative Approach

    • John L. Hennessy, David A. Patterson(Authors)
    • 2011(Publication Date)
    • Morgan Kaufmann
      (Publisher)
    Chapter 1 introduced, has always been just how wide a set of applications has significant data-level parallelism (DLP). Fifty years later, the answer is not only the matrix-oriented computations of scientific computing, but also the media-oriented image and sound processing. Moreover, since a single instruction can launch many data operations, SIMD is potentially more energy efficient than multiple instruction multiple data (MIMD), which needs to fetch and execute one instruction per data operation. These two answers make SIMD attractive for Personal Mobile Devices. Finally, perhaps the biggest advantage of SIMD versus MIMD is that the programmer continues to think sequentially yet achieves parallel speedup by having parallel data operations.
    This chapter covers three variations of SIMD: vector architectures, multimedia SIMD instruction set extensions, and graphics processing units (GPUs).1
    The first variation, which predates the other two by more than 30 years, means essentially pipelined execution of many data operations. These vector architectures are easier to understand and to compile to than other SIMD variations, but they were considered too expensive for microprocessors until recently. Part of that expense was in transistors and part was in the cost of sufficient DRAM bandwidth, given the widespread reliance on caches to meet memory performance demands on conventional microprocessors.
    The second SIMD variation borrows the SIMD name to mean basically simultaneous parallel data operations and is found in most instruction set architectures today that support multimedia applications. For x86 architectures, the SIMD instruction extensions started with the MMX (Multimedia Extensions) in 1996, which were followed by several SSE (Streaming SIMD Extensions) versions in the next decade, and they continue to this day with AVX (Advanced Vector Extensions). To get the highest computation rate from an x86 computer, you often need to use these SIMD instructions, especially for floating-point programs.
    The third variation on SIMD comes from the GPU community, offering higher potential performance than is found in traditional multicore computers today. While GPUs share features with vector architectures, they have their own distinguishing characteristics, in part due to the ecosystem in which they evolved. This environment has a system processor and system memory in addition to the GPU and its graphics memory. In fact, to recognize those distinctions, the GPU community refers to this type of architecture as heterogeneous
  • Book cover image for: Single-Instruction Multiple-Data Execution
    • Christopher J. Hughes(Author)
    • 2022(Publication Date)
    • Springer
      (Publisher)
    is combination means the instruction fetch and decode units do less work than if each operation were separate instructions. When we combine dependent operations into one instruction, as in our example, we may also avoid writing intermediate results to registers. In this book, we focus on a combination of the second and third options, single-instruction multiple-data (SIMD) execution. SIMD borrows the CISC concept of an instruction that speci- fies multiple operations, but to combine independent operations. Further, the independent opera- tions are the same arithmetic function, but on different data elements. at is, SIMD instructions 10 2. EXPLOITING DATA PARALLELISM WITH SIMD EXECUTION tell the hardware to apply the same operation to a set of independent data elements. us, SIMD specifically targets data parallelism. 2.2 SIMD EXECUTION SIMD is aptly named—a single instruction tells the hardware to perform a given operation on multiple data elements. is explicitly exposes parallelism to the hardware—the processor knows that the operations specified by the instruction can be done simultaneously. Architects can leverage this to increase the performance and/or energy efficiency of a core, as we will explain in Section 2.3 Supercomputers were the first to adopt SIMD execution. Illiac IV was the first machine to use SIMD execution [Barnes et al., 1968], and was followed by the CDC STAR-100 [Hintz and Tate, 1972] and Texas Instruments ASC [Watson, 1972]. e latter two more specifically contained the first vector processors. A vector processor operates on groups of independent data elements, i.e., vectors. For example, they may specify with one instruction that 64 elements of array A should be added to 64 elements of array B; the underlying hardware performs one or more additions per cycle, in a pipelined fashion, until it completes all 64 additions. e operands are typically stored in vector registers, registers capable of holding an entire vector.
  • Book cover image for: Parallel Processing from Applications to Systems
    4.1 SINGLE-INSTRUCTION MULTIPLE-DATA (SIMD) COMPUTERS An SIMD computer consists of an array of processing elements, memory elements (M), a control unit, and an interconnection network (IN). Such computers are attached to a host machine, which from the user's point of view is a front-end system. The role of the host computer is to perform compilation, load programs, perform I/O operations, and execute other operating system functions. We will examine three distinct organizations of SIMD computers: local-memory, shared-memory, and three-dimensional wafer-scale. 4.1.1 Local-Memory SIMD Model Specific to the model shown in Figure 4.1 is the fact that each PE has its own memory unit. The control unit fetches instructions from the CU's memory. It executes the instructions if they are control- type instructions or if they have scalar operands. If they are vector instructions, the CU broadcasts the instructions to the PE array. The CU can also broadcast data words as they may be needed for vector scalar operations. In the case in which the PEs need to fetch data from their own memories, the CU broadcasts the addresses to the PEs. The instruction issued by the CU normally supplies the same address to all processors. However, in order to increase the addressing capability of processors, an index register may be used to modify the memory address supplied by the controller. The programmer may set the index register in each processor. Flexible addressing schemes are possible. Data communi- cation is realized by moving data from one processor to another via an interconnection network. Each processor, or group of processors, has communication ports and data buffers. The interconnection net- work performs several mapping functions, depending on the network SINGLE-INSTRUCTION MULTIPLE-DATA (SIMD) COMPUTERS 183 HOST Data bus PE, M, I/O CU memory CU ± IN control Control bus PE '2Ki M 9 PE n M n Interconnection network (IN) Figure 4.1 Local-memory SIMD computer.
  • Book cover image for: The Computer Engineering Handbook
    • Vojin G. Oklobdzija(Author)
    • 2019(Publication Date)
    • CRC Press
      (Publisher)
    A sequential machine is considered to have sing le inst ruction st ream executing on a single data stream; this is called SISD . An SIMD machine has a single instruction stream executing on multiple data streams in the same cycle. MIMD has multiple instruction streams executing on multiple data st reams simultaneously. All are shown in Fig . 1.17 . An MISD is not shown but is considered to be a systolic array . Four categories of MIMD systems, dataflow, multithreaded, out of order execution, and very long instruction words ( VLIW ), are of particular interest, and seem to be the tendency for the future. These categories can be applied to a single CPU, providing parallelism by having multiple functional units . All four attempt to use fine-grain parallelism to maximize the number of instructions that may be executing in the same cycle. They also use fine-grain parallelism to assist in utilizing cycles, which possibly could be lost due to large latency in the execution of an instruction. Latency increases when the execution of one instruction is temporarily staled while waiting for some resource currently not available, such as the results of a cache miss, or even a cache fetch, the results of a floating point instruction (which takes longer than a simpler instruction), or the availability of a needed functional unit. This could cause delays in the execution of other instructions. If there is very fine grain parallelism, other instructions can use available resources while the staled instruction is waiting. This is one area where much computing power has been reclaimed. Two other compelling issues exist in parallel systems. Portability, once a program has been developed it should not need to be recoded to run efficiently on a parallel system, and scalability, the performance of a system should increase proportional to the size of the system. This is problematic since unexpected bottlenecks occur when more processors are added to many parallel systems.
  • Book cover image for: Computer Architecture
    eBook - ePub

    Computer Architecture

    Software Aspects, Coding, and Hardware

    • John Y. Hsu(Author)
    • 2017(Publication Date)
    • CRC Press
      (Publisher)
    HAPTER 8Vector and Multiple-Processor Machines

    8.1 VECTOR PROCESSORS

    A SIMD machine provides a general purpose set of instructions to operate on arrays, namely, vectors. As an example, one add vector instruction can add two arrays and store the result in a third array. That is, each corresponding word in the first and second array are added and stored in the corresponding word in the third array. This also means that after a single instruction is fetched and decoded, its EU (execution unit) provides control signals to fetch many operandi and execute them in a loop. As a consequence, the overhead of instruction retrievals and decodes are reduced. As a vector means an array in programming, the terms vector processor, array processor, and SIMD machine are all synonymous. A vector processor provides general purpose instructions, such as integer arithmetic, floating arithmetic, logical, shift, etc. on vectors. Each instruction contains an opcode, the size of the vector, and addresses of vectors. A SIMD or vector machine may have its data stream transmitted in serial or in parallel. A parallel data machine uses more hardware logic than a serial data machine.
    8.1.1 Serial Data Transfer
    The execution unit is called the processing element (PE) where the operations are performed. If one PE is connected to one processing element memory (PEM), we have a SIMD machine with serial data transfer as shown in Figure 8.1 a . That is, after decoding a vector instruction in the CU (control unit), an operand stream is fetched and executed serially in a hardware loop. That is, serial data are transferred on the data bus between the PE and PEM on a continuous basis until the execution is completed. In a serial data SIMD machine, there is one PE and one PEM. However, one instruction retrieval is followed by many operand fetches.
    8.1.2 Parallel Data Transfer
    If multiple PEs are tied to the CU and each PE is connected to a PEM, we have a parallel data machine, as shown in Figure 8.1b
  • Book cover image for: Programmable Digital Signal Processors
    eBook - PDF

    Programmable Digital Signal Processors

    Architecture: Programming, and Applications

    • Yu Hen Hu(Author)
    • 2001(Publication Date)
    • CRC Press
      (Publisher)
    The subword on the right end of a register will have index = n and therefore be in the least significant position. Multimedia Instructions in Microprocessors 93 operations on these subwords with a single instruction, as in single instructions- multiple data (SIMD) parallelism. SIMD parallelism is said to exist when a single instruction operates on multiple data elements in parallel. In the case of subword parallelism, the multiple data elements will correspond to the subwords in the packed register. Traditionally, however, the term SIMD was used to define a situation in which a single instruction operated on multiple registers, rather than on the sub- words of a single register. To address this difference, the parallelism exploited by the use of subword parallel instructions is defined as microSIMD parallelism [2], Thus, an add instruction operating on packed data, can be viewed as a mi- croSIMD instruction, where the single instruction is the add and the multiple data elements are the subwords in the packed source registers. For a given processor, the ISA needs to be enhanced to exploit microSIMD parallelism (see Fig. 2). New instructions are added to allow parallel processing of packed data types. Minor modifications to the underlying functional units will also be necessary. Fortunately, the register file and the pipeline structure need not be changed to support packed data types. We define packed instructions as the instructions that are specifically de- signed to operate on packed-data types. A packed add, for example, is an add instruction with the regular definition of addition, but it operates on packed data types. Packed subtract and packed multiply are other obvious instructions needed to efficiently manipulate packed data types. All of the architectures in this chapter include varieties of packed instruc- tions. More often than not, they also include other instructions that cannot be classified as packed arithmetic operations.
  • Book cover image for: Obstacle Avoidance In Multi-robot Systems, Experiments In Parallel Genetic Algorithms
    • Mark A C Gill, Albert Y Zomaya(Authors)
    • 1998(Publication Date)
    • World Scientific
      (Publisher)
    SIMD machines are extremely efficient in handling matrix and vector operations where there is inherent parallelism in the data. 14 Obstacle Avoidance in Multi-Robot Systems Figure 2.4: SIMD architecture 2.2.13 MISD Machines In this class, there are N processing units (Processing unit i i = 1,2, ..., AO, as shown in Figure (2.5). Each processing unit has its own control unit (Control unit /; / = 1, 2, ..., AO, but share a common memory containing data. There are N separate instructions (Instruction stream i i = 1, 2,..., AO that operate simultaneously on the same item of data. Each processing unit does different things to the same data. This type of architecture is very rare and impractical. Systolic arrays fall into this cate-gory (Hwang 1993). 2.2.7.4 MIMD Machines These machines are the most general and most powerful (Akl 1989). In this machine there are N processing units (Processing unit i i = 1, 2, ..., AO, along with their own instruction streams (Instruction stream i / = 1, 2, ..., N) from their own control units (Control unit i i -1, 2, ..., AO- Each processing unit receives data from its own data stream (Data stream i i = 1, 2, ..., AO, as shown in Figure (2.6). This machine is like a collection of SISD machines operating together asynchro-nously. Parallel Computing 15 Figure 2.6: MIMD architecture There are several varieties of MIMD machines. These range from fine-grained to 16 Obstacle Avoidance in Multi-Robot Systems coarse grained, and the coarse grained systems can be further subdivided into loosely coupled and tightly coupled systems. Data flow machines are fine grained systems, and are data driven systems (Dennis 1980). Unlike von Neumann machines, instructions are activated by the availability of their operands and not under control of a control unit. In the loosely coupled coarse grained systems each processor contains its own local memory and is connected to the others via an interconnection network and shares various resources on the network.
  • Book cover image for: Parallel Computing
    • Eduard L Lafferty(Author)
    • 2012(Publication Date)
    • William Andrew
      (Publisher)
    A 64 element vector X is stored in vector register VO and a 64 element vector Y is stored in vector register VI. The X-MP is executing the vector add instruction which will cause the 64 element vector Z = X + Y to be computed and stored in vector register V7. As can be seen from the figure, parts of eight different adds are going on at the same time. 26 Parallel Computing Xe . in transit to adder Y8 in transit to adder. Adder X 7 + Y7 Stage 0 Xa + Ya Stage 1 X s + Y s Stage 2 X 4 + Y 4 Stage 3 X 3 + Y 3 Stage 4 X2 + Y2 Stage 5 . I Xs31 . I Y631 . D Figure 2-15. Snapshot of Cray X~MP Vector Operation The essence of SIMD parallelism is that the same operation is being performed on different data elements at the same time. When the programmer uses a vector operation, he is telling the system that it may perform the same operation on different data elements at the same time. Thus, from the programmer's perspective a vector operation is a SIMD operation. In subsection 2.3 describing SIMD computers, we described how SIMD parallelism can be provided by replication. In a vector processor, it is typically provided by pipelining. Therefore, a vector processor is essentially a pipelined SIMD and is sometimes referred to as such in the literature [Hockney: 88]. The essence of MIMD parallelism is that different operations are being performed on different data elements at the same time. It is also possible to implement a MIMD by pipelining within a single processor. The process execution module (PEM) of the Denelcor Heterogeneous Element Processor (HEP) is a good example of MIMD pipelining. The HEP is no longer being manufactured, but is interesting enough to warrant further discussion. This machine could have up to 16 PEMs connected by a packet switched network and up to 128 data memory modules (DMM). Thus, the HEP is a shared memory MIMD where each processor in turn is a pipelined MIMD.
  • Book cover image for: GPGPU Programming for Games and Science
    Chapter 3 SIMD Computing 3.1 Intel Streaming SIMD Extensions Current CPUs have small-scale parallel support for 3D mathematics com-putations using single-instruction-multiple-data (SIMD) computing. The pro-cessors provide 128-bit registers, each register storing four 32-bit float values. The fundamental concepts are • to provide addition and multiplication of four numbers simultaneously (a single instruction applied to multiple data) and • to allow shuffling , sometimes called swizzling , of the four components. Of course, such hardware has support for more than just these operations. In this section I will briefly summarize the SIMD support for Intel CPUs, discuss a wrapper class that GTEngine has, and cover several approximations to standard mathematics functions. The latter topic is necessary because many SIMD implementations do not provide instructions for the standard functions. This is true for Intel’s SIMD, and it is true for Direct3D 11 GPU hardware. You might very well find that you have to implement approximations for both the GPU and SIMD on the CPU. The original SIMD support on Intel CPUs is called Intel Streaming SIMD Extensions (SSE). New features were added over the years, and with each the version number was appended to the acronym. Nearly everything I do with GTEngine requires the second version, SSE2. To access the support for programming, you simply need to include two header files, #i n cl u d e < xmmintrin . h > #i n cl u d e < emmintrin . h > These give you access to data types for the registers and compiler intrinsics that allow you to use SIMD instructions within your C++ programs. The main data type is the union m128 whose definition is found in xmmintrin.h . It has a special declaration so that it is 16-byte aligned, a re-quirement to use SSE2 instructions. If you require dynamic allocation to cre-ate items of this type, you can use Microsoft’s aligned malloc and aligned free .
  • Book cover image for: Computational Physics
    eBook - PDF

    Computational Physics

    Problem Solving with Python

    • Rubin H. Landau, Manuel J Páez, Cristian C. Bordeianu, Manuel J. Páez(Authors)
    • 2015(Publication Date)
    • Wiley-VCH
      (Publisher)
    The processors in a parallel computer are placed at the nodes of a communica- tion network. Each node may contain one CPU or a small number of CPUs, and 225 10.8 Parallel Semantics (Theory) the communication network may be internal to or external to the computer. One way of categorizing parallel computers is by the approach they utilize in handling instructions and data. From this viewpoint there are three types of machines: Single instruction, single data (SISD) These are the classic (von Neumann) se- rial computers executing a single instruction on a single data stream before the next instruction and next data stream are encountered. Single instruction, multiple data (SIMD) Here instructions are processed from a single stream, but the instructions act concurrently on multiple data elements. Generally, the nodes are simple and relatively slow but are large in number. Multiple instructions, multiple data (MIMD) In this category, each processor runs independently of the others with independent instructions and data. These are the types of machines that utilize message-passing packages, such as MPI, to communicate among processors. They may be a collection of PCs linked via a network, or more integrated machines with thousands of pro- cessors on internal boards, such as the Blue Gene computer described in Section 10.15. These computers, which do not have a shared memory space, are also called multicomputers. Although these types of computers are some of the most difficult to program, their low cost and effectiveness for certain classes of problems have led to their being the dominant type of parallel com- puter at present. The running of independent programs on a parallel computer is similar to the multitasking feature used by Unix and PCs. In multitasking (Figure 10.4a), sev- eral independent programs reside in the computer’s memory simultaneously and share the processing time in a round robin or priority order.
  • Book cover image for: Computational Vision
    MIMD is concerned with interac-tive processes that share resources and is thus characterized by asynchronous parallelism. Loosely coupled and tightly coupled PEs are the two main subdivisions within MIMD. Multicomputers is the 9.4. Multiple Instructions Multiple Data (MIMD) 481 first subclass, and its operating mode is message passing, while the second subclass is labeled multiprocessors, which derives its functiona-lity from PE sharing memory. As we move from SIMD to MIMD, note that SIMD and MIMD are equivalent in that they can simulate each other. An SIMD machine could interpret the PE data as different instructions, while the MIMD could execute only one instruction rather than many across the PE array. There are no strict boundaries between architectures as we have seen so far, and the basic question is that of efficiency and cost for solving a specific problem. We proceed by looking into the multicomputers and multiprocessors classes, respectively. 9.4.1. Message Passing Multicomputers Message-passing multicomputers are lattices of PE nodes connected by a message-passing network. The basic computational paradigm is that of concurrency of processes, where processes are instances of programs. The PE include private memory, and there is a global name space (PE # , process # ) for variables across the multicomputer. The (N x N) network is the binary ra-cube or mesh and facilitates locality of communication between the N nodes. Multiprogramming operating systems available at the PE and coordination through message-passing facilitate concurrency. Thus, the multicomputers constitute a physical and logical distributed system. Athas and Seitz (1988) provide a good review and taxonomy for such systems. According to the grain size—medium (Mbyte of memory per PE) or fine (Kbyte of memory per PE)—different architectures can be defined.
  • Book cover image for: Evaluation of Multicomputers for Image Processing
    However, an all-MIMD algorithm would be expected to outperform the all-SIMD one where all of the subimages are rather sparse and there is not a large variation in the distribution of object pixels. The current implementation of the simulator does not allow results for the all-MIMD algorithm to be obtained. Future studies will consider employing an adaptive thinning algorithm that would begin as an all-SIMD task for the early thinning passes and revert to an all-MIMD task for the later passes. 5. SUMMARY Performance measures such as execution time, speedup, and utiliza-tion can be used to help understand the interactions of an algorithm with a particular parallel processing system. The earlier work on SIMD perfor-mance measures was based on a complexity analysis of algorithms and included the effects of arithmetic operations, masking operations, and net-work operations. Simulation studies and further analysis led to enhanced performance measures for SIMD mode and an extension of the measures for MIMD mode. This chapter demonstrated that the components of and relationships between these measures are often complex and that intuition about an algorithm's predicted performance on a given parallel machine can often be wrong. Clearly, performance measures which deal with other system components such as I/O will be needed to further aid programmers in analyzing how an algorithm of interest will perform. 158 James T. Kuehn and Howard Jay Siegel REFERENCES Adams III, G. B. and Siegel, H. J. (1982), u The extra stage cube: a fault-tolerant interconnection network for supersystems, IEEE Trans. Comp., C-81, pp. 443-454. Batcher, K. E. (1977), STARAN series E, 1977 Int'l. Conf. Parallel Processing, pp. 140-143. Batcher, K. E. (1982), Bit serial parallel processing systems, IEEE Trans. Comp., C-81, pp. 377-384. Bouknight, W. J., Denenberg, S. A., Mclntryre, D. E., Randall, J. M., Sameh, A. H., and Slotnick, D. L. (1972), The Illiac IV system, Proc. IEEE, 60, pp.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.