Computer Science
Superscalar Architecture
Superscalar architecture is a type of CPU design that allows multiple instructions to be executed simultaneously. It achieves this by having multiple execution units within the CPU, each capable of executing a different instruction at the same time. This results in faster processing times and improved performance.
Written by Perlego with AI-assistance
Related key terms
1 of 5
12 Key excerpts on "Superscalar Architecture"
- eBook - ePub
Computer Architecture
Software Aspects, Coding, and Hardware
- John Y. Hsu(Author)
- 2017(Publication Date)
- CRC Press(Publisher)
C HAPTER 7 Superscalar Machine Principles If we view the instruction flow and data flow in a computer system, all the machines can be grouped into three classes: SISD (Single Instruction Single Data), SIMD (Single Instruction Multiple Data), and MIMD (Multiple Instruction Multiple Data). 13 A SISD machine contains a scalar processor — each instruction can operate on one piece of data. A SIMD machine contains a vector processor as each instruction can operate on a vector, i.e. array of data. A MIMD machine means that many processors are interconnected. This chapter discusses the design principles of a superscalar machine. The discussions of SIMD and MIMD machines are found in the next chapter. 7.1 PARALLEL OPERATIONS A superscalar processor is a processor that allows the executions of many instructions at the same time, but each instruction still operates on only one piece of data. In terms of instruction functions, a superscalar processor is not different from a scalar processor except in speed and complexity. That is, a superscalar processor is able to run much faster because all the operations are performed in parallel. This parallelism includes memory too, so the CPU speed is balanced with its operand access time. A powerful CPU is usually supported by a memory system that has many components operating at different speeds. The goal is to achieve parallel operations at all levels. That is to say, the CPU and all the memory components must be kept busy. In practice, there are two approaches to designing a fast CPU, a pipe or a decoupled pipe. Let us explore the memory structures before discussing CPU design in a balanced system. 7.1.1 Storage Hierarchy Memory and storage are synonymous. As shown in Figure 7.1, a hierarchical storage system commonly has four levels: the registers, cache, central memory, and disk. The first two levels are on chip and the next two levels are off-chip. As the level is closer to the CPU, the storage is smaller in size but faster in speed - eBook - PDF
Computer Architecture
Fundamentals and Principles of Computer Design, Second Edition
- Joseph D. Dumas II(Author)
- 2016(Publication Date)
- CRC Press(Publisher)
The logic required to detect and resolve all of these problems is complex to design and adds signifcantly to the amount of chip area required. The multiple pipelines also take up more room, making superscalar designs very space-sensitive and thus more amenable to implementation technologies with small feature (tran-sistor) sizes. (Superpipelined designs, by contrast, are best implemented with technologies that have short propagation delays.) These many dif f-culties of building a superscalar CPU are offset by a signifcant advantage: With multiple pipelines doing the work, clock frequency is not as critical in superscalar machines as it is in superpipelined ones. Because generat-ing and distributing a high-frequency clock signal across a microproces-sor is far from a trivial exercise, this is a substantial advantage in favor of the superscalar approach. Superscalar and superpipelined design are not mutually exclusive. Many CPUs have been implemented with multiple, deep pipelines, mak-ing them both superpipelined and superscalar. Sun’s UltraSPARC proces-sor, introduced in 1995, was an early example of this hybrid approach: It was both superpipelined (nine stages) and four-way superscalar. The AMD Athlon (frst introduced in 1999), Intel Pentium 4 (2000), IBM PowerPC 970 (2003), and ARM Cortex A8 (2005), among others, followed suit by combining superscalar and superpipelined design in order to maximize performance. Given suf fcient chip area, superscalar design is a useful enhancement that makes superpipelining much more practical. When a branch is encountered, a superscalar/superpipelined machine can use one (or more) of its deep pipelines to continue executing sequen-tial code while another pipeline executes speculatively down the branch target path. Whichever way the branch decision goes, at least one of the pipelines will have correct results; any that took the wrong path can be fushed. Some work is wasted, but processing never comes to a complete halt. - eBook - PDF
- David Loshin(Author)
- 2014(Publication Date)
- Academic Press(Publisher)
The architecture descriptions in this chapter are limited to those specif-ically designed for vector operations. This includes vector processors, at-tached array processors, and SIMD (single instruction, multiple data) ma-chines. The following chapter will cover more general multiple processor ma-chines, such as Shared Memory machines, and Scalable High Performance machines. 5.1 Pipelined Supercomputers Pipelined architectures are designed to allow overlapping of a number of different phases of a vector operation to achieve vector parallelism. The idea of a pipelined computer conjures up images of a manufacturing assembly line, with unfinished products moving down the line, constantly being modified until complete products emerge at the end of the line. A pipeline architecture is designed to allow overlapping of partial computations over a sequence of operands. Overlapping steps of a number of different computations is an example of temporal parallelism. Through the use of multiple functional 61 62 Chapter 5. Vector Processors units inside the Arithmetic/Logical Unit (ALU), partial results of many sets of operands can be computed at the same time. Let's take a look at our vector operation example from Chapter 2: A(l:128) = X*B(1:128) + C(l:128) In a pipeline architecture, a number of sets of operations would be partially computed simultaneously in the different functional units. The actual action of a pipeline machine is first to fill the pipes with values. Once all the pipes are full, they can all compute their partial values at the same time. As soon as the stream of inputs has been completely eaten up, the pipes drain, pumping out the results until they are clear. Given a pipelined machine with three memory units, an addition unit, and a multiplication unit, our vector example from before would be executed in the pipeline in this sequence: 1. Fill Stage • Load X • Load B(l) • Load C(l) 2. Fill Stage • Tl = X * B(l) • Load B(2) • Load C(2) 3. - David H. Bailey, Robert F. Lucas, Samuel Williams(Authors)
- 2010(Publication Date)
- CRC Press(Publisher)
In effect, execution is decoupled from the rest of the processor. Although programs must still exhibit instruction-level parallelism to attain peak performance, dedicating reservation stations for dif-ferent classes of instructions (functional units) allows out-of-order processors to overlap (and thus hide) instruction latencies. Superscalar Processors: Once out-of-order processors were implemented with multiple functional units and dedicated reservation stations, system architects attempted to process multiple instructions per cycle. Such a bandwidth-oriented approach potentially could increase peak performance Parallel Computer Architecture 19 without increasing frequency. In a “superscalar” processor, the processor at-tempts to fetch, decode, and issue multiple instructions per cycle from a se-quential instruction stream. Collectively, the functional units may complete several instructions per cycle to the reorder buffer. As such, the reorder buffer must be capable of committing several instructions per cycle. Ideally, this architecture should increase peak performance. However, the complexity in-volved in register renaming with multiple (potentially dependent) instruc-tions per cycle often negatively impacts clock frequency and power costs. For this reason, most out-of-order processors are limited to four-way designs (i.e., four instructions per cycle). Moreover, by Little’s Law, the latency–bandwidth product increases in proportional to the number of instructions that can be executed per cycle — only exacerbating the software challenge. VLIW Processors: Discovering four-way parallelism within one cycle has proven to be rather expensive. System architects can mitigate this cost by adopting a very-long instruction word (VLIW) architecture. The VLIW paradigm statically (at compile time) transforms the sequential instruction stream into N parallel instruction streams. The instructions of these streams are grouped into N -instruction bundles.- eBook - ePub
- Yan Solihin(Author)
- 2015(Publication Date)
- Chapman and Hall/CRC(Publisher)
Parallel architectures were initially a natural idea because there were not enough transistors on a chip to implement a complete microprocessor. Hence, it was natural to have multiple chips that communicated with each other, either when those chips implemented different components of a processor or when they implemented components of different processors. Initially, all levels of parallelism were considered in parallel computer architectures: instruction level parallelism, data parallelism, etc. What defined parallel architecture was unclear but has solidified over time. Almasi and Gottlieb [3] defined a parallel computer as: “A parallel computer is a collection of processing elements that communicate and cooperate to solve a large problem fast.” While the definition appears straightforward, there is a broad range of architectures that fit the definition. For example, take the phrase “collection of processing elements”. What constitutes a processing element? A processing element is logic that has an ability to process an instruction. It can be a functional unit, a thread context on a processor, a processor core, a processor chip, or an entire node (processors in a node, local memory, and disk). From this definition, instruction level parallelism can be thought of as parallel processing of instructions on functional units as the processing elements. Does it mean that a superscalar processor can be thought of as a parallel computer? A superscalar processor detects dependences between instructions and executes independent instructions in parallel on different functional units whenever possible. Here the definition seems to include a superscalar processor as a parallel computer. In contrast, today, many people do not consider a superscalar processor as a parallel computer. Such an ambiguity was understandable at the time because the level of parallelism being exploited was still fluid - eBook - PDF
Microprocessor Architecture
From Simple Pipelines to Chip Multiprocessors
- Jean-Loup Baer(Author)
- 2009(Publication Date)
- Cambridge University Press(Publisher)
3 Superscalar Processors 3.1 From Scalar to Superscalar Processors In the previous chapter we introduced a five-stage pipeline. The basic concept was that the instruction execution cycle could be decomposed into nonoverlapping stages with one instruction passing through each stage at every cycle. This so-called scalar processor had an ideal throughput of 1, or in other words, ideally the number of instructions per cycle ( IPC ) was 1. If we return to the formula giving the execution time, namely, EX CPU = Number of instructions × CPI × cycle time we see that in order to reduce EX CPU in a processor with the same ISA – that is, without changing the number of instructions, N – we must either reduce CPI (increase IPC) or reduce the cycle time, or both. Let us look at the two options. The only possibility to increase the ideal IPC of 1 is to radically modify the structure of the pipeline to allow more than one instruction to be in each stage at a given time. In doing so, we make a transition from a scalar processor to a superscalar one. From the microarchitecture viewpoint, we make the pipeline wider in the sense that its representation is not linear any longer. The most evident effect is that we shall need several functional units, but, as we shall see, each stage of the pipeline will be affected. The second option is to reduce the cycle time through an increase in clock fre-quency. In order to do so, each stage must perform less work. Therefore, a stage must be decomposed into smaller stages, and the overall pipeline becomes deeper . Modern microprocessors are therefore both wider and deeper than the five-stage pipeline of the previous chapter. In order to study the design decisions that are necessary to implement the concurrency caused by the superscalar effect and the consequences of deeper pipelines, it is convenient to distinguish between the front-end and the back-end of the pipeline. - eBook - PDF
- Zbigniew J. Czech(Author)
- 2017(Publication Date)
- Cambridge University Press(Publisher)
5 Architectures of Parallel Computers 5.1 CLASSIFICATION OF ARCHITECTURES Roughly speaking, computer architecture is a structure of computer system compo- nents. Architecture, in addition to manufacturing technology, is a major factor deter- mining the speed of a computer. Therefore designers devote a great deal of attention to improving computer architectures. One of architecture classifications is Flynn’s taxonomy, which is based on the concepts of instruction stream and data stream. An instruction stream is a sequence of instructions executed by a processor, and a data stream is a sequence of data processed by an instruction stream. Depending on the multiplicity of instruction and data streams occurring on a computer, Flynn has dis- tinguished four classes of architectures (Figure 5.1). Computers of SISD architecture, in brief SISD computers, are conventional com- puters wherein a processor executes a single instruction stream processing a single data stream. In modern processors, regularly more than one instruction is executed within a single clock cycle. Processors are equipped with a certain number of func- tional units enabling implementation of instruction in a pipelined fashion. Processors with multiple functional units are called superscalar. Suppose that the process of executing an instruction consists of six sequentially performed microoperations (also termed microinstructions): fetch instruction (FI), decode instruction (DI), calculate operand address (CA), fetch operand (FO), exe- cute instruction (EI), write result (WR). A sequence of microoperations making up the process of implementing an instruction is called pipeline. Each microoper- ation in the sequence is also called stage of a pipeline, so in our example we have the 6-stage pipeline. 1 Assume that the separate functional units (hardware circuitry) J 1 , J 2 , . . . , J 6 have been implemented in a processor to perform particular microop- erations. - eBook - PDF
- Vojin G. Oklobdzija(Author)
- 2019(Publication Date)
- CRC Press(Publisher)
3.4 Power-Efficient Microarchitecture Paradigms Now that we have examined specific microarchitectural constructs that aid power-efficient design, let us examine the inherent power–performance scalability and efficiency of selected paradigms that are currently established or are emerging in the hig h-end processor roadmap. In par ticular, we consider (1) wi de-issue, speculative superscalar processors, (2) multicluster superscalars, (3) SMT processors, and (4) chip multiprocessors (CMPs). those that use sing le program speculative multithreading , as well as those that are general multicore sy mmetric multiprocessing (SMP) or throughput engines. In illustrating the efficiency advantages or deficiencies, we use the follow ing running example. It shows one iteration of a loop trace that we consider in simulating the performance and power characteristics across the above computing platforms. Let us consider the follow ing floating-point loop kernel, shown below (coded using the PowerPC instruction set architecture): Example loop test case [P] [A] fadd fp3, fp1, fp0 [Q] [B] lfdu fp5, 8(r1) [R] [C] lfdu fp4, 8(r3) [S] [D] fadd fp4, fp5, fp4 [T] [E] fadd fp1, fp4, fp3 [U] [F] stfdu fp1, 8(r2) [V] [G] bc loop_top The loop body consists of seven instructions, the final one being a conditional branch that causes control to loop back to the top of the loop body. The instructions are labeled A throug h G. ( The labels P throug h V are used to tag the corresponding instructions for a parallel thread—when we consider SMT and CMP). The lfdu = stfdu instructions are load = store instructions wi th update, where the base address register (e.g ., r1, r2, or r3) is updated after execution by holding the newly computed address. 3.4.1 Single-Core Superscalar Processor Paradigm One school of thoug ht anticipates a continued progression along the path of wi der, aggressively super-scalar paradigms. - eBook - PDF
- Michel Dubois, Murali Annavaram, Per Stenström(Authors)
- 2012(Publication Date)
- Cambridge University Press(Publisher)
However, we also explain extensions required for more complex instruction sets, such as the Intel x86, as need arises. Since this book is about parallel architectures, we do not expose architectures that execute instructions one at a time. Thus the starting point is the 5-stage pipeline, which concurrently processes up to five instructions in every clock cycle. The 5-stage pipeline is a static pipeline in the sense that the order of instruction execution (or the schedule of instruction execution) is dictated by the compiler, an order commonly referred to as the program, thread, or process order, and the hardware makes no attempt to re-order the execution of instructions dynamically. The 5-stage pipeline exploits basic mechanisms, such as stalling, data forwarding, and pipeline stage flushing. These mechanisms are the fundamental hardware mechanisms exploited in all processor architectures and therefore must be fully understood. The 5-stage pipeline can be extended to static superpipelined and superscalar processors. Superpipelined processors are clocked faster than the 5-stage pipeline, and some functions processed in one stage of the 5-stage pipeline are spread across multiple stages. Additionally, more complex instructions (such as floating-point instructions) can be directly pipelined in the processor execution unit. Static superscalar processors fetch and execute multiple instructions in every cycle. Static pipelines rely exclusively on compiler optimizations for their efficiency. Whereas the compiler has high-level knowledge of the code and can easily identify loops for example, it 3.2 Instruction set architecture 75 misses some of the dynamic information available to hardware, such as memory addresses. Dynamically scheduled, out-of-order (OoO) processors can take advantage of both statically and dynamically available information. Out-of-order processors exploit the instruction-level parallelism (ILP) exposed by the compiler in each thread. - eBook - PDF
- Harry Wechsler(Author)
- 2014(Publication Date)
- Academic Press(Publisher)
We are most interested, however, in multifunctional pipelines and systolic architectures, as well as their image processing and computational vision applications. 9.2.1. Vector Supercomputers Vector-processing architecture can be found among both attached scientific processors and supercomputers. The attached processors (AP) enhance the floating-point (FP) and vector-processing capabilities of the host computer. Multiplicity in the processor organization and concurrency of operations is achieved through software development of specific μ coded packages, usually for matrix operations. Furthermore, the interface between the AP and the host might require data reformatting and programmed direct memory access (DMA). Supercomputers or vector-processing architectures are unhampered 9.2. Pipelining 469 by the overhead usual for loop-control mechanisms. Four basic vector operations are handled: (i) vector to vector (such as A(I) = A(I)**2), (ii) vector to scalar (such as s = ΣΑ(Γ)), (iii) (vector x vector) to vector (such as A(I) = B(I) ± C(I)), and (iv) (vector x scalar) to vector (such as A(I) = constant * A(I)). Furthermore, basic compare operations that yield Boolean vectors are then used as masks for compress and/or merge tasks. Basic design issues include fast memory access using vector registers (and base registers, offsets, and vector lengths), the setup and flushing time already mentioned, and how the theoretical peak performance degrades to the average speed when the system is faced with a mixed load of tasks including the I/O bottleneck. The CRAY generation is characteristic of vector processing, and its perfor-mance reflects setting multiple pipelines. A front-end (host) computer sets up computations and retrieves the results. The host is connected via multiple I/O channels to a CPU made up of a computation unit (CU) and an interleaved memory. The CU includes functional unit pipelines of the vector, floating point, scalar, and address type. - No longer available |Learn more
- (Author)
- 2014(Publication Date)
- Learning Press(Publisher)
The medium used for communication between the processors is likely to be hierarchical in large multiprocessor machines. Classes of parallel computers Parallel computers can be roughly classified according to the level at which the hardware supports parallelism. This classification is broadly analogous to the distance between basic computing nodes. These are not mutually exclusive; for example, clusters of symmetric multiprocessors are relatively common. Multicore computing A multicore processor is a processor that includes multiple execution units (cores) on the same chip. These processors differ from superscalar processors, which can issue multiple instructions per cycle from one instruction stream (thread); by contrast, a multicore processor can issue multiple instructions per cycle from multiple instruction streams. Each core in a multicore processor can potentially be superscalar as well—that is, on every cycle, each core can issue multiple instructions from one instruction stream. Simultaneous multithreading (of which Intel's HyperThreading is the best known) was an early form of pseudo-multicoreism. A processor capable of simultaneous multithreading has only one execution unit (core), but when that execution unit is idling (such as during a cache miss), it uses that execution unit to process a second thread. IBM's Cell microprocessor, designed for use in the Sony PlayStation 3, is another prominent multicore processor. ________________________ WORLD TECHNOLOGIES ________________________ Symmetric multiprocessing A symmetric multiprocessor (SMP) is a computer system with multiple identical processors that share memory and connect via a bus. Bus contention prevents bus architectures from scaling. As a result, SMPs generally do not comprise more than 32 processors. - No longer available |Learn more
- (Author)
- 2014(Publication Date)
- College Publishing House(Publisher)
The medium used for communication between the processors is likely to be hierarchical in large multiprocessor machines. Classes of parallel computers Parallel computers can be roughly classified according to the level at which the hard- ware supports parallelism. This classification is broadly analogous to the distance bet-ween basic computing nodes. These are not mutually exclusive; for example, clusters of symmetric multiprocessors are relatively common. Multicore computing A multicore processor is a processor that includes multiple execution units (cores) on the same chip. These processors differ from superscalar processors, which can issue multiple instructions per cycle from one instruction stream (thread); by contrast, a multicore processor can issue multiple instructions per cycle from multiple instruction streams. Each core in a multicore processor can potentially be superscalar as well—that is, on every cycle, each core can issue multiple instructions from one instruction stream. Simultaneous multithreading (of which Intel's HyperThreading is the best known) was an early form of pseudo-multicoreism. A processor capable of simultaneous multithreading has only one execution unit (core), but when that execution unit is idling (such as during a cache miss), it uses that execution unit to process a second thread. IBM's Cell microprocessor, designed for use in the Sony PlayStation 3, is another prominent multicore processor. ________________________ WORLD TECHNOLOGIES ________________________ Symmetric multiprocessing A symmetric multiprocessor (SMP) is a computer system with multiple identical pro-cessors that share memory and connect via a bus. Bus contention prevents bus arc-hitectures from scaling. As a result, SMPs generally do not comprise more than 32 pro-cessors.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.











