Part 1
Multicore and Many-Core (MC) Systems-On-Chip
1
A Reconfigurable On-Chip Interconnection Network for Large Multicore Systems
Mehdi Modarressi and Hamid Sarbazi-Azad
Contents
1.1 Introduction 1.1.1 Multicore and Many-Core Era
1.1.2 On-Chip Communication
1.1.3 Conventional Communication Mechanisms
1.1.4 Network-on-Chip
1.1.5 NoC Topology Customization
1.1.6 NoCs and Topology Reconfigurations
1.1.7 Reconfigurations Policy
1.2 Topology and Reconfiguration
1.3 The Proposed NoC Architecture 1.3.1 Baseline Reconfigurable NoC
1.3.2 Generalized Reconfigurable NoC
1.4 Energy and Performance-Aware Mapping 1.4.1 The Design Procedure for the Baseline Reconfigurable NoC 1.4.1.1 Core-to-Network Mapping
1.4.1.2 Topology and Route Generation
1.4.2 Mapping and Topology Generation for Cluster-Based NoC
1.5 Experimental Results 1.5.1 Baseline Reconfigurable NoC
1.5.2 Performance Evaluation with Cost Constraints
1.5.3 Comparison Cluster-Based NoC
1.6 Conclusion
References
1.1 Introduction
1.1.1 Multicore and Many-Core Era
With the recent scaling of semiconductor technology, coupled with the ever-increasing demand for high-performance computing in embedded, desktop, and server computer systems, the current general-purpose microprocessors have been moving from single-core to multicore and eventually to many-core processor architectures containing tens to hundreds of identical cores [1]. Major manufacturers already ship 10-core [2], 16-core [3, 4], and 48-core [5] chip multiprocessors, while some special-purpose processors have pushed the limit further to 188 [6], 200 [7], and 336 [8] cores.
Following the same trend, current multicore system-on-chips (SoC) have grown in size and complexity and consist of tens to hundreds of logic blocks of different types communicating with each other at very-high-speed rates.
1.1.2 On-Chip Communication
As the core count scales up, the rate and complexity of intercore communications increase dramatically. Consequently, the efficiency of on-chip communication mechanisms has emerged as a critical determinant of the overall performance in complex multicore system-on-chips (SoCs) and chip multiprocessors (CMPs). In addition to the performance considerations, on-chip interconnects of a conventional SoC and CMP account for a considerable fraction of the consumed power, and this fraction is expected to grow with every new technology point. The advent of deep submicron and nanotechnologies and supply voltage scaling also brings about several signal integrity and reliability issues [9]. As a result, interconnect design poses a whole new set of challenges for SoC and CMP designers.
1.1.3 Conventional Communication Mechanisms
Conventional small-scale SoCs and CMPs use the legacy bus and ad hoc dedicated links to manage on-chip traffic. With dedicated point-to-point links, the intercore data travel on dedicated wires directly connecting two end-point cores. Thus, they can potentially yield the ideal performance and power results when connecting a few cores. However, when the number of on-chip components increases, this scheme requires a huge amount of wiring to directly connect every component, with less than 10% average wire usage in time [10]. Consequently, the poor scalability due to considerable area overhead is a prohibitive drawback of dedicated links. In addition, the dedicated wires in submicron and nanotechnologies need special attention to manage hard-to-predict power, signal integrity, and performance issues. Furthermore, due to their ad hoc nature, dedicated links are not reusable. These issues bring the design effort to the forefront as the second drawback of the dedicated wires.
Bus architectures are the most common and cost-effective on-chip communication solution for traditional multicore SoCs and CMPs with a modest number of processors. However, bus-based communication schemes, even those utilizing hierarchies of buses, can support a few concurrent communications. Connecting more components to a shared bus would also lead to large bus lengths that in turn result in considerable energy overhead and unmanageable clock skew. Therefore, when the number of devices that need to communicate is high, bus-based systems show poor power and performance scalability [9]. Such scalability problems continue to increase, as technology advances allow more cores to be integrated on a single chip. The scalability and bandwidth challenges of the bus have led to a shift in the board-level interchip communication paradigm and the widely used PCI bus is replaced by the switch-based PCI Express network-on-board.
The on-chip communication has traveled the same path in the past decades: the problems of the bus and dedicated links and the efficiency of packet-based interconnection networks in parallel machines motivated researchers to propose switch-based network-on-chips (NoCs) to connect the cores in a high-performance, flexible, scalable, and reusable manner [10โ12].
1.1.4 Network-on-Chip
Networks on chip have now expanded from an interesting area of research to a viable industrial solution for multicore processors ranging from high-end server processors [5] to embedded SoCs [13]. The building blocks of on-chip networks are the routers at every node that are interconnected by short local on-chip wires. Routers multiplex multiple communication flows (in the form of data packets) over the links and manage the traffic in a distributed fashion. Relying on a modular and scalable infrastructure, NoCs can potentially deliver high-bandwidth, low-latency, and low-power communication. From the communication perspective, this allows integration of many components on a single chip.
The benefits of NoCs in providing scalable and high-bandwidth communication are substantial. However, the need for complex and multistage pipelined routers presents several challenges in reaching the potential latency and throughput of NoCs, due to their tight area and power budgets. Authors in [1] show that the bandwidth demands of future server and embedded applications is expected to grow greatly and project that in future CMPs and multicore SoCs, the power consumption of the NoCs implemented by the current methodologies will be about 10 times greater than the power budget that can be devoted to them. Therefore, much research has focused on improving NoC efficiency to bridge the existing gap between the current and the ideal NoC power/performance metrics.
Application-specific optimization is one of the most effective methods to increase the efficiency of the NoC [1]. This class of optimization methods tries to customize the architecture and characteristics of an NoC for a target application. These methods can work at either design time, if the application and its traffic characteristics are known in advance (which is the case for most embedded applications running on multicore SoCs), or at run time for the NoCs used in general-purpose CMPs.
There has been substantial research on application-specific optimization of NoCs, varying from simple methods that update routing tables for each application to sophisticated methods of router microarchitecture and topology reconfiguration [14].
1.1.5 NoC Topology Customization
The performance of a NoC is extremely sensitive to its topology, which determines the placement and connectivity of the network nodes. Proper topology, consequently, is an important target for many NoC customization methods. An equally important problem in specialized multicore SoCs is core (or processing node) to NoC node mapping, which determines on which NoC node each processing core should be physically placed. Mapping algorithms generally try to place the processing cores communicating more frequently near each other; note that when the number of intermediate routers between two communicating cores is reduced, the power consumption and latency of the communication between them decreases proportionally.
Topology and mapping deal with the physical placement of network nodes and links. As a result, the mapping and topology cannot be modified once the chip is fabricated and will remain unchanged during system lifetime. Due to this physical constraint, most current design flows for application-specific multicore SoCs are only effective in providing design time mapping and topology optimization for a single application [15โ18]. In other words, they generate and synthesize an optimized topology and mapping based on the traffic pattern of a single application.
This makes problems for today's multicore SoCs that run several different applications (often unknown at design time). Since the intercore communication characteristics can be very different across different applications, a topology that is designed based on the traffic pattern of one application does not necessarily meet the design constraints of other applications. Even the traffic generated by a single application may vary significantly in different phases of its operation. For example, the IEEE 802.11n standard (WiFi) supports 144 communications modes, each with different communication demands among cores [19]. In [20], more than 1500 different NoC configurations (topology, buffer size, and so on) are investigated and it has been shown that no single NoC can be found to provide optimal performance across a range of applications.
1.1.6 NoCs and Topology Reconfigurations
In this chapter, we introduce a NoC with reconfigurable topology, which ...