CHAPTER 1
Extreme Heterogeneity in Deep Learning Architectures
Jeff Anderson, Armin Mehrabian, Jiaxin Peng, and Tarek El-Ghazawi
The George Washington University
CONTENTS
1 Introduction
1.1 Deep Learning
1.2 Deep Learning Operations
1.3 Deep Learning Network Communications
1.3.1 Data and Model Parallelism
2 Hardware Architectures for Deep Learning
2.1 Microprocessors
2.2 Digital Signal Processors
2.3 Graphics Processing Units
2.4 Coarse Grained Reconfigurable Architectures
2.5 Tensor Processing Units
2.6 Mapping Deep Learning to an Architecture
3 FPGAs in Deep Learning
3.1 FPGA Optimizations for NN Operations
3.1.1 Communications Path Optimizations
3.1.2 Processing Element Optimizations
3.1.3 NN Model Optimizations
3.2 FPGA Reconfigurability for NNs
3.2.1 FPGA Reconfigurability in Inference Processing
3.2.2 FPGA Reconfigurability in NN Training
4 Discussion
4.1 Neural Network Models for Heterogeneous Systems
5 Conclusion
Further Reading
References
1 Introduction
Within the past few years, electronic devices that process voice commands have become ubiquitous in society. Some of these devices, such as the Amazon Echo and the Google Home, provide information to users and control their homes, while other devices, such as cellular phones, are more mobile in nature but perform similar functions.
The recent success of voice-activated electronics can be attributed to the field of Machine Learning (ML), and more specifically to the development of Convolutional Neural Networks (CNNs) and Deep Learning (DL). Due to the development of these advanced techniques, neural networks (NNs) can now successfully perform classification tasks such as far-field voice recognition, speech-to-text translation, natural language processing, and computer vision [1].
Researchers are now turning NN-based systems to other application areas, such as identification of radio frequency (RF) wave modulation [2–4]. These functions are building blocks for higher level functionality, such as cognitive radio and cognitive radar, where efficiency is gained through automatic adjustments made in response to the system’s knowledge of its current RF environment [3].
Current research in ML focuses on different application areas and efficient training methods, but the bulk of research has been on NNs implemented in server-based computing clusters [1–8]. This does not match well with cognitive radio and cognitive radar, which are implemented as small, embedded systems, where real-time performance is expected and power and energy consumption is watched very closely [9,10]. These systems are characterized by constraints on compute resources, size, weight, and power, and follow a different model than a server-based model [11]; instead of increasing efficiency by servicing multiple users in a batch-processing fashion, one task at a time must be executed quickly and efficiently.
Before implementing NNs in embedded computer platforms, it is worth reviewing state-of-the-art hardware architectures implementing NNs for various applications. The remainder of this section will review Deep Neural Networks (DNNs) and advances in ML. Section 2 summarizes hardware architectures that are likely to be useful for NN implementations in embedded systems. Following the review of hardware architectures, field-programmable gate arrays in DL are summarized in Section 3 and then a discussion on the future of heterogeneous embedded systems in DL in Section 4.
1.1 Deep Learning
Machine Learning is the practice of enabling computers to learn how to perform a task through exposure to data, as opposed to simply executing routines explicitly coded to accomplish specific tasks. During their learning phase, machines go through exhaustive iterations of a training procedure; and over the course of the training, they try to minimize errors generated from the mismatch between what they understand to be true and the ground truth.
Since their introduction by McCulloch and Pitts in 1943 [12], NNs have become the primary architecture used for machine learning applications. These networks comprise multiple layers of neurons, where the connections between layers are selected to maximize the likelihood of a correct classification. Initially an academic curiosity, NNs have recently gained mainstream acceptance due to their successes in solving many complex artificial intelligence (AI) problems. In 2006, Hinton et al. [13] proposed a method to train an NN with many layers (Deep), which was not feasible beforehand. The field, Deep Learning, put NNs back in the spotlight and has become the mainstream solution to many AI problems including, but not limited to, voice and speech recognition [1], image classification/segmentation [5, 6], natural language processing (NLP) [14,15], gaming AI [10], and analysis of particle physics data [16].
Almost all types of ML (including DL) go through two phases of operation, namely training and inference. During the training phase, NNs are trained to perform particular tasks by taking input vectors and checking the output of the network. Incorrect outputs are used to train the network through back-propagation from the output layer to the input layer, where parameters (weights) of each input layer are adjusted until the output of the NN reaches the desired result. This is repeated for a large training data set until the NN reaches an acceptable probability of classification. Then, the trained NNs are used to perform the designated task during the inference phase.
Different types of NNs, called NN models, such as CNNs, Recurrent Neural Networks (RNNs), and Long Short-Term Memory (LSTM) Neural Networks, have been shown to be efficient for specific classifications and have their own sets of operations with diverse computational and communication requirements. For instance, CNNs are widely used to solve image classification problems in DL as described by Champlin et al. [17]. The input of a CNN is typically an image, and the CNN uses several filters, comprising multiple neurons, to derive the feature maps, which are considered distinguished components for image classification tasks. There are hundreds of filters in each convolutional layer, and each CNN has several convolutional layers. In fact, 2D convolution, implemented as multiply-accumulate (MAC) operations inside of a neuron, occupies more than 90% of the CNN computation time [18]. After flowing through several convolutional layers, the feature maps are sent to several fully connected layers, where the CNN produces classification categories.
RNN and LSTM, while similar to a CNN from the standpoint of network architecture, introduce a time dependency to the network, where a neuron’s output is stored and then fed back into the neuron during subsequent calculations. Time-dependent networks such as these have proven useful for natural language processing tasks [2].
While the focus of the DL community has primarily been on functionality and the introduction of novel DL approaches, hardware and software performance optimizations of the existing NN models are now receiving more attention. Each new implementation attempts to optimize specific facets of performance, from specific network architectures designed to reduce latency and increase throughput, to computation accelerators designed to increase the performance of specific calculations. Prior to elaborating on specific architectures, it is beneficial to understand which facets of DL can benefit most from optimization.
1.2 Deep Learning Operations
The most obvious facet of DL which can benefit from optimization is the weighted summation function, usually implemented as a MAC. Each neuron in each layer is required to perform...