Composed of three sections, this book presents the most popular training algorithm for neural networks: backpropagation. The first section presents the theory and principles behind backpropagation as seen from different perspectives such as statistics, machine learning, and dynamical systems. The second presents a number of network architectures that may be designed to match the general concepts of Parallel Distributed Processing with backpropagation learning. Finally, the third section shows how these principles can be applied to a number of different fields related to the cognitive sciences, including control, speech recognition, robotics, image processing, and cognitive psychology. The volume is designed to provide both a solid theoretical foundation and a set of examples that show the versatility of the concepts. Useful to experts in the field, it should also be most helpful to students seeking to understand the basic principles of connectionist learning and to engineers wanting to add neural networks in general -- and backpropagation in particular -- to their set of problem-solving methods.
Trusted by 375,005 students
Access to over 1 million titles for a fair monthly price.
David E. Rumelhart Richard Durbin Richard Golden Yves Chauvin Department of Psychology, Stanford University
INTRODUCTION
Since the publication of the PDP volumes in 1986,1 learning by backpropagation has become the most popular method of training neural networks. The reason for the popularity is the underlying simplicity and relative power of the algorithm. Its power derives from the fact that, unlike its precursors, the perceptron learning rule and the Widrow-Hoff learning rule, it can be employed for training nonlinear networks of arbitrary connectivity. Since such networks are often required for real-world applications, such a learning procedure is critical. Nearly as important as its power in explaining its popularity is its simplicity. The basic idea is old and simple; namely define an error function and use hill climbing (or gradient descent if you prefer going downhill) to find a set of weights which optimize performance on a particular task. The algorithm is so simple that it can be implemented in a few lines of code, and there have been no doubt many thousands of implementations of the algorithm by now.
The name back propagation actually comes from the term employed by Rosenblatt (1962) for his attempt to generalize the perceptron learning algorithm to the multilayer case. There were many attempts to generalize the perceptron learning procedure to multiple layers during the 1960s and 1970s, but none of them were especially successful. There appear to have been at least three independent inventions of the modern version of the back-propagation algorithm: Paul Werbos developed the basic idea in 1974 in a Ph.D. dissertation entitled “Beyond Regression,” and David Parker and David Rumelhart apparently developed the idea at about the same time in the spring of 1982. It was, however, not until the publication of the paper by Rumelhart, Hinton, and Williams in 1986 explaining the idea and showing a number of applications that it reached the field of neural networks and connectionist artificial intelligence and was taken up by a large number of researchers.
Although the basic character of the back-propagation algorithm was laid out in the Rumelhart, Hinton, and Williams paper, we have learned a good deal more about how to use the algorithm and about its general properties. In this chapter we develop the basic theory and show how it applies in the development of new network architectures.
We will begin our analysis with the simplest cases, namely that of the feedforward network. The pattern of connectivity may be arbitrary (i.e., there need not be a notion of a layered network), but for our present analysis we will eliminate cycles. An example of such a network is illustrated in Figure 1.2
For simplicity, we will also begin with a consideration of a training set which consists of a set of ordered pairs
where we understand each pair to represent an observation in which outcome
occurred in the context of event
. The goal of the network is to learn the relationship between
and
. It is useful to imagine that there is some unknown function relating
to
, and we are trying to find a good approximation to this function. There are, of course, many standard methods of function approximation. Perhaps the simplest is linear regression. In that case, we seek the best linear approximation to the underlying function. Since multilayer networks are typically nonlinear it is often useful to understand feed-forward networks as performing a kind of nonlinear regression. Many of the issues that come up in ordinary linear regression also are relevant to the kind of nonlinear regression performed by our networks.
One important example comes up in the case of “overfitting.” We may have too many predictor variables (or degrees of freedom) and too little training data. In this case, it is possible to do a great job of “learning” the data but a poor job of generalizing to new data. The ultimate measure of success is not how closely we approximate the training data, but how well we account for as yet unseen cases. It is possible for a sufficiently large network to merely “memorize” the training data. We say that the network has truly “learned” the function when it performs well on unseen cases. Figure 2 illustrates a typical case in which accounting exactly for noisy observed data can lead to worse performance on the new data. Combating this “overfitting” problem is a major problem for complex networks with many weights.
Given the interpretation of feedforward networks as a kind of nonlinear regression, it may be useful to ask what features the networks have which might give them an advantage over other methods. For these purposes it is useful to compare the simple feedforward network with one hidden layer to the method of polynomial regression. In the case of polynomial regression we imagine that we transform the input variables
into a large number of variables by adding a number of the cross terms x1x2, x1x3,. … , x1x2x3, x1x2x4,. …. We can also add terms with higher powers
,
, … as well as cross terms with higher powers. In doing this we can, of course approximate any output surface we please. Given that we can produce any output surface with a simple polynomial regression model, why should we want to use a multilayer network? The structures of these two networks are shown in Figure 3.
Figure 1. A simple three-layer network. The key to the effectiveness of the multilayer network is that the hidden units learn to represent the input variables in a task-dependent way.
Figure 2. Even though the oscillating line passes directly through all of the data points, the smooth line would probably be the better predictor if the data were noisy.
We might suppose that the feedforward network would have an advantage in that it might be able to represent a larger function space with fewer parameters. This does not appear to be true. Roughly, it seems to be that the “capacity” of both networks is proportional to the number of parameters in the network (cf. Cover, 1965; Mitchison & Durbin, 1989). The real difference is in the different kinds of constraints the two representations impose. Notice that for the polynomial network the number of possible terms grows rapidly with the size of the input vector. It is not, in general, possible, even to use all of the first-order cross terms since there are n(n + l)/2 of them. Thus, we need to be able to select that subset of input variables that are most relevant, which often means selecting the lower-order cross terms and thereby representing only the pairwise or, perhaps, three-way interactions.
Figure 3. Two networks designed for nonlinear regression problems. The multilayer network has a set of hidden units designed to discover a “low-order” represe...
Table of contents
Front Cover
Half Title
DEVELOPMENTS IN CONNECTIONIST THEORY
Title Page
Copyright
Contents
Preface
1. Backpropagation: The Basic Theory
2. Phoneme Recognition Using Time-Delay Neural Networks
3. Automated Aircraft Flare and Touchdown Control Using Neural Networks
4. Recurrent Backpropagation Networks
5. A Focused Backpropagation Algorithm for Temporal Pattern Recognition
6. Nonlinear Control with Neural Networks
7. Forward Models: Supervised Learning with a Distal Teacher
8. Backpropagation: Some Comments and Variations
9. Graded State Machines: The Representation of Temporal Contingencies in Feedback Networks
10. Spatial Coherence as an Internal Teacher for a Neural Network
11. Connectionist Modeling and Control of Finite State Systems Given Partial State Information
12. Backpropagation and Unsupervised Learning in Linear Networks
13. Gradient-Based Learning Algorithms for Recurrent Networks and Their Computational Complexity
14. When Neural Networks Play Sherlock Holmes
15. Gradient Descent Learning Algorithms: A Unified Perspective
Author Index
Subject Index
Frequently asked questions
Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go. Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Backpropagation by Yves Chauvin,David E. Rumelhart in PDF and/or ePUB format, as well as other popular books in Psychology & Cognitive Psychology & Cognition. We have over one million books available in our catalogue for you to explore.