eBook - ePub

TensorFlow Reinforcement Learning Quick Start Guide

Name: TensorFlow Reinforcement Learning Quick Start Guide
ISBN: 9781789533446

Get up and running with training and deploying intelligent, self-learning agents using Python

Kaushik Balakrishnan,

184 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

TensorFlow Reinforcement Learning Quick Start Guide

Get up and running with training and deploying intelligent, self-learning agents using Python

Kaushik Balakrishnan,

About this book

Leverage the power of Tensorflow to Create powerful software agents that can self-learn to perform real-world tasks

Key Features

Explore efficient Reinforcement Learning algorithms and code them using TensorFlow and Python
Train Reinforcement Learning agents for problems, ranging from computer games to autonomous driving.
Formulate and devise selective algorithms and techniques in your applications in no time.

Book Description

Advances in reinforcement learning algorithms have made it possible to use them for optimal control in several different industrial applications. With this book, you will apply Reinforcement Learning to a range of problems, from computer games to autonomous driving.

The book starts by introducing you to essential Reinforcement Learning concepts such as agents, environments, rewards, and advantage functions. You will also master the distinctions between on-policy and off-policy algorithms, as well as model-free and model-based algorithms. You will also learn about several Reinforcement Learning algorithms, such as SARSA, Deep Q-Networks (DQN), Deep Deterministic Policy Gradients (DDPG), Asynchronous Advantage Actor-Critic (A3C), Trust Region Policy Optimization (TRPO), and Proximal Policy Optimization (PPO). The book will also show you how to code these algorithms in TensorFlow and Python and apply them to solve computer games from OpenAI Gym. Finally, you will also learn how to train a car to drive autonomously in the Torcs racing car simulator.

By the end of the book, you will be able to design, build, train, and evaluate feed-forward neural networks and convolutional neural networks. You will also have mastered coding state-of-the-art algorithms and also training agents for various control problems.

What you will learn

Understand the theory and concepts behind modern Reinforcement Learning algorithms
Code state-of-the-art Reinforcement Learning algorithms with discrete or continuous actions
Develop Reinforcement Learning algorithms and apply them to training agents to play computer games
Explore DQN, DDQN, and Dueling architectures to play Atari's Breakout using TensorFlow
Use A3C to play CartPole and LunarLander
Train an agent to drive a car autonomously in a simulator

Who this book is for

Data scientists and AI developers who wish to quickly get started with training effective reinforcement learning models in TensorFlow will find this book very useful. Prior knowledge of machine learning and deep learning concepts (as well as exposure to Python programming) will be useful.

Trusted by 375,005 students

Access to over 1 million titles for a fair monthly price.

Study more efficiently using our study tools.

Information

Publisher

Year

Print ISBN

eBook ISBN

Edition

Topic

Computer Science

Subtopic

Artificial Intelligence (AI) & Semantics

Index

Computer Science

Deep Q-Network

Deep Q-Networks (DQNs) revolutionized the field of reinforcement learning (RL). I am sure you have heard of Google DeepMind, which used to be a British company called DeepMind Technologies until Google acquired it in 2014. DeepMind published a paper in 2013 titled Playing Atari with Deep RL, where they used Deep Neural Networks (DNNs) in the context of RL, or DQNs as they are referred to – which is an idea that is seminal to the field. This paper revolutionized the field of deep RL, and the rest is history! Later, in 2015, they published a second paper, titled Human Level Control Through Deep RL, in Nature, where they had more interesting ideas that further improved the former paper. Together, the two papers led to a Cambrian explosion in the field of deep RL, with several new algorithms that have improved the training of agents using neural networks, and have also pushed the limits of applying deep RL to interesting real-world problems.

In this chapter, we will investigate a DQN and also code it using Python and TensorFlow. This will be our first use of deep neural networks in RL. It will also be our first effort in this book to use deep RL to solve real-world control problems.

In this chapter, the following topics will be covered:

Learning the theory behind a DQN
Understanding target networks
Learning about replay buffer
Getting introduced to the Atari environment
Coding a DQN in TensorFlow
Evaluating the performance of a DQN on Atari Breakout

Technical requirements

Knowledge of the following will help you to better understand the concepts presented in this chapter:

Python (2 and above)
NumPy
TensorFlow (version 1.4 or higher)

Learning the theory behind a DQN

In this section, we will look at the theory behind a DQN, including the math behind it, and learn the use of neural networks to evaluate the value function.

Previously, we looked at Q-learning, where Q(s,a) was stored and evaluated as a multi-dimensional array, with one entry for each state-action pair. This worked well for grid-world and cliff-walking problems, both of which are low-dimensional in both state and action spaces. So, can we apply this to higher dimensional problems? Well, no, due to the curse of dimensionality, which makes it unfeasible to store very large number states and actions. Moreover, in continuous control problems, the actions vary as a real number in a bounded range, although an infinite number of real numbers are possible, which cannot be represented as a tabular Q array. This gave rise to function approximations in RL, particularly with the use of DNNs – that is, DQNs. Here, Q(s,a) is represented as a DNN that will output the value of Q.

The following are the steps that are involved in a DQN:

Update the state-action value function using a Bellman equation, where (s, a) are the states and actions at a time, t, s' and a' are respectively the states and actions at the subsequent time t+1, and γ is the discount factor:

We then define a loss function at iteration step i to train the Q-network as follows:

The preceding parameters are are the neural network parameters, which are represented as θ, hence the Q-value is written as Q(s, a; θ).

y_i is the target for iteration i, and is given by the following equation:

We then train the neural network on the DQN by minimizing this loss function L(θ) using optimization algorithms, such as gradient descent, RMSprop, and Adam.

We used the least squared loss previously for the DQN loss function, also referred to as the L2 loss. You can also consider other losses, such as the Huber loss, which combines the L1 and L2 losses, with the L2 loss in the vicinity of zero and L1 in regions far away. The Huber loss is less sensitive to outliers than the L2 loss.

We will now look at the use of target networks. This is a very important concept, required to stabilize training.

Understanding target networks

An interesting feature of a DQN is the utilization of a second network during the training procedure, which is referred to as the target network. This second network is used for generating the target-Q values that are used to compute the loss function during training. Why not use just use one network for both estimations, that is, for choosing the action a to take, as well as updating the Q-network? The issue is that, at every step of training, the Q-network's values change, and if we use a constantly changing set of values to update our network, then the estimations can easily become unstable – the network can fall into feedback loops between the target and estimated Q-values. In order to mitigate this instability, the target network's weights are fixed – that is, slowly updated to the primary Q-network's values. This leads to training that is far more stable and practical.

We have a second neural network, which we will refer to as the target network. It is identical in architecture to the primary Q-network, although the neural network parameter values are different. Once every N steps, the parameters are copied from the Q-network to the target network. This results in stable training. For example, N = 10,000 steps can be used. Another option is to slowly update the weights of the target network (here, θ is the Q-network's weights, and θ^t is the target network's weights):

Here, τ is a small number, say, 0.001. This latter approach of using an exponential moving average is the preferred choice in this book.

Let's now learn about the use of replay buffer in off-policy algorithms.

Learning about replay buffer

We need the tuple (s, a, r, s', done) for updating the DQN, where s and a are respectively the state and actions at time t; s' is the new state at time t+1; and done is a Boolean value that is True or False depending on whether the episode is not completed or has ended, also referred to as the terminal value in the literature. This Boolean done or terminal variable is used so that, in the Bellman update, the last terminal state of an episode is properly handled (since we cannot do an r + γ max Q(s',a') for the terminal state). One problem in DQNs is that we use contiguous samples of the (s, a, r, s', done) tuple, they are correlated, and so the training can overfit.

To mitigate this issue, a replay buffer is used, where the tuple (s, a, r, s', done) is stored from experience, and a mini-batch of such experiences are randomly sampled from the replay buffer and used for training. This ensures that the samples drawn for each mini-batch are independent and identically distributed (IID). Usually, a large-size replay buffer is used, say, 500,000 to 1 million samples. At the beginning of the training, the replay buffer is filled to a sufficient number of samples and populated with new experiences. Once the replay buffer is filled to a maximum number of samples, the older samples are discarded one by one. This is because the older samples were generated from an inferior policy, and are not desired for training at a later stage as the agent has advanced in its learning.

In a more recent paper, DeepMind came up with a prioritized replay buffer, where the absolute value of the temporal difference error is used to give importance to a sample in the buffer. Thus, samples with higher errors have a higher priority and so have a bigger chance of being sampled. This prioritized replay buffer results in faster learning than the vanilla replay buffer. However, it is slightly harder to code, as it uses a SumTree data structure, which is a binary tree where the value of every parent node is the sum of the values of its two child nodes. This prioritized experience replay will not be discussed further for now!

The prioritized experience replay buffer is based on this DeepMind paper: https://arxiv.org/abs/1511.05952

We will now look into the Atari environment. If you like playing video games, you will love this section!

Getting introduced to the Atari environment

The Atari 2600 game suite was originally released in the 1970s, and was a big hit at that time. It involves several games that are p...

Title Page
Copyright and Credits
Dedication
About Packt
Contributors
Preface
Up and Running with Reinforcement Learning
Temporal Difference, SARSA, and Q-Learning
Deep Q-Network
Double DQN, Dueling Architectures, and Rainbow
Deep Deterministic Policy Gradient
Asynchronous Methods - A3C and A2C
Trust Region Policy Optimization and Proximal Policy Optimization
Deep RL Applied to Autonomous Driving
Assessment
Other Books You May Enjoy

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access TensorFlow Reinforcement Learning Quick Start Guide by Kaushik Balakrishnan in PDF and/or ePUB format, as well as other popular books in Computer Science & Artificial Intelligence (AI) & Semantics. We have over one million books available in our catalogue for you to explore.

About this book

Trusted by 375,005 students

Information

Table of contents

Frequently asked questions