Collecting Bananas with a Deep Q-Network

Collecting Bananas with a Deep Q-NetworkControlling a simulated agent through the Unity environmentDavid RoseBlockedUnblockFollowFollowingJan 3Using a simulated Unity environment, this agent learns a policy of collecting yellow bananas while avoiding blue bananas, using no preset instructions other than the rewards obtained through exploration.

To skip ahead to the code check the GitHub repo here: https://github.

com/cipher982/DRL-DQN-ModelA short sample of the environment in UnityReinforcement Learning → Q-Learning → Deep Q-LearningUnder the umbrella of machine learning, we typically describe 3 foundational perspectives:Unsupervised LearningSupervised LearningReinforcement LearningEach of these relate to differing methods of finding patterns in data, but the one we focus on here is reinforcement learning in which we are able to build up an optimal policy of actions using a simulation of positive and/or negative rewards.

The specific model used is referred to as a Deep Q-Network.

First proposed by DeepMind in 2015, see paper here in Nature, it attempted to incorporate deep neural networks and traditional Q-Learning into a unified framework that could better generalize between environments, and deal with the larger complexity of continuous state spaces and visual images (relative to the state space of a game such as chess).

The basics of a Q-NetworkWe start with the traditional method of Q-Learning, in which we have a table of possible states and actions Q[S,A], and the expected reward of each combination.

When the agent needs to act, it chooses the maximum expected value action for that state, as according to the Q-table.

As stated above, this complexity quickly gets out of hand when scaling up environments and we would like to generalize between state/action rewards rather than purely memorizing the past.

As an example: in the table below is a very simple environment in which there are only 3 states and 3 actions.

The green circles represent the expected rewards for each combination.

While this may be manageable to a human and more complex environments may still be manageable for a computer, eventually the infinitely continuous nature of the real world and the unavoidable issue of exponentially increasing combinations will come with a vengeance.

Enter neural networksA core idea behind neural networks is pattern fitting, specifically the ability to represent compressed connections from input → output, which in the case of supervised learning may be image → cat or audio → text.

In the case of reinforcement learning, we are trying to learn state → actions/rewards.

So given a particular state (for example the relative positions of different bananas compared to my agent) the model will output the expected reward for each action of forward/backward/left/right.

It is expected to learn that if a blue banana is in front of me while a yellow is to the left, the agent will turn left and then head forward.

So with these two ideas above we can see how combining them can solve the problems we came across regarding complexity, and the following I think is the most important part to understand:As opposed to the standard approach of explicitly mapping out a reward to each state/value combination, we can use a neural network as a function approximator.

That is, if there are two locations (states) near each other that continually produce positive rewards, we can generalize that the locations between those two should also produce positive rewards.

Now this is a very basic example, but it helps to get the point across.

Below is a diagram of the model that DeepMind used in one of their first attempts at playing through various Atari games.

Using the convolutional and fully connected layers, the neural network can progressively learn more-and-more detailed and intricate patterns and connections between the image on the screen, the actions to take, and the expected rewards.

Diagram of Deepmind’s DQN for Atari 2600In our model we have a more simplified version that uses fully-connected layers, consisting of:the state space of 37 that consists of 37 dimensions and contains the agent’s velocity, along with ray-based perception of objects around agent’s forward direction2 hidden layers (64 nodes each)an output of 4 actionsHere is what a diagram of all the layers, nodes, and connections looks like:(there may be some weird aliasing going on depending on the screen, the original image had to be scaled down a bit)See this GIST by craffel for the code on making this image.

It looks pretty impressive all laid out like this!.But this model is much simpler than most image receiving networks such as the ones for Atari.

Below I have a section of the code used to create the DQN class:That is all a high level overview of the basics of this approach, but by itself it would not work very admirably.

So below I outline a couple issues and techniques used to overcome them.

Correlated inputsWith this model we are feeding a time-series input of frames that are mostly the same and highly correlated each step which violates the assumption of independent and identically distributed (i.



) inputs.

To solve this we implement a couple features:Random sampling of past experienceFixed target, using two separate networks.

Replay BufferWe can help synthesize random experiences using a stored buffer of past experiences, which we can then sample from during training and update the Q-Network with random state/action combinations.

Unlike the buffer your computer uses when viewing online videos, this one is more like a giant pile of data which is just plucked out at random, as opposed to a First-In, First-Out (FIFO) stack you would want when getting sequential replay.

In the code below you can see how the random indices are pulled from the experiences variable and fed into the specified PyTorch device (GPU or CPU depending on whether you can run CUDA).

Dual Networks (Target — Local)Another need that pops up when using a neural network policy is that of avoiding unwanted divergences when tracking the error/loss from your network during evaluation.

The trick here is creating two networks (call them Target and Local) and temporarily freezing the weights on one (Target) while following the gradients from the Local network.

Otherwise we would be following a moving target which can lead to runaway divergence in the gradients and is not preferable to a stable learning process.

In the code below you can see the learning step that computes the loss and gradient, followed by the soft update of diluting the local policy parameters before updating the target, helping to slow down the changes from being too immediate and unstable.

I hope that was a sufficient overview to grasp at a conceptual level how it works.

To learn more I highly recommend the Udacity Nanodegree for Reinforcement-Learning that I have been using for most of my guidance recently.

Run this yourself!Download the environment from one of the links below.

You need only select the environment that matches your operating system:Linux: click hereMac OSX: click hereWindows (32-bit): click hereWindows (64-bit): click herePlace the file in the GitHub repository, and unzip (or decompress) the file.

Run ‘python main.

py’ in your terminal to begin the training process.

.. More details

Leave a Reply