Qrash Course: Deep Q Networks from the Ground Up in 10 MinutesShaked ZychlinskiBlockedUnblockFollowFollowingJan 9This article assumes no prior knowledge in Reinforcement Learning, but it does assume some basic understanding of neural networks.

Out of all the different types of Machine Learning fields, the one fascinating me the most is Reinforcement Learning.

For those who are less familiar with it — while Supervised Learning deals with predicting values or classes based on labeled data and Unsupervised Learning deals with clustering and finding relations in unlabeled data, Reinforcement Learning deals with how some arbitrary being (formally referred to as an “Agent”) should act and behave in a given environment.

The way it is done is by giving the Agent rewards or punishments based on the actions it has performed on different scenarios.

One of the first practical Reinforcement Learning methods I learned was Deep Q Networks, and I believe it’s an excellent kickstart to this journey.

So allow me to walk you through the path I walked on when attempted to learn RL —including a “Hello World” exercise, which helped me more than I can explain.

Defining the worldA Reinforcement Learning task is about training an Agent which interacts with its Environment.

The Agent transitions between different scenarios of the Environment, referred to as states, by performing actions.

Actions, in return, yield rewards, which could be positive, negative or zero.

The Agent’s sole purpose is to maximize the total reward it collects over an episode, which is everything that happens between an initial state and a terminal state.

Hence, we reinforce the Agent to perform certain actions by providing it with positive rewards, and to stray away from others by providing negative rewards.

This is how an Agent learns to develop a strategy, or a policy.

Take Super Mario as an example: Mario is the Agent interacting with the world (the Environment).

The states are exactly what we see on the screen, and an Episode is a level: the initial state is how the level begins, and the terminal state is how the level ends, whether we completed it or perished while trying.

The actions are move forward, move backwards, jump, etc.

Rewards are given depending on actions outcome: when Mario collects coins or bonuses, it receives a positive reward, and when it falls or being hit by an enemy, it receives a negative reward.

When Mario just wonders around, the reward it receives is zero, as if to say “you did nothing special”.

In a Super Mario game, Mario is the Agent interacting with the world — the EnvironmentBut there’s a problem here: to be able to collect rewards, some “non-special” actions are needed to be taken — you have to walk towards the coins before you can collect them.

So an Agent must learn how to handle postponed rewards by learning to link those to the actions that really caused them.

In my opinion, this is the most fascinating thing in Reinforcement Learning.

Markov Decision ProcessesEach state the Agent is in is a direct consequence of the previous state and the chosen action.

The previous state is also a direct consequence of the one that came before it, and so on till we reach the initial state.

Each one of these steps, and their order, hold information about the current state — and therefore have direct effect on which action should the Agent choose next.

But there’s an obvious problem here: the further we go, the more information the Agent needs to save and process at every given step it takes.

This can easily reach the point where it is simply unfeasible to perform calculations.

To tackle this, we assume all states are Markov States; that is — we assume that any state depends solely on the state that came before it, and the transition from that state to the current one (the action performed and reward given).

Let’s see an example — look at these two Tic Tac Toe games:Both games reached the same state, but in different ways.

Still, in both cases, the blue player must capture the top-right cell, or he will lose.

All we needed in order to determine this was the last state, nothing else.

It’s important to remember that when using the Markov assumption, data is being lost — in complex games such as Chess or Go, the order of the moves might have some implicit information on the opponent’s strategy or way of thought.

Still, the Markov States assumption is fundamental when attempting to calculate long-term strategies.

The Bellman EquationLet’s go ahead and develop our first strategy.

Consider the simplest case: assume we already know what is the expected reward for each action on each step.

How will we choose an action in this case?.Quite simply — we’ll choose the sequence of action that will eventually generate the highest reward.

This cumulative reward we’ll receive is often referred to as Q Value (an abbreviation of Quality Value), and we can formalize our strategy mathematically as:The Bellman EquationThe above equation states that the Q Value yielded from being at state s and selecting action a, is the immediate reward received, r(s,a), plus the highest Q Value possible from state s’ (which is the state we ended up in after taking action a from state s).

We’ll receive the highest Q Value from s’ by choosing the action that maximizes the Q Value.

We also introduce γ, usually called the discount factor, which controls the importance of longterm rewards versus the immediate one.

This equation is known as the Bellman Equation, and its Wikipedia page provides a comprehensive explanation of its mathematical derivation.

This elegant equation is quite powerful and will be very useful to us due to two important characteristics:While we still retain the Markov States assumptions, the recursive nature of the Bellman Equation allows rewards from future states to propagate to far-off past states.

There’s no need to really know what are the true Q Values when we start off; Since its recursive, we can guess something, and it will eventually converge to the real values.

Q LearningWe now have a basic strategy — at any given state, perform the action that will eventually yield the highest cumulative reward.

Algorithms like this are called greedy, for an obvious reason.

How would we implement this to solve real-life challenges?.One way is drawing a table to store all possible state-action combinations, and use it to save Q Values.

We can then update it using the Bellman Equation as an update rule:Q Learning update rule.

The “:=” notation is used here to explicitly mark assignment rather than equality.

Let’s see an example:Now remember some states are terminal states.

When the Agent reaches one, no action or state-transition are possible.

So, if the future state s’ is a terminal state, we are left with:Q Learning update rule for terminal state s’Not done yet — our greedy algorithm has a serious problem: if you keep selecting the same best-actions, you’ll never try anything new, and you might miss a more rewarding approach just because you never tried it.

To solve this, we use an ε-greedy approach: for some 0 < ε < 1, we choose the greedy action (using our table) with a probability p = 1-ε, or a random action with probability p = ε.

We thus give the Agent a chance to explore new opportunities.

This algorithm is known as Q Learning (or Q-Table).

Congratulations!.you just learned your very first Reinforcement Learning algorithm!Deep Q NetworksYou might have asked yourself how does Q Learning scale — and if you haven’t, let’s ask it together: what happens when the number of states and actions becomes very large?.This is actually not that rare — even a simple game such as Tic Tac Toe has has hundreds of different states (try to calculate this), and don’t forget we multiply this number by 9, which is the number of possible actions.

So how will we solve really complex problems?Enter Deep Learning!.We combine Q Learning and Deep Learning, which yields Deep Q Networks.

The idea is simple: we’ll replace the the Q Learning’s table with a neural network that tries to approximate Q Values.

It is usually referred to as the approximator or the approximating function, and denoted as Q(s,a; θ), where θ represents the trainable weights of the network.

Now it only makes sense to use the Bellman Equation as the cost function — but what exactly will we minimize?.Let’s take another look at it:The “=” sign marks assignment, but is there any condition which will also satisfy an equality?.Well, yes — when the Q Value reached its converged and final value.

And this is exactly our goal — so we can minimize the difference between the left-hand side and the right-hand side — and, viola!.Our cost function:DQN cost functionDoes this looks familiar?.Probably — it’s the Mean Square Error function, where the current Q Value is the prediction (y), and the immediate and future rewards are the target (y’):Mean square error functionThis is why Q(s’,a; θ) is usually referred to as Q-target.

Moving on: Training.

In Reinforcement Learning, the training set is created as we go; we ask the Agent to try and select the best action using the current network — and we record the state, action, reward and the next state it ended up at.

We decide on a batch size b, and after every time b new records were recorded, we select b records at random (!!) from the memory, and train the network.

The memory buffers used are usually referred to as Experience Replay.

Several types of such memories exist — one very common is a cyclic memory buffer.

This makes sure the Agent keeps training over its new behavior rather than things that might no longer be relevant.

Things are getting real, so let’s talk architecture: if imitating a table, the network should receive as input the state and action, and should output a Q Value:While correct, this architecture is very inefficient from a technical point of view.

Note that the cost function requires the maximal future Q Value, so we’ll need several network predictions for a single cost calculation.

So instead, we can use the following architecture:Here we provide the network only the state s as input, and receive Q Values for all possible actions at-once.

Much better.

And well, what do you know — that’s pretty much it.

Congratulations again!.You just learned how to design a Deep Q Network!Bonus— Double Deep Q LearningBefore we wrap it up, here’s something extra: a few paragraphs ago we compared the Deep Q Network cost function to Mean Square Error.

But MSE compares predictions y to the true labels y’ — and the true labels are constant throughout the entire training procedure.

Obviously, this is not the case in Deep Q Networks: both y and y’ are predicted by the network itself, and therefore might vary at every iteration.

The impact is clear.

Introducing: Double Deep Q Network, which uses semi-constant labels during training.

How?.We keep two copies of the Q Network, but only one is being updated — the other one remains still.

Every once in a while though, we replace the constant network with a copy of the trained Q Network, hence the reason we call it “semi-constant”.

And so:DDQN cost functionHere, ϑ represents the semi-constant weights, so Q(s’,a; ϑ) means the Q Value predicted by the semi-constant network.

That’s it, you got it.

Hello World!I personally believe the best way to grasp new concepts is by trying to implement them yourself.

In order to try Q Learning and Deep Q Networks, I made up a simple game: a board with 4 slots, which should be filled by the Agent.

When the Agent selects an empty slot, it receives a reward of +1, and the slot is filled.

If it selects a non-vacant slot, it receives a reward of -1.

The game ends when the entire board is full.

Give it a shot and try to implement an Agent that learns to master this game using both methods.

You can find my attempts right here.

Good luck, and for the third time today — Congratulations!.. More details