He pointed out that the poor performance is caused by large overestimation of action values due to the use of Max Q(s’,a) in Q-learning.To remedy this problem he proposed the Double Q-Learning method.The ProblemConsider an MDP having four states two of which are terminal states.State A is always considered at start state, and has two actions, either Right or Left..It is biased!In other words, when updating Q(s,a) with Max Q(s’,a), Q(s,a) is not moving towards the expected value of the actions at state B which is -0.5.This scenario gives an intuition for why Q-Learning over-estimates the values..For example it is possible to merge the two Q (average the values for each action) then apply epsilon-greedy.The algorithm updates both QA and QB in an equiprobable manner.Algorithm taken from Double Q-learning by Hado van HasseltThe charts below show a comparison between Double Q-Learning and Q-Learning when the number of actions at state B are 10 and 100 consecutively.It is clear that the Double Q-Learning converges faster than Q-learning.Notice that when the number of actions at B increases, Q-learning needs far more training than Double Q-Learning.Why Does It Work ?Van Hasselt proves in his paper that E(Q2(s’, a*)) ≤ Max Q1(s’, a*).So over enough number of experiments the expected value of Q2(s’, a*) is less or equal to Max Q1(s’, a*), which means that Q1(s, a) is not updated with a maximum value.The table below shows the evolution of the Q-Values of the Left action at state A as the number of episodes increases.Notice that in Q-Learning, Q(A, Left) is positive because it is affected by the positive rewards that exist at state B.. More details