In short, we update our knowledge of the quality of the current state, denoted V(s), based on a combination of what its value is and the result of taking the action to the next state defined in the episode.

For example, say we start the learning processing and our first action is to throw from state [-5,-5] and it successfully hits the bin, then we have a positive reward of +1 for reaching to goal.

Therefore, we have the following update:This may seem like a trivial calculation but it is important to remember that success is not guaranteed.

Therefore, if we think about all possible actions, the result of this first throw means we believe that this throw action is current the best choice.

This throw action has a value of 0.

5 compared to 0 for all other actions that haven’t been tested yet.

Therefore, under the ϵ−greedy selection process, we would try this again.

However this time, the paper does not go in the bin but misses and we therefore have a negative terminal reward of −1−1:So we see that our value of this state has now reduced slightly to account for the second throw.

The core concept in Reinforcement Learning is that we are testing actions by repeated sampling; we need to repeat the number of samples until the results converge to an estimate of the true probabilistic outcomes.

For example, if we consider tossing a coin 2 times, we are fairly like to have both outcomes being heads but if we throw it 100 times then we would likely see a 50/50 split between heads and tails.

In our example, if throwing from state [-5,-5] is a good action, then repeatedly trying this should, overall, lead to a positive result.

This can be quite difficult to comprehend at first but in simple terms, we are testing the action by trial and error and making our algorithm do all the work so we don’t have to.

Note: For now, we will fix the start state to [-5,-5] and parameters to be ϵ=0.

1, α=0.

5 and γ=0.

5 until we demonstrate parameter changes later.

After 100 episodes, we see that the states around our fixed start point have updated but if we compare the following heat-map side by side with the previous line plot, we see that this has not fully converged after 100 episodes and is still updating.

Initial TD(0) OutputWe therefore increase the number of episodes greatly from 100 to 1,000As we are starting to find that this takes longer and longer, a good idea is to introduce a method to keep track of the progress of the loop.

To do this, I applied a method introduced in this post.

Increase to 1,000 episodesVarying RewardsWe note that the results of this show that the value of the states are very negative and that they are diverging (i.

e.

not stable).

There are a few steps we can take to improve this, first we will introduce rewards for other actions.

Currently the only rewards we have are for when the algorithm throws and receives a +1 for the positive goal or a −1 for the negative goal.

This is part of the process in Reinforcement Learning that gives us control as to what the algorithm optimises for.

For example, say we want to discourage the algorithm from throwing, we could introduce a small positive reward (say 0.

1) for each move action as shown below.

Add reward for moving: r_move = 0.

1Although this appears worse at first, the value of the state oscillating shows that there is a value its trying to find but our choice of parameters is causing it to diverge.

But, we at least can see that it is getting closer to converging.

We could start varying parameters but part of the issue is that we are summarising the value of the state for a large number of actions (360 throw directions and 8 move directions).

Therefore, instead of summarising this into one value, it would be better to consider the quality of each state-action pair individually.

To do this, we can introduce our second model-free method: Q-learning.

Q-LearningMuch like TD(0), Q-learning learns as we take each action but instead searches through the possible subsequent actions to learn faster.

Definition: Q-Learning Update Rule: Wikiwhere:Q(s_t,a_t) is the value of state-action pair s,α is the learning rate parameter,r is the reward,γ is the discount factor parameter and,Q(s_t+1,a) is the value of action-pairs in the next state.

As before, we will fix the parameters to be ϵ=0.

1, α=0.

5 and γ=0.

5.

Q-Learning Initial RunVarying ParametersWe have three main parameters to vary, the learning rate αα, the discount factor γγ and our value for the ϵ−greedyϵ−greedy action selection.

The following explanations for each are taken directly from Wikipedia and have already introduced the ϵϵ parameter in some detail.

Explore vs exploitThe learning rate or step size determines to what extent newly acquired information overrides old information.

A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities).

In fully deterministic environments, a learning rate of αt=1 is optimal.

When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero.

In practice, often a constant learning rate is used, such as αt=0.

1 for all t.

[3]Discount factorThe discount factor γγ determines the importance of future rewards.

A factor of 0 will make the agent “myopic” (or short-sighted) by only considering current rewards, i.

e.

rt (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward.

If the discount factor meets or exceeds 1, the action values may diverge.

For γ=1, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite.

[4] Even with a discount factor only slightly lower than 1, Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network.

[5] In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning.

[6]So what do these mean, and, more importantly, what are we trying to achieve with our parameter selection?The overall aim is that we are trying to find the optimal action for any given state whilst achieving this in a reasonable number of effort (measured by the number of episodes, computation needed or time).

A good explanation for the learning rate is that a high value means we value the information gained in each action more and so learn faster but may find it hard to full converge whilst a small value will take longer but will steadily converge.

A good analogy for this is to think of it like we are playing Golf with just one club; a high alpha corresponds to using a big-hitting club whilst a small alpha value is akin to using a small-hitting club.

The big-hitting club will initially get us closer to the green but once we get close it is hard to accurately hit the hole.

However, a small-hitting club will take more attempts to reach the green but once it does we have more control and can reach the hole easier.

We have already observed the effect of a large alpha parameter in our earlier applications where the values oscillated between each episode.

Therefore, we need to use a small value but this introduces the challenge regarding the number of episodes required to converge.

We already need thousands of episodes to converge for a fixed start state and we have 100 to consider for the full environment.

This is the trade-off we have to consider and the best decision may be to only learn for one fixed start state at a time when needed rather than trying to find the optimal policy for all states.

When we observe the trends for varying alpha in the animation below, we see that, if we fix the start state, we are able to use a small alpha value without needing an impossibly large number of episodes.

If we were to consider all states, we would likely need to use a slightly larger alpha value to reach a suitable result in good time.

We therefore go back to considering a single fixed start state of [-5,-5] and choose an alpha value of α=0.

1.

With these set, we then evaluate the choice of γ.

From the Wikipedia explanation, we understand that the value corresponds to whether we consider future rewards important or not.

It helps when we consider this to remind ourselves of the update rule for Q-learning.

Withing the Q-learning update rule, we see that γ scales the Q value taken for the best action from the next state.

This is in relation to the reward of the action itself as part of the same bracket and so if we reduce this by having a small gamma value then the reward has more weight.

Conversely, if we take a high gamma value, we consider the information obtained from this next state to be more important.

Therefore, we would ideally choose a value that adds value to future rewards so that our decisions lead to optimally to the bin and select a value of γ=0.

9.

Alpha Analysisalpha = 0.

9alpha = 0.

1Gamma Analysisgamma= 0.

9Final Parameter Output- alpha = 0.

1- gamma = 0.

9- epsilon = 0.

1- 10,000 episodes- Fixed start state: [-5,-5]Output:The optimal action from the start state is to MOVE in direction: 2ConclusionWe see that the final output for start state is to move EAST.

As mentioned, we may want to consider changing the rewards so that moving is discouraged slightly as it seems that our algorithm is collecting rewards for moving rather than reaching the end goal.

The results for all states covered by the episodes converge within the 10,000 episodes though it appears that many have yet to be fully explored and are not optimal.

But if we only concerned with the start state then these do not matter significantly.

I hope this notebook/write-up is useful for demonstrating the impact each parameter has on learning and the overall process of RL in a self contained example.

Thanks.