Those directly north, east, south of west can move in multiple directions whereas the states (1,1), (1,-1),(-1,-1) and (-1,1) can either move or throw towards the bin.Lastly, I decided to show the change of the optimal policy over each update by exporting each plot and passing into a small animation.Stage 3: Finding the Optimal Policy with Reinforcement Learning where the Probabilities are hiddenQ-Learning AlgorithmWe will now imagine that the probabilities are unknown to the person and therefore experience is needed to find the optimal actions.First, let’s try to find the optimal action if the person starts in a fixed position and the bin is fixed to (0,0) as before.We will be applying Q-learning and initialise all state-action pairs with a value of 0 and use the update rule:Q learning Update RuleWe give the algorithm the choice to throw in any 360 degree direction (to a whole degree) or to move to any surrounding position of the current one..There are therefore 8 places it can move: north, north-east, east, etc.When it chooses to throw the paper, it will either receive a positive reward of +1 or a negative of -1 depending on whether it hits the bin or not and the episode ends.It will need to establish by a number of trial and error attempts where the bin is located and then whether it is better to move first or throw from the current position.Q-Learning Pseudo-codeFirst, as before, we initialise the Q-table with arbitrary values of 0.For now, the start of the episode’s position will be fixed to one state and we also introduce a cap on the number of actions in each episode so that it doesn’t accidentally keep going endlessly.Each episode ends naturally if the paper is thrown, the action the algorithm performs is decided by the epsilon-greedy action selection procedure whereby the action is selected randomly with probability epsilon and greedily (current max) otherwise..To balance the random selection slightly between move or throwing actions (as there are only 8 move actions but 360 throwing actions) I decided to give the algorithm a 50/50 chance of moving or throwing then will subsequently pick an action randomly from these.As before, the random movement action cannot go beyond the boundary of the room and once found we update the current Q(s,a) dependent upon the max Q(s’,a) for all possible subsequent actions..For example, if we move from -9,-9 to -8,-8, Q( (-9,-9), (1,1) ) will update according the the maximum of Q( (-8,-8), a ) for all possible actions including the throwing ones.If the algorithms throws the paper, the probability of success is calculated for this throw and we simulate whether in this case it was successful and receives a positive terminal reward or was unsuccessful and receives a negative terminal reward.The algorithm continues to update the Q values for each state-action pair until the results converge.We will analyse the effect of varying parameters in the next post but for now simply introduce some arbitrary parameter choices of: — num_episodes = 100 — alpha = 0.5 — gamma = 0.5 — epsilon = 0.2 — max_actions = 1000 — pos_terminal_reward = 1 — neg_terminal_reward = -1Running the algorithm with these parameters 10 times we produce the following ‘optimal’ action for state -5,-5:Clearly these are not aligned which heavily suggests the actions are not in fact optimal..Therefore, we need to consider how the parameters we have chosen effect the output and what can be done to improve the results.ConclusionWe have introduced an environment from scratch in Python and found the optimal policy..Furthermore, I have begun to introduce the method for finding the optimal policy with Q-learning.I will continue this in a follow up post and improve these initial results by varying the parameters..For now, I hope this demonstrates enough for you to begin trying their own algorithms on this example.If you have any questions, please feel free to comment below or on the Kaggle pages.ThanksPhil. More details