The Other Type of Machine LearningA brief introduction to Reinforcement LearningGenevieve HayesBlockedUnblockFollowFollowingFeb 9This is the first in a series of articles on reinforcement learning and OpenAI Gym.

IntroductionSuppose you’re playing a video game.

You enter a room with two doors.

Behind Door 1 are 100 gold coins, followed by a passageway.

Behind Door 2 is 1 gold coin, followed by a second passageway going in a different direction.

Once you go through one of the doors, there is no going back.

Which of the doors should you choose?If you made your decision based solely on maximizing your immediate reward (or score), then your answer would be Door 1.

However, the aim of most video games is not to maximize your score in a single section of the game, but to maximize your score for the entire game.

After all, there could be 1000 gold coins at the end of the passageway behind Door 2, for all you know.

The only way to really answer this question is to play the game multiple times; try a different door each time; and then based on the information you’ve collected, determine your optimal strategy.

The scenario I’ve just described is a classic example of reinforcement learning (RL), the often overlooked “other” type of machine learning.

In this post, I will take you through the basic characteristics of RL and introduce Q-learning, the fundamental RL algorithm.

What is Reinforcement Learning?Reinforcement learning, along with supervised learning and unsupervised learning, is one of the three basic types of machine learning.

In supervised learning, you are given a labelled dataset and the aim is to use that dataset to determine a general rule that will allow you to label any new datapoints you might come across.

For example, using a dataset of labelled pictures of pets, to create a model that will allow you to label any new pictures of pets you might come across.

In unsupervised learning, you are given an unlabelled dataset and the aim is to draw conclusions about the underlying structure of the data, simply by examining the relationships that exist between the datapoints.

For example, identifying clusters of similar-looking images in the MNIST handwritten digits dataset.

In reinforcement learning, however, rather than being presented with a dataset up front, you are typically presented with an initially unknown “environment” (such as a maze, the US stock exchange, or even the real world) and you must gather data by talking actions in that environment (for example, choosing which of two doors to go through in a video game) and observing the consequences.

The types of problems that RL is applied to are referred to as “sequential decision problems” and the ultimate goal is to determine the optimal sequence of actions that will maximize your long-term gain.

The fact that the environment is initially unknown is important, because if the environment were fully known, then you wouldn’t need to interact with it in order to gather information about it.

The key characteristics of RL problems are then:An initially unknown environment: you must explore the environment in order to gain information about it;Delayed feedback: after taking an action, it may take some time to fully realize the long-term consequences of that action; andSequential decision making: the overall reward received is typically the result of a sequence of multiple actions rather than a single stand-alone action.

Much research has been put into developing algorithms for solving RL problems.

One of the best known of these is Q-learning.

Q-LearningRL algorithms come in two varieties: “model-based” and “model-free”.

With “model-based” RL, we determine the optimal course of action (also known as the optimal policy) using a pre-determined model of our environment (which may or may not be accurate), while with “model-free” RL, our aim is to determine the optimal policy by interacting with the environment.

Model-free RL algorithms can, in turn, be split into two types: “on-policy” and “off-policy”, relating to how we interact with the environment in order to gather information about it.

With “on-policy” algorithms, we make decisions about our actions based on our “best” policy at the time of making the decision and then use the information gathered from taking that action to improve on the best policy.

With “off-policy” algorithms, on the other hand, our behavior in interacting with (or exploring) the environment may be unrelated to what we believe to be the best optimal at the time of taking the action.

For example, we may deliberately choose to take an action that we know to be sub-optimal in the short term in order to determine if it results in a greater reward in the long term.

Q-Learning is a “model-free, off-policy” RL algorithm.

Q-learning works by building a table of Q values, Q(s,a), representing the expected discounted (long term) reward of taking action a in environment state s, and then iteratively improving on this table by interacting with the environment until the optimal Q table is found.

Initially, all of the values in the Q table are set equal to small random numbers (with the exception of the Q(s,a) values for terminal states, which are set equal to 0).

Each time an action is taken in the environment, a (state, action, reward, new state) tuple is produced.

This represents the state of the environment immediately prior to the action, s, the action taken in that state, a, the immediate reward received following that action, r, and the new state of the environment following the action, s’.

These (state, action, reward, new state) tuples are used to iteratively update the values in the Q table, where the updated Q(s,a) value, denoted Q’(s,a), is set equal to the weighted average of the old Q(s,a) value and the Q value implied by the new observation.

That is, the sum of (i) the immediate reward and (ii) the expected discounted reward received from the new state onwards, assuming you always choose the optimal action.

This can be expressed by the following update rule:Q’(s,a) = (1 — w) * Q(s, a) + w *(r + d * Q(s’, argmax a’ : Q(s’, a’)))where:w is the weight applied to the new information (also known as the learning rate); andd is the discount rate, which allows for the fact that $1 received now is more valuable than $1 received in the future.

Once the optimal Q table is found, the optimal action in a given state is the action which maximizes Q(s,a) for that state (that is, argmax a: Q(s, a)).

How well this algorithm works depends on the choice of exploration strategy used to interact with the environment — if you never visit a particular state of the environment or take a particular action in that state, then you are never going to know the consequences of taking that action in that state.

Nevertheless, we don’t just want to randomly move through our environment either.

Ideally, once we have collected some information about our environment, we would like to exploit this information by focusing our future exploration on the states and actions that we believe are likely to result in the greatest reward.

One way of doing this is using the epsilon greedy strategy.

Under the epsilon greedy strategy, at a given state, s, the greedy best action (that is, the action that maximizes Q(s,a) for that state) is chosen with probability (1 — epsilon), and otherwise a random action is selected.

Often epsilon is decayed over time.

This encourages exploration of the environment in early iterations of the algorithm, but then, as time progresses, reduces the amount of exploration, allowing us to focus more on exploiting that information.

If all state-action pairs are visited a sufficiently large number of times, the Q-learning algorithm is guaranteed to eventually converge on the optimal table, although how long convergence takes is another matter.

ExampleConsider the following maze (taken from here):Your aim is to train a robot to find the optimal path through this maze, starting from cell (0, 0) and ending in cell (6, 6), given no prior knowledge of the environment.

To encourage the robot to find the shortest path, a small penalty of 0.

04 units is applied each time the robot moves into an empty (white) cell, and obstacles are places around the maze (marked in gray) which result in a larger penalty of 0.

75 units if the robot enters a cell containing one of them.

The robot can only move up, down, left or right (that is, diagonal moves are not allowed).

However, a level of uncertainty is associate with each movement, such that there is only an 80% chance the robot will move in the intended direction and a 20% chance the robot moves at right angles to the intended direction (split evenly between the two possibilities).

The robot is unable to move outside the boundaries of the maze, and if it attempts to do so, bumps into the wall and its position remains unchanged.

If the robot successfully makes it to the end of the maze, it receives a reward of 1 unit.

Assuming a discount rate of 0.

9, a learning rate of 0.

3 and an epsilon greedy exploration strategy with (constant) epsilon equal to 0.

5, after 50,000 iterations of the Q-learning algorithm we get the following policy.

The diagram shows the optimal direction for the robot to take in each square of the grid.

Python code for finding this solution can be found here.

Reinforcement Learning in the Real WorldAlthough it may not seem obvious from the example given above, researchers have found numerous ways to apply RL to the real world.

RL has been successfully used to develop automated stock trading systems; optimize chemical reactions; train self-driving cars; and to teach a computer to play video games better than any human can; among other things.

For those looking to get started in RL, however, one of the biggest obstacles has historically been getting hold of an interesting and challenging environment in which to experiment.

This is the reason why RL is the least known member of the machine learning family.

Yet, since the introduction of OpenAI Gym, this is no longer the case.

OpenAI Gym is a Python package comprising a selection of RL environments, ranging from simple “toy” environments, such as the one described in the example, to more challenging environments, including simulated robotics environments and Atari video game environments.

Examples of some of the environments available in OpenAI GymThis package makes it possible for computer scientists, both amateur and professional, to experiment with a range of different RL algorithms, and even, potentially, to develop their own.

SummaryReinforcement learning is the branch of machine learning that deals with learning from interacting with an environment where feedback may be delayed.

In this tutorial, we discussed the basic characteristics of RL and introduced one of the best known of all RL algorithms, Q-learning.

Q-learning involves creating a table of Q(s,a) values for all state-action pairs and then optimizing this table by interacting with the environment.

The algorithm is guaranteed to converge provided each state-action pair is visited a sufficiently large number of times.

But what happens when the number of possible states and/or actions becomes so large that this is no longer possible?In my next article, I will go through how to get started with OpenAI Gym, which includes several environments that will allow us to explore this question in further detail.

About the AuthorGenevieve Hayes is an Data Scientist with experience in the insurance, government and education sectors, in both managerial and technical roles.

She holds a PhD in Statistics from the Australian National University and a Master of Science in Computer Science (Machine Learning) from Georgia Institute of Technology.

You can learn more about her here.

.. More details