Reinforcement Learning — Model Based Planning Methods

In this article, we will together explore RL methods with environment as a model.

The following will be structured as:Start with basic idea of how to model an environmentImplement an example in Python using the theory we just learntFurther ideas to extend the theory to more general casesModel the EnvironmentAn agent starts from a state, by taking an available action in that state, the environment gives it feedback, and accordingly the agent lands into next state and receive reward if any.

In this general settings, the environment gives an agent two signals, one is its next state in the setting, and the other is reward.

So when we say to model an environment, we are modelling a function mapping (state, action) to (nextState, reward) .

For example, consider a situation in a grid world setting, an agent bashes its head into the wall, and in response, the agent stays where it is and gets a reward of 0, then in the simplest format, the model function will be (state, action)–>(state, 0) , indicating that the agent with this specific state and action, the agent will stay at the same place and get reward 0.

AlgorithmLet’s now look into how a model of environment can help improve the process of Q-learning.

We start by introducing the simplest form of an algorithm called Dyna-Q:The way Q-learning leveraging models to backup policy is simple and straight forward.

Firstly, the a, b, c, d steps are exactly the same as general Q-learning steps(if you are not familiar with Q-learning, please check out my examples here).

The only difference lies in step e and f , in step e , a model of the environment is recorded based on the assumption of deterministic environment(for non-deterministic and more complex environment, a more general model can be formulated based on the particular case).

Step f can be simply summarised as applying the model being learnt and update the Q function n times, where n is a predefined parameter.

The backup in step f is totally the same as it is in step d , and you may think it as repeating what the agent has experienced several times in order to reinforce the learning process.

Typically, as in Dyna-Q, the same reinforcement learning method is used both for learning from real experience and for planning from simulated experience.

The reinforcement learning method is thus the “final common path” for both learning and planning.

The general Dyna ArchitectureThe graph shown above more directly displays the general structure of Dyna methods.

Notice the 2 upward arrows in Policy/value functons , which in most cases are Q functions that we talked before, one of the arrow comes from direct RL update through real experience , which in this case equals the agent exploring around the environment, and the other comes from planning update through simulated experience , which, in this case, is repeating the model the agent learnt from real experience.

So in each action taking, the learning process is strengthened by updating the Q function from both actual action taking and model simulation.

Implement Dyna MazeI believe the best way to understand an algorithm is to implement an actual example.

I will take the example from reinforcement learning an introduction, implement it in Python and compare it with general Q learning without planning steps(model simulation).

Game SettingDyna Maze BoardConsider the simple maze shown inset in the Figure.

In each of the 47 states there are four actions, up, down, right, and left, which take the agent deterministically to the corresponding neighbouring states, except when movement is blocked by an obstacle or the edge of the maze, in which case the agent remains where it is.

Reward is zero on all transitions, except those into the goal state, on which it is +1.

After reaching the goal state (G), the agent returns to the start state (S) to begin a new episode.

The whole structure of this implementation is to have 2 classes, the first class represents the board, which is also the environment, that is able toTake in an action and output the agent next state(or position)Give reward accordinglyAnd the second class represents the agent, which is able toExplore around the boardKeep track of a model of the environmentUpdate the Q functions along the way.

Board ImplementationThe first class of the board settings are similar with many board games we talked before, you can checkout full implementation here.

I will eliminate my explanation here(you can check my previous articles to see more examples), as a result, we will have a board look like this:Board ImplementationThe board is represented in an numpy array, where z indicates the block, * indicates the agent’s current position and 0 indicates empty and available spots.

Agent ImplementationInitilisationFirstly, in the init function we will initialise all the parameters required for the algorithm.

Besides those general Q-learning settings(learning rate, state_actions, …), a model of (state, action) -> (reward, state) is also initialised as python dictionary, and the model will only be updated along with the agent’s exploration in the environment.

The self.

steps is the number of time the model is used to update the Q function in each action taking, and self.

steps_per_episode is used to record the number of steps in each episode(we will take it as a key metrics in the following algorithm comparison).

Choose ActionsIn the chooseAction function, the agent will still take ϵ-greedy action, where it has self.

exp_rate probability to take a random action and 1 — self.

exp_rate probability to take a greedy action.

Model learning and Policy updatingNow let’s get to the key point of policy updating using models being learnt along agent’s exploration.

This implementation follows exactly as the algorithm we listed above.

At each episode(game playing), after the first round of Q function update, the model will also be updated with self.


state][action]=(reward, nxtState) , and then Q function will be repeatedly updated by self.

steps number of times.

Notice that in side the loop, the state and action are both randomly selected from the previously observations.

Experimenting with different stepsWhen the number of steps is set to 0, the Dyna-Q method is essentially Q-learning.

Let’s compare the learning process with steps of 0, 5 and 50.

Dyna-Q with different stepsThe x-axis is the number of episodes and y-axis is the number of steps to reach the goal.

The task is to get to the goal as fast as possible.

From the learning curve, we observe that the learning curve of planning agent(with simulated model) stabilises faster than non-planing agent.

Referring to the words in Sutton’s book:Without planning (n = 0), each episode adds only one additional step to the policy, and so only one step (the last) has been learned so far.

With planning, again only one step is learned during the first episode, but here during the second episode an extensive policy has been developed that by the end of the episode will reach almost back to the start stateThe additional model simulation and backup further reinforced the agent’s experience, thus resulted in a faster and more stable learning process.

(checkout the full implementation)How to generalise the idea?The example we explored here surely has limited use, as the state is discretised an the action is deterministic.

But the idea of modelling environment to accelerate learning process has unlimited use.

For discrete states with non-deterministic actionA probability model could be learnt rather than a straight forward 1 to 1 mapping we introduced above.

The probability model should be constantly updated through the learning process, and during the backup stage, the (reward, nextState) could be chosen non-deterministically with a probability distribution.

For continuous statesThe Q function update will be slightly different(I will introduce it in further articles), and the key would be to learn a more complex and general parametric model of the environment.

This process could involve general supervised learning algorithms with current state, action as input and next state and reward as output.

In next post, we will learn further ideas to improve Dyna methods and talk about situations when the model is wrong!And lastly, please check out the full code here.

You are welcomed to contribute, and if you have any questions or suggestions, please raise comment below!Reference:http://incompleteideas.




. More details

Leave a Reply