Model-Free Prediction: Reinforcement Learning

Model-Free Prediction: Reinforcement LearningPart 4: Model-Free Predictions with Monte-Carlo Learning, Temporal-Difference Learning and TD( λ)Ryan WongBlockedUnblockFollowFollowingFeb 3Previously, we looked at planning by dynamic programming to solve a known MDP.

In this post, we will use model-free prediction to estimate the value function of an unknown MDP.

i.

e We will look at policy evaluation of an unknown MDP.

This series of blog posts contain a summary of concepts explained in Introduction to Reinforcement Learning by David Silver.

Part: 1・ 2・3・4・…The three main methods that will be explained for model-free predictions are:Monte-Carlo LearningTemporal-Difference LearningTD(λ)This post will mainly look at evaluating a given policy in an unknown MDP and not finding the optimal policy.

Monte-Carlo LearningMonte Carlo methods are model-free which learn directly from episodes of experience.

Monte Carlo learns from complete episodes with no bootstrapping.

One drawback to MC is that it can only apply to episodic Markov Decision Processes where all episodes must terminate.

Model-Free: no knowledge of MDP transitions / rewardsBootstrapping: update involves an estimateMonte-Carlo Policy EvaluationGoal: Given a policy π, learn v_π (value for the policy) from episodes of experience.

Given policy π with each state, action and associated reward for taking that actionRecall: return is the total discounted rewardRecall: value function is the expected returnMonte-Carlo policy evaluation uses empirical mean return instead of expected return.

Two approaches to evaluate the value function of a policy at a state is to use First-Visit Monte-Carlo Policy Evaluation or Every-Visit Monte-Carlo Policy Evaluation.

First-Visit Monte-Carlo Policy EvaluationEvaluate the value of state s of a given policyThe first time-step (t) that state (s) is visited in an episodeIncrement counter: N(s) ← N(s) + 1Increment total return: S(s) ← S(s) + GₜValue is estimated by mean return: V(s) = S(s)/N(s)V(s) → v_π(s) as N(s) → ∞Every-Visit Monte-Carlo Policy EvaluationEvaluate the value of state s of a given policyEvery time-step (t) that state (s) is visited in an episodeIncrement counter: N(s) ← N(s) + 1Increment total return: S(s) ← S(s) + GₜValue is estimated by mean return: V(s) = S(s)/N(s)V(s) → v_π(s) as N(s) → ∞In both of the above evaluation approaches, we had to track the statistics of our algorithm.

i.

e we can only compute the value after we have completed all the episodes.

To solve this problem we can use the Incremental Mean equation to update the value incrementally.

Incremental MeanThe mean µ₁, µ₂, … of a sequence x₁, x₂, … can be computed incrementally.

Incremental MeanIncremental Monte-Carlo UpdatesUpdate V(s) incrementally after episode S₁, A₁, R₂, …, Sₜ.

For each state Sₜ with return Gₜ :Replace step 3–5 with the above.

(Gₜ − V(Sₜ)) can be viewed as the error between the return and the mean at time step tIn non-stationary problems (where things are drifting around and you don’t need to remember things that happened long time ago), we can use the running mean approach, i.

e.

forget old episodes.

Incremental Monte-Carlo updatesTemporal-Difference LearningTemporal-Difference is model-free.

Temporal Difference methods learn directly from experience / interaction with the environment.

Temporal Difference learns from incomplete episodes, by bootstrapping (update the guess of the value function).

In both MC and TD the goal is to learn v_π online from experience under policy π.

If we were to apply incremental every-visit Monte-Carlo we update the value V(Sₜ) towards the actual return GₜThe simplest temporal-difference learning algorithm, TD(0) differs as we update the value V(Sₜ) towards an estimated return Rₜ₊₁ + γV(Sₜ₊₁)Rₜ₊₁ + γV(Sₜ₊₁) is the TD target and δₜ=Rₜ₊₁+γV(Sₜ₊₁)-V(Sₜ) is the TD error.

TD learning updates the value function immediately which allows it to learn before knowing the final outcome after every step unlike MC which must wait until the end of the episode before the return is known.

TD works in continuing (non-terminating) environments while MC only works for episodic (terminating) environments / complete sequences.

An example to illustrate the difference between TD and MC, is if we were to try to predict the how long it takes to drive home at each state along the way.

In MC, we would assign each state the value we receive at the end of the journey (actual outcome).

In TD, we would update the value along the way at each state using the impact that the next state has on the current state (estimated outcome).

There is a trade-off between the bias and the variance.

MC has a high variance and zero bias as it uses the return Gₜ which depends on many random actions, transitions and rewards.

Therefore it has good convergence properties even with the function approximation and is not sensitive to initial value.

TD has low variance and some bias as the TD target depends on one random action, transition and reward.

It is usually more efficient than MC.

TD(0) converges to v_π(s) but not always with function approximation.

Unlike MC it is more sensitive to the initial value.

Batch MC and TDSo we have seen that MC and TD converge: V(s) → v_π(s) as experience → ∞But in practise we cannot go on forever, so how do these algorithms converge for batch solution for finite experience?Suppose we have two states A, B with no discounting and 8 episodes of experience.

AB ExampleWhat is the value at state A.

V(A) ?MC converges to solution which best fits the observed returns with minimum mean-squared error.

Therefore V(A)=0.

As the only time the state A appears in an episode is when the return is 0.

TD(0) converges to solution of max likelihood Markov model.

It is the solution to the MDP that best fits the data.

Therefore V(A)=0.

75.

Since we got a reward 6 out of 8 episodes.

TD exploits the Markov property unlike MC.

Comparison Between Backup MethodsMonte-Carlo Backup:Value of a state Sₜ can only be computed once a terminal state is reachedTemporal-Difference TD(0) Backup: Value of a state Sₜ is computed using only one step look ahead.

Dynamic Programming Backup:Value at Sₜ is computed with a one step look at every possible state and computes an expected value.

n-Step ReturnAn approach between TD(0) and MC, where we have n-step temporal-difference learning.

Therefore the value will be computed by looking ahead n-steps and apply the temporal-difference learning method.

TD(λ)Instead of looking at each n-step return Gₜ⁽ⁿ⁾, we can use a decaying weighted sum to combine all n-step returns called the λ-return.

Forward-view TD(λ)The value at a state can now be computed using a forward-view TD(λ)The forward-view looks into the future to compute the λ-return and updates the value function towards it and can only be computed from complete episodes.

Backward-view TD(λ)The backward-view provides mechanism to update the value online, every step, from incomplete sequences.

We keep an eligibility trace for every state s and update the V(s) for every state s in proportion to TD-error δₜ and eligibility trace Eₜ(s).

Eligibility TraceEligibility traces combine both frequency heuristic and recency heuristic.

– Frequency heuristic: assign credit to most frequent states- Recency heuristic: assign credit to most recent statesEligibility Trace EquationsSummaryWe have looked at various methods for model-free predictions such as Monte-Carlo Learning, Temporal-Difference Learning and TD(λ).

These methods allowed us to find the value of a state when given a policy.

In the next post, we will look at finding the optimal policies using model-free methods.

ReferencesUCL Course on RL — Lecture 4An Introduction to Reinforcement Learning, Sutton and Barto, 1998If you enjoyed this post and want to see more don’t forget follow and/or leave a clap.