→ the agent should take into account the future as well → the reward overtime → , however, when the timestamp goes to infinity → numerically unstable → hence discount factor.

(there also exist a model in which optimize the average reward).

→ but cannot differentiate between the policy that gets a good score at the beginning or at the end.

How different models look when visualized.

(speed of convergence is also a factor that needs to be in consideration when comparing which model is optimal → being optimal as fast as possible → might lead to worse policy).

Regret → the expected decrease in not acting optimally.

→ but this is hard to measure → since we need to know what the optimal value is for every state for each action → might not know this.

(Adaptive control → very strong theory).

Explore and Exploit → K arm bandit problem → gambling machines and pulls → this is the most famous example problem when discussing the explore and exploit dilemma.

(there is Dynamic programming approach to solve this problem).

The dynamic programming solution can be seen above.

(this section describes the possible solutions for k-arm bandit problems → such as learning automata or dynamic programming).

→ still, more work on mathematical theories has to be developed.

Ad hoc methods → actually give a good result → such as greedy or e-greedy methods.

One of the methods → involves the softmax function → involving Boltzmann distribution with temperature.

(finally, there is also interval based methods).

(many ad-hoc methods lack the theoretical guarantees).

Delayed Reward → we now have to think about the case in which → the reward is coming at a very future time stamp.

→ here MDP has to be assumed.

Policy → can be understood as a solution → here → the policy is in which maximize the state value function → for every state.

To find the optimal policy → find the optimal value function → smaller search space when compared to finding the state-action value function.

(this have a bound).

→ good theoretical guarantees → stopping criteria.

(very flexible).

The simple equation above → contains a lot of theories as well as beautiful statistics.

The above → directly optimize the policy → rather than finding the max state value or state-action value function.

→ and by doing in hand → this method does work.

Value iteration → much faster → but is it better?.→ there is still debate on which method is better → There are methods in which speed up the convergence of these algorithms.

→ also, computation complexity is also one edge to consider.

(linear programming → also can be solved).

→ have good theoretical results.

Now → we do not have the transition probability → do not know the model in advanced → either do not learn the model → or create a simulation → both approaches are good → sampling efficiency will differ.

Model-free → what if the model space is huge?.→ the problem in RL → credit assignment problem.

→ the temporal difference learning method →There is two component for this algorithm → deals with multiple states → maximize the instant reward → and in turn, will maximize the overall reward.

Whenever the state is visited → it updates right away.

→ this is a general form of TD algorithm →A more advanced version of the TD algorithm → is TD lambda → more expensive computably but converges faster.

Q learning →learns directly what to do → at every state → no transition model is needed.

No transition model is needed → just directly learn the action.

(but during learning we need to make the exploration and exploitation dilemma correctly → ad-hoc methods can be applied here).

There are also methods → with average reward learning → but there is some difficulty → such as not finding the optimal policy → which is not a good result at all.

Model-based methods → now we are going to create the world itself → learn the world by simulation → one method is certainty equivalence → but this method have some problem → need a lot of data.

(breaking up the agent’s life into a learning phase and acting phase → , not a good idea).

Dyna → another method for model-based algorithms → Learn the transition → and use it for Q learning.

Dyna → better than naive model-based methods → six times more computationally expensiveBut it converges much faster → so every method has its pros and cons.

(but Dyna itself is not the best method).

Prioritized sweeping / Quee-Dyna → more advanced method they are very similar to one another.

The general algorithm for those methods.

→ when the suprising transition happens → a lot of computation is done to make this state known to previous states.

(much faster than original Dyna).

(there are other model-based methods as well → RTDP is another method).

Now we need to talk about → Generalization → going beyond learning → RL have a lot of mapping → from state to action → or such as deterministic transition and more.

→ some of this mapping → can be learned via supervised learningState plays a huge role in RL → where the agent is in the current env.

Immediate reward → very similar to bandit problems → maximizes the reward that comes right away.

(Other methods exist such as CRBP → that uses feedforward network as well).

→ ARC → associative reinforcement comparison → another method by Dr.

Sutton.

REINFORCE Algorithm → uses backpropagation for optimization.

→ to know which action to take → a very famous algorithm.

(the gradient update → chooses the action that is more desirable).

(another method is to learn a boolean function).

Decision Trees → also one of the methods to improve the generalization of RL → combination between traditional decision trees and RL.

(such a long history of study).

After discovering the good path → follows the good pathThe above is the Partigame algorithm → able to optimize for continuous env.

(fast method).

So many different methods exist to overcome the large state space problem → in which divides the large space to smaller space.

(Feudal Q-learning → hierarchy of learning modules).

Now in this section → more realistic env is drafted → in which full env is not observable → POMDP → the first method is just to ignore it → and learn to behave from start with not acknowledging the fact that it is partially observable.

(there are no guarantees)State free random policy → this was another method for overcoming POMDP.

→ but in general, it is hard to say that this method ‘solves’ the given partial problem is not right.

(recurrent Q learning exist as well).

→ hence many of these methods are limited to simple env and not complex env.

Many different areas of application exist → such as robotics and games and much more.

(game playing is huge, chess and backgammon).

Several months of computer time → holy shit.

Scaling to a large problem in the general case → this is very hard.

Much more work is needed.

Referencehttps://arxiv.

org/pdf/cs/9605103.

pdf.