Monte Carlo Tree Search in Reinforcement LearningA recipe of the search algorithm at the heart of Deep Mind’s Alpha Zero AI.

Ziad SALLOUMBlockedUnblockFollowFollowingFeb 17https://unsplash.

com/photos/waAAaeC9hnsInstead of an alpha-beta search with domain-specific enhancements, AlphaZero uses a general purpose Monte Carlo tree search (MCTS) algorithm.

Each search consists of a series of simulated games of self-play that traverse a tree from root state root until a leaf state is reached.

(view source )OverviewMonte Carlo tree search (MCTS) algorithm consists of four phases: Selection, Expansion, Rollout/Simulation, Backpropagation.

1.

SelectionAlgorithm starts at root node R, then moves down the tree by selecting optimal child node until a leaf node L(no known children so far) is reached.

2.

ExpansionIf L is a not a terminal node (it does not terminate the game) then create one or more child nodes according to available actions at the current state (node), then and select the first of these new nodes M.

3.

SimulationRun a simulated rollout from M until a terminal state is found.

The terminal state contains a result (value) that will be returned to upwards in the is backpropagation phase.

NB.

The states or nodes in which the rollout passes through are not considered visited.

4.

BackpropagationAfter the simulation phase, a result is returned.

All nodes from M up to R will be updated by adding the result to their value and increase the count of visits at each node.

Selection & ExpansionThe algorithm starts at a node or state that is considered the root, then selects a child node to move to.

The selection is based on the Upper Confidence Bounds (UCB1) formula:where vᵢ is the value at state Si, nᵢ is number of visits at state Si, N is the total number of visits of the nodes at the same level (or in other words the number of visits at the parent of Si), C is a constant used for fine tuning.

After computing the UCB1 for every child of the current node, the one with the highest value is chosen.

If the selected node is new, meaning it is not visited yet, Rollout is called to find a terminal state with value.

Otherwise, if it is visited, then create child nodes for all actions available at the current node, the move to the first of the child nodes and begin a Rollout.

Rollout or SimulationThe Rollout or Simulation is the phase in which random actions are taken, retrieve the landing state then take another random action in order to land in a new state.

This process is iterated until a terminal state is reached, at that point the value of the terminal stated is returned.

Backpropagationthe back propagation gets the value of the Rollout and updates the nodes from the start of the Rollout till the root node.

The update consists of adding the Rollout result to the current value of each node, and to increase by one the count of visits at each of these nodes.

ExampleThe following is an example that unrolls all the different phases of the MCST algorithm.

Let’s start with root node S0.

We assume that it has been visited before, so it is not new.

So we need to expand it.

We assume that there are 2 actions (A1, A2) at this state.

We expand the tree by adding two states S1 and S2 resulting from the actions A1, A2.

Since the UCB1 for both S1 and S2 is infinity (since N1=0 and N2=0) we choose to go at first with A1.

Now S1 is new because it has not been visited yet (N1 = 0).

The algorithm says that in this case we should rollout until we reach a terminal state.

Let’s assume that we did just that and found a terminal node with value 20.

We propagate this value from S1 upwards to S0 (text in blue).

So S1 and S0 will have both values equal 20 and the number of visits equals 1.

Now in the second iteration, UCB1(S1) = 20 and UCB1(S2) = infinity, so we take action A2.

Also we notice that S2 is new (N2=0) so we rollout until we find a terminal state.

We assume that the terminal state has a value 10.

We backpropagate it from S2 up to S0 (text in blue).

At the end we get S2 (V2=10, N2=1) and S0(V0 = 30, N0=2)In the third iteration, UCB1(S1) = 20 and UCB1(S2) = 10, so we choose to go to S1.

Since now it is not new anymore (N1 > 0) we expand it by adding the available actions and their landing states.

We suppose that there are 2 actions A3, A4 with the respective states S3 and S4.

Since both S3 and S4 have UCB1 infinity we choose S3 by following actin A3.

S3 is new so we rollout until we find terminal state with value 15.

We backpropagate this value to node S3, S1, S0.

The updated values can be seen in the blue text.

AdvantageAheuristicMCTS does not require any knowledge about the given domain.

Apart from the game rules and conditions the does not need to know any strategy or tactics, which makes it reusable for other games with small modifications.

AsymmetricMCTS visits more interesting nodes more often, and focusses its search time in more relevant parts of the tree.

AnytimeIt is possible to stop the algorithm at any time to return the current best estimate.

ElegantThe algorithm is not complex and is easy to implement.

DisadvantagesPlaying StrengthMCTS, in its basic form, can fail to find reasonable moves for games even of medium complexity within a reasonable amount of time.

SpeedMCTS search can take many iterations to converge to a good solution, which can be an issue for more general applications that are difficult to optimise.

ImprovementsDomain KnowledgeDomain knowledge specific to the current game can be help filter out unlikely moves or to produce rollouts that are more similar to the ones that would occur between human opponents.

This has the benefit of rollout results that will be more realistic than random simulations and that nodes will require fewer iterations to yield realistic reward values.

Domain IndependentDomain independent enhancements do not tie the implementation to a particular domain, maintaining generality, and are hence the focus of most current work in the area.

.