Monte Carlo Methods — Estimate Blackjack PolicyJeremy zhangBlockedUnblockFollowFollowingJun 3We have been discussing several reinforcement learning problems, within each we are trying to get the optimal policy by keep playing the game and update our estimates.
In essence, we estimate the (state, value) pair or (state, action, value) pair and based on the estimates we generate policy by taking the action that gives the highest value.
But this is not the only way to do it.
In this paragraph, I will introduce Monte Carlo Method, which is another way to estimate the value of a state, or value of a policy.
The Monte Carlo Method involves a wide broad of methods, but all follows the same principal — sampling.
The idea is straightforward and intuitive, if you are not sure of the value of a state, just do sampling, which is keep visiting that state and averaging over the reward that got from simulated actions by interacting with the environment.
“Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns.
” — Sutton, Reinforcement Learning an IntroductionOur Blackjack RuleTo get a better understanding on how to leverage Monte Carlo Methods on reinforcement learning, let us dive into an example which comes from Sutton’s book.
General RuleThe game we are playing today is the widely known casino card game of blackjack, and our goal is to estimate a policy using sampling.
Following is a brief description of the rules:The game begins with two cards dealt to both dealer and player.
One of the dealer’s cards is face up and the other is face down.
If the player has 21 immediately (an ace and a 10-card), it is called a natural.
He then wins unless the dealer also has a natural, in which case the game is a draw.
If the player does not have a natural, then he can request additional cards, one by one (hits), until he either stops (sticks) or exceeds 21 (goes bust).
If he goes bust, he loses; if he sticks, then it becomes the dealer’s turn.
The dealer hits or sticks according to a fixed strategy without choice: he sticks on any sum of 17 or greater, and hits otherwise.
If the dealer goes bust, then the player wins; otherwise, the outcome—win, lose, or draw—is determined by whose final sum is closer to 21.
If the player holds an ace that he could count as 11 without going bust, then the ace is said to be usable.
Our PolicyOur policy is to stick if the our card’s sum is 20 or 21, and to his otherwise.
To find the state-value function for this policy by a Monte Carlo approach, we will simulate many blackjack games using the policy and averages the returns following each state.
The key point is that a dealer’s policy is fixed, sticking at sum ≥ 17, and we are going to simulate many blackjack games to measure the policy sticking at sum ≥ 20.
Blackjack Monte Carlo ImplementationstatesTo implement our simulation, let us first be clear about the states.
The states in blackjack that we need to consider about include firstly, player’s card sum, which ranges from 12–21(we exclude sum lower than 12 as in those scenarios we would always hit), secondly, dealer’s showing card, as it potentially indicates the dealer’s final card value, and lastly, player’s usable ace, which is also a factor that would slightly affect our wining chance.
So add up in total we have 10(from 12 to 21) * 10(from 1 to 10) * 2(true or false) = 200 states.
(full code here)InitAs our goal is to estimate the states’ value under our fixed policy, we will need to have a (state, value) dict to record all the states and number of winnings along our game simulation, and also a player_states to track the states of each game.
The player_win and player_draw are used to track the total wining of the game.
Deal CardThe fundamental mechanism that required is to randomly give a card to each side, in which case we assume that we have infinite card.
(J, Q, K are taken as 10)Dealer PolicyOur dealer policy is to hit when card value less than 17 and stand when 17 or above.
The function will be able to return a new value based on the action chosen and be able to tell if the game is ended.
This function can be called recursively until reaches its end as its returns is the same as inputs.
We keep track of the usable ace, when current value is less than 10, the ace will always be taken as 11, otherwise 1; When current value is over 21 and the dealer has usable ace on hand, then the usable ace will be taken as 1, total value is subtracted by 10 accordingly, and the usable ace indicator will be set to false .
Player PolicyThe implementation is exactly the same as dealer policy except that player stands at 20 or above in this setting.
Give CreditAt the end of each game, credit will be given to the winner by adding one to the state that leads to wining.
The player_states is a list that records all the states until game ends, and the credit is only given to the last state that with card value between 12 and 21, that is to say, if we have a card sum of 15, and take the action HIT, getting a card of value 7, which ends with total sum of 22, then 15 will be taken as the last state instead of 22.
Besides counting state_value , number of player winning and drawing is also counted, which will be used to measure the overall goodness of this policy.
MC SimulationWith all the above preparation, we are good to run the simulation!.The whole process is similar to general value iteration stuff, but except backpropagating the reward at the end of the game, we update the estimates of state, value pair by what we defined in _giveCredit function.
At the beginning of the game, dealer is given 2 cards, and we assume the first card is being shown.
Then to the player’s turn, only card value between 12 and 21 is included in the states, since it is meaningless to include value 22 or above in the states, as these values can not lead to a wining, thus needless to be counted.
And finally at the end of the game, states that not yet exist in the player_state_value dict will be initialised to 0, and game will be credited by _giveCredit .
Estimation VisualisationWe ran our simulation by 10000 rounds and plotted the estimates of state, value:MC SimulationThe z column of reward is the number of wining of that state.
As expected, when a player has value 20 or 21, which is at the rear of the plot, one is more likely to win.
In overall, this HIT20 policy gives us 30.
6% of wining, 24.
37% drawing and 45.
03% of losing against dealer’s HIT17, obviously this is a policy that leads to more losing than winning in casino.
(full code here)Let’s also have a look at how usable ace leads to the nuances of game playing:MC Simulation(usable ace)Simulation Result from Sutton’s bookThe result from Sutton’s book is clearer, in which he raised the question: why the frontmost values with usable ace are slightly higher than that without usable ace?.I guess usable ace does increase the chance of wining in that a player with usable ace has greater chance to hit, as even the value crosses above 21, he is still able to continue by making the usable ace non-usable.
To conclude, we together explored using MC method to evaluate a given policy.
The idea of sampling of MC method can be generalised to many game playing scenarios, so if you reach a state and don’t know where to go or what action to take, just sampling!Please check out full code here, and you are welcomed to contribute and raise issues!Reference http://incompleteideas.