Reinforcement Learning — Solving BlackjackJeremy zhangBlockedUnblockFollowFollowingJun 14We have talked about how to use Monte Carlo methods to evaluate a policy in reinforcement learning here, where we took the example of blackjack and set a fixed policy, and by repetitively sampling, we are able to get an unbiased estimates of the policy and the state, value pairs along the way.

The next direct thought would be that are we able to solve the blackjack problem by using value iteration which we’ve been introduced in previous passages?The answer is a sounded yes, as we can clearly define the three main components(state, action, reward of blackjack, and we also know the policy of the opponent, which will also be treated as part of the environment, so the process will be similar to what we have discussed in tic-tac-toe implementation, where we took the standpoint of 1 player, and the corresponding action from the opponent will be taken as the feedback from the environment(in this way we don’t need any model of our opponent, which is one of the advantages of reinforcement learning).

Blackjack RulesJust a quick review of the blackjack rules and the general policy that a dealer takes:The game begins with two cards dealt to both dealer and player.

One of the dealer’s cards is face up and the other is face down.

If the player has 21 immediately (an ace and a 10-card), it is called a natural.

He then wins unless the dealer also has a natural, in which case the game is a draw.

If the player does not have a natural, then he can request additional cards, one by one (hits), until he either stops (sticks) or exceeds 21 (goes bust).

If he goes bust, he loses; if he sticks, then it becomes the dealer’s turn.

The dealer hits or sticks according to a fixed strategy without choice: he sticks on any sum of 17 or greater, and hits otherwise.

If the dealer goes bust, then the player wins; otherwise, the outcome — win, lose, or draw — is determined by whose final sum is closer to 21.

If the player holds an ace that he could count as 11 without going bust, then the ace is said to be usable.

ImplementationLet’s first be clear about the state, action and reward.

The state of the game is the components that matter and affect the winning chance.

Firstly, the most important is card sum, the current value on hand.

Secondly, there are two more factors that contribute to game wining, which we described above in the rule introduction, are the usable ace and dealer’s showing card.

So the state would has 3 components player’s current card sum, usable ace and dealer’s showing card.

Actions are clear as one can has only 2 actions in blackjack, either HIT or STAND.

Reward would be based on the result of the game, where we give 1 to a win, 0 to a draw and -1 to a lose.

As I have talked about MC method on blackjack, in the following sections, I will introduce the major differences of implementation of the two and try to make the code more concise.

(full code)InitIn the init function, we define the global values that will be frequently used or updated in the following functions.

And as opposed to MC implementation where our player follows a fixed policy, here the player we control does not use a fixed policy, thus we need more components to update its Q-value estimates.

Components defined inside this init function are generally used in most cases of reinforcement learning problem.

The added parts compared to the init function in MC method include self.

player_Q_Values , which is the initialised estimation of (state, action) that will be updated after each episode, self.

lr , which is used to control updating speed and self.

exp , which is used for action taking.

Deal Card & Dealer PolicyThe giveCard and dealerPolicy function is exactly the same.

As our opponent, the dealer, still holds the same strategy in this game, and we are exploring a strategy that could be as competitive or even better than dealer’s policy.

Action ChoosingThis time our player no longer follows a fixed policy, so it needs to think about which action to take in terms of balancing the exploration and exploitation.

Our player has two actions to take, of which 0 stands for stand and 1 stands for hit.

When the current card sum is equal or less than 11, one would always hit as there is no harm in hitting a another card.

When the card sum exceeds 11, our player would take a ϵ-greedy policy, where exp_rate percentage of time, it takes random action, otherwise takes a greedy action which is the action that gains the most reward based on the current estimates of Q-value.

Judging the Next StateBy taking an action, our player moves from the current state to the next state, so the playerNxtState function will take in an action and output the next state and judge if it is the end of game.

In order to move to next state, the function needs to know what is the current state.

It does this at the beginning by assigning the current state to fixed variables.

The following logic is if our action is 1, which stands for HIT, our player will draw another card, and the current card sum will be added accordingly based on whether the drawing card is ace or not.

On the other hand, if the action is STAND, the game ends right away and the current state will be returned.

It is worth noting that at the end of the function we add another section to judge if the game ends according to whether the player has an usable ace on hand.

Give Reward and Update Q-valueLike all reinforcement learning updates, at the end of the game, which is considered one episode, our player receives a reward based on card value of itself and dealer’s and propagate this value backward to update the estimates of (state, action) .

Q-value updateThese 2 functions could be merged into 1, and I separate them to make it clearer in structure.

The winner function judges the winner of the game and returns reward correspondingly, and the _giveCredit function updates reward according to the formula above, which is exactly the same as we introduced in grid world Q-learning, the Q-values is updated in a reversed fashion, whereas the last updated value will be used to update the current Q value.

TrainingIn the training phase, we will simulate many games and let our player to play against the dealer in order to update the Q-values.

Different from MC method of blackjack, at the beginning I added a function deal2cards which just simply deal 2 cards in a row to a player.

The reason is to follow the rule that if either of the player gets 21 points with the first 2 cards, the game ends directly rather than continuing to wait the next player reaching its end.

This avoids cases that one player gets 21 points with the first 2 cards while the other also gets 21 points with more than 2 cards, but the game ends with a draw.

Play with Dealer & ResultWith our agent equipped with intelligence, we are good to let it play against a dealer(The save and load policy function is same as that in MC blackjack, and the playWithDealer function has similar structure with the training process, only to tune the exp_rate to 0).

After training 10,000 rounds with exp_rate=0.

2 and lr=0.

1 , I saved the policy and let it play against dealer with 10,000 rounds and got the result:wining: 4122drawing: 1639losing: 4239which was a policy that not able to overtake dealer’s.

There surly exists a policy that performs better than HIT17(in fact, this is an open secret), the reason that our agent did not learn the optimal policy and perform as well is that, I believe,Not enough rounds of trainingAdjustments on exploration rate and learning rate(fixing them might not be good for this case)1-step Q-value update has its own limitationWhat you can do?I strongly suggest you to try more based on the current implementation, which is both interesting and good for yourself in terms of deepen your understanding of reinforcement learning.

You can try:n-step updates rather than 1-step(which I will also introduce in future posts)Train an agent to play against itself based on current saved policy (this is an intriguing idea that worth a pop, as we know AlphaGo is strong by learning through playing with masters, but AlphaGo Zero beats it with 100:0 by learning to play against itself.

You can also leverage this idea and try to train an agent to play with itself)Please check out the full code here.

You are welcomed to contribute, and if you have any questions or suggestions, please raise comment below!Reference[1] http://incompleteideas.

net/book/the-book-2nd.

html.. More details