# Creating AI for GameBoy Part 4: Q-Learning and Variations

This is where the real magic happens — we’ve built our tools and now it is time to set them in motion.

A quick recap of our journey so far: first, we built a controller so that we could run the game using Python scripts; second, we used a combination of open source and homemade image processing tools to extract information from the game; and third, we automated the process so that the game could play itself with random inputs.

Now, we will implement a variation of Q-learning to beat the first level of this game.

If you want to take a deeper dive into reinforcement learning, I highly recommend these two posts (LINK 1, LINK 2), they do a great job of breaking down the algorithms and implementation.

The final result — playing using the Q-Table’s most rewarded actionsWhat is Q-Learning and How Do We Apply it Here?To solve the problem of Fire Emblem, we will use a Q-table.

The rows of this table represent the possible states of the game and the columns represent the actions.

In this particular situation, the table has 150 states (the level is like a board with 150 squares — 10 rows and 15 columns) and 180 possible actions at any given turn.

This number of actions comes from our character, Lyn, being able to walk a maximum Manhattan distance of five units (which results in 60 different movement options) and having three options when she arrives (attack, item, and wait).

The values that fill each state, action pair is the estimated reward associated with taking that action given the state.

This reward is often represented as a function Q(s,a), which is why this process is called Q-learning.

You may be familiar with the phrase “Deep Q-Learning” (DQN) and wondering why we aren’t implementing that instead — the game plays too slowly to generate a satisfactory amount of data for deep learning and we can’t speed it up in this current emulator setup.

Below is an example of a Q-table getting filled in on an arbitrary problem.

A small example of a Q-table getting filled in — SourceTo fill this 150×180 table, we must play the game many times.

We start by taking random actions given our state and keeping track of the rewards as we progress.

The rewards, in this context, are the number of enemies we defeated in the last turn.

If we are at an arbitrary state s, and take an arbitrary action a, our reward Q(s,a) is either 0 if all enemies survived that turn or 1 per enemy slain.

As we play the game randomly, we are currently exploring the action/state space and keeping track of the rewards so that when we want to exploit this knowledge later, we can have a sense of what works well.

We must initially explore in order to find routes that work well just as we must ultimately use the information we have gained to improve our gameplay.

In practice, this is represented by a variable called exploration_rate (or epsilon) that decays as we progress.

Each time we attempt the level, we decrease our exploration rate (through an exponential decay) and compare it to a random number to see if we will be exploring or exploiting on the attempt.

At the end of the learning process, we will essentially be playing the game without any randomly-generated decisions, relying only on our knowledge we have accrued through playing the game stored in our Q-table.

Walking through the implementation, we have 3 major parts:Initializing the data structuresInitializing hyperparametersImplementing the algorithmSetup CodeThe first block of code here sets up the data structures we need, namely the Q-table and dictionaries mapping states and actions to indices in said table.

The second block of code initializes the hyperparameters we will use for this experiment.

Learning rate and gamma are multipliers that affect the scoring of our reward when we update the Q table.

The third block of code shows the actual Q-learning as it takes place.

We start each run by resetting the game and setting our state to the starting location and then begin taking turns.

For each turn, we decide whether to explore or exploit using a random number generator and evaluating if that number is higher than our current exploration rate.

If so, we are exploiting what knowledge we currently have in the Q-table.

Once we have made that decision, we either randomly (explore) generate an action or choose the highest-rewarded action (exploit) from the Q-table.

We then take a turn with that action using the take_turn* function I wrote which returns information: the name of who moved, where they moved, what action they took, and the reward.

The boolean arguments indicate whether we are exploring or exploiting the Q-table.

Because there are some spaces that are off limits i.

e.

impassable terrain, enemy units, I included a check to see if the move is invalid.

We then set our new state using our prior state and where we moved and update our Q-table with results.

At the end of each training run, we decay our exploration rate so that we can use our information more fittingly.

*Note regarding the take_turn function: in OpenAI’s gym, they have a very similar method — env.

step(action) that returns similar information after proceeding one step in a game.

I wanted to mimic their logic to make my implementation as applicable to other problems as possible.

If you want to look into my take_turn function, you can find it here on my GitHub.

The Plot Thickens — Alternative MethodsThis method, as currently described, is the typical manner for non-deep Q-learning.

For this problem, it doesn’t work.

Not only does the program play the game too slowly to generate data for DQN, it plays too slowly to even fill in a Q-table with optimal values.

Fret not, dear reader, for I have not wasted your time thus far; there are different variations of Q-learning that we will discuss moving forward that will solve this problem.

Altering the RewardUp until this point, the rewards have been given simply for killing enemies.

This does not motivate our hero to move toward the final, static boss.

The probability that she would randomly move to a square next to the boss to attack him and then stay there a second turn to finish him off is minuscule.

That isn’t to say that with enough training runs with a sped up emulation that this couldn’t be done, but frankly, I have better things to do with my time and computer.

To solve this problem, I included an addition to the reward based off Euclidean distance to a square next to the boss.

Instead of adding 1 per enemy killed, we now add 1/enemy + %progress from start to finish square.

Sensibly Initializing the Q-TableAltering the reward helped hugely, but we still struggled to finish the level balancing the exploration required to kill our first enemy and heal ourselves with the exploitation of our known target square.

We don’t find the optimal route to the boss and spend a lot more time on our first randomly-selected path because that is what we know “works”.

This seems to me to be a problem that doesn’t need to be a problem — we already know what the distance-based reward terms will be before we start playing.

With this in mind, I looped through every possible state in the Q-table and rewarded each movement with how close it would put us to the boss at the end of the level.

This way, we start the process with access to the fastest route to the end of the level and only need to “learn” how to deal with the enemy in the way.

This also allows us to increase the exploration rate decay, as with more prior knowledge and less “learning” to do, we need to perform fewer random movements.

Using Q-learning and sensible reward functions, I was able to beat the first level of Fire Emblem in an optimal amount of turns.

The second level introduces us to allies, complicating our problem even further: how can we alter our method to use allies?.I’ll be answering this multi-agent reinforcement learning question in the next edition of Creating AI for GameBoy!.. More details